聚合函式與分組:PostgreSQL 的資料匯總與多維分析 | PostgreSQL
聚合函式(Aggregate Function) 是 SQL 中將多筆資料列合併為單一結果的核心機制。PostgreSQL 在標準 SQL 聚合基礎上提供了豐富的擴充能力,包含 FILTER 條件聚合、GROUPING SETS / ROLLUP / CUBE 多維度分組、Ordered-Set Aggregates 百分位分析,以及自訂聚合函式與 Parallel Aggregate 並行執行,讓資料匯總與統計分析能力遠超一般關聯式資料庫。
聚合函式的內部執行流程
PostgreSQL 的每個聚合函式由兩個核心函式組成:
- Transition Function(狀態轉換函式):逐筆處理輸入資料,將當前行的值與累積狀態合併
- Final Function(最終函式):在所有資料列處理完畢後,將最終狀態轉換為輸出結果
以 avg(x) 為例:
state_type = (sum NUMERIC, count BIGINT)
Transition Function:逐筆累積
new_state = (state.sum + x, state.count + 1)
Final Function:計算結果
result = state.sum / state.count
可透過系統目錄 pg_aggregate 觀察任何聚合函式的內部函式:
-- 查看 avg(numeric) 的轉換函式與最終函式
SELECT aggfnoid::regprocedure,
aggtransfn::regprocedure,
aggfinalfn::regprocedure
FROM pg_aggregate
WHERE aggfnoid = 'avg(numeric)'::regprocedure;
基本聚合函式與 NULL 處理
-- 範例資料
CREATE TABLE sales (
id SERIAL PRIMARY KEY,
rep_id INT,
product TEXT,
amount NUMERIC,
sale_date DATE
);
INSERT INTO sales (rep_id, product, amount, sale_date) VALUES
(1, 'A', 100, '2026-01-01'),
(1, 'B', 200, '2026-01-02'),
(1, 'A', NULL, '2026-01-03'), -- NULL 值
(2, 'B', 150, '2026-01-01'),
(2, 'A', 300, '2026-01-04'),
(2, 'C', 250, '2026-01-05');
-- 基本聚合與 NULL 行為
SELECT
rep_id,
COUNT(*) AS total_rows, -- 計算所有列,包含 NULL
COUNT(amount) AS non_null_count, -- 排除 NULL 列
SUM(amount) AS total_amount, -- NULL 被忽略
AVG(amount) AS avg_amount, -- NULL 不計入分母
MIN(amount) AS min_amount,
MAX(amount) AS max_amount
FROM sales
GROUP BY rep_id;
-- 結果(rep_id=1 的 amount 含一個 NULL):
-- rep_id | total_rows | non_null_count | total_amount | avg_amount | min | max
-- 1 | 3 | 2 | 300 | 150 | 100 | 200
-- 2 | 3 | 3 | 700 | 233.33 | 150 | 300
NULL 處理要點:
| 函式 | NULL 行為 |
|---|---|
COUNT(*) | 計算所有列(含 NULL) |
COUNT(column) | 排除 NULL 列 |
SUM / AVG / MIN / MAX | 自動忽略 NULL 值 |
| 全部為 NULL 時 | SUM 回傳 NULL(非 0) |
-- 安全的 NULL 處理
SELECT COALESCE(SUM(amount), 0) AS safe_total
FROM sales
WHERE rep_id = 99; -- 無資料時回傳 0 而非 NULL
HashAggregate vs GroupAggregate
查詢規劃器在執行 GROUP BY 時會選擇兩種策略之一:
| 策略 | 原理 | 適用情境 |
|---|---|---|
| HashAggregate | 建立 Hash Table,以 GROUP BY 鍵直接累積 | 分組數少、資料量適中 |
| GroupAggregate | 資料預先排序,逐組掃描累積 | 分組數極多、需排序輸出 |
-- 觀察執行策略
EXPLAIN (ANALYZE, BUFFERS)
SELECT department, SUM(salary)
FROM employees
GROUP BY department;
-- "HashAggregate" → 使用 Hash 策略
-- "GroupAggregate" → 使用排序後群組策略
PostgreSQL 13 的重要改進:HashAggregate 支援磁碟溢出(disk spilling),超過 work_mem 時不再強制切換為 GroupAggregate:
-- 監控 HashAggregate 磁碟溢出
EXPLAIN (ANALYZE, BUFFERS)
SELECT col, COUNT(*) FROM large_table GROUP BY col;
-- 若出現 "Batches: N" (N > 1) 代表發生磁碟溢出
FILTER 子句:條件聚合
FILTER 子句(PostgreSQL 9.4+)是 SQL 標準的條件聚合語法,比傳統 CASE WHEN 更清晰:
-- 傳統寫法(CASE WHEN)
SELECT
rep_id,
SUM(CASE WHEN product = 'A' THEN amount ELSE 0 END) AS sum_a,
SUM(CASE WHEN product = 'B' THEN amount ELSE 0 END) AS sum_b
FROM sales
GROUP BY rep_id;
-- 現代寫法(FILTER 子句,推薦)
SELECT
rep_id,
SUM(amount) FILTER (WHERE product = 'A') AS sum_a,
SUM(amount) FILTER (WHERE product = 'B') AS sum_b,
COUNT(*) FILTER (WHERE amount > 200) AS high_value_count
FROM sales
GROUP BY rep_id;
FILTER 可與任何聚合函式搭配使用,包含 array_agg、string_agg 等。
GROUPING SETS:多維度分組
GROUPING SETS 允許在單一查詢中同時計算多個分組層級的聚合結果,邏輯上等同於多個 GROUP BY 的 UNION ALL,但 PostgreSQL 內部會共用一次資料掃描:
-- 範例資料
CREATE TABLE revenue (
year INT,
quarter INT,
region TEXT,
amount NUMERIC
);
INSERT INTO revenue VALUES
(2026, 1, 'North', 1000), (2026, 1, 'South', 800),
(2026, 2, 'North', 1200), (2026, 2, 'South', 900),
(2025, 1, 'North', 950), (2025, 1, 'South', 750);
-- 指定分組集合
SELECT year, quarter, region, SUM(amount) AS total
FROM revenue
GROUP BY GROUPING SETS (
(year, quarter, region), -- 最細粒度
(year, quarter), -- 年季合計
(year), -- 年合計
() -- 全部總計
)
ORDER BY year NULLS LAST, quarter NULLS LAST, region NULLS LAST;
-- 結果包含 4 個層級的小計:
-- year | quarter | region | total
-- 2025 | 1 | North | 950
-- 2025 | 1 | South | 750
-- 2025 | 1 | NULL | 1700 ← 年季小計
-- 2025 | NULL | NULL | 1700 ← 年小計
-- 2026 | 1 | North | 1000
-- ...
-- NULL | NULL | NULL | 5600 ← 總計
ROLLUP:階層式小計
ROLLUP(a, b, c) 等同於 GROUPING SETS((a,b,c), (a,b), (a), ()),適合階層式報表:
SELECT year, quarter, SUM(amount) AS total
FROM revenue
GROUP BY ROLLUP(year, quarter)
ORDER BY year NULLS LAST, quarter NULLS LAST;
-- 結果:
-- year | quarter | total
-- 2025 | 1 | 1700
-- 2025 | NULL | 1700 ← 2025 年合計
-- 2026 | 1 | 1800
-- 2026 | 2 | 2100
-- 2026 | NULL | 3900 ← 2026 年合計
-- NULL | NULL | 5600 ← 總計
CUBE:所有維度組合
CUBE(a, b) 產生所有可能的子集組合(2^n 個分組集合):
SELECT year, region, SUM(amount) AS total
FROM revenue
GROUP BY CUBE(year, region)
ORDER BY year NULLS LAST, region NULLS LAST;
-- CUBE(year, region) 產生 4 個分組集合:
-- (year, region), (year), (region), ()
-- 結果:
-- year | region | total
-- 2025 | North | 950
-- 2025 | South | 750
-- 2025 | NULL | 1700 ← 2025 年合計
-- 2026 | North | 2200
-- 2026 | South | 1700
-- 2026 | NULL | 3900 ← 2026 年合計
-- NULL | North | 3150 ← North 地區合計
-- NULL | South | 2450 ← South 地區合計
-- NULL | NULL | 5600 ← 總計
GROUPING() 函式判斷小計行
GROUPING(column) 回傳 0(此欄位在當前分組集合中)或 1(此欄位為 NULL 因為是小計行),解決小計 NULL 與資料 NULL 的歧義:
SELECT
CASE GROUPING(year)
WHEN 1 THEN '所有年份'
ELSE year::TEXT
END AS year_label,
CASE GROUPING(region)
WHEN 1 THEN '所有地區'
ELSE region
END AS region_label,
SUM(amount) AS total,
GROUPING(year, region) AS grouping_code
FROM revenue
GROUP BY CUBE(year, region)
ORDER BY GROUPING(year), year NULLS LAST,
GROUPING(region), region NULLS LAST;
-- grouping_code:二進位組合值
-- 0 = 兩欄位都在分組中
-- 1 = region 為小計
-- 2 = year 為小計
-- 3 = 兩欄位都為小計(總計行)
特殊聚合函式
array_agg — 收集為陣列
SELECT
rep_id,
array_agg(product) AS all_products,
array_agg(product ORDER BY product) AS sorted_products,
array_agg(DISTINCT product) AS unique_products,
array_agg(amount) FILTER (WHERE amount > 0) AS positive_amounts
FROM sales
GROUP BY rep_id;
-- rep_id | all_products | sorted_products | unique_products
-- 1 | {A,B,NULL} | {A,B} | {A,B}
-- 2 | {B,A,C} | {A,B,C} | {A,B,C}
-- 注意:array_agg 預設包含 NULL;DISTINCT 會排除 NULL
string_agg — 字串連接聚合
SELECT
rep_id,
string_agg(product, ', ') AS product_list,
string_agg(product, ', ' ORDER BY product) AS sorted_list,
string_agg(DISTINCT product, ' | ') AS unique_list
FROM sales
WHERE amount IS NOT NULL
GROUP BY rep_id;
-- rep_id | product_list | sorted_list | unique_list
-- 1 | A, B | A, B | A | B
-- 2 | B, A, C | A, B, C | A | B | C
json_agg / jsonb_agg — 聚合為 JSON 陣列
-- 聚合整列為 JSON 陣列
SELECT
rep_id,
json_agg(
json_build_object(
'product', product,
'amount', amount
) ORDER BY sale_date
) AS sales_detail
FROM sales
GROUP BY rep_id;
-- 結果(rep_id=2):
-- [{"product":"B","amount":150},{"product":"A","amount":300},{"product":"C","amount":250}]
json_object_agg — 聚合為 JSON 物件
-- 以產品為 Key,金額為 Value 建立 JSON 物件
SELECT
rep_id,
json_object_agg(product, amount) AS product_map
FROM sales
WHERE amount IS NOT NULL
GROUP BY rep_id;
-- {"A": 100, "B": 200}
-- {"A": 300, "B": 150, "C": 250}
Ordered-Set Aggregates
Ordered-Set Aggregates(有序集合聚合)要求輸入資料按指定順序排列,語法使用 WITHIN GROUP (ORDER BY ...)。
percentile_cont — 連續百分位數
-- 全體薪資百分位分析
SELECT
percentile_cont(0.25) WITHIN GROUP (ORDER BY salary) AS p25,
percentile_cont(0.5) WITHIN GROUP (ORDER BY salary) AS median,
percentile_cont(0.75) WITHIN GROUP (ORDER BY salary) AS p75,
percentile_cont(0.9) WITHIN GROUP (ORDER BY salary) AS p90
FROM salaries;
-- 各部門中位數
SELECT
dept,
percentile_cont(0.5) WITHIN GROUP (ORDER BY salary) AS median_salary
FROM salaries
GROUP BY dept;
percentile_disc — 離散百分位數
-- percentile_disc 回傳資料集中實際存在的值(不插值)
SELECT
percentile_disc(0.5) WITHIN GROUP (ORDER BY salary) AS median_discrete,
percentile_cont(0.5) WITHIN GROUP (ORDER BY salary) AS median_continuous
FROM salaries;
-- 兩者結果可能不同,disc 版本必定是資料集中的某個實際值
mode() — 眾數
-- 最常出現的值
SELECT mode() WITHIN GROUP (ORDER BY dept) AS most_common_dept
FROM salaries;
rank / percent_rank 作為 Ordered-Set Aggregate
-- 查詢假設薪資 80000 在資料集中的排名
SELECT
rank(80000) WITHIN GROUP (ORDER BY salary) AS rank_at_80k,
percent_rank(80000) WITHIN GROUP (ORDER BY salary) AS pct_rank_at_80k
FROM salaries;
統計聚合函式
-- 薪資統計分析
SELECT
dept,
COUNT(*) AS n,
AVG(salary) AS mean,
stddev(salary) AS std_dev, -- 樣本標準差
stddev_pop(salary) AS std_dev_pop, -- 母體標準差
variance(salary) AS var_sample, -- 樣本變異數
var_pop(salary) AS var_pop -- 母體變異數
FROM salaries
GROUP BY dept;
-- 迴歸與相關性分析
SELECT
corr(y, x) AS correlation, -- 相關係數(-1 到 1)
regr_slope(y, x) AS slope, -- 線性迴歸斜率
regr_intercept(y, x) AS intercept, -- 線性迴歸截距
regr_r2(y, x) AS r_squared, -- 決定係數 R²
covar_samp(y, x) AS covariance -- 樣本共變異數
FROM metrics;
自訂聚合函式(CREATE AGGREGATE)
PostgreSQL 允許自訂聚合函式,需提供 Transition Function 與 Final Function:
-- 自訂幾何平均數(Geometric Mean)聚合
-- 幾何平均 = exp(avg(ln(x)))
-- 1. Transition Function
CREATE FUNCTION geo_mean_transfn(state NUMERIC[], x NUMERIC)
RETURNS NUMERIC[] AS $$
BEGIN
IF x IS NULL OR x <= 0 THEN RETURN state; END IF;
RETURN ARRAY[
COALESCE(state[1], 0) + LN(x), -- 累積 ln 總和
COALESCE(state[2], 0) + 1 -- 計數
];
END;
$$ LANGUAGE plpgsql IMMUTABLE;
-- 2. Final Function
CREATE FUNCTION geo_mean_finalfn(state NUMERIC[])
RETURNS NUMERIC AS $$
BEGIN
IF state IS NULL OR state[2] = 0 THEN RETURN NULL; END IF;
RETURN EXP(state[1] / state[2]);
END;
$$ LANGUAGE plpgsql IMMUTABLE;
-- 3. 建立 Aggregate
CREATE AGGREGATE geometric_mean(NUMERIC) (
SFUNC = geo_mean_transfn, -- 狀態轉換函式
STYPE = NUMERIC[], -- 狀態類型
FINALFUNC = geo_mean_finalfn, -- 最終函式
INITCOND = '{0, 0}' -- 初始狀態
);
-- 使用自訂聚合
SELECT rep_id, geometric_mean(amount) AS geo_mean_sales
FROM sales
WHERE amount > 0
GROUP BY rep_id;
Parallel Aggregate
PostgreSQL 9.6+ 支援聚合並行化,執行流程:
Finalize Aggregate ← 合併各 Worker 的部分結果
└── Gather ← 收集 Worker 輸出
├── Partial Aggregate (Worker 1) ← 各自累積部分狀態
├── Partial Aggregate (Worker 2)
└── Partial Aggregate (Worker 3)
自訂聚合若要支援並行,需額外提供 COMBINEFUNC(合併函式):
-- 合併函式:將兩個 Worker 的部分狀態合併
CREATE FUNCTION geo_mean_combinefn(state1 NUMERIC[], state2 NUMERIC[])
RETURNS NUMERIC[] AS $$
BEGIN
RETURN ARRAY[
COALESCE(state1[1], 0) + COALESCE(state2[1], 0),
COALESCE(state1[2], 0) + COALESCE(state2[2], 0)
];
END;
$$ LANGUAGE plpgsql IMMUTABLE PARALLEL SAFE;
-- 支援並行的聚合定義
CREATE AGGREGATE geometric_mean_parallel(NUMERIC) (
SFUNC = geo_mean_transfn,
STYPE = NUMERIC[],
FINALFUNC = geo_mean_finalfn,
COMBINEFUNC = geo_mean_combinefn, -- 並行合併函式
INITCOND = '{0, 0}',
PARALLEL = SAFE
);
聚合與窗口函式結合
聚合函式可作為窗口函式使用,但不能在同一 SELECT 層級混用普通聚合與窗口聚合。常見模式是透過 CTE 分層:
-- 計算每筆銷售佔該業務員總金額的百分比
SELECT
rep_id, product, amount,
SUM(amount) OVER (PARTITION BY rep_id) AS rep_total,
ROUND(amount / SUM(amount) OVER (PARTITION BY rep_id) * 100, 1) AS pct_of_rep
FROM sales
WHERE amount IS NOT NULL;
-- 聚合後再使用窗口函式(先 GROUP BY,再 OVER)
WITH monthly_sales AS (
SELECT
rep_id,
DATE_TRUNC('month', sale_date) AS month,
SUM(amount) AS monthly_total
FROM sales
GROUP BY rep_id, DATE_TRUNC('month', sale_date)
)
SELECT
rep_id, month, monthly_total,
SUM(monthly_total) OVER (
PARTITION BY rep_id ORDER BY month
) AS cumulative_total,
LAG(monthly_total) OVER (
PARTITION BY rep_id ORDER BY month
) AS prev_month
FROM monthly_sales;
HAVING vs WHERE
SQL 執行順序:
FROM → WHERE → GROUP BY → 聚合計算 → HAVING → SELECT → ORDER BY
↑ ↑
聚合前過濾 聚合後過濾
-- WHERE 在聚合前過濾列,HAVING 在聚合後過濾結果
SELECT rep_id, SUM(amount) AS total
FROM sales
WHERE sale_date >= '2026-01-01' -- 先過濾列
GROUP BY rep_id
HAVING SUM(amount) > 200; -- 再過濾聚合結果
COUNT(DISTINCT) 的效能代價
COUNT(DISTINCT col) 在 PostgreSQL 中有顯著的效能代價:
- 無法並行化:DISTINCT 聚合目前無法在 Partial Aggregate 階段並行處理
- 多個 DISTINCT 代價倍增:每個 DISTINCT 聚合獨立去重
-- 昂貴的多 DISTINCT 聚合
SELECT
COUNT(DISTINCT rep_id) AS unique_reps,
COUNT(DISTINCT product) AS unique_products,
COUNT(DISTINCT sale_date) AS unique_dates
FROM sales;
-- 替代方案:子查詢預先去重
WITH base AS (
SELECT DISTINCT rep_id, product, sale_date FROM sales
)
SELECT
COUNT(DISTINCT rep_id) AS unique_reps,
COUNT(DISTINCT product) AS unique_products,
COUNT(DISTINCT sale_date) AS unique_dates
FROM base;
-- 大資料量:考慮 HyperLogLog 擴充(hll extension)
-- SELECT hll_cardinality(hll_add_agg(hll_hash_text(col::TEXT))) FROM t;
版本演進
| 版本 | 新增功能 |
|---|---|
| PG 9.4 | FILTER 子句、WITHIN GROUP(Ordered-Set Aggregates) |
| PG 9.6 | Parallel Aggregate 基礎支援 |
| PG 10 | 改善 HashAggregate 記憶體管理 |
| PG 12 | Parallel DISTINCT 部分改善 |
| PG 13 | HashAggregate 支援磁碟溢出,超過 work_mem 不再強制切 GroupAggregate |
| PG 14 | json_object_agg、jsonb_object_agg 加入 DISTINCT 與 ORDER BY |
| PG 16 | Aggregate pushdown 改善、並行聚合更多場景支援 |
常見陷阱
GROUP BY 使用 SELECT 別名
-- PostgreSQL 允許,但標準 SQL 不允許
SELECT
DATE_TRUNC('month', sale_date) AS month,
SUM(amount) AS total
FROM sales
GROUP BY month; -- PostgreSQL 擴充語法,其他資料庫可能報錯
ROLLUP / CUBE 中 NULL 的歧義
-- 分組小計的 NULL 與資料本身的 NULL 無法區分
-- 應使用 GROUPING() 函式
SELECT
GROUPING(region) AS is_subtotal,
region,
SUM(amount)
FROM revenue
GROUP BY ROLLUP(region);
-- is_subtotal = 1 → 小計行(region 為 NULL 是因為 ROLLUP)
-- is_subtotal = 0 → 實際資料(region 若為 NULL 是資料本身)
SUM 全部為 NULL 時回傳 NULL
-- SUM 在所有值為 NULL 時回傳 NULL,不是 0
SELECT SUM(amount) FROM sales WHERE 1 = 0;
-- 結果:NULL
-- 安全處理
SELECT COALESCE(SUM(amount), 0) FROM sales WHERE 1 = 0;
-- 結果:0
總結
聚合函式 是 PostgreSQL 資料匯總與分析的核心能力:
- 基本聚合(SUM / AVG / COUNT / MIN / MAX)自動忽略 NULL,
COUNT(*)除外 - FILTER 子句 取代
CASE WHEN,語義更清晰(PG 9.4+) - GROUPING SETS / ROLLUP / CUBE 在單一查詢中計算多維度小計,搭配
GROUPING()區分小計行 - Ordered-Set Aggregates 提供百分位(
percentile_cont/percentile_disc)、眾數(mode)等統計功能 - 自訂聚合(
CREATE AGGREGATE)透過 Transition + Final Function 實現任意聚合邏輯 - Parallel Aggregate 需提供
COMBINEFUNC才能並行執行自訂聚合 COUNT(DISTINCT)效能代價高,大資料量考慮 HyperLogLog 近似計數
下一篇,我們將深入探討 CTE 與子查詢——PostgreSQL 的查詢組合與遞迴查詢能力。