聚合函式與分組:PostgreSQL 的資料匯總與多維分析 | PostgreSQL

2026/06/24
聚合函式與分組:PostgreSQL 的資料匯總與多維分析 | PostgreSQL

聚合函式(Aggregate Function) 是 SQL 中將多筆資料列合併為單一結果的核心機制。PostgreSQL 在標準 SQL 聚合基礎上提供了豐富的擴充能力,包含 FILTER 條件聚合、GROUPING SETS / ROLLUP / CUBE 多維度分組、Ordered-Set Aggregates 百分位分析,以及自訂聚合函式與 Parallel Aggregate 並行執行,讓資料匯總與統計分析能力遠超一般關聯式資料庫。

聚合函式的內部執行流程

PostgreSQL 的每個聚合函式由兩個核心函式組成:

  1. Transition Function(狀態轉換函式):逐筆處理輸入資料,將當前行的值與累積狀態合併
  2. Final Function(最終函式):在所有資料列處理完畢後,將最終狀態轉換為輸出結果

avg(x) 為例:

state_type = (sum NUMERIC, count BIGINT)

Transition Function:逐筆累積
  new_state = (state.sum + x, state.count + 1)

Final Function:計算結果
  result = state.sum / state.count

可透過系統目錄 pg_aggregate 觀察任何聚合函式的內部函式:

-- 查看 avg(numeric) 的轉換函式與最終函式
SELECT aggfnoid::regprocedure,
       aggtransfn::regprocedure,
       aggfinalfn::regprocedure
FROM   pg_aggregate
WHERE  aggfnoid = 'avg(numeric)'::regprocedure;

基本聚合函式與 NULL 處理

-- 範例資料
CREATE TABLE sales (
    id         SERIAL PRIMARY KEY,
    rep_id     INT,
    product    TEXT,
    amount     NUMERIC,
    sale_date  DATE
);

INSERT INTO sales (rep_id, product, amount, sale_date) VALUES
(1, 'A', 100,  '2026-01-01'),
(1, 'B', 200,  '2026-01-02'),
(1, 'A', NULL, '2026-01-03'),  -- NULL 值
(2, 'B', 150,  '2026-01-01'),
(2, 'A', 300,  '2026-01-04'),
(2, 'C', 250,  '2026-01-05');
-- 基本聚合與 NULL 行為
SELECT
    rep_id,
    COUNT(*)          AS total_rows,      -- 計算所有列,包含 NULL
    COUNT(amount)     AS non_null_count,  -- 排除 NULL 列
    SUM(amount)       AS total_amount,    -- NULL 被忽略
    AVG(amount)       AS avg_amount,      -- NULL 不計入分母
    MIN(amount)       AS min_amount,
    MAX(amount)       AS max_amount
FROM   sales
GROUP  BY rep_id;

-- 結果(rep_id=1 的 amount 含一個 NULL):
-- rep_id | total_rows | non_null_count | total_amount | avg_amount | min | max
--      1 |          3 |              2 |          300 |        150 | 100 | 200
--      2 |          3 |              3 |          700 |     233.33 | 150 | 300

NULL 處理要點

函式NULL 行為
COUNT(*)計算所有列(含 NULL)
COUNT(column)排除 NULL 列
SUM / AVG / MIN / MAX自動忽略 NULL 值
全部為 NULL 時SUM 回傳 NULL(非 0)
-- 安全的 NULL 處理
SELECT COALESCE(SUM(amount), 0) AS safe_total
FROM   sales
WHERE  rep_id = 99;  -- 無資料時回傳 0 而非 NULL

HashAggregate vs GroupAggregate

查詢規劃器在執行 GROUP BY 時會選擇兩種策略之一:

策略原理適用情境
HashAggregate建立 Hash Table,以 GROUP BY 鍵直接累積分組數少、資料量適中
GroupAggregate資料預先排序,逐組掃描累積分組數極多、需排序輸出
-- 觀察執行策略
EXPLAIN (ANALYZE, BUFFERS)
SELECT department, SUM(salary)
FROM   employees
GROUP  BY department;

-- "HashAggregate" → 使用 Hash 策略
-- "GroupAggregate" → 使用排序後群組策略

PostgreSQL 13 的重要改進:HashAggregate 支援磁碟溢出(disk spilling),超過 work_mem 時不再強制切換為 GroupAggregate:

-- 監控 HashAggregate 磁碟溢出
EXPLAIN (ANALYZE, BUFFERS)
SELECT col, COUNT(*) FROM large_table GROUP BY col;
-- 若出現 "Batches: N" (N > 1) 代表發生磁碟溢出

FILTER 子句:條件聚合

FILTER 子句(PostgreSQL 9.4+)是 SQL 標準的條件聚合語法,比傳統 CASE WHEN 更清晰:

-- 傳統寫法(CASE WHEN)
SELECT
    rep_id,
    SUM(CASE WHEN product = 'A' THEN amount ELSE 0 END) AS sum_a,
    SUM(CASE WHEN product = 'B' THEN amount ELSE 0 END) AS sum_b
FROM   sales
GROUP  BY rep_id;

-- 現代寫法(FILTER 子句,推薦)
SELECT
    rep_id,
    SUM(amount) FILTER (WHERE product = 'A') AS sum_a,
    SUM(amount) FILTER (WHERE product = 'B') AS sum_b,
    COUNT(*)    FILTER (WHERE amount > 200)   AS high_value_count
FROM   sales
GROUP  BY rep_id;

FILTER 可與任何聚合函式搭配使用,包含 array_aggstring_agg 等。

GROUPING SETS:多維度分組

GROUPING SETS 允許在單一查詢中同時計算多個分組層級的聚合結果,邏輯上等同於多個 GROUP BYUNION ALL,但 PostgreSQL 內部會共用一次資料掃描:

-- 範例資料
CREATE TABLE revenue (
    year    INT,
    quarter INT,
    region  TEXT,
    amount  NUMERIC
);

INSERT INTO revenue VALUES
(2026, 1, 'North', 1000), (2026, 1, 'South', 800),
(2026, 2, 'North', 1200), (2026, 2, 'South', 900),
(2025, 1, 'North', 950),  (2025, 1, 'South', 750);
-- 指定分組集合
SELECT year, quarter, region, SUM(amount) AS total
FROM   revenue
GROUP  BY GROUPING SETS (
    (year, quarter, region),  -- 最細粒度
    (year, quarter),           -- 年季合計
    (year),                    -- 年合計
    ()                         -- 全部總計
)
ORDER  BY year NULLS LAST, quarter NULLS LAST, region NULLS LAST;

-- 結果包含 4 個層級的小計:
-- year  | quarter | region | total
-- 2025  |       1 | North  |   950
-- 2025  |       1 | South  |   750
-- 2025  |       1 | NULL   |  1700  ← 年季小計
-- 2025  | NULL    | NULL   |  1700  ← 年小計
-- 2026  |       1 | North  |  1000
-- ...
-- NULL  | NULL    | NULL   |  5600  ← 總計

ROLLUP:階層式小計

ROLLUP(a, b, c) 等同於 GROUPING SETS((a,b,c), (a,b), (a), ()),適合階層式報表:

SELECT year, quarter, SUM(amount) AS total
FROM   revenue
GROUP  BY ROLLUP(year, quarter)
ORDER  BY year NULLS LAST, quarter NULLS LAST;

-- 結果:
-- year  | quarter | total
-- 2025  |       1 |  1700
-- 2025  |    NULL |  1700  ← 2025 年合計
-- 2026  |       1 |  1800
-- 2026  |       2 |  2100
-- 2026  |    NULL |  3900  ← 2026 年合計
-- NULL  |    NULL |  5600  ← 總計

CUBE:所有維度組合

CUBE(a, b) 產生所有可能的子集組合(2^n 個分組集合):

SELECT year, region, SUM(amount) AS total
FROM   revenue
GROUP  BY CUBE(year, region)
ORDER  BY year NULLS LAST, region NULLS LAST;

-- CUBE(year, region) 產生 4 個分組集合:
-- (year, region), (year), (region), ()

-- 結果:
-- year  | region | total
-- 2025  | North  |   950
-- 2025  | South  |   750
-- 2025  | NULL   |  1700  ← 2025 年合計
-- 2026  | North  |  2200
-- 2026  | South  |  1700
-- 2026  | NULL   |  3900  ← 2026 年合計
-- NULL  | North  |  3150  ← North 地區合計
-- NULL  | South  |  2450  ← South 地區合計
-- NULL  | NULL   |  5600  ← 總計

GROUPING() 函式判斷小計行

GROUPING(column) 回傳 0(此欄位在當前分組集合中)或 1(此欄位為 NULL 因為是小計行),解決小計 NULL 與資料 NULL 的歧義:

SELECT
    CASE GROUPING(year)
        WHEN 1 THEN '所有年份'
        ELSE year::TEXT
    END AS year_label,
    CASE GROUPING(region)
        WHEN 1 THEN '所有地區'
        ELSE region
    END AS region_label,
    SUM(amount) AS total,
    GROUPING(year, region) AS grouping_code
FROM   revenue
GROUP  BY CUBE(year, region)
ORDER  BY GROUPING(year), year NULLS LAST,
         GROUPING(region), region NULLS LAST;

-- grouping_code:二進位組合值
-- 0 = 兩欄位都在分組中
-- 1 = region 為小計
-- 2 = year 為小計
-- 3 = 兩欄位都為小計(總計行)

特殊聚合函式

array_agg — 收集為陣列

SELECT
    rep_id,
    array_agg(product)                          AS all_products,
    array_agg(product ORDER BY product)         AS sorted_products,
    array_agg(DISTINCT product)                 AS unique_products,
    array_agg(amount) FILTER (WHERE amount > 0) AS positive_amounts
FROM   sales
GROUP  BY rep_id;

-- rep_id | all_products | sorted_products | unique_products
--      1 | {A,B,NULL}   | {A,B}           | {A,B}
--      2 | {B,A,C}      | {A,B,C}         | {A,B,C}
-- 注意:array_agg 預設包含 NULL;DISTINCT 會排除 NULL

string_agg — 字串連接聚合

SELECT
    rep_id,
    string_agg(product, ', ')                  AS product_list,
    string_agg(product, ', ' ORDER BY product) AS sorted_list,
    string_agg(DISTINCT product, ' | ')        AS unique_list
FROM   sales
WHERE  amount IS NOT NULL
GROUP  BY rep_id;

-- rep_id | product_list | sorted_list | unique_list
--      1 | A, B         | A, B        | A | B
--      2 | B, A, C      | A, B, C     | A | B | C

json_agg / jsonb_agg — 聚合為 JSON 陣列

-- 聚合整列為 JSON 陣列
SELECT
    rep_id,
    json_agg(
        json_build_object(
            'product', product,
            'amount',  amount
        ) ORDER BY sale_date
    ) AS sales_detail
FROM   sales
GROUP  BY rep_id;

-- 結果(rep_id=2):
-- [{"product":"B","amount":150},{"product":"A","amount":300},{"product":"C","amount":250}]

json_object_agg — 聚合為 JSON 物件

-- 以產品為 Key,金額為 Value 建立 JSON 物件
SELECT
    rep_id,
    json_object_agg(product, amount) AS product_map
FROM   sales
WHERE  amount IS NOT NULL
GROUP  BY rep_id;

-- {"A": 100, "B": 200}
-- {"A": 300, "B": 150, "C": 250}

Ordered-Set Aggregates

Ordered-Set Aggregates(有序集合聚合)要求輸入資料按指定順序排列,語法使用 WITHIN GROUP (ORDER BY ...)

percentile_cont — 連續百分位數

-- 全體薪資百分位分析
SELECT
    percentile_cont(0.25) WITHIN GROUP (ORDER BY salary) AS p25,
    percentile_cont(0.5)  WITHIN GROUP (ORDER BY salary) AS median,
    percentile_cont(0.75) WITHIN GROUP (ORDER BY salary) AS p75,
    percentile_cont(0.9)  WITHIN GROUP (ORDER BY salary) AS p90
FROM   salaries;

-- 各部門中位數
SELECT
    dept,
    percentile_cont(0.5) WITHIN GROUP (ORDER BY salary) AS median_salary
FROM   salaries
GROUP  BY dept;

percentile_disc — 離散百分位數

-- percentile_disc 回傳資料集中實際存在的值(不插值)
SELECT
    percentile_disc(0.5) WITHIN GROUP (ORDER BY salary) AS median_discrete,
    percentile_cont(0.5) WITHIN GROUP (ORDER BY salary) AS median_continuous
FROM   salaries;
-- 兩者結果可能不同,disc 版本必定是資料集中的某個實際值

mode() — 眾數

-- 最常出現的值
SELECT mode() WITHIN GROUP (ORDER BY dept) AS most_common_dept
FROM   salaries;

rank / percent_rank 作為 Ordered-Set Aggregate

-- 查詢假設薪資 80000 在資料集中的排名
SELECT
    rank(80000)         WITHIN GROUP (ORDER BY salary) AS rank_at_80k,
    percent_rank(80000) WITHIN GROUP (ORDER BY salary) AS pct_rank_at_80k
FROM   salaries;

統計聚合函式

-- 薪資統計分析
SELECT
    dept,
    COUNT(*)               AS n,
    AVG(salary)            AS mean,
    stddev(salary)         AS std_dev,      -- 樣本標準差
    stddev_pop(salary)     AS std_dev_pop,  -- 母體標準差
    variance(salary)       AS var_sample,   -- 樣本變異數
    var_pop(salary)        AS var_pop       -- 母體變異數
FROM   salaries
GROUP  BY dept;
-- 迴歸與相關性分析
SELECT
    corr(y, x)             AS correlation,  -- 相關係數(-1 到 1)
    regr_slope(y, x)       AS slope,        -- 線性迴歸斜率
    regr_intercept(y, x)   AS intercept,    -- 線性迴歸截距
    regr_r2(y, x)          AS r_squared,    -- 決定係數 R²
    covar_samp(y, x)       AS covariance    -- 樣本共變異數
FROM   metrics;

自訂聚合函式(CREATE AGGREGATE)

PostgreSQL 允許自訂聚合函式,需提供 Transition Function 與 Final Function:

-- 自訂幾何平均數(Geometric Mean)聚合
-- 幾何平均 = exp(avg(ln(x)))

-- 1. Transition Function
CREATE FUNCTION geo_mean_transfn(state NUMERIC[], x NUMERIC)
RETURNS NUMERIC[] AS $$
BEGIN
    IF x IS NULL OR x <= 0 THEN RETURN state; END IF;
    RETURN ARRAY[
        COALESCE(state[1], 0) + LN(x),  -- 累積 ln 總和
        COALESCE(state[2], 0) + 1         -- 計數
    ];
END;
$$ LANGUAGE plpgsql IMMUTABLE;

-- 2. Final Function
CREATE FUNCTION geo_mean_finalfn(state NUMERIC[])
RETURNS NUMERIC AS $$
BEGIN
    IF state IS NULL OR state[2] = 0 THEN RETURN NULL; END IF;
    RETURN EXP(state[1] / state[2]);
END;
$$ LANGUAGE plpgsql IMMUTABLE;

-- 3. 建立 Aggregate
CREATE AGGREGATE geometric_mean(NUMERIC) (
    SFUNC     = geo_mean_transfn,       -- 狀態轉換函式
    STYPE     = NUMERIC[],              -- 狀態類型
    FINALFUNC = geo_mean_finalfn,       -- 最終函式
    INITCOND  = '{0, 0}'                -- 初始狀態
);

-- 使用自訂聚合
SELECT rep_id, geometric_mean(amount) AS geo_mean_sales
FROM   sales
WHERE  amount > 0
GROUP  BY rep_id;

Parallel Aggregate

PostgreSQL 9.6+ 支援聚合並行化,執行流程:

Finalize Aggregate             ← 合併各 Worker 的部分結果
  └── Gather                   ← 收集 Worker 輸出
        ├── Partial Aggregate (Worker 1)  ← 各自累積部分狀態
        ├── Partial Aggregate (Worker 2)
        └── Partial Aggregate (Worker 3)

自訂聚合若要支援並行,需額外提供 COMBINEFUNC(合併函式):

-- 合併函式:將兩個 Worker 的部分狀態合併
CREATE FUNCTION geo_mean_combinefn(state1 NUMERIC[], state2 NUMERIC[])
RETURNS NUMERIC[] AS $$
BEGIN
    RETURN ARRAY[
        COALESCE(state1[1], 0) + COALESCE(state2[1], 0),
        COALESCE(state1[2], 0) + COALESCE(state2[2], 0)
    ];
END;
$$ LANGUAGE plpgsql IMMUTABLE PARALLEL SAFE;

-- 支援並行的聚合定義
CREATE AGGREGATE geometric_mean_parallel(NUMERIC) (
    SFUNC       = geo_mean_transfn,
    STYPE       = NUMERIC[],
    FINALFUNC   = geo_mean_finalfn,
    COMBINEFUNC = geo_mean_combinefn,  -- 並行合併函式
    INITCOND    = '{0, 0}',
    PARALLEL    = SAFE
);

聚合與窗口函式結合

聚合函式可作為窗口函式使用,但不能在同一 SELECT 層級混用普通聚合與窗口聚合。常見模式是透過 CTE 分層:

-- 計算每筆銷售佔該業務員總金額的百分比
SELECT
    rep_id, product, amount,
    SUM(amount) OVER (PARTITION BY rep_id) AS rep_total,
    ROUND(amount / SUM(amount) OVER (PARTITION BY rep_id) * 100, 1) AS pct_of_rep
FROM   sales
WHERE  amount IS NOT NULL;
-- 聚合後再使用窗口函式(先 GROUP BY,再 OVER)
WITH monthly_sales AS (
    SELECT
        rep_id,
        DATE_TRUNC('month', sale_date) AS month,
        SUM(amount)                    AS monthly_total
    FROM   sales
    GROUP  BY rep_id, DATE_TRUNC('month', sale_date)
)
SELECT
    rep_id, month, monthly_total,
    SUM(monthly_total) OVER (
        PARTITION BY rep_id ORDER BY month
    ) AS cumulative_total,
    LAG(monthly_total) OVER (
        PARTITION BY rep_id ORDER BY month
    ) AS prev_month
FROM   monthly_sales;

HAVING vs WHERE

SQL 執行順序:

  FROM → WHERE → GROUP BY → 聚合計算 → HAVING → SELECT → ORDER BY
              ↑                              ↑
          聚合前過濾                     聚合後過濾
-- WHERE 在聚合前過濾列,HAVING 在聚合後過濾結果
SELECT rep_id, SUM(amount) AS total
FROM   sales
WHERE  sale_date >= '2026-01-01'   -- 先過濾列
GROUP  BY rep_id
HAVING SUM(amount) > 200;           -- 再過濾聚合結果

COUNT(DISTINCT) 的效能代價

COUNT(DISTINCT col) 在 PostgreSQL 中有顯著的效能代價:

  1. 無法並行化:DISTINCT 聚合目前無法在 Partial Aggregate 階段並行處理
  2. 多個 DISTINCT 代價倍增:每個 DISTINCT 聚合獨立去重
-- 昂貴的多 DISTINCT 聚合
SELECT
    COUNT(DISTINCT rep_id)    AS unique_reps,
    COUNT(DISTINCT product)   AS unique_products,
    COUNT(DISTINCT sale_date) AS unique_dates
FROM   sales;

-- 替代方案:子查詢預先去重
WITH base AS (
    SELECT DISTINCT rep_id, product, sale_date FROM sales
)
SELECT
    COUNT(DISTINCT rep_id)    AS unique_reps,
    COUNT(DISTINCT product)   AS unique_products,
    COUNT(DISTINCT sale_date) AS unique_dates
FROM   base;

-- 大資料量:考慮 HyperLogLog 擴充(hll extension)
-- SELECT hll_cardinality(hll_add_agg(hll_hash_text(col::TEXT))) FROM t;

版本演進

版本新增功能
PG 9.4FILTER 子句、WITHIN GROUP(Ordered-Set Aggregates)
PG 9.6Parallel Aggregate 基礎支援
PG 10改善 HashAggregate 記憶體管理
PG 12Parallel DISTINCT 部分改善
PG 13HashAggregate 支援磁碟溢出,超過 work_mem 不再強制切 GroupAggregate
PG 14json_object_aggjsonb_object_agg 加入 DISTINCTORDER BY
PG 16Aggregate pushdown 改善、並行聚合更多場景支援

常見陷阱

GROUP BY 使用 SELECT 別名

-- PostgreSQL 允許,但標準 SQL 不允許
SELECT
    DATE_TRUNC('month', sale_date) AS month,
    SUM(amount) AS total
FROM   sales
GROUP  BY month;  -- PostgreSQL 擴充語法,其他資料庫可能報錯

ROLLUP / CUBE 中 NULL 的歧義

-- 分組小計的 NULL 與資料本身的 NULL 無法區分
-- 應使用 GROUPING() 函式
SELECT
    GROUPING(region) AS is_subtotal,
    region,
    SUM(amount)
FROM   revenue
GROUP  BY ROLLUP(region);
-- is_subtotal = 1 → 小計行(region 為 NULL 是因為 ROLLUP)
-- is_subtotal = 0 → 實際資料(region 若為 NULL 是資料本身)

SUM 全部為 NULL 時回傳 NULL

-- SUM 在所有值為 NULL 時回傳 NULL,不是 0
SELECT SUM(amount) FROM sales WHERE 1 = 0;
-- 結果:NULL

-- 安全處理
SELECT COALESCE(SUM(amount), 0) FROM sales WHERE 1 = 0;
-- 結果:0

總結

聚合函式 是 PostgreSQL 資料匯總與分析的核心能力:

  • 基本聚合(SUM / AVG / COUNT / MIN / MAX)自動忽略 NULL,COUNT(*) 除外
  • FILTER 子句 取代 CASE WHEN,語義更清晰(PG 9.4+)
  • GROUPING SETS / ROLLUP / CUBE 在單一查詢中計算多維度小計,搭配 GROUPING() 區分小計行
  • Ordered-Set Aggregates 提供百分位(percentile_cont / percentile_disc)、眾數(mode)等統計功能
  • 自訂聚合CREATE AGGREGATE)透過 Transition + Final Function 實現任意聚合邏輯
  • Parallel Aggregate 需提供 COMBINEFUNC 才能並行執行自訂聚合
  • COUNT(DISTINCT) 效能代價高,大資料量考慮 HyperLogLog 近似計數

下一篇,我們將深入探討 CTE 與子查詢——PostgreSQL 的查詢組合與遞迴查詢能力。

BenZ Software Developer

熱愛技術的軟體開發者,在這裡分享程式開發經驗與學習筆記。