Thread

  1. [Patch] Add WHERE clause support to REFRESH MATERIALIZED VIEW

    Adam Brusselback <adambrusselback@gmail.com> — 2025-12-08T20:58:27Z

    Attached is a patch implementing support for a WHERE clause in REFRESH
    MATERIALIZED VIEW.
    
    The syntax allows for targeted refreshes:
        REFRESH MATERIALIZED VIEW mv WHERE invoice_id = ANY('{1,2,3}');
        REFRESH MATERIALIZED VIEW CONCURRENTLY mv WHERE customer_id = 42;
        REFRESH MATERIALIZED VIEW mv WHERE order_date >= '2023-01-01';
    
    I was inspired to implement this feature after watching the Hacking
    Postgres discussion on the topic:
    https://www.youtube.com/watch?v=6cZvHjDrmlQ
    
    This allows the user to restrict the refresh operation to a subset of the
    view. The qualification is applied to the view's output columns. The
    optimizer can then push this condition down to the underlying base tables,
    avoiding a full scan when only a known subset of data has changed.
    
    Implementation notes:
    
    1. The grammar accepts an optional WHERE clause. We forbid volatile
    functions in the clause to ensure correctness.
    
    2. Non-Concurrent Partial Refresh: When `CONCURRENTLY` is not specified,
    the operation performs an in-place modification using a `ROW EXCLUSIVE`
    lock.
        *   This mode requires a unique index to ensure constraint violations
    are handled correctly (e.g., when a row's values change such that it
    "drifts" into or out of the `WHERE` clause scope).
        *   It executes a Prune + Upsert strategy:
            * `DELETE` all rows in the materialized view that match the `WHERE`
    clause.
            * `INSERT` the new data from the source query.
        *   It uses `ON CONFLICT DO UPDATE` during the insert phase to handle
    concurrency edge cases, ensuring the refresh is robust against constraint
    violations.
    
    3. Concurrent Partial Refresh: When `CONCURRENTLY` is specified, it uses
    the existing diff/merge infrastructure (`refresh_by_match_merge`), limiting
    the scope of the diff (and the temporary table population) to the rows
    matching the predicate. This requires an `EXCLUSIVE` lock and a unique
    index, consistent with existing concurrent refresh behavior. It is much
    slower than `Non-Concurrent Partial Refresh`
    
    4. The execution logic uses SPI to inject the predicate into the source
    queries during execution.
    
    I have attached a benchmark suite to validate performance and correctness:
    
    *   `setup.sql`: Creates a schema `mv_benchmark` modeling an invoicing
    system (`invoices` and `invoice_lines`). It includes an aggregated
    materialized view (`invoice_summary`) and a control table
    (`invoice_summary_table`).
    *   `workload_*.sql`: pgbench scripts simulating a high-churn environment
    (45% inserts, 10% updates, 45% deletes) to maintain roughly stable dataset
    sizes while generating significant refresh work.
    *   `run_benchmark_comprehensive.sh`: Orchestrates the benchmark across
    multiple scale factors and concurrency levels.
    
    The benchmark compares strategies for keeping a summary up to date (vs
    baseline):
    *   Partial Refresh: Triggers on the base table collect modified IDs and
    execute `REFRESH MATERIALIZED VIEW ... WHERE ...`.
    *   Materialized Table (Control): A standard table maintained via complex
    PL/pgSQL triggers (the traditional manual workaround).
    *   Full Refresh (Legacy): Manually refresh the view after changes.
    
    Results are below:
    Concurrency: 1 client(s)
    ----------------------------------------------------------------------------------
    Scale       Batch | Baseline TPS | Full (Rel)   Partial (Rel) Table (Rel)
    ---------- ------ | ------------ | ------------ ------------ ------------
    20000           1 | 5309.05      | 0.002x        0.437x        0.470x
    20000          50 | 1209.32      | 0.010x        0.600x        0.598x
    20000        1000 | 56.05        | 0.164x        0.594x        0.576x
    400000          1 | 5136.91      | 0    x        0.450x        0.487x
    400000         50 | 1709.17      | 0    x        0.497x        0.482x
    400000       1000 | 110.35       | 0.006x        0.507x        0.460x
    
    Concurrency: 4 client(s)
    ----------------------------------------------------------------------------------
    Scale       Batch | Baseline TPS | Full (Rel)   Partial (Rel) Table (Rel)
    ---------- ------ | ------------ | ------------ ------------ ------------
    20000           1 | 19197.50     | 0x            0.412x        0.435x
    20000          50 | 1016.14      | 0.007x        0.966x        1.036x
    20000        1000 | 9.94         | 0.708x        1.401x        1.169x
    400000          1 | 19637.36     | 0x            0.436x        0.483x
    400000         50 | 4669.32      | 0x            0.574x        0.566x
    400000       1000 | 23.26        | 0.029x        1.147x        0.715x
    
    Concurrency: 8 client(s)
    ----------------------------------------------------------------------------------
    Scale       Batch | Baseline TPS | Full (Rel)   Partial (Rel) Table (Rel)
    ---------- ------ | ------------ | ------------ ------------ ------------
    20000           1 | 30358.32     | 0x            0.440x        0.457x
    20000          50 | 262.75       | 0.026x        2.943x        2.740x
    20000        1000 | 11.28        | 0.575x        0.840x        0.578x
    400000          1 | 36007.15     | 0x            0.430x        0.464x
    400000         50 | 6664.58      | 0x            0.563x        0.494x
    400000       1000 | 11.61        | 0.058x        1.000x        1.277x
    
    
    
    In these tests, the partial refresh behaves as O(delta) rather than
    O(total), performing comparably to the manual PL/pgSQL approach but with
    significantly lower code complexity for the user.
    
    I recognize that adding a WHERE clause to REFRESH is an extension to the
    SQL standard. I believe the syntax is intuitive, but I am open to
    discussion regarding alternative implementation strategies or syntax if the
    community feels a different approach is warranted.
    
    New regression tests are included in the patch.
    
    This is my first time submitting a patch to PostgreSQL, so please bear with
    me if I've missed anything or made any procedural mistakes. I'm happy to
    address any feedback.
    
    Thanks,
    Adam Brusselback