Thread

  1. Re: [SQL/PGQ] Early pruning for GRAPH_TABLE path generation

    Henson Choi <assam258@gmail.com> — 2026-05-09T14:52:16Z

    Hi Junwang,
    
    Thanks for an interesting thread, and for picking this up — it has
    been a useful occasion to re-examine assumptions I had carried over
    from a different graph model.
    
    My background is on AgensGraph, where labels form an inheritance
    hierarchy (rather than Neo4j-style multi-label tag sets), so a path
    pattern rewrites against a single parent table per position and the
    fan-out across labels is left to the planner's existing inheritance
    machinery. The rewriter never enumerates per-element combinations, so
    an N^K blow-up like Satya's was simply not on the radar there.
    
    SQL/PGQ instead designates graph identity through the schema itself,
    so each pattern position carries a set of candidate element tables
    and the rewriter has to materialize each schema-feasible join
    combination — a (vertex, edge, vertex, ...) sequence of element
    tables — as its own SELECT branch, with all such branches joined by
    UNION ALL. The number of *feasible* join combinations is bounded by
    the schema graph's connectivity and is usually small; the current
    rewriter's naive shape, however, is to enumerate the full N^K
    candidate space first and reject infeasible combinations only
    afterwards. The patch tackles exactly that part.
    
    To be clear, I do not mean this as a criticism of the original
    implementer. For sizeable features I tend to prefer a deliberately
    naive implementation first — getting the full functionality working
    end-to-end before optimizing — and that approach is even more
    valuable in team settings, since a working baseline lets additional
    contributors join much earlier. The current shape of
    generate_queries_for_path_pattern_recurse() reads exactly like such a
    baseline, and this patch is precisely the kind of focused follow-up
    improvement it invites.
    
    When I thought about this independently, I had reached for the dual
    formulation — "extend the prefix only along feasible edges" rather
    than "enumerate everything and prune" — and my intuition was that the
    two should produce the same surviving set, though I have not worked
    that through rigorously. Either way, your "early pruning" framing is
    the better one for the patch itself: it stays close to the existing
    code structure and lets the change read as a small refinement of
    generate_queries_for_path_pattern_recurse() rather than a
    reorganization of how candidates are walked. That makes review and
    incremental landing materially easier.
    
    The same reframing also explains, after the fact, why the discrepancy
    between Ashutosh's quick runs and Satya's 81 GB case was so large:
    the cost being measured is in large part infeasible enumeration,
    which a forward-moved check eliminates regardless of pattern length
    or schema size in realistic graphs.
    
    I have not gone through the patch at the code level yet, but the
    shape of the problem and of your fix is clear enough from the
    description. My personal take, with that caveat, is that early
    pruning, together with the CHECK_FOR_INTERRUPTS that Satya already
    proposed in the earlier thread [1], feels like the right pair to
    land first; the explicit hard limit, in turn, is probably best
    revisited only if a real case turns up where even those two combined
    are not enough. That keeps the immediate change narrow and
    semantics-preserving while leaving the policy question open for
    evidence rather than upfront estimation.
    
    There is a fairly recent precedent in PostgreSQL itself that I think
    supports this ordering. I have been following — and participating in
    — Tatsuo Ishii's Row Pattern Recognition thread [2] since earlier
    this year, and that work ran into a structurally similar concern: in
    its earlier implementation the engine could undergo a combinatorial
    explosion of pattern-candidate states. A memory-based limit was
    floated as one mitigation, but it never actually converged into a
    design — the conversation around an explicit limit simply faded out
    as a layered set of pruning and early-termination rules accumulated
    incrementally and brought the state space back under control. CFI
    and stack-depth checks, by contrast, were treated as independently
    necessary throughout and stayed in the design.
    
    The shape of this thread feels analogous, and the same staged
    approach (structural fix + CFI first, hard limit only if a residual
    case genuinely demands it) is, I suspect, what will hold up
    here as well.
    
    I'll set aside time separately to do a proper line-by-line review and
    follow up.
    
    Regards,
    Henson
    
    [1]
    https://www.postgresql.org/message-id/CAHg%2BQDe8JU%2BERqA2xwjrg_ZptvH_v0T6PS9_P_ZgyYzD5h-Grw%40mail.gmail.com
    [2]
    https://www.postgresql.org/message-id/flat/20230625.210509.1276733411677577841.t-ishii%40sranhm.sra.co.jp