Thread

  1. Re: [SQL/PGQ] Early pruning for GRAPH_TABLE path generation

    Henson Choi <assam258@gmail.com> — 2026-05-12T10:28:02Z

    Hi Junwang,
    
    Thanks for the v1 patch -- I walked through it carefully. The
    direction holds up well; the notes below are non-blocking, with one
    concrete request and a handful of nits.
    
    ## On the approach
    
    "Enumerate and prune early" is the right shape for a first cut:
    the DFS keeps its existing structure, and we just stop extending a
    prefix as soon as the newly appended element makes adjacency
    impossible. Label/kind filtering already keeps slot sizes small in
    typical cases, so a structural fix at the DFS boundary captures the
    remaining blow-up without reshuffling the candidate set itself.
    
    Glad to see CHECK_FOR_INTERRUPTS folded in alongside the structural
    fix -- that pairing alone should be enough; a hard limit only needs
    to come back if a residual case demands it.
    
    ## On correctness
    
    The DFS is a nested loop: outer picks a vertex for the vertex slot,
    inner picks an edge for the adjacent edge slot. The new helper just
    compares vertex.elemoid against edge.srcvertexid/destvertexid -- both
    catalog-derived, fixed before the DFS starts. The pre-patch code
    evaluates the same equality at the end of the full N x M x ...
    expansion (via edge_qual == NULL in generate_query_for_graph_path);
    the patch hoists it into the inner loop body. Same equality, same
    surviving set. Direction handling is preserved the same way --
    reverse_ok starts true only for EDGE_PATTERN_ANY, so the undirected
    case keeps its src<->dest cross comparison.
    
    The patch is also stack-neutral: both pre-existing recursive
    functions already carry check_stack_depth() at entry
    (generate_queries_for_path_pattern_recurse,
    generate_setop_from_pathqueries), and the new helpers are
    non-recursive.
    
    ## One request: regression test
    
    Three cases would together cover the new code paths well:
    
    1. A long chain pattern that survives pruning but used to enumerate
       the full N^K product (the g5 chain from your benchmark is a
       natural fit). Not currently covered in
       src/test/regress/sql/graph_table.sql -- worth adding.
    
    2. A cyclic variant whose closing edge has both endpoints already
       in the prefix -- this is the only path that fires both if-blocks
       of graph_path_edge_is_feasible() simultaneously. Already covered
       by the existing `(a)-[b]->(a)-[b]->(a)` tests in graph_table.sql,
       since same-name vertex patterns are merged into a single path
       factor before DFS; worth noting in the commit message rather
       than adding a new test.
    
    3. A pattern where an edge slot has all its candidates pruned --
       to cover the empty-pathqueries route into
       generate_query_for_empty_path_pattern(). Already covered by the
       reversed-direction `MATCH (o IS orders)-[IS customer_orders]->
       (c IS customers)` test in graph_table.sql, which yields zero
       rows because no edge in the catalog satisfies that adjacency.
       Same note: worth flagging as newly reachable via pruning.
    
    ## Nits (all stylistic -- feel free to skip)
    
    1. In graph_path_prefix_is_feasible_with_new_element(), the
       `if (!IS_EDGE_PATTERN(prev_pe->path_factor->kind)) return true;`
       branch is unreachable -- gram.y enforces V-E alternation, and
       the path-factor linkage code in
       generate_queries_for_path_pattern() re-asserts it. Tightening
       to `Assert(IS_EDGE_PATTERN(...))` would match the rest of the
       file.
    
    2. `normal_ok` / `reverse_ok` reads slightly off from this file's
       diction. Consider `feasible` / `rev_feasible` -- the asymmetric
       `rev_` prefix matches the existing `edge_qual` / `rev_edge_qual`
       in generate_query_for_graph_path(), and echoes the function name
       itself.
    
    3. A one-line comment above the V-E alternation check (whether
       kept as `if` or tightened per nit 1) citing parse analysis would
       save the next reader a trip through gram.y.
    
    4. In the prune branch of generate_queries_for_path_pattern_recurse(),
       a one-liner noting that an exhausted level returns an empty list,
       which the caller routes to generate_query_for_empty_path_pattern(),
       would help -- pre-existing behavior, but far more reachable after
       the patch.
    
    ## On the trailing edge_qual guard
    
    The `if (edge_qual == NULL) return NULL;` in
    generate_query_for_graph_path() becomes unreachable after the patch,
    but worth keeping as a defensive backstop -- it is the literal dual
    of the helper's check, and removing it would only tighten coupling
    for no measurable gain. A short comment marking it as such would help.
    
    Regards,
    Henson
    
    2026년 5월 12일 (화) 오전 9:57, Junwang Zhao <zhjwpku@gmail.com>님이 작성:
    
    > Hi Henson,
    >
    > On Sat, May 9, 2026 at 10:52 PM Henson Choi <assam258@gmail.com> wrote:
    > >
    > > Hi Junwang,
    > >
    > > Thanks for an interesting thread, and for picking this up — it has
    > > been a useful occasion to re-examine assumptions I had carried over
    > > from a different graph model.
    > >
    > > My background is on AgensGraph, where labels form an inheritance
    > > hierarchy (rather than Neo4j-style multi-label tag sets), so a path
    > > pattern rewrites against a single parent table per position and the
    > > fan-out across labels is left to the planner's existing inheritance
    > > machinery. The rewriter never enumerates per-element combinations, so
    > > an N^K blow-up like Satya's was simply not on the radar there.
    >
    > Thanks for the input, good to know.
    >
    > >
    > > SQL/PGQ instead designates graph identity through the schema itself,
    > > so each pattern position carries a set of candidate element tables
    > > and the rewriter has to materialize each schema-feasible join
    > > combination — a (vertex, edge, vertex, ...) sequence of element
    > > tables — as its own SELECT branch, with all such branches joined by
    > > UNION ALL. The number of *feasible* join combinations is bounded by
    > > the schema graph's connectivity and is usually small; the current
    > > rewriter's naive shape, however, is to enumerate the full N^K
    > > candidate space first and reject infeasible combinations only
    > > afterwards. The patch tackles exactly that part.
    > >
    > > To be clear, I do not mean this as a criticism of the original
    > > implementer. For sizeable features I tend to prefer a deliberately
    > > naive implementation first — getting the full functionality working
    > > end-to-end before optimizing — and that approach is even more
    > > valuable in team settings, since a working baseline lets additional
    > > contributors join much earlier. The current shape of
    > > generate_queries_for_path_pattern_recurse() reads exactly like such a
    > > baseline, and this patch is precisely the kind of focused follow-up
    > > improvement it invites.
    > >
    > > When I thought about this independently, I had reached for the dual
    > > formulation — "extend the prefix only along feasible edges" rather
    > > than "enumerate everything and prune" — and my intuition was that the
    > > two should produce the same surviving set, though I have not worked
    > > that through rigorously. Either way, your "early pruning" framing is
    > > the better one for the patch itself: it stays close to the existing
    > > code structure and lets the change read as a small refinement of
    > > generate_queries_for_path_pattern_recurse() rather than a
    > > reorganization of how candidates are walked. That makes review and
    > > incremental landing materially easier.
    > >
    > > The same reframing also explains, after the fact, why the discrepancy
    > > between Ashutosh's quick runs and Satya's 81 GB case was so large:
    > > the cost being measured is in large part infeasible enumeration,
    > > which a forward-moved check eliminates regardless of pattern length
    > > or schema size in realistic graphs.
    > >
    > > I have not gone through the patch at the code level yet, but the
    > > shape of the problem and of your fix is clear enough from the
    > > description. My personal take, with that caveat, is that early
    > > pruning, together with the CHECK_FOR_INTERRUPTS that Satya already
    > > proposed in the earlier thread [1], feels like the right pair to
    > > land first; the explicit hard limit, in turn, is probably best
    > > revisited only if a real case turns up where even those two combined
    > > are not enough. That keeps the immediate change narrow and
    > > semantics-preserving while leaving the policy question open for
    > > evidence rather than upfront estimation.
    >
    > Agreed, I add a CHECK_FOR_INTERRUPTS to the patch now, see
    > the attached.
    >
    > >
    > > There is a fairly recent precedent in PostgreSQL itself that I think
    > > supports this ordering. I have been following — and participating in
    > > — Tatsuo Ishii's Row Pattern Recognition thread [2] since earlier
    > > this year, and that work ran into a structurally similar concern: in
    > > its earlier implementation the engine could undergo a combinatorial
    > > explosion of pattern-candidate states. A memory-based limit was
    > > floated as one mitigation, but it never actually converged into a
    > > design — the conversation around an explicit limit simply faded out
    > > as a layered set of pruning and early-termination rules accumulated
    > > incrementally and brought the state space back under control. CFI
    > > and stack-depth checks, by contrast, were treated as independently
    > > necessary throughout and stayed in the design.
    > >
    > > The shape of this thread feels analogous, and the same staged
    > > approach (structural fix + CFI first, hard limit only if a residual
    > > case genuinely demands it) is, I suspect, what will hold up
    > > here as well.
    >
    > Thanks for the insightful analysis, it's extremely helpful.
    >
    > >
    > > I'll set aside time separately to do a proper line-by-line review and
    > > follow up.
    > >
    > > Regards,
    > > Henson
    > >
    > > [1]
    > https://www.postgresql.org/message-id/CAHg%2BQDe8JU%2BERqA2xwjrg_ZptvH_v0T6PS9_P_ZgyYzD5h-Grw%40mail.gmail.com
    > > [2]
    > https://www.postgresql.org/message-id/flat/20230625.210509.1276733411677577841.t-ishii%40sranhm.sra.co.jp
    >
    >
    >
    > --
    > Regards
    > Junwang Zhao
    >