Thread

  1. Re: Refactor query normalization into core query jumbling

    Sami Imseih <samimseih@gmail.com> — 2025-12-19T21:36:16Z

    > While this is technically correct so the compiler does not complain (because
    > clocations is a non const pointer in JumbleState and the added const does not
    > apply to what clocations points to), I think that adding const here is misleading.
    
    Yes, I am not happy with this. I initially thought it would be OK to
    modify the JumbleState as long as it is done in a core function, but
    after much thought, this is neither a good idea nor safe. If a pointer
    is marked as a Constant, we should not modify it.
    
    So, I went back to think about this, and the core problem as I see it
    is that multiple hooks on the chain can modify the constant lengths. For
    example, pg_stat_statements can now modify the constant lengths in
    JumbleState via fill_in_constant_lengths, and the same JumbleState can
    have its constant locations modified in a different way.
    
    At this time, constant lengths are the only part of the JumbleState that
    is not set by core. So, I don't think post_parse_analyze receiving
    JumbleState as a constant is the right approach, nor do we need it.
    
    I think we should allow JumbleState to define a normalization callback,
    which defaults to a core normalization function rather than an
    extension specific one:
    
    ```
        jstate->normalize_query = GenerateNormalizedQuery;
    ```
    
    This way, any extension that wishes to return a normalized string from
    the same JumbleState can invoke this callback and get consistent results.
    pg_stat_statements and other extensions with a need to normalize a query
    string based on the locations of a JumbleState do not need to care about the
    internals of normalization, they simply invoke the callback and
    receive the final
    string.
    
    So v2-0001 implements this callback and moves the normalization logic
    into core. Both changes must be done in the same patch.  The comments
    are also updated where they are no longer applicable or could be improved.
    
    One additional improvement that this patch did not include is a bool in
    JumbleState that tracks whether a normalized string has already been
    generated. This way, repeated calls to the callback would not need to
    regenerate the string; only the first call would perform the work,
    while subsequent calls could simply return the previously computed
    normalized string.
    
    I do like the simplicity of this approach and it removes pg_stat_statements
    from having to own the normalization code and how well different extensions
    sharing the same JumbleState can play together. Not yet sure if this is the
    correct direction, and I am open to other ideas.
    
    --
    Sami Imseih
    Amazon Web Services (AWS)