Thread
-
Re: Refactor query normalization into core query jumbling
Sami Imseih <samimseih@gmail.com> — 2025-12-24T17:32:45Z
> > For the second part of making more jumbling utilities global, this can > > be taken up in another thread. I am now thinking we would be better > > off actually taking all the generic jumbling routines and separating > > them into their own C file. So we can have a new file like primjumble.c > > which will have all the "primitive" jumbling utilities, and queryjumblefuncs.c > > can worry about its core business of jumbling a Query tree. Anyhow, > > these are just some high level thoughts on this part for now. > > Hmm. Moving from pgss to the core backup the code that updates the > clocations of an existing JumbleState does not solve the issue that > we should not modify the internals of the JumbleState computed. So > this does not move the ball at all. This is for the point "I am wondering if we should not expose a bit more the jumble query APIs" [0], which as I mention is a separate topic and is not about extensions being able to modify JumbleState. > The updates of clocations depend on GenerateNormalizedQuery(), which > depends on the pre-work done in CleanQuerytext() to get a query string > and its location for a multi-query string case. If we really want to > make a JumbleState not touchable at all, we could either push more > work of CleanQuerytext() and GenerateNormalizedQuery() into core. When you say “push more work into core,” I understand that to mean this work should occur during jumbling. If so, there are several complications. 1/ CleanQueryText() requires access to the query text, which we do not have during core jumbling as of 2ecbb0a4935. 2/ GenerateNormalizedQuery() should be invoked on demand by the extension that needs the normalized string, for example when pg_stat_statements needs to store a query string. Both operations are expensive, especially GenerateNormalizedQuery(). Doing this work on every JumbleQuery() would introduce unnecessary overhead. > Or a second thing that we can do, which would be actually much > simpler, is to rework entirely fill_in_constant_lengths() so as we > generate a *copy* of the location data PGSS wants rather than force it > into a JumbleState yes, that's an idea that did cross my mind. I have hesitation of copying clocations, but that may not be such a big deal, since it will only be occurring on demand when the extension requests a normalized string. > that would be pushed across more stacks of the > post-parse-analyze hook. Doing that would allow us to pull the plug > into making JumbleState and the original copies of clocations a set of > consts when given to extensions. Yes correct. The hook signature will turn into, so all extensions now just have a constant pointer to the jumblestate. They can of course work hard to cast away constants, but at that point, they are doing things the wrong way. ``` typedef void (*post_parse_analyze_hook_type) (ParseState *pstate, Query *query, const JumbleState *jstate); ``` This of course will be a big breaking change to all the extensions out there using this hook. This is not just about a signature change of the hook, but an author now has to figure out how to deal with a constant JumbleState in their code, which may not be trivial. My proposal in v3 was a bit more softer, and keeps the hook backwards compatible. Basically, we keep JumbleState a non-constant, but provide core APIs for any operation, such as generate_normalized_query, that needs to modify the state. So, my approach was not about enforcing a read-only JumbleState, but about providing the API to dissuade an author from modifying a JumbleState. If we need a stronger API contract, then we can go with the hook receving JumbleState as a constant and implement the copy of clocations to return back to extensions that need to normalize a query. We also have to keep in mind that if there are future requirements where the state has to be modified by an extension, we will need to implement more copy functions for those field. [0] https://www.postgresql.org/message-id/aUXeTEpSymo6n7lF%40paquier.xyz -- Sami Imseih Amazon Web Services (AWS)