Thread

  1. Should we optimize the `ORDER BY random() LIMIT x` case?

    Aleksander Alekseev <aleksander@timescale.com> — 2025-05-14T23:41:15Z

    Hi,
    
    If I didn't miss anything, currently we don't seem to support sampling
    the result of an arbitrary SELECT query efficiently.
    
    To give one specific example:
    
    ````
    CREATE TABLE temperature(
      ts TIMESTAMP NOT NULL,
      city TEXT NOT NULL,
      temperature INT NOT NULL);
    
    CREATE TABLE humidity(
      ts TIMESTAMP NOT NULL,
      city TEXT NOT NULL,
      humidity INT NOT NULL);
    
    -- imagine having much more data ...
    INSERT INTO temperature (ts, city, temperature)
    SELECT ts + (INTERVAL '60 minutes' * random()), city, 30*random()
    FROM generate_series('2022-01-01' :: TIMESTAMP,
                         '2022-01-31', '1 day') AS ts,
         unnest(array['City A', 'City B']) AS city;
    
    INSERT INTO humidity (ts, city, humidity)
    SELECT ts + (INTERVAL '60 minutes' * random()), city, 100*random()
    FROM generate_series('2022-01-01' :: TIMESTAMP,
                         '2022-01-31', '1 day') AS ts,
         unnest(array['City A', 'City B']) AS city;
    
    -- "AS OF" join:
    SELECT t.ts, t.city, t.temperature, h.humidity
    FROM temperature AS t
    LEFT JOIN LATERAL
      ( SELECT * FROM humidity
        WHERE city = t.city AND ts <= t.ts
        ORDER BY ts DESC LIMIT 1
      ) AS h ON TRUE
    WHERE t.ts < '2022-01-05';
    ```
    
    One can do `SELECT (the query above) ORDER BY random() LIMIT x` but
    this produces an inefficient plan. Alternatively one could create
    temporary tables using `CREATE TEMP TABLE ... AS SELECT * FROM tbl
    TABLESAMPLE BERNOULLI(20)` but this is inconvenient and would be
    suboptimal even if we supported global temporary tables.
    
    1. Do you think there might be value in addressing this issue?
    2. If yes, how would you suggest addressing it from the UI point of
    view - by adding a special syntax, some sort of aggregate function, or
    ...?
    
    -- 
    Best regards,
    Aleksander Alekseev
    
    
    
    
  2. Re: Should we optimize the `ORDER BY random() LIMIT x` case?

    wenhui qiu <qiuwenhuifx@gmail.com> — 2025-05-15T02:06:38Z

    Hi Aleksander
     if we can optimize the query, that would be great,Then we won't need to
    pull a lot of data to the program end and randomly pick the needed data
    there.
    
    On Thu, 15 May 2025 at 07:41, Aleksander Alekseev <aleksander@timescale.com>
    wrote:
    
    > Hi,
    >
    > If I didn't miss anything, currently we don't seem to support sampling
    > the result of an arbitrary SELECT query efficiently.
    >
    > To give one specific example:
    >
    > ````
    > CREATE TABLE temperature(
    >   ts TIMESTAMP NOT NULL,
    >   city TEXT NOT NULL,
    >   temperature INT NOT NULL);
    >
    > CREATE TABLE humidity(
    >   ts TIMESTAMP NOT NULL,
    >   city TEXT NOT NULL,
    >   humidity INT NOT NULL);
    >
    > -- imagine having much more data ...
    > INSERT INTO temperature (ts, city, temperature)
    > SELECT ts + (INTERVAL '60 minutes' * random()), city, 30*random()
    > FROM generate_series('2022-01-01' :: TIMESTAMP,
    >                      '2022-01-31', '1 day') AS ts,
    >      unnest(array['City A', 'City B']) AS city;
    >
    > INSERT INTO humidity (ts, city, humidity)
    > SELECT ts + (INTERVAL '60 minutes' * random()), city, 100*random()
    > FROM generate_series('2022-01-01' :: TIMESTAMP,
    >                      '2022-01-31', '1 day') AS ts,
    >      unnest(array['City A', 'City B']) AS city;
    >
    > -- "AS OF" join:
    > SELECT t.ts, t.city, t.temperature, h.humidity
    > FROM temperature AS t
    > LEFT JOIN LATERAL
    >   ( SELECT * FROM humidity
    >     WHERE city = t.city AND ts <= t.ts
    >     ORDER BY ts DESC LIMIT 1
    >   ) AS h ON TRUE
    > WHERE t.ts < '2022-01-05';
    > ```
    >
    > One can do `SELECT (the query above) ORDER BY random() LIMIT x` but
    > this produces an inefficient plan. Alternatively one could create
    > temporary tables using `CREATE TEMP TABLE ... AS SELECT * FROM tbl
    > TABLESAMPLE BERNOULLI(20)` but this is inconvenient and would be
    > suboptimal even if we supported global temporary tables.
    >
    > 1. Do you think there might be value in addressing this issue?
    > 2. If yes, how would you suggest addressing it from the UI point of
    > view - by adding a special syntax, some sort of aggregate function, or
    > ...?
    >
    > --
    > Best regards,
    > Aleksander Alekseev
    >
    >
    >
    
  3. Re: Should we optimize the `ORDER BY random() LIMIT x` case?

    Andrei Lepikhov <lepihov@gmail.com> — 2025-05-15T07:54:35Z

    On 15/5/2025 01:41, Aleksander Alekseev wrote:
    > One can do `SELECT (the query above) ORDER BY random() LIMIT x` but
    > this produces an inefficient plan. Alternatively one could create
    > temporary tables using `CREATE TEMP TABLE ... AS SELECT * FROM tbl
    > TABLESAMPLE BERNOULLI(20)` but this is inconvenient and would be
    > suboptimal even if we supported global temporary tables.
    > 
    > 1. Do you think there might be value in addressing this issue?
    > 2. If yes, how would you suggest addressing it from the UI point of
    > view - by adding a special syntax, some sort of aggregate function, or
    > ...?
    I think I got your point, but just to be sure:
    Do you want to have some random sampling from an arbitrary subquery with 
    the guarantee that N distinct (by tid) tuples will be produced or all 
    the tuples if the underlying subquery produces less than N?
    
    What kind of optimisation trick may the optimiser use here to provide an 
    optimal plan? As I see it, it will need to think that all the tuples 
    should be returned from the subquery. The only profit is to skip sorting 
    the massive sample.
    
    As a palliative, you may laterally join your subquery with a stored 
    procedure, which will process the incoming tuple and implement the logic 
    of random sampling.
    
    Implementation of that in the core will need a new "skip result" node 
    and new syntax, which may be too much if a workaround is found.
    
    -- 
    regards, Andrei Lepikhov
    
    
    
    
  4. Re: Should we optimize the `ORDER BY random() LIMIT x` case?

    Aleksander Alekseev <aleksander@timescale.com> — 2025-05-15T09:17:53Z

    wenhui, Andrei,
    
    > if we can optimize the query, that would be great
    
    Thanks for the feedback!
    
    > I think I got your point, but just to be sure:
    > Do you want to have some random sampling from an arbitrary subquery with
    > the guarantee that N distinct (by tid) tuples will be produced or all
    > the tuples if the underlying subquery produces less than N?
    
    Right.
    
    > What kind of optimisation trick may the optimiser use here to provide an
    > optimal plan? As I see it, it will need to think that all the tuples
    > should be returned from the subquery. The only profit is to skip sorting
    > the massive sample.
    
    Doesn't look like a generic optimization trick will help us. I was
    thinking about a custom aggregate function, e.g. `SELECT sample(*, 10)
    ...`. However I doubt that aggregate functions are flexible enough. Or
    alternatively a rewrite rule. I never dealt with those before so I
    have no idea what I'm talking about :D
    
    > As a palliative, you may laterally join your subquery with a stored
    > procedure, which will process the incoming tuple and implement the logic
    > of random sampling.
    
    Yes, that's a good observation. The query:
    
    ```
    SELECT t.* FROM mytable AS t
    INNER JOIN LATERAL (
        SELECT 1 WHERE random() > 0.99
    ) AS r ON TRUE
    ```
    
    Semples 1% of the tuples efficiently. I wonder if there is also an
    efficient way to sample "no more than N" tuples without knowing the
    size of the entire set / stream beforehand. Implementing such an
    algorithm in C is not difficult [1]. Doesn't seem so in SQL.
    
    > Implementation of that in the core will need a new "skip result" node
    > and new syntax, which may be too much if a workaround is found.
    
    Agree. I'm not too worried about the new node. The new syntax
    apparently is going to be mandatory. Otherwise we will have to check
    "wait what if it's ORDER BY random() LIMIT?" for every single query. I
    wonder what the community's opinion on having such a syntax is.
    
    [1] e.g. https://samwho.dev/reservoir-sampling/
    -- 
    Best regards,
    Aleksander Alekseev
    
    
    
    
  5. Re: Should we optimize the `ORDER BY random() LIMIT x` case?

    Andrei Lepikhov <lepihov@gmail.com> — 2025-05-15T09:32:35Z

    On 15/5/2025 11:17, Aleksander Alekseev wrote:
    >> What kind of optimisation trick may the optimiser use here to provide an
    >> optimal plan? As I see it, it will need to think that all the tuples
    >> should be returned from the subquery. The only profit is to skip sorting
    >> the massive sample.
    > 
    > Doesn't look like a generic optimization trick will help us. I was
    > thinking about a custom aggregate function, e.g. `SELECT sample(*, 10)
    > ...`. However I doubt that aggregate functions are flexible enough. Or
    > alternatively a rewrite rule. I never dealt with those before so I
    > have no idea what I'm talking about :D
    A custom SRF seems great to me. You may propose such an aggregate in the 
    core - it seems it doesn't even need any syntax changes. For example:
    SELECT * FROM (SELECT sample(q, 10, <type>) FROM (SELECT ...) AS q);
    or something like that.
    
    -- 
    regards, Andrei Lepikhov
    
    
    
    
  6. Re: Should we optimize the `ORDER BY random() LIMIT x` case?

    Aleksander Alekseev <aleksander@timescale.com> — 2025-05-16T12:01:04Z

    Hi,
    
    > A custom SRF seems great to me. You may propose such an aggregate in the
    > core - it seems it doesn't even need any syntax changes. For example:
    > SELECT * FROM (SELECT sample(q, 10, <type>) FROM (SELECT ...) AS q);
    > or something like that.
    
    I experimented with this idea a bit and it looks like this is not going to work.
    
    In our case sample() should behave both as an aggregate function and
    as SRF. An aggregate function computes a single result [1]. We
    probably could bypass this by returning an array but it would be
    inconvenient for the user. The problem with SRFs is that we never know
    when SRF is called the last time e.g. there is no SRF_IS_LASTCALL().
    So there is no way to aggregate the tuples until we receive the last
    one and then return the sample. I don't think SRF can know this in a
    general case (JOINs, SELECT .. LIMIT .., etc). Unless I'm missing
    something of course. Perhaps somebody smarter than me could show us
    the exact way to implement sample() as an SRF, but at the moment
    personally I don't see it.
    
    On top of that it's worth noting that the query you showed above
    wouldn't be such a great UI in my opinion.
    
    If I'm right about the limitations of aggregate functions and SRFs
    this leaves us the following options:
    
    1. Changing the constraints of aggregate functions or SRFs. However I
    don't think we want to do it for such a single niche scenario.
    2. Custom syntax and a custom node.
    3. To give up
    
    Thoughts?
    
    [1]: https://www.postgresql.org/docs/current/functions-aggregate.html
    -- 
    Best regards,
    Aleksander Alekseev
    
    
    
    
  7. Re: Should we optimize the `ORDER BY random() LIMIT x` case?

    Tom Lane <tgl@sss.pgh.pa.us> — 2025-05-16T13:01:50Z

    Aleksander Alekseev <aleksander@timescale.com> writes:
    > If I'm right about the limitations of aggregate functions and SRFs
    > this leaves us the following options:
    
    > 1. Changing the constraints of aggregate functions or SRFs. However I
    > don't think we want to do it for such a single niche scenario.
    > 2. Custom syntax and a custom node.
    > 3. To give up
    
    Seems to me the obvious answer is to extend TABLESAMPLE (or at least, some
    of the tablesample methods) to allow it to work on a subquery.
    
    			regards, tom lane
    
    
    
    
  8. Re: Should we optimize the `ORDER BY random() LIMIT x` case?

    Vik Fearing <vik@postgresfriends.org> — 2025-05-16T21:10:49Z

    On 16/05/2025 15:01, Tom Lane wrote:
    > Aleksander Alekseev <aleksander@timescale.com> writes:
    >> If I'm right about the limitations of aggregate functions and SRFs
    >> this leaves us the following options:
    >> 1. Changing the constraints of aggregate functions or SRFs. However I
    >> don't think we want to do it for such a single niche scenario.
    >> 2. Custom syntax and a custom node.
    >> 3. To give up
    > Seems to me the obvious answer is to extend TABLESAMPLE (or at least, some
    > of the tablesample methods) to allow it to work on a subquery.
    
    
    Isn't this a job for <fetch first clause>?
    
    
    Example:
    
    SELECT ...
    FROM ... JOIN ...
    FETCH SAMPLE FIRST 10 ROWS ONLY
    
    
    Then the nodeLimit could do some sort of reservoir sampling.
    
    
    There are several enhancements to <fetch first clause> coming down the 
    pipe, this could be one of them.
    
    -- 
    
    Vik Fearing
    
    
    
    
    
  9. Re: Should we optimize the `ORDER BY random() LIMIT x` case?

    Tom Lane <tgl@sss.pgh.pa.us> — 2025-05-16T21:21:08Z

    Vik Fearing <vik@postgresfriends.org> writes:
    > On 16/05/2025 15:01, Tom Lane wrote:
    >> Seems to me the obvious answer is to extend TABLESAMPLE (or at least, some
    >> of the tablesample methods) to allow it to work on a subquery.
    
    > Isn't this a job for <fetch first clause>?
    > FETCH SAMPLE FIRST 10 ROWS ONLY
    
    How is that an improvement on TABLESAMPLE?  Or did the committee
    forget that they already have that feature?
    
    TABLESAMPLE seems strictly better to me here because it affords
    the opportunity to specify one of several methods, which seems
    like it would be useful in this context.
    
    			regards, tom lane
    
    
    
    
  10. Re: Should we optimize the `ORDER BY random() LIMIT x` case?

    Vik Fearing <vik@postgresfriends.org> — 2025-05-16T21:42:59Z

    On 16/05/2025 23:21, Tom Lane wrote:
    > Vik Fearing <vik@postgresfriends.org> writes:
    >> On 16/05/2025 15:01, Tom Lane wrote:
    >>> Seems to me the obvious answer is to extend TABLESAMPLE (or at least, some
    >>> of the tablesample methods) to allow it to work on a subquery.
    >> Isn't this a job for <fetch first clause>?
    >> FETCH SAMPLE FIRST 10 ROWS ONLY
    > How is that an improvement on TABLESAMPLE?  Or did the committee
    > forget that they already have that feature?
    >
    > TABLESAMPLE seems strictly better to me here because it affords
    > the opportunity to specify one of several methods, which seems
    > like it would be useful in this context.
    
    
    TABLESAMPLE is hitched to a <table primary> which can be basically 
    anything resembling a relation.  So it appears the standard already 
    allows this and we just need to implement it.
    
    -- 
    
    Vik Fearing
    
    
    
    
    
  11. Re: Should we optimize the `ORDER BY random() LIMIT x` case?

    Nico Williams <nico@cryptonector.com> — 2025-05-16T21:50:49Z

    On Fri, May 16, 2025 at 09:01:50AM -0400, Tom Lane wrote:
    > Seems to me the obvious answer is to extend TABLESAMPLE (or at least, some
    > of the tablesample methods) to allow it to work on a subquery.
    
    The key here is that we need one bit of state between rows: the count of
    rows seen so far.
    
    
    
    
  12. Re: Should we optimize the `ORDER BY random() LIMIT x` case?

    Nico Williams <nico@cryptonector.com> — 2025-05-16T21:53:29Z

    On Fri, May 16, 2025 at 11:10:49PM +0200, Vik Fearing wrote:
    > Isn't this a job for <fetch first clause>?
    > 
    > Example:
    > 
    > SELECT ...
    > FROM ... JOIN ...
    > FETCH SAMPLE FIRST 10 ROWS ONLY
    > 
    > Then the nodeLimit could do some sort of reservoir sampling.
    
    The query might return fewer than N rows.  What reservoir sampling
    requires is this bit of state: the count of input rows so far.
    
    The only way I know of to keep such state in a SQL query is with a
    RECURSIVE CTE, but unfortunately that would require unbounded CTE size,
    and it would require a way to query next rows one per-iteration.
    
    Nico
    -- 
    
    
    
    
  13. Re: Should we optimize the `ORDER BY random() LIMIT x` case?

    Nico Williams <nico@cryptonector.com> — 2025-05-16T22:06:49Z

    On Thu, May 15, 2025 at 02:41:15AM +0300, Aleksander Alekseev wrote:
    > SELECT t.ts, t.city, t.temperature, h.humidity
    > FROM temperature AS t
    > LEFT JOIN LATERAL
    >   ( SELECT * FROM humidity
    >     WHERE city = t.city AND ts <= t.ts
    >     ORDER BY ts DESC LIMIT 1
    >   ) AS h ON TRUE
    > WHERE t.ts < '2022-01-05';
    > ```
    > 
    > One can do `SELECT (the query above) ORDER BY random() LIMIT x` but
    > this produces an inefficient plan. Alternatively one could create
    
    I assume it gathers the whole result and _then_ orders by random(), yes?
    
    I think the obvious thing to do is to have the optimizer recognize
    `ORDER BY random() LIMIT x` and do the reservoir sampling thing as the
    underlying query produces rows.  The engine needs to keep track of how
    many rows it's seen so far so that with each new row it can use
    `random() < 1/n` to decide whether to keep the row.
    
    > temporary tables using `CREATE TEMP TABLE ... AS SELECT * FROM tbl
    > TABLESAMPLE BERNOULLI(20)` but this is inconvenient and would be
    > suboptimal even if we supported global temporary tables.
    
    Speaking of which, what would it take to have global temp tables?
    
    > 1. Do you think there might be value in addressing this issue?
    
    Yes!  I think `ORDER BY random() LIMIT x` is succinct syntax, and
    semantically it should yield as uniformly distributed a result set as
    possible even when the size of the underlying query is not knowable a
    priori.  And it should be as optimal as possible considering that the
    underlying query's result set is in principle -and many useful real use
    cases- unbounded.
    
    And reservoir sampling is a suitable existing technique to get an
    optimal and uniform random finite sampling of an indefinite size row
    set.
    
    > 2. If yes, how would you suggest addressing it from the UI point of
    > view - by adding a special syntax, some sort of aggregate function, or
    > ...?
    
    `ORDER BY random() LIMIT x` is a) in the standard, b) semantically what
    you're asking for, c) succint syntax, d) implementable optimally,
    therefore IMO `ORDER BY random() LIMIT x` is the correct syntax.
    
    Nico
    -- 
    
    
    
    
  14. Re: Should we optimize the `ORDER BY random() LIMIT x` case?

    Aleksander Alekseev <aleksander@timescale.com> — 2025-05-19T10:25:00Z

    Tom, Nico, Vik,
    
    > Seems to me the obvious answer is to extend TABLESAMPLE (or at least, some
    > of the tablesample methods) to allow it to work on a subquery.
    
    Currently our BERNOULLI and SYSTEM sampling methods don't allow
    specifying the maximum number of rows that should be returned, only
    the percentage. So it's going to do the same as:
    
    ```
    SELECT t.* FROM mytable AS t
    INNER JOIN LATERAL (
        SELECT 1 WHERE random() < 0.99
    ) AS r ON TRUE
    ```
    
    Just to clarify, do you propose to add a RESERVOIR sampling method as well?
    
    > I assume it gathers the whole result and _then_ orders by random(), yes?
    
    Sure.
    
    > I think the obvious thing to do is to have the optimizer recognize
    > `ORDER BY random() LIMIT x` and do the reservoir sampling thing as the
    > underlying query produces rows.  The engine needs to keep track of how
    > many rows it's seen so far so that with each new row it can use
    > `random() < 1/n` to decide whether to keep the row.
    
    I agree this would be most convenient for the user. Unfortunately this
    will require us to check every SELECT query: "oh, isn't it by any
    chance ORDER BY random() LIMIT x?". I don't think we can't afford such
    a performance degradation, even a small one, for an arguably rare
    case.
    
    > > temporary tables using `CREATE TEMP TABLE ... AS SELECT * FROM tbl
    > > TABLESAMPLE BERNOULLI(20)` but this is inconvenient and would be
    > > suboptimal even if we supported global temporary tables.
    >
    > Speaking of which, what would it take to have global temp tables?
    
    Considering the fact that several people (me included [1]) tried to
    address this issue in several ways for at least the past decade
    without success, I would say this is a fairly difficult problem. I
    remember seeing a global temp tables patch on the commitfests maybe a
    year or two ago, but it's not there anymore. Check the past commitfest
    to find out how it ended up.
    
    > TABLESAMPLE is hitched to a <table primary> which can be basically
    > anything resembling a relation.  So it appears the standard already
    > allows this and we just need to implement it.
    
    Vik, many thanks for sharing this. I don't have a strong opinion on
    `FETCH SAMPLE FIRST 10 ROWS ONLY` but since TABLESAMPLE should / may
    work for subqueries anyway we could focus on it for now.
    
    ***
    
    As a separate observation, I noticed that we could do something like this:
    
    ```
    -- imagine replacing inefficient array_sample(array_agg(t), 10)
    -- with more efficient array_sample_reservoir(t, 10)
    SELECT (unnest(agg)).* AS k FROM
    (  SELECT array_sample(array_agg(t), 10) AS agg FROM (
       ... here goes the subquery ...
       ) AS t
    );
    ```
    
    ... if only we supported such a column expansion for not registered
    records. Currently such a query fails with:
    
    ```
    ERROR:  record type has not been registered
    ```
    
    It has all the needed records, just fails to display them in a form of
    separate columns. Note that in this case I don't care if column names
    are going to be something like '???' or 'unnamed1'. I can imagine how
    it could be convenient in more cases than just this one.
    
    Any thoughts on whether we should support this?
    
    [1]: https://www.postgresql.org/message-id/flat/20160729141552.4062e19b%40fujitsu
    -- 
    Best regards,
    Aleksander Alekseev
    
    
    
    
  15. Re: Should we optimize the `ORDER BY random() LIMIT x` case?

    Nico Williams <nico@cryptonector.com> — 2025-05-19T15:38:19Z

    On Mon, May 19, 2025 at 01:25:00PM +0300, Aleksander Alekseev wrote:
    > I agree this would be most convenient for the user. Unfortunately this
    > will require us to check every SELECT query: "oh, isn't it by any
    > chance ORDER BY random() LIMIT x?". I don't think we can't afford such
    > a performance degradation, even a small one, for an arguably rare
    > case.
    
    Can the detection of such queries be done by the yacc/bison parser
    grammar?
    
    Nico
    -- 
    
    
    
    
  16. Re: Should we optimize the `ORDER BY random() LIMIT x` case?

    Andrei Lepikhov <lepihov@gmail.com> — 2025-05-19T18:04:06Z

    On 5/19/25 12:25, Aleksander Alekseev wrote:
    > ```
    > -- imagine replacing inefficient array_sample(array_agg(t), 10)
    > -- with more efficient array_sample_reservoir(t, 10)
    > SELECT (unnest(agg)).* AS k FROM
    > (  SELECT array_sample(array_agg(t), 10) AS agg FROM (
    >     ... here goes the subquery ...
    >     ) AS t
    > );
    > ```
    > 
    > ... if only we supported such a column expansion for not registered
    > records. Currently such a query fails with:
    > 
    > ```
    > ERROR:  record type has not been registered
    > ```
    I know about this issue. Having resolved it in a limited number of local 
    cases (like FDW push-down of row types), I still do not have a universal 
    solution worth proposing upstream. Do you have any public implementation 
    of the array_sample_reservoir to play with?
    
    -- 
    regards, Andrei Lepikhov
    
    
    
    
  17. Re: Should we optimize the `ORDER BY random() LIMIT x` case?

    Vik Fearing <vik@postgresfriends.org> — 2025-05-19T19:01:39Z

    On 19/05/2025 12:25, Aleksander Alekseev wrote:
    > Tom, Nico, Vik,
    >
    >> TABLESAMPLE is hitched to a <table primary> which can be basically
    >> anything resembling a relation.  So it appears the standard already
    >> allows this and we just need to implement it.
    > Vik, many thanks for sharing this. I don't have a strong opinion on
    > `FETCH SAMPLE FIRST 10 ROWS ONLY` but since TABLESAMPLE should / may
    > work for subqueries anyway we could focus on it for now.
    
    
    Yeah, putting this into <fetch first clause> was a dumb idea on my part, 
    and Tom correctly corrected me.  I do not yet have the required number 
    of years in the sql standards studies to know the whole thing by heart.
    
    
    I think we (as a community) should work on expanding our <sample clause> 
    to work with any <table primary> and not just a base table.  Especially 
    since we already have two existing extensions by Petr to the standard 
    for that clause.  We can easily make more, which might even make their 
    way back into the standard.
    
    -- 
    
    Vik Fearing
    
    
    
    
    
  18. Re: Should we optimize the `ORDER BY random() LIMIT x` case?

    Nico Williams <nico@cryptonector.com> — 2025-05-19T20:53:35Z

    On Mon, May 19, 2025 at 10:38:19AM -0500, Nico Williams wrote:
    > On Mon, May 19, 2025 at 01:25:00PM +0300, Aleksander Alekseev wrote:
    > > I agree this would be most convenient for the user. Unfortunately this
    > > will require us to check every SELECT query: "oh, isn't it by any
    > > chance ORDER BY random() LIMIT x?". I don't think we can't afford such
    > > a performance degradation, even a small one, for an arguably rare
    > > case.
    > 
    > Can the detection of such queries be done by the yacc/bison parser
    > grammar?
    
    Maybe the `sortby` rule could check if the expression is `random()`,
    then `sort_clause` could check if `$3` is a one-item `sortby_list` of
    just `random()` and mark `$$` as special -- this should be cheap, yes?
    We'd still need to check for `LIMIT` somewhere else.
    
    
    
    
  19. Re: Should we optimize the `ORDER BY random() LIMIT x` case?

    Aleksander Alekseev <aleksander@timescale.com> — 2025-05-20T08:29:30Z

    Nico,
    
    > > Can the detection of such queries be done by the yacc/bison parser
    > > grammar?
    >
    > Maybe the `sortby` rule could check if the expression is `random()`,
    > then `sort_clause` could check if `$3` is a one-item `sortby_list` of
    > just `random()` and mark `$$` as special -- this should be cheap, yes?
    > We'd still need to check for `LIMIT` somewhere else.
    
    Although partially implementing an optimizer on the parser level is
    possible, it doesn't strike me as a right long-term solution. Firstly,
    it's a hacky solution because maybe it will work for random() but for
    some reason will not work for -random()*2. Secondly, imagine adding a
    dozen optimizations like this in the upcoming years and all of them
    interacting with each other. Imagine you are the person who has to
    maintain this and not break things when adding another optimization.
    All in all, this would be a poor design choice in my opinion.
    
    -- 
    Best regards,
    Aleksander Alekseev
    
    
    
    
  20. Re: Should we optimize the `ORDER BY random() LIMIT x` case?

    Aleksander Alekseev <aleksander@timescale.com> — 2025-05-20T10:46:38Z

    Andrei,
    
    > > ```
    > > -- imagine replacing inefficient array_sample(array_agg(t), 10)
    > > -- with more efficient array_sample_reservoir(t, 10)
    > > SELECT (unnest(agg)).* AS k FROM
    > > (  SELECT array_sample(array_agg(t), 10) AS agg FROM (
    > >     ... here goes the subquery ...
    > >     ) AS t
    > > );
    > > ```
    > >
    > > ... if only we supported such a column expansion for not registered
    > > records. Currently such a query fails with:
    > >
    > > ```
    > > ERROR:  record type has not been registered
    > > ```
    > I know about this issue. Having resolved it in a limited number of local
    > cases (like FDW push-down of row types), I still do not have a universal
    > solution worth proposing upstream. Do you have any public implementation
    > of the array_sample_reservoir to play with?
    
    array_sample_reservoir() is purely a figment of my imagination at the
    moment. Semantically it does the same as array_sample(array_agg(t), N)
    except the fact that array_sample(..., N) requires the array to have
    at least N items. You can experiment with array_sample(array_agg(...),
    N) as long as the subquery returns much more than N rows.
    
    -- 
    Best regards,
    Aleksander Alekseev