Thread

  1. Extensible storage manager API - SMGR hook Redux

    Matthias van de Meent <boekewurm+postgres@gmail.com> — 2023-06-30T12:26:44Z

    Hi hackers,
    
    At Neon, we've been working on removing the file system dependency
    from PostgreSQL and replacing it with a distributed storage layer. For
    now, we've seen most success in this by replacing the implementation
    of the smgr API, but it did require some core modifications like those
    proposed early last year  by Anastasia [0].
    
    As mentioned in the previous thread, there are several reasons why you
    would want to use a non-default storage manager: storage-level
    compression, encryption, and disk limit quotas [0]; offloading of cold
    relation data was also mentioned [1].
    
    In the thread on Anastasia's patch, Yura Sokolov mentioned that
    instead of a hook-based smgr extension, a registration-based smgr
    would be preferred, with integration into namespaces. Please find
    attached an as of yet incomplete patch that starts to do that.
    
    The patch is yet incomplete (as it isn't derived from Anastasia's
    patch), but I would like comments on this regardless, as this is a
    fairly fundamental component of PostgreSQL that is being modified, and
    it is often better to get comments early in the development cycle. One
    significant issue that I've seen so far are that catcache is not
    guaranteed to be available in all backends that need to do smgr
    operations, and I've not yet found a good solution.
    
    Changes compared to HEAD:
    - smgrsw is now dynamically allocated and grows as new storage
    managers are loaded (during shared_preload_libraries)
    - CREATE TABLESPACE has new optional syntax USING smgrname (option [, ...])
    - tablespace storage is (planned) fully managed by smgr through some
    new smgr apis
    
    Changes compared to Anastasia's patch:
    - extensions do not get to hook and replace the api of the smgr code
    directly - they are hidden behind the smgr registry.
    
    Successes:
    - 0001 passes tests (make check-world)
    - 0002 builds without warnings (make)
    
    TODO:
    - fix dependency failures when catcache is unavailable
    - tablespace redo is currently broken with 0002
    - fix tests for 0002
    - ensure that pg_dump etc. works with the new tablespace storage manager options
    
    Looking forward to any comments, suggestions and reviews.
    
    Kind regards,
    
    Matthias van de Meent
    Neon (https://neon.tech/)
    
    
    [0] https://www.postgresql.org/message-id/CAP4vRV6JKXyFfEOf%3Dn%2Bv5RGsZywAQ3CTM8ESWvgq%2BS87Tmgx_g%40mail.gmail.com
    [1] https://www.postgresql.org/message-id/D365F19F-BC3E-4F96-A91E-8DB13049749E@yandex-team.ru
    
  2. Re: Extensible storage manager API - SMGR hook Redux

    Andres Freund <andres@anarazel.de> — 2023-07-01T01:50:07Z

    Hi,
    
    On 2023-06-30 14:26:44 +0200, Matthias van de Meent wrote:
    > diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
    > index 4c49393fc5..8685b9fde6 100644
    > --- a/src/backend/postmaster/postmaster.c
    > +++ b/src/backend/postmaster/postmaster.c
    > @@ -1002,6 +1002,11 @@ PostmasterMain(int argc, char *argv[])
    >  	 */
    >  	ApplyLauncherRegister();
    >  
    > +	/*
    > +	 * Register built-in managers that are not part of static arrays
    > +	 */
    > +	register_builtin_dynamic_managers();
    > +
    >  	/*
    >  	 * process any libraries that should be preloaded at postmaster start
    >  	 */
    
    That doesn't strike me as a good place to initialize this, we'll need it in
    multiple places that way.  How about putting it into BaseInit()?
    
    
    > -static const f_smgr smgrsw[] = {
    > +static f_smgr *smgrsw;
    
    This adds another level of indirection. I would rather limit the number of
    registerable smgrs than do that.
    
    
    
    > +SMgrId
    > +smgr_register(const f_smgr *smgr, Size smgrrelation_size)
    > +{
    
    > +	MemoryContextSwitchTo(old);
    > +
    > +	pg_compiler_barrier();
    
    Huh, what's that about?
    
    
    > @@ -59,14 +63,8 @@ typedef struct SMgrRelationData
    >  	 * Fields below here are intended to be private to smgr.c and its
    >  	 * submodules.  Do not touch them from elsewhere.
    >  	 */
    > -	int			smgr_which;		/* storage manager selector */
    > -
    > -	/*
    > -	 * for md.c; per-fork arrays of the number of open segments
    > -	 * (md_num_open_segs) and the segments themselves (md_seg_fds).
    > -	 */
    > -	int			md_num_open_segs[MAX_FORKNUM + 1];
    > -	struct _MdfdVec *md_seg_fds[MAX_FORKNUM + 1];
    > +	SMgrId		smgr_which;		/* storage manager selector */
    > +	int			smgrrelation_size;	/* size of this struct, incl. smgr-specific data */
    
    It looked to me like you determined this globally - why do we need it in every
    entry then?
    
    Greetings,
    
    Andres Freund
    
    
    
    
  3. Re: Extensible storage manager API - SMGR hook Redux

    Tristan Partin <tristan@neon.tech> — 2023-07-11T20:57:34Z

    > Subject: [PATCH v1 1/2] Expose f_smgr to extensions for manual implementation
    
    From what I can see, all the md* APIs that were exposed in md.h can now
    be made static in md.c. The only other references to those APIs were in
    smgr.c.
    
    > Subject: [PATCH v1 2/2] Prototype: Allow tablespaces to specify which SMGR
    >  they use
    
    > -typedef uint8 SMgrId;
    > +/*
    > + * volatile ID of the smgr. Across various configurations IDs may vary,
    > + * true identity is the name of each smgr.
    > + */
    > +typedef int SMgrId;
    > 
    > -#define MaxSMgrId UINT8_MAX
    > +#define MaxSMgrId              INT_MAX
    
    In a future revision of this patch, seems worthwhile to just start as
    int instead of a uint8 to avoid this song and dance. Maybe int8 instead
    of int?
    
    > +static SMgrId recent_smgrid = -1;
    
    You could use InvalidSmgrId here.
    
    > +void smgrvalidatetspopts(const char *smgrname, List *opts)
    > +{
    > +       SMgrId smgrid = get_smgr_by_name(smgrname, false);
    > +
    > +       smgrsw[smgrid].smgr_validate_tspopts(opts);
    > +}
    > +
    > +void smgrcreatetsp(const char *smgrname, Oid tsp, List *opts, bool isredo)
    > +{
    > +       SMgrId smgrid = get_smgr_by_name(smgrname, false);
    > +
    > +       smgrsw[smgrid].smgr_create_tsp(tsp, opts, isredo);
    > +}
    > +
    > +void smgrdroptsp(const char *smgrname, Oid tsp, bool isredo)
    > +{
    > +       SMgrId smgrid = get_smgr_by_name(smgrname, false);
    > +
    > +       smgrsw[smgrid].smgr_drop_tsp(tsp, isredo);
    > +}
    
    Do you not need to check if smgrid is the InvalidSmgrId? I didn't see
    any other validation anywhere.
    
    > +       char       *smgr;
    > +       List       *smgropts; /* list of DefElem nodes */
    
    smgrname would probably work better alongside tablespacename in that
    struct.
    
    > @@ -221,7 +229,7 @@ mdexists(SMgrRelation reln, ForkNumber forknum)
    >         if (!InRecovery)
    >                 mdclose(reln, forknum);
    > 
    > -       return (mdopenfork(reln, forknum, EXTENSION_RETURN_NULL) != NULL);
    > +       return (mdopenfork(mdreln, forknum, EXTENSION_RETURN_NULL) != NULL);
    >  }
    
    Was this a victim of a bad rebase? Seems like it belongs in the previous
    patch.
    
    > +void mddroptsp(Oid tsp, bool isredo)
    > +{
    > +
    > +}
    
    Some functions in this file have the return type on the previous line.
    
    This is a pretty slick patchset. Excited to read more dicussion and how
    it evolves.
    
    -- 
    Tristan Partin
    Neon (https://neon.tech)
    
    
    
    
  4. Re: Extensible storage manager API - SMGR hook Redux

    Tristan Partin <tristan@neon.tech> — 2023-09-19T20:54:53Z

    Found these warnings while compiling while only 0001 is applied.
    
    [1166/2337] Compiling C object src/backend/postgres_lib.a.p/storage_smgr_md.c.o
    ../src/backend/storage/smgr/md.c: In function ‘mdexists’:
    ../src/backend/storage/smgr/md.c:224:28: warning: passing argument 1 of ‘mdopenfork’ from incompatible pointer type [-Wincompatible-pointer-types]
      224 |         return (mdopenfork(reln, forknum, EXTENSION_RETURN_NULL) != NULL);
          |                            ^~~~
          |                            |
          |                            SMgrRelation {aka SMgrRelationData *}
    ../src/backend/storage/smgr/md.c:167:43: note: expected ‘MdSMgrRelation’ {aka ‘MdSMgrRelationData *’} but argument is of type ‘SMgrRelation’ {aka ‘SMgrRelationData *’}
      167 | static MdfdVec *mdopenfork(MdSMgrRelation reln, ForkNumber forknum, int behavior);
          |                            ~~~~~~~~~~~~~~~^~~~
    ../src/backend/storage/smgr/md.c: In function ‘mdcreate’:
    ../src/backend/storage/smgr/md.c:287:40: warning: passing argument 1 of ‘register_dirty_segment’ from incompatible pointer type [-Wincompatible-pointer-types]
      287 |                 register_dirty_segment(reln, forknum, mdfd);
          |                                        ^~~~
          |                                        |
          |                                        SMgrRelation {aka SMgrRelationData *}
    ../src/backend/storage/smgr/md.c:168:51: note: expected ‘MdSMgrRelation’ {aka ‘MdSMgrRelationData *’} but argument is of type ‘SMgrRelation’ {aka ‘SMgrRelationData *’}
      168 | static void register_dirty_segment(MdSMgrRelation reln, ForkNumber forknum,
    
    Here is a diff to be applied to 0001 which fixes the warnings that get 
    generated when compiling. I did see that one of the warnings gets fixed 
    0002 (the mdexists() one). I am assuming that change was just missed 
    while rebasing the patchset or something. I did not see a fix for
    mdcreate() in 0002 however.
    
    -- 
    Tristan Partin
    Neon (https://neon.tech)
    
  5. Fwd: Extensible storage manager API - SMGR hook Redux

    Kirill Reshke <reshkekirill@gmail.com> — 2023-12-04T16:50:41Z

    Sorry for double-posting, I accidentally replied to Matthias, not the
    mailing list :(
    
    ---------- Forwarded message ---------
    From: Kirill Reshke <reshkekirill@gmail.com>
    Date: Mon, 4 Dec 2023 at 19:46
    Subject: Re: Extensible storage manager API - SMGR hook Redux
    To: Matthias van de Meent <boekewurm+postgres@gmail.com>
    
    
    Hi!
    
    On Fri, 30 Jun 2023 at 15:27, Matthias van de Meent <
    boekewurm+postgres@gmail.com> wrote:
    
    > Hi hackers,
    >
    > At Neon, we've been working on removing the file system dependency
    > from PostgreSQL and replacing it with a distributed storage layer. For
    > now, we've seen most success in this by replacing the implementation
    > of the smgr API, but it did require some core modifications like those
    > proposed early last year  by Anastasia [0].
    >
    > As mentioned in the previous thread, there are several reasons why you
    > would want to use a non-default storage manager: storage-level
    > compression, encryption, and disk limit quotas [0]; offloading of cold
    > relation data was also mentioned [1].
    >
    > In the thread on Anastasia's patch, Yura Sokolov mentioned that
    > instead of a hook-based smgr extension, a registration-based smgr
    > would be preferred, with integration into namespaces. Please find
    > attached an as of yet incomplete patch that starts to do that.
    >
    > The patch is yet incomplete (as it isn't derived from Anastasia's
    > patch), but I would like comments on this regardless, as this is a
    > fairly fundamental component of PostgreSQL that is being modified, and
    > it is often better to get comments early in the development cycle. One
    > significant issue that I've seen so far are that catcache is not
    > guaranteed to be available in all backends that need to do smgr
    > operations, and I've not yet found a good solution.
    >
    > Changes compared to HEAD:
    > - smgrsw is now dynamically allocated and grows as new storage
    > managers are loaded (during shared_preload_libraries)
    > - CREATE TABLESPACE has new optional syntax USING smgrname (option [, ...])
    > - tablespace storage is (planned) fully managed by smgr through some
    > new smgr apis
    >
    > Changes compared to Anastasia's patch:
    > - extensions do not get to hook and replace the api of the smgr code
    > directly - they are hidden behind the smgr registry.
    >
    > Successes:
    > - 0001 passes tests (make check-world)
    > - 0002 builds without warnings (make)
    >
    > TODO:
    > - fix dependency failures when catcache is unavailable
    > - tablespace redo is currently broken with 0002
    > - fix tests for 0002
    > - ensure that pg_dump etc. works with the new tablespace storage manager
    > options
    >
    > Looking forward to any comments, suggestions and reviews.
    >
    > Kind regards,
    >
    > Matthias van de Meent
    > Neon (https://neon.tech/)
    >
    >
    > [0]
    > https://www.postgresql.org/message-id/CAP4vRV6JKXyFfEOf%3Dn%2Bv5RGsZywAQ3CTM8ESWvgq%2BS87Tmgx_g%40mail.gmail.com
    > [1]
    > https://www.postgresql.org/message-id/D365F19F-BC3E-4F96-A91E-8DB13049749E@yandex-team.ru
    
    
    So, 0002 patch uses the `get_tablespace` function, which searches Catalog
    to tablespace SMGR id. I wonder how `smgr_redo` would work with it?
    Is it possible to query the system catalog during crash recovery? As far as
    i understand the answer is "no", correct me if I'm wrong.
    Furthermore, why do we only allow tablespace to have its own SMGR
    implementation, can we have per-relation SMGR? Maybe we can do it in a way
    similar to custom RMGR (meaning, write SMGR OID into WAL and use it in
    crash recovery etc.)?
    
  6. Re: Extensible storage manager API - SMGR hook Redux

    Matthias van de Meent <boekewurm+postgres@gmail.com> — 2023-12-04T19:21:12Z

    On Mon, 4 Dec 2023 at 17:51, Kirill Reshke <reshkekirill@gmail.com> wrote:
    >
    > So, 0002 patch uses the `get_tablespace` function, which searches Catalog to tablespace SMGR id. I wonder how `smgr_redo` would work with it?
    
    That's a very good point I hadn't considered in detail yet. Quite
    clearly, the current code is wrong in assuming that the catalog is
    accessible, and it should probably be stored in a way similar to
    pg_filenode.map in a file managed outside the buffer pool.
    
    > Is it possible to query the system catalog during crash recovery? As far as i understand the answer is "no", correct me if I'm wrong.
    
    Yes, you're correct, we can't access buffers like this during
    recovery. That's going to need some more effort.
    
    > Furthermore, why do we only allow tablespace to have its own SMGR implementation, can we have per-relation SMGR? Maybe we can do it in a way similar to custom RMGR (meaning, write SMGR OID into WAL and use it in crash recovery etc.)?
    
    AMs (and by extension, their RMGRs) that use Postgres' buffer pool
    have control over how they want to layout their blocks and files, but
    generally don't care about where those blocks and files are located,
    as long as they _can_ be retrieved.
    
    Tablespaces, however, describe 'drives' or 'storage pools' in which
    the tables/relations are stored, which to me seems to be the more
    logical place to configure the SMGR abstraction of how and where to
    store the actual data, as SMGRs manage the low-level relation block IO
    (= file accesses), and tablespaces manage where files are stored.
    
    Note that nothing prevents you from using one tablespace (thus
    different SMGR) per relation, apart from bloated catalogs and the
    superuser permissions required for creating those tablespaces. It'd be
    difficult to manage, but not impossible.
    
    Kind regards,
    
    Matthias van de Meent
    Neon (https://neon.tech)
    
    
    
    
  7. Re: Extensible storage manager API - SMGR hook Redux

    Kirill Reshke <reshkekirill@gmail.com> — 2023-12-04T21:02:59Z

    On Mon, 4 Dec 2023 at 22:21, Matthias van de Meent <
    boekewurm+postgres@gmail.com> wrote:
    
    > On Mon, 4 Dec 2023 at 17:51, Kirill Reshke <reshkekirill@gmail.com> wrote:
    > >
    > > So, 0002 patch uses the `get_tablespace` function, which searches
    > Catalog to tablespace SMGR id. I wonder how `smgr_redo` would work with it?
    >
    > That's a very good point I hadn't considered in detail yet. Quite
    > clearly, the current code is wrong in assuming that the catalog is
    > accessible, and it should probably be stored in a way similar to
    > pg_filenode.map in a file managed outside the buffer pool.
    >
    > Hmm, pg_filenode.map  is a nice idea. So, simply maintain TableSpaceOId ->
    smgr id mapping in a separate file and update the whole file on any
    changes, right?
    Looks reasonable to me, but it is clear that this solution can be really
    slow in some patterns, like if we create many-many tablespaces(the way you
    suggested it in the per-relation SMGR feature). Maybe we can store data in
    files somehow separately, and only update one chunk per operation.
    
    Anyway, if we use a `pg_filenode.map` - like solution, we need to reuse its
    code infrasture, right? For example, it seems that code that calculates
    checksums can be reused.
    So, we need to refactor code here, define something like FileMap API maybe.
    Or is it not really worth it? We can just write similar code twice.
    
  8. Re: Extensible storage manager API - SMGR hook Redux

    Matthias van de Meent <boekewurm+postgres@gmail.com> — 2023-12-04T21:30:36Z

    On Mon, 4 Dec 2023 at 22:03, Kirill Reshke <reshkekirill@gmail.com> wrote:
    >
    > On Mon, 4 Dec 2023 at 22:21, Matthias van de Meent <boekewurm+postgres@gmail.com> wrote:
    >>
    >> On Mon, 4 Dec 2023 at 17:51, Kirill Reshke <reshkekirill@gmail.com> wrote:
    >> >
    >> > So, 0002 patch uses the `get_tablespace` function, which searches Catalog to tablespace SMGR id. I wonder how `smgr_redo` would work with it?
    >>
    >> That's a very good point I hadn't considered in detail yet. Quite
    >> clearly, the current code is wrong in assuming that the catalog is
    >> accessible, and it should probably be stored in a way similar to
    >> pg_filenode.map in a file managed outside the buffer pool.
    >>
    > Hmm, pg_filenode.map  is a nice idea. So, simply maintain TableSpaceOId -> smgr id mapping in a separate file and update the whole file on any changes, right?
    > Looks reasonable to me, but it is clear that this solution can be really slow in some patterns, like if we create many-many tablespaces(the way you suggested it in the per-relation SMGR feature). Maybe we can store data in files somehow separately, and only update one chunk per operation.
    
    Yes, but that's a later issue... I'm not sure many-many tablespaces is
    actually a good thing. There are already very few reasons to store
    tables in more than just the default tablespace. For temporary
    relations, there is indeed a guc to automatically put them into one
    tablespace; and I can see a similar thing being useful for temporary
    relations, too. Then there I can see high-performant local disks vs
    lower-performant (but cheaper) local disks also as something
    reasonable. But that only gets us to ~6 tablespaces, assuming separate
    tablespaces for each combination of (normal, temp, unlogged) * (fast,
    cheap). I'm not sure there are many other reasons to add tablespaces,
    let alone making one for each table.
    
    Note that you can select which tablespace a table is stored in, so I
    see very little reason to actually do something about large numbers of
    tablespaces being prohibitively expensive performance-wise.
    
    Why do you want to have a whole new storage configuration for each of
    your relations?
    
    > Anyway, if we use a `pg_filenode.map` - like solution, we need to reuse its code infrasture, right? For example, it seems that code that calculates checksums can be reused.
    > So, we need to refactor code here, define something like FileMap API maybe. Or is it not really worth it? We can just write similar code twice.
    
    I'm not sure about that. I really doubt we'll need things that are
    that similar: right now, the tablespace->smgr mapping could be
    considered to be implied by the symlinks in /pg_tblspc/. Non-MD
    tablespaces could add a file <oid>.tblspc that detail their
    configuration, which would also fix the issue of spcoid->smgr mapping.
    
    Kind regards,
    
    Matthias van de Meent
    Neon (https://neon.tech)
    
    
    
    
  9. Re: Extensible storage manager API - SMGR hook Redux

    Tristan Partin <tristan@neon.tech> — 2024-01-12T19:57:22Z

    Thought I would show off what is possible with this patchset.
    
    Heikki, a couple of months ago in our internal Slack, said:
    
    > [I would like] a debugging tool that checks that we're not missing any 
    > fsyncs. I bumped into a few missing fsync bugs with unlogged tables 
    > lately and a tool like that would've been very helpful.
    
    My task was to create such a tool, and off I went. I started with the 
    storage manager extension patch that Matthias sent to the list last 
    year[0].
    
    Andres, in another thread[1], said:
    
    > I've been thinking that we need a validation layer for fsyncs, it's too hard
    > to get right without testing, and crash testing is not likel enough to catch
    > problems quickly / resource intensive.
    >
    > My thought was that we could keep a shared hash table of all files created /
    > dirtied at the fd layer, with the filename as key and the value the current
    > LSN. We'd delete files from it when they're fsynced. At checkpoints we go
    > through the list and see if there's any files from before the redo that aren't
    > yet fsynced.  All of this in assert builds only, of course.
    
    I took this idea and ran with it. I call it the fsync_checker™️. It is an 
    extension that prints relations that haven't been fsynced prior to 
    a CHECKPOINT. Note that this idea doesn't work in practice because 
    relations might not be fsynced, but they might be WAL-logged, like in 
    the case of createdb. See log_smgrcreate(). I can't think of an easy way 
    to solve this problem looking at the codebase as it stands.
    
    Here is a description of the patches:
    
    0001:
    
    This is essentially just the patch that Matthias posted earlier, but 
    rebased and adjusted a little bit so storage managers can "inherit" from 
    other storage managers.
    
    0002:
    
    This is an extension of 0001, which allows for extensions to set 
    a global storage manager. This is pretty hacky, and if it was going to 
    be pulled into mainline, it would need some better protection. For 
    instance, only one extension should be able to set the global storage 
    manager. We wouldn't want extensions stepping over each other, etc.
    
    0003:
    
    Adds a hook for extensions to inspect a checkpoint before it actually 
    occurs. The purpose for the fsync_checker is so that I can iterate over
    all the relations the extension tracks to find files that haven't been
    synced prior to the completion of the checkpoint.
    
    0004:
    
    This is the actual fsync_checker extension itself. It must be preloaded.
    
    Hopefully this is a good illustration of how the initial patch could be 
    used, even though it isn't perfect.
    
    [0]: https://www.postgresql.org/message-id/CAEze2WgMySu2suO_TLvFyGY3URa4mAx22WeoEicnK%3DPCNWEMrA%40mail.gmail.com
    [1]: https://www.postgresql.org/message-id/20220127182838.ba3434dp2pe5vcia%40alap3.anarazel.de
    
    -- 
    Tristan Partin
    Neon (https://neon.tech)
    
  10. Re: Extensible storage manager API - SMGR hook Redux

    Aleksander Alekseev <aleksander@timescale.com> — 2024-03-15T14:15:42Z

    Hi,
    
    > Thought I would show off what is possible with this patchset.
    >
    > [...]
    
    Just wanted to let you know that cfbot doesn't seem to be entirely
    happy with the patch [1]. Please consider submitting an updated
    version.
    
    Best regards,
    Aleksander Alekseev (wearing co-CFM hat)
    
    [1]: http://cfbot.cputube.org/
    
    
    
    
  11. Re: Extensible storage manager API - SMGR hook Redux

    Nitin Jadhav <nitinjadhavpostgres@gmail.com> — 2024-09-21T18:24:58Z

    Hi,
    
    I reviewed the discussion and took a look at the patch sets. It seems
    like many things are combined here. Based on the subject, I initially
    thought it aimed to provide the infrastructure to easily extend
    storage managers. This would allow anyone to create their own storage
    managers using this infrastructure. While it addresses this, it also
    includes additional features like fsync_checker, which I believe
    should be a separate feature. Even though it might use the same
    infrastructure, it appears to be a different functionality. I think we
    should focus solely on providing the infrastructure here.
    
    We need to decide on our approach—whether to use a hook-based method
    or a registration-based method—and I believe this requires further
    discussion.
    
    The hook-based approach is simple and works well for anyone writing
    their own storage manager. However, it has its limitations as we can
    either use the default storage manager or a custom-built one for all
    the work load, but we cannot choose between multiple storage managers.
    On the other hand, the registration-based approach allows choosing
    between multiple storage managers based on the workload, though it
    requires a lot of changes.
    
    Are we planning to support other storage managers in PostgreSQL in the
    near future? If not, it is better to go with the hook-based approach.
    Otherwise, the registration-based approach is preferable as it offers
    more flexibility to users and enhances PostgreSQL’s functionality.
    
    Could you please share your thoughts on this? Also, let me know if
    this topic has already been discussed and if any conclusions were
    reached.
    
    Best Regards,
    Nitin Jadhav
    Azure Database for PostgreSQL
    Microsoft
    
    On Sat, Jan 13, 2024 at 1:27 AM Tristan Partin <tristan@neon.tech> wrote:
    >
    > Thought I would show off what is possible with this patchset.
    >
    > Heikki, a couple of months ago in our internal Slack, said:
    >
    > > [I would like] a debugging tool that checks that we're not missing any
    > > fsyncs. I bumped into a few missing fsync bugs with unlogged tables
    > > lately and a tool like that would've been very helpful.
    >
    > My task was to create such a tool, and off I went. I started with the
    > storage manager extension patch that Matthias sent to the list last
    > year[0].
    >
    > Andres, in another thread[1], said:
    >
    > > I've been thinking that we need a validation layer for fsyncs, it's too hard
    > > to get right without testing, and crash testing is not likel enough to catch
    > > problems quickly / resource intensive.
    > >
    > > My thought was that we could keep a shared hash table of all files created /
    > > dirtied at the fd layer, with the filename as key and the value the current
    > > LSN. We'd delete files from it when they're fsynced. At checkpoints we go
    > > through the list and see if there's any files from before the redo that aren't
    > > yet fsynced.  All of this in assert builds only, of course.
    >
    > I took this idea and ran with it. I call it the fsync_checker™️. It is an
    > extension that prints relations that haven't been fsynced prior to
    > a CHECKPOINT. Note that this idea doesn't work in practice because
    > relations might not be fsynced, but they might be WAL-logged, like in
    > the case of createdb. See log_smgrcreate(). I can't think of an easy way
    > to solve this problem looking at the codebase as it stands.
    >
    > Here is a description of the patches:
    >
    > 0001:
    >
    > This is essentially just the patch that Matthias posted earlier, but
    > rebased and adjusted a little bit so storage managers can "inherit" from
    > other storage managers.
    >
    > 0002:
    >
    > This is an extension of 0001, which allows for extensions to set
    > a global storage manager. This is pretty hacky, and if it was going to
    > be pulled into mainline, it would need some better protection. For
    > instance, only one extension should be able to set the global storage
    > manager. We wouldn't want extensions stepping over each other, etc.
    >
    > 0003:
    >
    > Adds a hook for extensions to inspect a checkpoint before it actually
    > occurs. The purpose for the fsync_checker is so that I can iterate over
    > all the relations the extension tracks to find files that haven't been
    > synced prior to the completion of the checkpoint.
    >
    > 0004:
    >
    > This is the actual fsync_checker extension itself. It must be preloaded.
    >
    > Hopefully this is a good illustration of how the initial patch could be
    > used, even though it isn't perfect.
    >
    > [0]: https://www.postgresql.org/message-id/CAEze2WgMySu2suO_TLvFyGY3URa4mAx22WeoEicnK%3DPCNWEMrA%40mail.gmail.com
    > [1]: https://www.postgresql.org/message-id/20220127182838.ba3434dp2pe5vcia%40alap3.anarazel.de
    >
    > --
    > Tristan Partin
    > Neon (https://neon.tech)
    
    
    
    
  12. Re: Extensible storage manager API - SMGR hook Redux

    Xun Gong <gongxun0928@gmail.com> — 2024-12-10T06:49:53Z

    Thank you for your detailed review and insights. I share your view that a
    registration-based approach for custom storage managers (smgr) is more
    versatile. This method allows for the implementation of custom table access
    methods, facilitating the use of various storage services (such as file
    services or object storage), different file organization formats (files in
    one directory  or many sub directories), and flexible deletion logic
    (direct deletion or mark-and-sweep).
    
    While I acknowledge that the registration-based approach requires more
    modifications, I believe the benefits in terms of extensibility and
    functionality are significant. I have seen similar explorations in the
    implementation of AO tables in Greenplum, where a dedicated smgr was
    created due to its distinct file organization compared to heap tables.
    
    I look forward to further discussions on this topic
    
    Nitin Jadhav <nitinjadhavpostgres@gmail.com> 于2024年12月10日周二 11:25写道:
    
    > Hi,
    >
    > I reviewed the discussion and took a look at the patch sets. It seems
    > like many things are combined here. Based on the subject, I initially
    > thought it aimed to provide the infrastructure to easily extend
    > storage managers. This would allow anyone to create their own storage
    > managers using this infrastructure. While it addresses this, it also
    > includes additional features like fsync_checker, which I believe
    > should be a separate feature. Even though it might use the same
    > infrastructure, it appears to be a different functionality. I think we
    > should focus solely on providing the infrastructure here.
    >
    > We need to decide on our approach—whether to use a hook-based method
    > or a registration-based method—and I believe this requires further
    > discussion.
    >
    > The hook-based approach is simple and works well for anyone writing
    > their own storage manager. However, it has its limitations as we can
    > either use the default storage manager or a custom-built one for all
    > the work load, but we cannot choose between multiple storage managers.
    > On the other hand, the registration-based approach allows choosing
    > between multiple storage managers based on the workload, though it
    > requires a lot of changes.
    >
    > Are we planning to support other storage managers in PostgreSQL in the
    > near future? If not, it is better to go with the hook-based approach.
    > Otherwise, the registration-based approach is preferable as it offers
    > more flexibility to users and enhances PostgreSQL’s functionality.
    >
    > Could you please share your thoughts on this? Also, let me know if
    > this topic has already been discussed and if any conclusions were
    > reached.
    >
    > Best Regards,
    > Nitin Jadhav
    > Azure Database for PostgreSQL
    > Microsoft
    >
    > On Sat, Jan 13, 2024 at 1:27 AM Tristan Partin <tristan@neon.tech> wrote:
    > >
    > > Thought I would show off what is possible with this patchset.
    > >
    > > Heikki, a couple of months ago in our internal Slack, said:
    > >
    > > > [I would like] a debugging tool that checks that we're not missing any
    > > > fsyncs. I bumped into a few missing fsync bugs with unlogged tables
    > > > lately and a tool like that would've been very helpful.
    > >
    > > My task was to create such a tool, and off I went. I started with the
    > > storage manager extension patch that Matthias sent to the list last
    > > year[0].
    > >
    > > Andres, in another thread[1], said:
    > >
    > > > I've been thinking that we need a validation layer for fsyncs, it's
    > too hard
    > > > to get right without testing, and crash testing is not likel enough to
    > catch
    > > > problems quickly / resource intensive.
    > > >
    > > > My thought was that we could keep a shared hash table of all files
    > created /
    > > > dirtied at the fd layer, with the filename as key and the value the
    > current
    > > > LSN. We'd delete files from it when they're fsynced. At checkpoints we
    > go
    > > > through the list and see if there's any files from before the redo
    > that aren't
    > > > yet fsynced.  All of this in assert builds only, of course.
    > >
    > > I took this idea and ran with it. I call it the fsync_checker™️. It is an
    > > extension that prints relations that haven't been fsynced prior to
    > > a CHECKPOINT. Note that this idea doesn't work in practice because
    > > relations might not be fsynced, but they might be WAL-logged, like in
    > > the case of createdb. See log_smgrcreate(). I can't think of an easy way
    > > to solve this problem looking at the codebase as it stands.
    > >
    > > Here is a description of the patches:
    > >
    > > 0001:
    > >
    > > This is essentially just the patch that Matthias posted earlier, but
    > > rebased and adjusted a little bit so storage managers can "inherit" from
    > > other storage managers.
    > >
    > > 0002:
    > >
    > > This is an extension of 0001, which allows for extensions to set
    > > a global storage manager. This is pretty hacky, and if it was going to
    > > be pulled into mainline, it would need some better protection. For
    > > instance, only one extension should be able to set the global storage
    > > manager. We wouldn't want extensions stepping over each other, etc.
    > >
    > > 0003:
    > >
    > > Adds a hook for extensions to inspect a checkpoint before it actually
    > > occurs. The purpose for the fsync_checker is so that I can iterate over
    > > all the relations the extension tracks to find files that haven't been
    > > synced prior to the completion of the checkpoint.
    > >
    > > 0004:
    > >
    > > This is the actual fsync_checker extension itself. It must be preloaded.
    > >
    > > Hopefully this is a good illustration of how the initial patch could be
    > > used, even though it isn't perfect.
    > >
    > > [0]:
    > https://www.postgresql.org/message-id/CAEze2WgMySu2suO_TLvFyGY3URa4mAx22WeoEicnK%3DPCNWEMrA%40mail.gmail.com
    > > [1]:
    > https://www.postgresql.org/message-id/20220127182838.ba3434dp2pe5vcia%40alap3.anarazel.de
    > >
    > > --
    > > Tristan Partin
    > > Neon (https://neon.tech)
    >
    >
    >
    >
    >
    
  13. Re: Extensible storage manager API - SMGR hook Redux

    Andreas Karlsson <andreas@proxel.se> — 2025-02-03T11:27:23Z

    Hi!
    
    We at Percona are very interested in this patch for our transparent data 
    encryption extension. So we would love to collaborate with you, and 
    anyone else interested, on making the SMGR extensible.
    
    I have attached rebased and a bit cleaned up versions of Tristan's 
    patches plus a couple of patches we have been working on in-house 
    (mainly my colleague Zsolt). I also have some questions which I would 
    like to discuss.
    
    0001-0004
    
    The same patches as Tristan posted but rebased and cleaned up a bit to 
    better follow the code style. I also removed a couple of dead variables 
    which seemed like left overs.
    
    0005
    
    Since we support having both encrypted and unencrypted relations we use 
    the RelFileLocator to look up if a relation is encrypted. And to 
    preserve that information when smgrcreate() creates a new relfile for a 
    relation we pass along the old RelFileLocator.
    
    For our use case it is possible that we could solve this in other ways. 
    For example if we decide to go with configuring the SMGR per schema this 
    will probably not be necessary at all.
    
    0006
    
    The patch introduces the concept of "chaining" SMGRs where we have tail 
    (e.g. md or a theoretical Ceph SMGR) and modifier (e.g. TDE or the 
    fsync_checker). Something like this would be useful for our case since 
    it would be nice to be able to use the same encryption code for md and 
    for some other potential replacement for md which uses some kind object 
    storage for example.
    
    As a bonus this allowed us to make the functions implementing md static.
    
    It is currently controlled via a GUC, smgr_chain, but this will of 
    course depend on how we decide to implement configuring which SMGR to use.
    
    Questions
    
    - What is up with the barrier when loading SMGRs? That does not seem 
    necessary or am I missing something? I believe Andres also spotted this.
    
    - How should we configure which SMGR to use for each relation? People 
    have talked about doing it per tablespace or using hooks and we have a 
    patch which uses a GUC for this. I have personally not researched these 
    options enough to have an opinion yet.
    
    - Is our idea about chaining SMGRs useful? In its current form or some 
    variant inspired by it?
    
    - We need to benchmark this to make sure we do not introduce too much 
    overhead, especially for people who just want to use md. I saw for 
    example that Andres had some complaint about extra indirection which we 
    may have to address.
    
    Andreas
    
  14. Re: Extensible storage manager API - SMGR hook Redux

    Andreas Karlsson <andreas@proxel.se> — 2025-02-03T12:41:15Z

    On 9/21/24 8:24 PM, Nitin Jadhav wrote:
    > I reviewed the discussion and took a look at the patch sets. It seems
    > like many things are combined here. Based on the subject, I initially
    > thought it aimed to provide the infrastructure to easily extend
    > storage managers. This would allow anyone to create their own storage
    > managers using this infrastructure. While it addresses this, it also
    > includes additional features like fsync_checker, which I believe
    > should be a separate feature. Even though it might use the same
    > infrastructure, it appears to be a different functionality. I think we
    > should focus solely on providing the infrastructure here.
    
    I personally think that it is fine that there are patches which provide 
    a PoC implementation of the new API. It is hard to verify if an API is 
    correct if there are zero alternative implementations. And there is also 
    a case to be made for having one of them in contrib just to make it hard 
    for us to break the API for external users.
    
    That said I have not made up my mind yet if this is a good extension for 
    contrib.
    
    > We need to decide on our approach—whether to use a hook-based method
    > or a registration-based method—and I believe this requires further
    > discussion.
    
    100% agreed.
    
    > The hook-based approach is simple and works well for anyone writing
    > their own storage manager. However, it has its limitations as we can
    > either use the default storage manager or a custom-built one for all
    > the work load, but we cannot choose between multiple storage managers.
    > On the other hand, the registration-based approach allows choosing
    > between multiple storage managers based on the workload, though it
    > requires a lot of changes.
    > 
    > Are we planning to support other storage managers in PostgreSQL in the
    > near future? If not, it is better to go with the hook-based approach.
    > Otherwise, the registration-based approach is preferable as it offers
    > more flexibility to users and enhances PostgreSQL’s functionality.
    > 
    > Could you please share your thoughts on this? Also, let me know if
    > this topic has already been discussed and if any conclusions were
    > reached.
    
    I do not think there is any plan for core to support multiple storage 
    managers, but there are open source thrid party extensions which plan to 
    implement this API once it has been merged.
    
    Andreas
    
    
    
    
    
  15. Re: Extensible storage manager API - SMGR hook Redux

    Andreas Karlsson <andreas@proxel.se> — 2025-03-07T11:52:08Z

    Hi,
    
    Here is a rebased version of it to make the CI happy. I plan to work 
    more on this next week but am happy with any feedback on what is already 
    there.
    
    Andreas
    
  16. Re: Extensible storage manager API - SMGR hook Redux

    Kirill Reshke <reshkekirill@gmail.com> — 2025-03-10T12:30:23Z

    On Fri, 7 Mar 2025 at 16:52, Andreas Karlsson <andreas@proxel.se> wrote:
    >
    > Hi,
    
    Hi!
    
    > Here is a rebased version of it to make the CI happy.
    
    Looks like CI is still unhappy with this change[0]
    
    0001:
    
    >+
    >+SMgrId
    >+smgr_register(const f_smgr *smgr, Size smgrrelation_size)
    ...
    
    > + Assert(smgr->smgr_open != NULL);
    > + Assert(smgr->smgr_close != NULL);
    > + Assert(smgr->smgr_create != NULL);
    > + Assert(smgr->smgr_exists != NULL);
    > + Assert(smgr->smgr_unlink != NULL);
    > + Assert(smgr->smgr_extend != NULL);
    > + Assert(smgr->smgr_zeroextend != NULL);
    > + Assert(smgr->smgr_prefetch != NULL);
    > + Assert(smgr->smgr_readv != NULL);
    > + Assert(smgr->smgr_writev != NULL);
    > + Assert(smgr->smgr_writeback != NULL);
    > + Assert(smgr->smgr_nblocks != NULL);
    > + Assert(smgr->smgr_truncate != NULL);
    > + Assert(smgr->smgr_immedsync != NULL);
    
    Are we sure we need to force extension authors to implement prefetch?
    Also, do we intentionally skip Assert on smgr_registersync and
    smgr_init here? I am not questioning smgr_shutdown here, as I can see
    it is NULL for md implementation.
    
    
    0002:
    should we merge this with 0001?
    
    
    0003: Looks mature, no comments.
    
    0004:
    It's a bit strange to place fsync_checker under contrib, huh? Like,
    you will never use it in production. Maybe src/test/modules is a
    better place?
    
    0005:
    We are missing rationale for this change in the commit message.
    
    I didn't look at the 0006 modifications. Later, I'll try to take another look.
    
    [0] https://cirrus-ci.com/task/6466113875214336
    -- 
    Best regards,
    Kirill Reshke
    
    
    
    
  17. Re: Extensible storage manager API - SMGR hook Redux

    vignesh C <vignesh21@gmail.com> — 2025-03-17T06:52:27Z

    On Fri, 7 Mar 2025 at 17:22, Andreas Karlsson <andreas@proxel.se> wrote:
    >
    > Hi,
    >
    > Here is a rebased version of it to make the CI happy. I plan to work
    > more on this next week but am happy with any feedback on what is already
    > there.
    
    I noticed that Kirill's comments from [1] are not yet addressed, I
    have changed the commitfest entry status to "Waiting on Author. Please
    address them and change the status to "Needs Review".
    [1] - https://www.postgresql.org/message-id/CALdSSPimrJWeex1RbvVXoGCROLiC6VgKUdEE0pUcib=GNYo58g@mail.gmail.com
    
    Regards,
    Vignesh