Thread

  1. clog segment truncation

    Grigory Smolkin <smallkeen@gmail.com> — 2025-11-10T12:29:32Z

    Hello, hackers!
    
    I`m currently investigating the case of one of our PostgreSQL instance
    having started issuing error messages about missing clog segment:
    ERROR:  could not access status of transaction 1550558894
    DETAIL:  Could not open file "pg_xact/05C6": No such file or directory.
    
    
    After examining the WAL records just prior the first sighing of these
    errors, it became apparent, that clog segment was truncated by autovacuum
    rmgr: CLOG        len (rec/tot):     38/    38, tx:          0, lsn:
    81A/87C899E0, prev 81A/87C895A0, desc: TRUNCATE page 48264; oldestXact
    1581542039
    
    
    After looking closely at the affected heap pages (there are scores of them)
    with tuples, created by 1550558894 transaction, and comparing them with
    pg_visibility state and pg_class.relfrozenxig, it became apparent, that
    there is may be something awry with either setting all_frozen bit in _vm,
    or oldestXact calculation by vacuum.
    
    According to pg_visibility, all_frozen bit is set:
    wms-svc-inspecting=# select * from pg_visibility('affected_relation',
    5456224)
    wms-svc-inspecting-# ;
     all_visible | all_frozen | pd_all_visible
    -------------+------------+----------------
     t           | t          | f
    (1 row)
    
    So all tuples on this page should be marked with HEAP_XMIN_FROZEN, but that
    is not the case
    
    SELECT lp, lp_off, lp_flags, lp_len, t_xmin, t_xmax, t_field3, t_ctid,
    t_infomask2, t_infomask, t_hoff, t_bits,
    heap_tuple_infomask_flags(t_infomask, t_infomask2) FROM
    heap_page_items(get_raw_page('affected_relation', 5456224)) ;
    
    -[ RECORD 1
    ]-------------+-----------------------------------------------------------------------------
    lp                        | 1
    lp_off                    | 7976
    lp_flags                  | 1
    lp_len                    | 216
    t_xmin                    | 1550558894
    t_xmax                    | 0
    t_field3                  | 0
    t_ctid                    | (5456224,1)
    t_infomask2               | 12
    t_infomask                | 2051
    t_hoff                    | 32
    t_bits                    | 1111111011100000
    heap_tuple_infomask_flags |
    ("{HEAP_HASNULL,HEAP_HASVARWIDTH,HEAP_XMAX_INVALID}",{})
    -[ RECORD 2
    ]-------------+-----------------------------------------------------------------------------
    lp                        | 2
    lp_off                    | 7768
    lp_flags                  | 1
    lp_len                    | 208
    t_xmin                    | 1550558894
    t_xmax                    | 0
    t_field3                  | 0
    t_ctid                    | (5456224,2)
    t_infomask2               | 12
    t_infomask                | 2307
    t_hoff                    | 32
    t_bits                    | 1111111111100000
    heap_tuple_infomask_flags |
    ("{HEAP_HASNULL,HEAP_HASVARWIDTH,HEAP_XMIN_COMMITTED,HEAP_XMAX_INVALID}",{})
    -[ RECORD 3
    ]-------------+-----------------------------------------------------------------------------
    lp                        | 3
    lp_off                    | 7520
    lp_flags                  | 1
    lp_len                    | 248
    t_xmin                    | 1550558894
    t_xmax                    | 0
    t_field3                  | 0
    t_ctid                    | (5456224,3)
    t_infomask2               | 12
    t_infomask                | 2051
    t_hoff                    | 32
    t_bits                    | 1111111111100000
    heap_tuple_infomask_flags |
    ("{HEAP_HASNULL,HEAP_HASVARWIDTH,HEAP_XMAX_INVALID}",{})
    -[ RECORD 4
    ]-------------+-----------------------------------------------------------------------------
    ....
    -[ RECORD 23
    ]------------+-----------------------------------------------------------------------------
    lp                        | 23
    lp_off                    | 272
    lp_flags                  | 1
    lp_len                    | 312
    t_xmin                    | 1550558894
    t_xmax                    | 0
    t_field3                  | 0
    t_ctid                    | (5456224,23)
    t_infomask2               | 12
    t_infomask                | 2051
    t_hoff                    | 32
    t_bits                    | 1111111111100000
    heap_tuple_infomask_flags |
    ("{HEAP_HASNULL,HEAP_HASVARWIDTH,HEAP_XMAX_INVALID}",{})
    
    
    Also pg_class.relfrozenxig contains xid which exceed 1550558894:
    # select relname, relfrozenxid from pg_class where oid =
    'affected_relation'::regclass::oid;
    -[ RECORD 1 ]+-----------
    relname      | affected_relation
    relfrozenxid | 1609822074
    
    So is it possible that either all_frozen bit was set incorrectly or wansn't
    unset when tuples were created?
    I will be happy to provide any additional information if required.
    PostgreSQL version: 15.6, data_checksums: on
    
  2. Re: clog segment truncation

    Matheus Alcantara <matheusssilv97@gmail.com> — 2025-11-10T13:05:26Z

    On Mon Nov 10, 2025 at 9:29 AM -03, Smolkin Grigory wrote:
    > Hello, hackers!
    >
    > I`m currently investigating the case of one of our PostgreSQL instance
    > having started issuing error messages about missing clog segment:
    > ERROR:  could not access status of transaction 1550558894
    > DETAIL:  Could not open file "pg_xact/05C6": No such file or directory.
    >
    Is this error happening with the LISTEN/NOTIFY? There is a discussion
    going on [1] that aims to fix this issue for the LISTEN/NOTIFY.
    
    [1] https://www.postgresql.org/message-id/CAK98qZ3wZLE-RZJN_Y%2BTFjiTRPPFPBwNBpBi5K5CU8hUHkzDpw%40mail.gmail.com
    
    -- 
    Matheus Alcantara
    EDB: http://www.enterprisedb.com
    
    
    
    
    
  3. Re: clog segment truncation

    Matheus Alcantara <matheusssilv97@gmail.com> — 2025-11-10T13:43:47Z

    On Mon Nov 10, 2025 at 9:29 AM -03, Smolkin Grigory wrote:
    > So is it possible that either all_frozen bit was set incorrectly or wansn't
    > unset when tuples were created?
    > I will be happy to provide any additional information if required.
    > PostgreSQL version: 15.6, data_checksums: on
    >
    Actually the 15.6 is quite old. I would also recommend to upgrade to the
    latest 15 release which is 15.14.
    
    I didn't find any specific release note about clog truncation but you
    are missing some important fixes though.
    
    There is this fix [1] on 15.8 but I'm not sure if it's fully related
    with your issue faced here.
    
    [1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=c809a2b2d
    
    -- 
    Matheus Alcantara
    EDB: http://www.enterprisedb.com
    
    
    
    
    
  4. Re: clog segment truncation

    Grigory Smolkin <smallkeen@gmail.com> — 2025-11-10T14:35:02Z

    > On Mon Nov 10, 2025 at 9:29 AM -03, Smolkin Grigory wrote:
    > > Hello, hackers!
    > >
    > > I`m currently investigating the case of one of our PostgreSQL instance
    > > having started issuing error messages about missing clog segment:
    > > ERROR:  could not access status of transaction 1550558894
    > > DETAIL:  Could not open file "pg_xact/05C6": No such file or directory.
    > >
    > Is this error happening with the LISTEN/NOTIFY? There is a discussion
    > going on [1] that aims to fix this issue for the LISTEN/NOTIFY.
    
    
    No, LISTEN/NOTIFY are not used.
    
    So do you have a step by step to reproduce the issue?
    
    
    Unfortunately, no.
    
    
    > Or at least some more information about what the server was running that
    > caused the
    > error?
    
    
    By the looks of it, nothing special, really, just plain inserts and
    occasional index-scan reads (tuple (5456224,2) looks like it was read by
    index-scan, because it has HEAP_XMIN_COMMITTED stamped on).
    Instance wasn't subjected to power outages, kernel crashes
    or hardware issues.
    
    On Mon, Nov 10, 2025 at 4:43 PM Matheus Alcantara <matheusssilv97@gmail.com>
    wrote:
    
    > On Mon Nov 10, 2025 at 9:29 AM -03, Smolkin Grigory wrote:
    > > So is it possible that either all_frozen bit was set incorrectly or
    > wansn't
    > > unset when tuples were created?
    > > I will be happy to provide any additional information if required.
    > > PostgreSQL version: 15.6, data_checksums: on
    > >
    > Actually the 15.6 is quite old.
    >
    
    It is, but I didn't see any fixes that can possibly be related to this
    case, but maybe I've missed something.
    
    
    >
    > I didn't find any specific release note about clog truncation but you
    > are missing some important fixes though.
    >
    > There is this fix [1] on 15.8 but I'm not sure if it's fully related
    > with your issue faced here.
    >
    > [1]
    > https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=c809a2b2d
    >
    > Looks like it is more about multi-xact truncation, so I'm also not sure.