Thread

  1. Re: Potential "AIO / io workers" inter-worker locking issue in PG18?

    Thom Brown <thom@linux.com> — 2025-10-05T10:51:19Z

    On Sun, 5 Oct 2025, 10:52 Marco Boeringa, <marco@boeringa.demon.nl> wrote:
    
    > Hi,
    >
    > I currently run PG18 + PostGIS 3.6.0 on an Ubuntu 24.04 VM guest as
    > Windows 10 Hyper-V virtual machine.
    >
    > The machine is a dedicated refurbished HP Z840 local workstation with
    > 2x22 cores (E5-2699 v4) with 512 GB RAM and a 10 TB NVMe raid-0, with
    > the Ubuntu guest having 400 GB RAM available.
    >
    > On this machine, which is dedicated to just one custom written
    > geoprocessing workflow involving OpenStreetMap data, I have successfully
    > processed up to global OpenStreetMap Facebook Daylight distribution
    > data, with up to > 2.4 B record Polygon table for all Facebook Daylight
    > buildings. So this has proven a very capable system.
    >
    > However, after upgrading to PG18 and the switch to the "io_method =
    > worker" setting (tested with 3, 5, 16 and 22 workers), I am seeing an
    > issue where it appears there may be a major issue with io workers
    > potentially getting into some sort of locking conflict, that takes hours
    > to resolve.
    >
    > The custom written geoprocessing workflow uses Python multi-threading
    > based on the Python 'concurrent.futures' framework in combination with
    > either pyodbc or psycopg2 as database connector to implement a powerful
    > parallel processing solution to speed up some of the computationally
    > intensive tasks (which use UPDATEs), which I generally use with up to 44
    > threads to fully saturate the dual CPU 44 core system. The custom code
    > creates a pool of jobs to process for the threads, with the code being
    > designed to minimize inter-thread locking issues by taking into account
    > PostgreSQL page locality (although the actual records to process are not
    > assigned by pages but by unique IDs in the tables). Basically, the code
    > is designed such that different threads never attempt to access the same
    > database pages, as each thread gets it own unique pages assigned, thus
    > avoiding inter-thread locking conflicts. This has worked really well in
    > the past, with system usage maximized over all cores and significantly
    > speeding up processing. Jobs are implemented as database VIEWs, that
    > point to the records to process via the unique ID of each. These views
    > must of course be read by each thread, which is probably where the PG18
    > io workers kick-in.
    >
    > This has worked really well in previous versions of PostgreSQL (tested
    > up to PG17). However, in PG18, during the multi-threaded processing, I
    > see some of my submitted jobs that in this case were run against a small
    > OpenStreetMap Italy extract of Geofabrik, all of a sudden take > 1 hour
    > to finish (up to 6 hours for this small extract), even though similar
    > jobs from the same processing step, finish in less than 10 seconds (and
    > the other jobs should as well). This seems to happen kind of "random".
    > Many multi-threading tasks before and after the affected processing
    > steps, do finish normally.
    >
    > When this happens, I observe the following things:
    >
    > - High processor activity, even though the jobs that should finish in
    > seconds, take hours, all the while showing the high core usage.
    >
    > - PgAdmin shows all sessions created by the Python threads as 'active',
    > with *no* wait events attached.
    >
    > - The pg_locks table does not show locking conflicts, all locks are
    > granted. I did notice however, that the relation / table locks were not
    > "fastpath" locks, but ordinary ones. All other locks taken, e.g. on
    > indexes related to the same table, were fastpath. I don't know if this
    > has any relevance though, as from what I read about the difference, this
    > shouldn't cause such a big difference, not seconds to hours.
    >
    > - Please note that the processing DOES eventually proceed, so it is not
    > an infinite dead-lock or something where I need to kill my Python code.
    > It just takes hours to resolve.
    >
    > - Switching to "io_method = sync" seems to resolve this issue, and I do
    > not observe some jobs of the same batch getting "stuck". This is the
    > behavior I was used to seeing in <=PG17.
    >
    > I am not to familiar with all the internals of PostgreSQL and the new
    > AIO framework and its "io workers". However, it seems there may be some
    > sort of locking issue between io workers that can occasionally happen in
    > PG18 with "io_method = worker"? Is there anyone else observing similar
    > issues in high multi-threaded processing worklflows?
    >
    
    So, to confirm, you get the issue with as little as 3 io_workers?
    
    Also, what is pg_aios telling you during this time?
    
    Thom
    
    >