Thread

  1. File locks for data directory lockfile in the context of Linux namespaces

    Dmitry Dolgov <9erthalion6@gmail.com> — 2025-12-19T14:27:40Z

    Hi,
    
    TL;DR This is a proposal to use file locking with a data directory lockfile at
    startup, which helps to avoid potential Linux PID namespace visibility issues.
    
    Recently I've stumbled upon a quite annoying problem, which will require a bit
    of explanation. Currently at startup if the data directory lock exists,
    postgres inspects it and assumes that if it contains the same PID as the
    current process, it must be a state file after a system reboot and assigning
    of the same PID again. But it seems there is another possible scenario: two
    postgres instances are running concurrently inside different PID namespaces,
    they don't see each other and have the same PID assigned withing the respective
    namespace.
    
    It's relatively easy to use pid/ipc/net namespaces to construct a situation,
    when two postgres instances run in parallel on the same data directory and do
    not notice a thing, something like this:
    
        sudo unshare --ipc --net --pid --fork --mount-proc \
            bash -c 'sudo -u postgres postgres -D data'
    
    This of course can lead to all sorts of nasty issues, but looks very artificial
    at first -- obviously whoever is responsible for namespace management must also
    take care about data access isolation.
    
    But it turns out situations like this indeed could happen in practice, when it
    comes to container orchestration, mostly due to lack of knowledge or
    misunderstanding of documentation. Kubernetes has one particular access mode
    for volumes, ReadWriteOnce [1], which often assumed to be good enough -- but it
    guarantees only a single mount per node, not per pod. Kubernetes also allows a
    forced pod termination [2], which removes the pod from the API, but still gives
    some grace period for the pod to finish. All of this can lead to an unfortunate
    sequence of events:
    
    * A postgres pod got forcefully terminated and removed from the API right away.
    * A new postgres pod is started instead (there is nothing in the API, so
      why not), while the old one is still terminating.
    * If they were utilizing RWO mode, the new pod will immediately get the data
      volume and can access it while the old pod is still terminating.
    
    In the end we get a situation similar to what I've described above, and
    strangely enough it looks like this indeed happens in the field.
    
    It's fair to say that it's a Kubernetes issue (there are warnings about
    that in the documentation), and PostgreSQL doesn't have anything to do
    with that. But taking into account general possibility of confusing
    PostgreSQL with Linux namespaces it looks to me like one of those "shoot
    yourself in the foot" situation, and I became curious if there are any
    easy way to improve things.
    
    The root of the problem is lack of any time related information that PostgreSQL
    could use to distinguish between two scenarios: when a single container was
    killed and started again later; and when two containers run at the same time.
    After some experimenting it looks like the only plausible answer could be file
    locking for data directory lockfile.
    
    This approach was discussed many times in hackers, and from what I see there
    are few arguments against using file locking as the main mechanism for
    protecting the data directory content:
    
    * Portability. There are two types of file locks, advisory record locks (POSIX)
      and open file description locks (was non-POSIX). The former has set of flaws,
      but most importantly for this discussion is that advisory record locks are
      associated with a process and thus affected by PID namespace isolation. The
      later are associated with open file descriptors and are suitable solution to
      fix the problem. Originally open file description locks were non-POSIX, but
      looks like they have become a part of POSIX.1 2024, (see F_OFD_SETLK) [3].
    
    * Issues with NFS. It turns out NFSv3 does not support open file description
      locks and convert them into advisory locks. For our purposes it means that
      the aproach will not change anything for NFSv3. Regarding NFSv4, it uses some
      sort of lease system for locking, and I haven't found anything claiming that
      locks will be converted to advisory.
    
    With this in mind, it seems to me that adding file locking to data directory
    lockfile as a "best efforts" approach (i.e. if it doesn't work, we continue as
    before) on top of already existing mechanism will improve most of things, while
    keeping status quo for some others. I've attached a quick sketch of how the
    patch might look like.
    
    Any thoughts / commentaries on that?
    
    [1]: https://kubernetes.io/docs/concepts/storage/persistent-volumes/#access-modes
    [2]: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-termination-forced
    [3]: https://pubs.opengroup.org/onlinepubs/9799919799/functions/fcntl.html