Thread

File locks for data directory lockfile in the context of Linux namespaces

Dmitry Dolgov <9erthalion6@gmail.com> — 2025-12-19T14:27:40Z

Hi,

TL;DR This is a proposal to use file locking with a data directory lockfile at
startup, which helps to avoid potential Linux PID namespace visibility issues.

Recently I've stumbled upon a quite annoying problem, which will require a bit
of explanation. Currently at startup if the data directory lock exists,
postgres inspects it and assumes that if it contains the same PID as the
current process, it must be a state file after a system reboot and assigning
of the same PID again. But it seems there is another possible scenario: two
postgres instances are running concurrently inside different PID namespaces,
they don't see each other and have the same PID assigned withing the respective
namespace.

It's relatively easy to use pid/ipc/net namespaces to construct a situation,
when two postgres instances run in parallel on the same data directory and do
not notice a thing, something like this:

sudo unshare --ipc --net --pid --fork --mount-proc \
bash -c 'sudo -u postgres postgres -D data'

This of course can lead to all sorts of nasty issues, but looks very artificial
at first -- obviously whoever is responsible for namespace management must also
take care about data access isolation.

But it turns out situations like this indeed could happen in practice, when it
comes to container orchestration, mostly due to lack of knowledge or
misunderstanding of documentation. Kubernetes has one particular access mode
for volumes, ReadWriteOnce [1], which often assumed to be good enough -- but it
guarantees only a single mount per node, not per pod. Kubernetes also allows a
forced pod termination [2], which removes the pod from the API, but still gives
some grace period for the pod to finish. All of this can lead to an unfortunate
sequence of events:

* A postgres pod got forcefully terminated and removed from the API right away.
* A new postgres pod is started instead (there is nothing in the API, so
why not), while the old one is still terminating.
* If they were utilizing RWO mode, the new pod will immediately get the data
volume and can access it while the old pod is still terminating.

In the end we get a situation similar to what I've described above, and
strangely enough it looks like this indeed happens in the field.

It's fair to say that it's a Kubernetes issue (there are warnings about
that in the documentation), and PostgreSQL doesn't have anything to do
with that. But taking into account general possibility of confusing
PostgreSQL with Linux namespaces it looks to me like one of those "shoot
yourself in the foot" situation, and I became curious if there are any
easy way to improve things.

The root of the problem is lack of any time related information that PostgreSQL
could use to distinguish between two scenarios: when a single container was
killed and started again later; and when two containers run at the same time.
After some experimenting it looks like the only plausible answer could be file
locking for data directory lockfile.

This approach was discussed many times in hackers, and from what I see there
are few arguments against using file locking as the main mechanism for
protecting the data directory content:

* Portability. There are two types of file locks, advisory record locks (POSIX)
and open file description locks (was non-POSIX). The former has set of flaws,
but most importantly for this discussion is that advisory record locks are
associated with a process and thus affected by PID namespace isolation. The
later are associated with open file descriptors and are suitable solution to
fix the problem. Originally open file description locks were non-POSIX, but
looks like they have become a part of POSIX.1 2024, (see F_OFD_SETLK) [3].

* Issues with NFS. It turns out NFSv3 does not support open file description
locks and convert them into advisory locks. For our purposes it means that
the aproach will not change anything for NFSv3. Regarding NFSv4, it uses some
sort of lease system for locking, and I haven't found anything claiming that
locks will be converted to advisory.

With this in mind, it seems to me that adding file locking to data directory
lockfile as a "best efforts" approach (i.e. if it doesn't work, we continue as
before) on top of already existing mechanism will improve most of things, while
keeping status quo for some others. I've attached a quick sketch of how the
patch might look like.

Any thoughts / commentaries on that?

[1]: https://kubernetes.io/docs/concepts/storage/persistent-volumes/#access-modes
[2]: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-termination-forced
[3]: https://pubs.opengroup.org/onlinepubs/9799919799/functions/fcntl.html