Thread
-
File locks for data directory lockfile in the context of Linux namespaces
Dmitry Dolgov <9erthalion6@gmail.com> — 2025-12-19T14:27:40Z
Hi, TL;DR This is a proposal to use file locking with a data directory lockfile at startup, which helps to avoid potential Linux PID namespace visibility issues. Recently I've stumbled upon a quite annoying problem, which will require a bit of explanation. Currently at startup if the data directory lock exists, postgres inspects it and assumes that if it contains the same PID as the current process, it must be a state file after a system reboot and assigning of the same PID again. But it seems there is another possible scenario: two postgres instances are running concurrently inside different PID namespaces, they don't see each other and have the same PID assigned withing the respective namespace. It's relatively easy to use pid/ipc/net namespaces to construct a situation, when two postgres instances run in parallel on the same data directory and do not notice a thing, something like this: sudo unshare --ipc --net --pid --fork --mount-proc \ bash -c 'sudo -u postgres postgres -D data' This of course can lead to all sorts of nasty issues, but looks very artificial at first -- obviously whoever is responsible for namespace management must also take care about data access isolation. But it turns out situations like this indeed could happen in practice, when it comes to container orchestration, mostly due to lack of knowledge or misunderstanding of documentation. Kubernetes has one particular access mode for volumes, ReadWriteOnce [1], which often assumed to be good enough -- but it guarantees only a single mount per node, not per pod. Kubernetes also allows a forced pod termination [2], which removes the pod from the API, but still gives some grace period for the pod to finish. All of this can lead to an unfortunate sequence of events: * A postgres pod got forcefully terminated and removed from the API right away. * A new postgres pod is started instead (there is nothing in the API, so why not), while the old one is still terminating. * If they were utilizing RWO mode, the new pod will immediately get the data volume and can access it while the old pod is still terminating. In the end we get a situation similar to what I've described above, and strangely enough it looks like this indeed happens in the field. It's fair to say that it's a Kubernetes issue (there are warnings about that in the documentation), and PostgreSQL doesn't have anything to do with that. But taking into account general possibility of confusing PostgreSQL with Linux namespaces it looks to me like one of those "shoot yourself in the foot" situation, and I became curious if there are any easy way to improve things. The root of the problem is lack of any time related information that PostgreSQL could use to distinguish between two scenarios: when a single container was killed and started again later; and when two containers run at the same time. After some experimenting it looks like the only plausible answer could be file locking for data directory lockfile. This approach was discussed many times in hackers, and from what I see there are few arguments against using file locking as the main mechanism for protecting the data directory content: * Portability. There are two types of file locks, advisory record locks (POSIX) and open file description locks (was non-POSIX). The former has set of flaws, but most importantly for this discussion is that advisory record locks are associated with a process and thus affected by PID namespace isolation. The later are associated with open file descriptors and are suitable solution to fix the problem. Originally open file description locks were non-POSIX, but looks like they have become a part of POSIX.1 2024, (see F_OFD_SETLK) [3]. * Issues with NFS. It turns out NFSv3 does not support open file description locks and convert them into advisory locks. For our purposes it means that the aproach will not change anything for NFSv3. Regarding NFSv4, it uses some sort of lease system for locking, and I haven't found anything claiming that locks will be converted to advisory. With this in mind, it seems to me that adding file locking to data directory lockfile as a "best efforts" approach (i.e. if it doesn't work, we continue as before) on top of already existing mechanism will improve most of things, while keeping status quo for some others. I've attached a quick sketch of how the patch might look like. Any thoughts / commentaries on that? [1]: https://kubernetes.io/docs/concepts/storage/persistent-volumes/#access-modes [2]: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-termination-forced [3]: https://pubs.opengroup.org/onlinepubs/9799919799/functions/fcntl.html