Re: Heads Up: cirrus-ci is shutting down June 1st

Nazir Bilal Yavuz <byavuz81@gmail.com>
From: Nazir Bilal Yavuz <byavuz81@gmail.com>
To: Andres Freund <andres@anarazel.de>
Cc: Jelte Fennema-Nio <postgres@jeltef.nl>, Thomas Munro <thomas.munro@gmail.com>, pgsql-hackers@postgresql.org, Zsolt Parragi <zsolt.parragi@percona.com>, Peter Eisentraut <peter@eisentraut.org>
Date: 2026-05-28T17:06:22Z
Lists: pgsql-hackers
Attachments

v3-0001-Add-GitHub-Actions-yaml-file.patch (text/x-patch)
Hi,

Thank you for looking into this!

On Wed, 27 May 2026 at 21:10, Andres Freund <andres@anarazel.de> wrote:
>
> > Here is the v2, I took Jelte's patch and reviewed & merged it with my
> > patch. Updates and questions are:
> >
> > 1- I continued to use Jelte's container method (Linux tasks only for
> > now, BSD tasks will be included in the future) because I think that is
> > the future-proof way since we might want to generate our container
> > images in the future. Also, up-to-date Debian images can be tested
> > with this way; otherwise we would need to use Ubuntu 24.04.
>
> Good.
>
>
> > 2- io_uring tests work on the Linux Meson task.
>
> Is there a reason to not just do that for all the tasks?

I might word it incorrectly. I meant that Linux meson tests use:

PG_TEST_INITDB_EXTRA_OPTS: >-
  -c io_method=io_uring

and that wasn't working before, now it works. I guess we have this
only on Linux because we wanted to test io_method=worker in the other
tasks.


> > 3- I didn't put commands to helper scripts for now. I think it is a
> > good thing to have a helper script but it would be better to have this
> > helper script after the first version is committed since it can extend
> > the timeline. Also, I found that having all commands in one file makes
> > debugging easier.
>
> Hm. I'm a bit worried about this getting pretty unmaintainable, due to the
> repetition.  I think at least we need to use yaml anchors to deduplicate some
> steps.

Github Actions added support of yaml anchors last year but
unfortunately they don't support merge keys. Related information: [1].


> > 4- FreeBSD task has these options:
> >
> >       PG_TEST_INITDB_EXTRA_OPTS: >-
> >         -c debug_copy_parse_plan_trees=on
> >         -c debug_write_read_parse_plan_trees=on
> >         -c debug_raw_expression_coverage_test=on
> >         -c debug_parallel_query=regress
> >
> > Since we won't have FreeBSD for the first version. I put these options
> > to the MacOS task but I couldn't decide where to put
> > 'PG_TEST_PG_UPGRADE_MODE: --link'.
>
> Makes sense.
>
>
> > Also, I am planning to work on back patches when we agree on the
> > upstream one. Does that sound good?
>
> Yep.
>
>
>
> > diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
> > new file mode 100644
> > index 00000000000..6d20068727c
> > --- /dev/null
> > +++ b/.github/workflows/ci.yml
> > @@ -0,0 +1,1125 @@
> > +# GitHub Actions CI configuration for PostgreSQL
> > +
> > +name: Github Actions CI
> > +
> > +on:
> > +  push:
> > +    branches: [ "*" ]
> > +
> > +# Default to the minimum privilege the jobs need (just reading the repo
> > +# contents during checkout). Individual jobs override this when they need
> > +# more, e.g. `cancel-previous` needs `actions: write` to cancel runs.
> > +permissions:
> > +  contents: read
>
> I'm not sure I like that we ever need more than that. I'd expect that
> postgresql-cfbot will explicitly disable write permissions for runs.

Done. Updated the comment and removed the 'Cancel previous runs' step.


> > +# NB: intentionally NO workflow-level `concurrency:` block. The native
> > +# concurrency mechanism makes a new run wait for the previous one to fully
> > +# cancel before it starts — which can take a while. Instead the
> > +# `cancel-previous` job below fires a cancel API call asynchronously,
> > +# so the new run gets going immediately. On master the cancel job is skipped,
> > +# so every push runs to completion.
>
> Is this really worth having our own code? Seems like it'd not be that frequent
> to push if there are already running runs?  What kind of delays are we talking
> about?

Jelte already answered this in [2]. 'Cancel previous runs' step is
removed and concurrency is used instead.


> > +  # To avoid unnecessarily spinning up a lot of VMs / containers for entirely
> > +  # broken commits, have a minimal task that all others depend on.
> > +  #
> > +  # SPECIAL:
> > +  # - Builds with --auto-features=disabled and thus almost no enabled
> > +  #   dependencies
> > +  sanity-check:
> > +    name: SanityCheck
> > +    needs: setup
> > +    if: needs.setup.outputs.sanitycheck == 'true'
> > +    runs-on: ubuntu-latest
> > +    timeout-minutes: 15
> > +    container:
> > +      image: ${{ needs.setup.outputs.linux_ci_image }}
> > +    env:
> > +      BUILD_JOBS: 8
> > +      TEST_JOBS: 8
> > +      CCACHE_DIR: ${{ github.workspace }}/ccache_dir
> > +      # no options enabled, should be small
> > +      CCACHE_MAXSIZE: "150M"
> > +    steps:
> > +      - uses: actions/checkout@v6
> > +        with:
> > +          fetch-depth: ${{ env.CLONE_DEPTH }}
> > +
> > +      - name: Restore ccache
> > +        uses: actions/cache@v5
>
> Seems like this is used by every task. Can we move this into a yaml anchor or
> such, by using a variable representing the job name?

Github Actions doesn't support merge keys. So we can't really
duplicate them. I used yaml anchors for the checkout step since it is
exactly for all jobs.


> > +        with:
> > +          path: ${{ env.CCACHE_DIR }}
> > +          key: ccache-sanitycheck-${{ github.run_id }}
> > +          restore-keys: ccache-sanitycheck-
>
> Why is the key here the run id? Doesn't that mean that we will never have a
> precise cache match and that we will keep multiple versions of the cache
> around? That seems like a waste of cache space?
>
> For efficiency, particularly on cfbot, it seems like it could be useful to
> populate the cache of branches with the cache of the master branch. For that
> we'd need the branch name in the key. Which I think would also good for
> postgres/postgres, as we currently have a lot of interference between runs on
> the main and the REL_XY_STABLE branches.

I think that is the default way. If the cache has the exact hit, it
doesn't refresh the cache. So, having ${{ github.run_id }} makes sure
we won't have exact hits and the cache will always be refreshed. This
sounds bad but that is what I understood :(

I can implement something like this:

      - name: Restore ccache
        uses: actions/cache/restore@v5
        with:
          path: ${{ env.CCACHE_DIR }}
          key: ccache-sanitycheck-master
          restore-keys: |
            ccache-sanitycheck-${{ github.ref_name }}
            ccache-sanitycheck-

      - name: Save ccache
        if: always()
        uses: actions/cache/save@v5
        with:
          path: ${{ env.CCACHE_DIR }}
          key: ccache-sanitycheck-${{ github.ref_name }}-${{ github.run_id }}

So, it will first look for master's cache, then current branch's cache
and lastly whatever cache is available. Do you prefer that?


> > +      - name: Prepare workspace
> > +        run: |
> > +          whoami
> > +          useradd -m postgres
> > +          chown -R postgres:postgres .
> > +          mkdir -p "$CCACHE_DIR"
> > +          chown -R postgres:postgres "$CCACHE_DIR"
> > +          # Can't change the container's kernel.core_pattern; the postgres
> > +          # user can't write to / normally. Make / writable.
> > +          chown root:postgres /
> > +          chmod g+rwx /
>
> Why not just always use a privileged container?

Done.


> > +      - name: Configure
> > +        run: |
> > +          su postgres <<-'EOF'
> > +            set -e
> > +            meson setup \
> > +              --buildtype=debug \
> > +              --auto-features=disabled \
> > +              -Ddefault_library=shared \
> > +              -Dtap_tests=enabled \
> > +              build
> > +          EOF
> > +
> > +      - name: Build
> > +        run: |
> > +          su postgres <<EOF
> > +            set -e
> > +            ninja -C build -j${BUILD_JOBS} ${MBUILD_TARGET}
> > +          EOF
>
> Should we have an explicit cache upload step here? Or are upload steps run
> unconditionally?

Like I explained above, that is done by having ${{ github.run_id }} in
the cache key.


> > +      # Run a minimal set of tests. The main regression tests take too long
> > +      # for this purpose. For now this is a random quick pg_regress style
> > +      # test, and a tap test that exercises both a frontend binary and the
> > +      # backend.
> > +      - name: Test
> > +        run: |
> > +          su postgres <<EOF
> > +            set -e
> > +            ulimit -c unlimited
> > +            meson test ${MTEST_ARGS} --suite setup
> > +            meson test ${MTEST_ARGS} --num-processes ${TEST_JOBS} \
> > +              cube/regress pg_ctl/001_start_stop
> > +          EOF
> > +
> > +      - name: Core backtraces
> > +        if: failure()
> > +        run: |
> > +          mkdir -m 770 /tmp/cores
> > +          find / -maxdepth 1 -type f -name 'core*' -exec mv '{}' /tmp/cores/ \;
> > +          src/tools/ci/cores_backtrace.sh linux /tmp/cores
> > +
> > +      - name: Upload logs
> > +        if: failure()
> > +        uses: actions/upload-artifact@v7
> > +        with:
> > +          name: sanitycheck-logs-${{ github.run_id }}
> > +          path: |
> > +            build*/testrun/**/*.log
> > +            build*/testrun/**/*.diffs
> > +            build*/testrun/**/regress_log_*
> > +            build*/meson-logs/*.txt
> > +          if-no-files-found: ignore
>
> I think this really should be in a yaml anchor, we have a few somewhat
> different versions of this now.

Same thing, we can't have yaml anchors because merge keys are not
supported.  I created this variable:

_LOG_PATHS: &log_paths |
build*/testrun/**/*.log
build*/testrun/**/*.diffs
build*/testrun/**/regress_log_*
build*/meson-logs/*.txt

and used it in the Upload logs' path.


> It's pretty annoying that the output of the failures isn't visible in the UI.
> Maybe we ought to print a few of the failures out or something?

We already have '--print-errorlogs', do you mean something different?


> > +
> > +  # SPECIAL:
> > +  # - Uses address sanitizer (sanitizer failures are typically printed in
> > +  #   the server log)
> > +  # - Configures postgres with a small segment size
> > +  #
> > +  # Enable a reasonable set of sanitizers. Use the linux task for that, as
> > +  # it's one of the fastest tasks (without sanitizers). Also several of the
> > +  # sanitizers work best on linux.
> > +  #
> > +  # The overhead of alignment sanitizer is low, undefined behaviour has
> > +  # moderate overhead. Test alignment sanitizer in the meson task, as it
> > +  # does both 32 and 64 bit builds and is thus more likely to expose
> > +  # alignment bugs.
> > +  #
> > +  # Address sanitizer in contrast is somewhat expensive. Enable it in the
> > +  # autoconf task, as the meson task tests both 32 and 64bit.
>
> I wonder if we should split the meson task into two, one for 32bit and one for
> 64bit. The concurrency limits for public repos are high enough for that to
> seem like a reasonable tradeoff? There's no work, other than the repo
> checkout, shared between them.

Done.


> > +  # disable_coredump=0, abort_on_error=1: for useful backtraces in case of crashes
> > +  # print_stacktraces=1,verbosity=2, duh
> > +  # detect_leaks=0: too many uninteresting leak errors in short-lived binaries
> > +  linux-autoconf:
> > +    name: Linux - Debian Trixie - Autoconf
> > +    needs: [setup, sanity-check]
> > +    if: |
> > +      !cancelled() &&
> > +      needs.setup.outputs.linux == 'true' &&
> > +      needs.sanity-check.result != 'failure'
> > +    runs-on: ubuntu-latest
> > +    timeout-minutes: 60
> > +    container:
> > +      image: ${{ needs.setup.outputs.linux_ci_image }}
> > +      # Share the host PID + IPC namespaces. 017_shm.pl rapidly creates,
> > +      # kill9's, and restarts postgres; with the container's small PID
> > +      # space a new postgres can recycle the dead postmaster's PID before
> > +      # pg_ctl's postmaster.pid check notices, producing spurious "node X
> > +      # is already running" failures. SysV shm in the test also relies on
> > +      # host-like IPC behavior.
> > +      #
> > +      # --ulimit raises memlock and core dump size. Memlock is needed for
> > +      # running the AIO tests.
> > +      #
> > +      # --privileged is needed so the prepare step can write to sysctls
> > +      # under /proc/sys (it's mounted read-only without it). We use it to
> > +      # set kernel.core_pattern.
> > +      options: --pid=host --ipc=host --ulimit memlock=-1:-1 --privileged
> > +    env:
> > +      BUILD_JOBS: 4
> > +      TEST_JOBS: 8
> > +      CCACHE_DIR: /tmp/ccache_dir
> > +      DEBUGINFOD_URLS: "https://debuginfod.debian.net"
> > +
> > +      SANITIZER_FLAGS: -fsanitize=address
> > +      UBSAN_OPTIONS: print_stacktrace=1:disable_coredump=0:abort_on_error=1:verbosity=2
> > +      ASAN_OPTIONS: print_stacktrace=1:disable_coredump=0:abort_on_error=1:detect_leaks=0:detect_stack_use_after_return=0
> > +      CFLAGS: -Og -ggdb -fno-sanitize-recover=all -fsanitize=address
> > +      CXXFLAGS: -Og -ggdb -fno-sanitize-recover=all -fsanitize=address
> > +      LDFLAGS: -fsanitize=address
> > +      CC: ccache gcc
> > +      CXX: ccache g++
>
> There's a fair bit of stuff shared between the meson/autoconf linux
> tasks. Previously they used a matrix to reduce that a *bit*. But now it's
> entirely duplicated, including stuff that doesn't apply to the current job
> (e.g. UBSAN_OPTIONS/ASAN_OPTIONS).  And blocks like the following:
>
>
> > +      - name: Prepare workspace
> > +        run: |
> > +          useradd -m postgres
> > +          chown -R postgres:postgres .
> > +          mkdir -p "$CCACHE_DIR"
> > +          chown -R postgres:postgres "$CCACHE_DIR"
> > +          mkdir -m 770 /tmp/cores
> > +          chown root:postgres /tmp/cores
> > +          sysctl kernel.core_pattern='/tmp/cores/%e-%s-%p.core'
> > +
> > +          # Hosts for the load balance test
> > +          cat >> /etc/hosts <<-EOF
> > +            127.0.0.1 pg-loadbalancetest
> > +            127.0.0.2 pg-loadbalancetest
> > +            127.0.0.3 pg-loadbalancetest
> > +          EOF


I found we can use matrices and merged all linux tasks. I am not sure
that is better since it is a bit harder to read now.


> > +      # Install dependencies via Homebrew rather than Macports. On stock
> > +      # GH runners macports requires a heavy bootstrap, and the relevant
> > +      # Postgres deps are all available in brew.
>
> What does "heavy bootstrap" mean?

I used MacPorts on my first version. It took ~10 minutes to download
MacPorts. I think that if we could use caching like we did in the
Cirrus, it makes sense to use MacPorts. I will spend some time on
that.

And after spending some time, I am able to make it work. Now the first
run's dependencies install takes ~10 minutes since there is no
MacPorts cache but subsequent runs' install only take ~5 seconds.


> > +      - name: Install dependencies
> > +        run: |
> > +          brew update
> > +          brew install \
> > +            ccache meson openldap python@3.12 tcl-tk
> > +          # IPC::Run via cpanm (system perl)
> > +          sudo cpan -T -i IPC::Run IO::Tty
>
> We do spend ~95s on this every run, that's not nothing. And it puts a bunch of
> load onto the brew's mirrors to do that every run.

You are right. MacPorts is used now.


> > +      - name: Test world
> > +        run: |
> > +          ulimit -c unlimited
> > +          ulimit -n 1024
> > +          meson test ${MTEST_ARGS} --num-processes ${TEST_JOBS}
>
> I'd re-add the comments that were in .cirrus.yml about this.

Done.


> > +  windows-vs:
> > +    name: Windows - Server 2022, VS 2022 - Meson & ninja
> > +    needs: [setup, sanity-check]
> > +    if: |
> > +      !cancelled() &&
> > +      needs.setup.outputs.windows == 'true' &&
> > +      needs.sanity-check.result != 'failure'
> > +    runs-on: windows-2022
> > +    timeout-minutes: 60
> > +    env:
> > +      TEST_JOBS: 8
> > +      # Avoid port conflicts between concurrent tap tests
> > +      PG_TEST_USE_UNIX_SOCKETS: 1
> > +      PG_REGRESS_SOCK_DIR: 'c:\pgsock\'
>
> At least my editor gets confused by the \', thinking it's escaping the '. As
> everything just works without the trailing \, I'd go that way.

Done.


> > +      # The TAP tests build an initdb template under build/tmp_install and
> > +      # then `robocopy` it into per-test data directories. Robocopy with the
> > +      # default /COPY:DAT flag doesn't copy ACLs — destinations inherit from
> > +      # their parent dir. On GitHub-hosted Windows runners the workspace's
> > +      # inherited ACL grants Administrators:(F) and Users:(RX) but does NOT
> > +      # grant the runner user (runneradmin) directly. That matters because
> > +      # pg_ctl on Windows uses CreateRestrictedProcess to drop admin
> > +      # privileges from postmaster, so the postmaster process has the user
> > +      # SID in its token but no longer the Administrators group — leaving it
> > +      # with only "Users:(RX)" on pg_control and friends, which causes
> > +      # "PANIC: could not open file global/pg_control: Permission denied".
> > +      #
> > +      # Fix it once on the workspace dir with (OI)(CI) inheritance flags so
> > +      # every file/dir created underneath gets an explicit grant for the
> > +      # current user.
> > +      - name: Grant workspace ACL to runner user
> > +        shell: pwsh
> > +        run: |
> > +          icacls "${{ github.workspace }}" /grant "${env:USERNAME}:(OI)(CI)F" /Q | Out-Null
> > +          Write-Host "Granted Full Control to $env:USERNAME on ${{ github.workspace }}"
>
> Perhaps this would be better to fix by changing the robocopy flags?

I couldn't fix this by using robocopy flags. I used /COPYALL and
/SECFIX together but they didn't work.


> > +      # postgres' plpython3u loads python3.dll (the stable-ABI forwarder)
> > +      # which in turn loads whichever python3NN.dll the Windows loader finds
> > +      # first on PATH. On windows-2022 `C:\Program Files\Mercurial\` ships
> > +      # its own python3.dll + python39.dll and appears on PATH *before* the
> > +      # hostedtoolcache Python 3.12 — so without intervention the backend
> > +      # ends up running Python 3.9 while postgres' stdlib search uses 3.12,
> > +      # producing `ImportError: cannot import name 'text_encoding' from
> > +      # 'io'` (the 3.12 `io.py` calling into 3.9's `_io`).
> > +      #
> > +      # Pin PYTHONHOME to the Python 3.12 prefix, and prepend that prefix
> > +      # to PATH so its python3.dll wins the DLL search.
> > +      - name: Pin Python prefix on PATH and PYTHONHOME
> > +        shell: pwsh
> > +        run: |
> > +          $prefix = (python -c "import sys; print(sys.prefix)").Trim()
> > +          Add-Content $env:GITHUB_ENV "PYTHONHOME=$prefix"
> > +          Add-Content $env:GITHUB_PATH $prefix
> > +          Write-Host "PYTHONHOME=$prefix"
> > +          Write-Host "Prepended $prefix to PATH"
>
> GRJGJKLJKJDFJKDF.

I re-checked this since Jelte wasn't completely sure about this [2]
but this is unfortunately correct :(


> > +      - name: Install dependencies
> > +        shell: pwsh
> > +        run: |
> > +          choco install -y --no-progress --limitoutput diffutils winflexbison
> > +          # meson + ninja aren't preinstalled on windows-2022. Install via pip
> > +          python -m pip install --upgrade meson ninja
> > +
> > +          # OpenSSL 1.1 via the slproweb installer (pinned to match the
> > +          # version used elsewhere in postgres CI).
> > +          curl.exe -fsSL -o openssl-setup.exe https://slproweb.com/download/Win64OpenSSL-1_1_1w.exe
> > +          Start-Process -Wait -FilePath ./openssl-setup.exe `
> > +            -ArgumentList '/DIR=c:\openssl\1.1\ /VERYSILENT /SP- /SUPPRESSMSGBOXES'
> > +          # The slproweb installer puts libcrypto-1_1-x64.dll / libssl-1_1-x64.dll
> > +          # in c:\openssl\1.1\bin\ and updates the system PATH. GH Actions
> > +          # snapshots PATH at job start though, so the running job won't
> > +          # see those DLLs and initdb.exe would crash silently at runtime.
> > +          # Push the bin dir onto GITHUB_PATH so it persists for later steps.
> > +          Add-Content $env:GITHUB_PATH "c:\openssl\1.1\bin"
>
> I don't like that much, but I'm not sure we have a better alternative
> short-term.

Making chocolatey would be a nice alternative. You already said
sometimes chocolatey takes too much time. I am planning to spend time
on it unless we are planning to use our own Windows containers.


> > +  windows-mingw:
> > +    name: Windows - Server 2022, MinGW64 - Meson
> > +    needs: [setup, sanity-check]
> > +    if: |
> > +      !cancelled() &&
> > +      needs.setup.outputs.mingw == 'true' &&
> > +      needs.sanity-check.result != 'failure'
> > +    runs-on: windows-2022
> > +    timeout-minutes: 60
> > +    env:
> > +      TEST_JOBS: 4  # higher concurrency causes occasional failures
> > +      PG_TEST_USE_UNIX_SOCKETS: 1
> > +      PG_REGRESS_SOCK_DIR: 'c:\pgsock\'
> > +      TAR: "c:/windows/system32/tar.exe"
> > +      # for mingw plpython to find its installation
> > +      PYTHONHOME: D:/a/_temp/msys64/ucrt64
> > +
> > +      MSYS: winjitdebug
> > +      CHERE_INVOKING: 1
> > +      MESON_FEATURES: >-
> > +        -Dnls=disabled
>
> Missing comments from .cirrus.tasks.yml

Done.

v3 is attached. Just a quick note, v3 includes Zsolt [3] And Peter's
[4] reviews & feedback too. I will reply to them after sending this.

GA run after v3 is applied:
https://github.com/nbyavuz/postgres/actions/runs/26587973538


[1]
https://github.com/actions/runner/issues/1182
https://github.com/orgs/community/discussions/185877
[2] https://postgr.es/m/CAGECzQQBCF%3DHSk4eCc1fEYTpCt59rgpcwWp47%2B6M-CDMYEaM2A%40mail.gmail.com
[3] https://postgr.es/m/CAN4CZFO4usEzFQoYzEywvOgoagW%3DU4yhpB4Oq-a7bUCR53djHA%40mail.gmail.com
[4] https://postgr.es/m/3daa29a4-6a08-41c1-8a6a-53ba8cd3c7fb%40eisentraut.org


--
Regards,
Nazir Bilal Yavuz
Microsoft