Thread
-
Re: Heads Up: cirrus-ci is shutting down June 1st
Nazir Bilal Yavuz <byavuz81@gmail.com> — 2026-05-28T17:06:22Z
Hi, Thank you for looking into this! On Wed, 27 May 2026 at 21:10, Andres Freund <andres@anarazel.de> wrote: > > > Here is the v2, I took Jelte's patch and reviewed & merged it with my > > patch. Updates and questions are: > > > > 1- I continued to use Jelte's container method (Linux tasks only for > > now, BSD tasks will be included in the future) because I think that is > > the future-proof way since we might want to generate our container > > images in the future. Also, up-to-date Debian images can be tested > > with this way; otherwise we would need to use Ubuntu 24.04. > > Good. > > > > 2- io_uring tests work on the Linux Meson task. > > Is there a reason to not just do that for all the tasks? I might word it incorrectly. I meant that Linux meson tests use: PG_TEST_INITDB_EXTRA_OPTS: >- -c io_method=io_uring and that wasn't working before, now it works. I guess we have this only on Linux because we wanted to test io_method=worker in the other tasks. > > 3- I didn't put commands to helper scripts for now. I think it is a > > good thing to have a helper script but it would be better to have this > > helper script after the first version is committed since it can extend > > the timeline. Also, I found that having all commands in one file makes > > debugging easier. > > Hm. I'm a bit worried about this getting pretty unmaintainable, due to the > repetition. I think at least we need to use yaml anchors to deduplicate some > steps. Github Actions added support of yaml anchors last year but unfortunately they don't support merge keys. Related information: [1]. > > 4- FreeBSD task has these options: > > > > PG_TEST_INITDB_EXTRA_OPTS: >- > > -c debug_copy_parse_plan_trees=on > > -c debug_write_read_parse_plan_trees=on > > -c debug_raw_expression_coverage_test=on > > -c debug_parallel_query=regress > > > > Since we won't have FreeBSD for the first version. I put these options > > to the MacOS task but I couldn't decide where to put > > 'PG_TEST_PG_UPGRADE_MODE: --link'. > > Makes sense. > > > > Also, I am planning to work on back patches when we agree on the > > upstream one. Does that sound good? > > Yep. > > > > > diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml > > new file mode 100644 > > index 00000000000..6d20068727c > > --- /dev/null > > +++ b/.github/workflows/ci.yml > > @@ -0,0 +1,1125 @@ > > +# GitHub Actions CI configuration for PostgreSQL > > + > > +name: Github Actions CI > > + > > +on: > > + push: > > + branches: [ "*" ] > > + > > +# Default to the minimum privilege the jobs need (just reading the repo > > +# contents during checkout). Individual jobs override this when they need > > +# more, e.g. `cancel-previous` needs `actions: write` to cancel runs. > > +permissions: > > + contents: read > > I'm not sure I like that we ever need more than that. I'd expect that > postgresql-cfbot will explicitly disable write permissions for runs. Done. Updated the comment and removed the 'Cancel previous runs' step. > > +# NB: intentionally NO workflow-level `concurrency:` block. The native > > +# concurrency mechanism makes a new run wait for the previous one to fully > > +# cancel before it starts — which can take a while. Instead the > > +# `cancel-previous` job below fires a cancel API call asynchronously, > > +# so the new run gets going immediately. On master the cancel job is skipped, > > +# so every push runs to completion. > > Is this really worth having our own code? Seems like it'd not be that frequent > to push if there are already running runs? What kind of delays are we talking > about? Jelte already answered this in [2]. 'Cancel previous runs' step is removed and concurrency is used instead. > > + # To avoid unnecessarily spinning up a lot of VMs / containers for entirely > > + # broken commits, have a minimal task that all others depend on. > > + # > > + # SPECIAL: > > + # - Builds with --auto-features=disabled and thus almost no enabled > > + # dependencies > > + sanity-check: > > + name: SanityCheck > > + needs: setup > > + if: needs.setup.outputs.sanitycheck == 'true' > > + runs-on: ubuntu-latest > > + timeout-minutes: 15 > > + container: > > + image: ${{ needs.setup.outputs.linux_ci_image }} > > + env: > > + BUILD_JOBS: 8 > > + TEST_JOBS: 8 > > + CCACHE_DIR: ${{ github.workspace }}/ccache_dir > > + # no options enabled, should be small > > + CCACHE_MAXSIZE: "150M" > > + steps: > > + - uses: actions/checkout@v6 > > + with: > > + fetch-depth: ${{ env.CLONE_DEPTH }} > > + > > + - name: Restore ccache > > + uses: actions/cache@v5 > > Seems like this is used by every task. Can we move this into a yaml anchor or > such, by using a variable representing the job name? Github Actions doesn't support merge keys. So we can't really duplicate them. I used yaml anchors for the checkout step since it is exactly for all jobs. > > + with: > > + path: ${{ env.CCACHE_DIR }} > > + key: ccache-sanitycheck-${{ github.run_id }} > > + restore-keys: ccache-sanitycheck- > > Why is the key here the run id? Doesn't that mean that we will never have a > precise cache match and that we will keep multiple versions of the cache > around? That seems like a waste of cache space? > > For efficiency, particularly on cfbot, it seems like it could be useful to > populate the cache of branches with the cache of the master branch. For that > we'd need the branch name in the key. Which I think would also good for > postgres/postgres, as we currently have a lot of interference between runs on > the main and the REL_XY_STABLE branches. I think that is the default way. If the cache has the exact hit, it doesn't refresh the cache. So, having ${{ github.run_id }} makes sure we won't have exact hits and the cache will always be refreshed. This sounds bad but that is what I understood :( I can implement something like this: - name: Restore ccache uses: actions/cache/restore@v5 with: path: ${{ env.CCACHE_DIR }} key: ccache-sanitycheck-master restore-keys: | ccache-sanitycheck-${{ github.ref_name }} ccache-sanitycheck- - name: Save ccache if: always() uses: actions/cache/save@v5 with: path: ${{ env.CCACHE_DIR }} key: ccache-sanitycheck-${{ github.ref_name }}-${{ github.run_id }} So, it will first look for master's cache, then current branch's cache and lastly whatever cache is available. Do you prefer that? > > + - name: Prepare workspace > > + run: | > > + whoami > > + useradd -m postgres > > + chown -R postgres:postgres . > > + mkdir -p "$CCACHE_DIR" > > + chown -R postgres:postgres "$CCACHE_DIR" > > + # Can't change the container's kernel.core_pattern; the postgres > > + # user can't write to / normally. Make / writable. > > + chown root:postgres / > > + chmod g+rwx / > > Why not just always use a privileged container? Done. > > + - name: Configure > > + run: | > > + su postgres <<-'EOF' > > + set -e > > + meson setup \ > > + --buildtype=debug \ > > + --auto-features=disabled \ > > + -Ddefault_library=shared \ > > + -Dtap_tests=enabled \ > > + build > > + EOF > > + > > + - name: Build > > + run: | > > + su postgres <<EOF > > + set -e > > + ninja -C build -j${BUILD_JOBS} ${MBUILD_TARGET} > > + EOF > > Should we have an explicit cache upload step here? Or are upload steps run > unconditionally? Like I explained above, that is done by having ${{ github.run_id }} in the cache key. > > + # Run a minimal set of tests. The main regression tests take too long > > + # for this purpose. For now this is a random quick pg_regress style > > + # test, and a tap test that exercises both a frontend binary and the > > + # backend. > > + - name: Test > > + run: | > > + su postgres <<EOF > > + set -e > > + ulimit -c unlimited > > + meson test ${MTEST_ARGS} --suite setup > > + meson test ${MTEST_ARGS} --num-processes ${TEST_JOBS} \ > > + cube/regress pg_ctl/001_start_stop > > + EOF > > + > > + - name: Core backtraces > > + if: failure() > > + run: | > > + mkdir -m 770 /tmp/cores > > + find / -maxdepth 1 -type f -name 'core*' -exec mv '{}' /tmp/cores/ \; > > + src/tools/ci/cores_backtrace.sh linux /tmp/cores > > + > > + - name: Upload logs > > + if: failure() > > + uses: actions/upload-artifact@v7 > > + with: > > + name: sanitycheck-logs-${{ github.run_id }} > > + path: | > > + build*/testrun/**/*.log > > + build*/testrun/**/*.diffs > > + build*/testrun/**/regress_log_* > > + build*/meson-logs/*.txt > > + if-no-files-found: ignore > > I think this really should be in a yaml anchor, we have a few somewhat > different versions of this now. Same thing, we can't have yaml anchors because merge keys are not supported. I created this variable: _LOG_PATHS: &log_paths | build*/testrun/**/*.log build*/testrun/**/*.diffs build*/testrun/**/regress_log_* build*/meson-logs/*.txt and used it in the Upload logs' path. > It's pretty annoying that the output of the failures isn't visible in the UI. > Maybe we ought to print a few of the failures out or something? We already have '--print-errorlogs', do you mean something different? > > + > > + # SPECIAL: > > + # - Uses address sanitizer (sanitizer failures are typically printed in > > + # the server log) > > + # - Configures postgres with a small segment size > > + # > > + # Enable a reasonable set of sanitizers. Use the linux task for that, as > > + # it's one of the fastest tasks (without sanitizers). Also several of the > > + # sanitizers work best on linux. > > + # > > + # The overhead of alignment sanitizer is low, undefined behaviour has > > + # moderate overhead. Test alignment sanitizer in the meson task, as it > > + # does both 32 and 64 bit builds and is thus more likely to expose > > + # alignment bugs. > > + # > > + # Address sanitizer in contrast is somewhat expensive. Enable it in the > > + # autoconf task, as the meson task tests both 32 and 64bit. > > I wonder if we should split the meson task into two, one for 32bit and one for > 64bit. The concurrency limits for public repos are high enough for that to > seem like a reasonable tradeoff? There's no work, other than the repo > checkout, shared between them. Done. > > + # disable_coredump=0, abort_on_error=1: for useful backtraces in case of crashes > > + # print_stacktraces=1,verbosity=2, duh > > + # detect_leaks=0: too many uninteresting leak errors in short-lived binaries > > + linux-autoconf: > > + name: Linux - Debian Trixie - Autoconf > > + needs: [setup, sanity-check] > > + if: | > > + !cancelled() && > > + needs.setup.outputs.linux == 'true' && > > + needs.sanity-check.result != 'failure' > > + runs-on: ubuntu-latest > > + timeout-minutes: 60 > > + container: > > + image: ${{ needs.setup.outputs.linux_ci_image }} > > + # Share the host PID + IPC namespaces. 017_shm.pl rapidly creates, > > + # kill9's, and restarts postgres; with the container's small PID > > + # space a new postgres can recycle the dead postmaster's PID before > > + # pg_ctl's postmaster.pid check notices, producing spurious "node X > > + # is already running" failures. SysV shm in the test also relies on > > + # host-like IPC behavior. > > + # > > + # --ulimit raises memlock and core dump size. Memlock is needed for > > + # running the AIO tests. > > + # > > + # --privileged is needed so the prepare step can write to sysctls > > + # under /proc/sys (it's mounted read-only without it). We use it to > > + # set kernel.core_pattern. > > + options: --pid=host --ipc=host --ulimit memlock=-1:-1 --privileged > > + env: > > + BUILD_JOBS: 4 > > + TEST_JOBS: 8 > > + CCACHE_DIR: /tmp/ccache_dir > > + DEBUGINFOD_URLS: "https://debuginfod.debian.net" > > + > > + SANITIZER_FLAGS: -fsanitize=address > > + UBSAN_OPTIONS: print_stacktrace=1:disable_coredump=0:abort_on_error=1:verbosity=2 > > + ASAN_OPTIONS: print_stacktrace=1:disable_coredump=0:abort_on_error=1:detect_leaks=0:detect_stack_use_after_return=0 > > + CFLAGS: -Og -ggdb -fno-sanitize-recover=all -fsanitize=address > > + CXXFLAGS: -Og -ggdb -fno-sanitize-recover=all -fsanitize=address > > + LDFLAGS: -fsanitize=address > > + CC: ccache gcc > > + CXX: ccache g++ > > There's a fair bit of stuff shared between the meson/autoconf linux > tasks. Previously they used a matrix to reduce that a *bit*. But now it's > entirely duplicated, including stuff that doesn't apply to the current job > (e.g. UBSAN_OPTIONS/ASAN_OPTIONS). And blocks like the following: > > > > + - name: Prepare workspace > > + run: | > > + useradd -m postgres > > + chown -R postgres:postgres . > > + mkdir -p "$CCACHE_DIR" > > + chown -R postgres:postgres "$CCACHE_DIR" > > + mkdir -m 770 /tmp/cores > > + chown root:postgres /tmp/cores > > + sysctl kernel.core_pattern='/tmp/cores/%e-%s-%p.core' > > + > > + # Hosts for the load balance test > > + cat >> /etc/hosts <<-EOF > > + 127.0.0.1 pg-loadbalancetest > > + 127.0.0.2 pg-loadbalancetest > > + 127.0.0.3 pg-loadbalancetest > > + EOF I found we can use matrices and merged all linux tasks. I am not sure that is better since it is a bit harder to read now. > > + # Install dependencies via Homebrew rather than Macports. On stock > > + # GH runners macports requires a heavy bootstrap, and the relevant > > + # Postgres deps are all available in brew. > > What does "heavy bootstrap" mean? I used MacPorts on my first version. It took ~10 minutes to download MacPorts. I think that if we could use caching like we did in the Cirrus, it makes sense to use MacPorts. I will spend some time on that. And after spending some time, I am able to make it work. Now the first run's dependencies install takes ~10 minutes since there is no MacPorts cache but subsequent runs' install only take ~5 seconds. > > + - name: Install dependencies > > + run: | > > + brew update > > + brew install \ > > + ccache meson openldap python@3.12 tcl-tk > > + # IPC::Run via cpanm (system perl) > > + sudo cpan -T -i IPC::Run IO::Tty > > We do spend ~95s on this every run, that's not nothing. And it puts a bunch of > load onto the brew's mirrors to do that every run. You are right. MacPorts is used now. > > + - name: Test world > > + run: | > > + ulimit -c unlimited > > + ulimit -n 1024 > > + meson test ${MTEST_ARGS} --num-processes ${TEST_JOBS} > > I'd re-add the comments that were in .cirrus.yml about this. Done. > > + windows-vs: > > + name: Windows - Server 2022, VS 2022 - Meson & ninja > > + needs: [setup, sanity-check] > > + if: | > > + !cancelled() && > > + needs.setup.outputs.windows == 'true' && > > + needs.sanity-check.result != 'failure' > > + runs-on: windows-2022 > > + timeout-minutes: 60 > > + env: > > + TEST_JOBS: 8 > > + # Avoid port conflicts between concurrent tap tests > > + PG_TEST_USE_UNIX_SOCKETS: 1 > > + PG_REGRESS_SOCK_DIR: 'c:\pgsock\' > > At least my editor gets confused by the \', thinking it's escaping the '. As > everything just works without the trailing \, I'd go that way. Done. > > + # The TAP tests build an initdb template under build/tmp_install and > > + # then `robocopy` it into per-test data directories. Robocopy with the > > + # default /COPY:DAT flag doesn't copy ACLs — destinations inherit from > > + # their parent dir. On GitHub-hosted Windows runners the workspace's > > + # inherited ACL grants Administrators:(F) and Users:(RX) but does NOT > > + # grant the runner user (runneradmin) directly. That matters because > > + # pg_ctl on Windows uses CreateRestrictedProcess to drop admin > > + # privileges from postmaster, so the postmaster process has the user > > + # SID in its token but no longer the Administrators group — leaving it > > + # with only "Users:(RX)" on pg_control and friends, which causes > > + # "PANIC: could not open file global/pg_control: Permission denied". > > + # > > + # Fix it once on the workspace dir with (OI)(CI) inheritance flags so > > + # every file/dir created underneath gets an explicit grant for the > > + # current user. > > + - name: Grant workspace ACL to runner user > > + shell: pwsh > > + run: | > > + icacls "${{ github.workspace }}" /grant "${env:USERNAME}:(OI)(CI)F" /Q | Out-Null > > + Write-Host "Granted Full Control to $env:USERNAME on ${{ github.workspace }}" > > Perhaps this would be better to fix by changing the robocopy flags? I couldn't fix this by using robocopy flags. I used /COPYALL and /SECFIX together but they didn't work. > > + # postgres' plpython3u loads python3.dll (the stable-ABI forwarder) > > + # which in turn loads whichever python3NN.dll the Windows loader finds > > + # first on PATH. On windows-2022 `C:\Program Files\Mercurial\` ships > > + # its own python3.dll + python39.dll and appears on PATH *before* the > > + # hostedtoolcache Python 3.12 — so without intervention the backend > > + # ends up running Python 3.9 while postgres' stdlib search uses 3.12, > > + # producing `ImportError: cannot import name 'text_encoding' from > > + # 'io'` (the 3.12 `io.py` calling into 3.9's `_io`). > > + # > > + # Pin PYTHONHOME to the Python 3.12 prefix, and prepend that prefix > > + # to PATH so its python3.dll wins the DLL search. > > + - name: Pin Python prefix on PATH and PYTHONHOME > > + shell: pwsh > > + run: | > > + $prefix = (python -c "import sys; print(sys.prefix)").Trim() > > + Add-Content $env:GITHUB_ENV "PYTHONHOME=$prefix" > > + Add-Content $env:GITHUB_PATH $prefix > > + Write-Host "PYTHONHOME=$prefix" > > + Write-Host "Prepended $prefix to PATH" > > GRJGJKLJKJDFJKDF. I re-checked this since Jelte wasn't completely sure about this [2] but this is unfortunately correct :( > > + - name: Install dependencies > > + shell: pwsh > > + run: | > > + choco install -y --no-progress --limitoutput diffutils winflexbison > > + # meson + ninja aren't preinstalled on windows-2022. Install via pip > > + python -m pip install --upgrade meson ninja > > + > > + # OpenSSL 1.1 via the slproweb installer (pinned to match the > > + # version used elsewhere in postgres CI). > > + curl.exe -fsSL -o openssl-setup.exe https://slproweb.com/download/Win64OpenSSL-1_1_1w.exe > > + Start-Process -Wait -FilePath ./openssl-setup.exe ` > > + -ArgumentList '/DIR=c:\openssl\1.1\ /VERYSILENT /SP- /SUPPRESSMSGBOXES' > > + # The slproweb installer puts libcrypto-1_1-x64.dll / libssl-1_1-x64.dll > > + # in c:\openssl\1.1\bin\ and updates the system PATH. GH Actions > > + # snapshots PATH at job start though, so the running job won't > > + # see those DLLs and initdb.exe would crash silently at runtime. > > + # Push the bin dir onto GITHUB_PATH so it persists for later steps. > > + Add-Content $env:GITHUB_PATH "c:\openssl\1.1\bin" > > I don't like that much, but I'm not sure we have a better alternative > short-term. Making chocolatey would be a nice alternative. You already said sometimes chocolatey takes too much time. I am planning to spend time on it unless we are planning to use our own Windows containers. > > + windows-mingw: > > + name: Windows - Server 2022, MinGW64 - Meson > > + needs: [setup, sanity-check] > > + if: | > > + !cancelled() && > > + needs.setup.outputs.mingw == 'true' && > > + needs.sanity-check.result != 'failure' > > + runs-on: windows-2022 > > + timeout-minutes: 60 > > + env: > > + TEST_JOBS: 4 # higher concurrency causes occasional failures > > + PG_TEST_USE_UNIX_SOCKETS: 1 > > + PG_REGRESS_SOCK_DIR: 'c:\pgsock\' > > + TAR: "c:/windows/system32/tar.exe" > > + # for mingw plpython to find its installation > > + PYTHONHOME: D:/a/_temp/msys64/ucrt64 > > + > > + MSYS: winjitdebug > > + CHERE_INVOKING: 1 > > + MESON_FEATURES: >- > > + -Dnls=disabled > > Missing comments from .cirrus.tasks.yml Done. v3 is attached. Just a quick note, v3 includes Zsolt [3] And Peter's [4] reviews & feedback too. I will reply to them after sending this. GA run after v3 is applied: https://github.com/nbyavuz/postgres/actions/runs/26587973538 [1] https://github.com/actions/runner/issues/1182 https://github.com/orgs/community/discussions/185877 [2] https://postgr.es/m/CAGECzQQBCF%3DHSk4eCc1fEYTpCt59rgpcwWp47%2B6M-CDMYEaM2A%40mail.gmail.com [3] https://postgr.es/m/CAN4CZFO4usEzFQoYzEywvOgoagW%3DU4yhpB4Oq-a7bUCR53djHA%40mail.gmail.com [4] https://postgr.es/m/3daa29a4-6a08-41c1-8a6a-53ba8cd3c7fb%40eisentraut.org -- Regards, Nazir Bilal Yavuz Microsoft