Re: Suggestion to add --continue-client-on-abort option to pgbench

Yugo Nagata <nagata@sraoss.co.jp>

From: Yugo Nagata <nagata@sraoss.co.jp>

To: Chao Li <li.evan.chao@gmail.com>

Cc: Fujii Masao <masao.fujii@gmail.com>, Rintaro Ikeda <ikedarintarof@oss.nttdata.com>, Jakub Wartak <jakub.wartak@enterprisedb.com>, "Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com>, "slpmcf@gmail.com" <slpmcf@gmail.com>, "boekewurm+postgres@gmail.com" <boekewurm+postgres@gmail.com>, "pgsql-hackers@postgresql.org" <pgsql-hackers@postgresql.org>, Srinath Reddy Sadipiralla <srinath2133@gmail.com>, Dilip Kumar <dilipbalaut@gmail.com>

Date: 2025-11-13T05:50:33Z

Lists: pgsql-hackers

On Thu, 13 Nov 2025 13:14:37 +0800
Chao Li <li.evan.chao@gmail.com> wrote:

> 
> 
> > On Nov 13, 2025, at 12:02, Chao Li <li.evan.chao@gmail.com> wrote:
> > 
> > 
> > 
> >> On Nov 13, 2025, at 11:47, Fujii Masao <masao.fujii@gmail.com> wrote:
> >> 
> >> On Thu, Nov 13, 2025 at 11:21 AM Chao Li <li.evan.chao@gmail.com> wrote:
> >>> I debugged further this morning, and I think I have found the root cause. Ultimately, the problem is not with discardUntilSync(), instead, discardAvailableResults() mistakenly eats PGRES_PIPELINE_SYNC.
> >> 
> >> Thanks for debugging!
> >> 
> >> Yes, discardAvailableResults() can discard PGRES_PIPELINE_SYNC,
> >> but do you mean that's the root cause of the assertion failure
> >> Nagata-san reported?
> >> Since that failure can occur even in older branches, I was thinking
> >> that newer code
> >> like discardAvailableResults() in master isn't the root cause...
> >> 
> > 
> > I haven’t debugged with old code, but the old code also discard non-NULL results:
> > 
> > ```
> > - do
> > - {
> > - res = PQgetResult(st->con);
> > - PQclear(res);
> > - } while (res);
> > + discardAvailableResults(st);
> > ```
> > 
> > Which may also discard the sync message. That’s my guess. I can also debug the old code this afternoon.
> > 
> 
> I just tried the old code but it didn’t trigger the assert with Yugo’s deadlock scripts.

To trigger a deadlock error, the tables need to have enough rows so that the scan takes some
time. In my environment, about 1,000 rows were enough to cause a deadlock.

Regards,
Yugo Nagata

> 
> I did "git reset --hard a3ea5330fcf47390c8ab420bbf433a97a54505d6”, that is the previous commit of “—continue-on-error”. And I ran Yugo’s deadlock scripts, but I didn’t get the assert:
> 
> ```
> % pgbench -n  --failures-detailed  -M extended -j 2 -c 2  -f deadlock.sql -f deadlock2.sql evantest
> pgbench (19devel)
> transaction type: multiple scripts
> scaling factor: 1
> query mode: extended
> number of clients: 2
> number of threads: 2
> maximum number of tries: 1
> number of transactions per client: 10
> number of transactions actually processed: 20/20
> number of failed transactions: 0 (0.000%)
> number of serialization failures: 0 (0.000%)
> number of deadlock failures: 0 (0.000%)
> latency average = 0.341 ms
> initial connection time = 2.637 ms
> tps = 5865.102639 (without initial connection time)
> SQL script 1: deadlock.sql
>  - weight: 1 (targets 50.0% of total)
>  - 12 transactions (60.0% of total)
>  - number of transactions actually processed: 12 (tps = 3519.061584)
>  - number of failed transactions: 0 (0.000%)
>  - number of serialization failures: 0 (0.000%)
>  - number of deadlock failures: 0 (0.000%)
>  - latency average = 0.311 ms
>  - latency stddev = 0.304 ms
> SQL script 2: deadlock2.sql
>  - weight: 1 (targets 50.0% of total)
>  - 8 transactions (40.0% of total)
>  - number of transactions actually processed: 8 (tps = 2346.041056)
>  - number of failed transactions: 0 (0.000%)
>  - number of serialization failures: 0 (0.000%)
>  - number of deadlock failures: 0 (0.000%)
>  - latency average = 0.366 ms
>  - latency stddev = 0.364 ms
> ```
> 
> Best regards,
> --
> Chao Li (Evan)
> HighGo Software Co., Ltd.
> https://www.highgo.com/
> 
> 
> 
> 


-- 
Yugo Nagata <nagata@sraoss.co.jp>