Re: BUG #16045: vacuum_db crash and illegal memory alloc after pg_upgrade from PG11 to PG12
Tomas Vondra <tomas.vondra@2ndquadrant.com>
Commits
GET /api/v1/messages/:b64id/commits
the thread's linked commits as JSON, with link sources.
API reference →
-
Move into separate file all the SQL queries used in pg_upgrade tests
- 1924d508c335 10.20 landed
- 0e603b75c434 11.15 landed
- b6e525648d72 12.10 landed
- fae5f08e1719 13.6 landed
- b6dac98b0561 14.2 landed
- 0df9641d3905 15.0 landed
-
Add table to regression tests for binary-compatibility checks in pg_upgrade
- a9993416f80f 12.10 landed
- 755f04c72ef1 13.6 landed
- cf3d79aa31f2 14.2 landed
- 835bcba8b8d7 15.0 landed
-
Fix tests of pg_upgrade across different major versions
- afa09e4a9af6 12.9 landed
- 2a8dee6a67cc 13.5 landed
- f4e1c8892b9e 14.1 landed
- fa66b6dee084 15.0 landed
-
Multirange datatypes
- 6df7a9698bb0 14.0 cited
-
Work around cross-version-upgrade issues created by commit 9e38c2bb5.
- 97f73a978fc1 14.0 cited
-
Declare assorted array functions using anycompatible not anyelement.
- 9e38c2bb5093 14.0 cited
-
Remove factorial operators, leaving only the factorial() function.
- 76f412ab3105 14.0 cited
-
Create by default sql/ and expected/ for output directory in pg_regress
- e78900afd217 14.0 cited
-
Add missing include to pg_upgrade/version.c
- bc3a94dc0005 9.4.25 landed
- 984aa0ede1d2 9.5.20 landed
- e09ab32a2205 9.6.16 landed
-
Improve the check for pg_catalog.line data type in pg_upgrade
- 235a52ca0f26 9.4.25 landed
- f57b01dd75ee 9.5.20 landed
- 0a643de08715 9.6.16 landed
- 2218fdca496b 10.11 landed
- a970b6cdebd1 11.6 landed
- ebb4caa9120d 12.1 landed
- 8d48e6a7240c 13.0 landed
-
Improve the check for pg_catalog.unknown data type in pg_upgrade
- e86ece22114d 10.11 landed
- d071a2539ff4 11.6 landed
- a8e49ae0c381 12.1 landed
- a524f50d0fc6 13.0 landed
-
Check for tables with sql_identifier during pg_upgrade
- eaf900e842ab 12.1 landed
- 0ccfc2822366 13.0 landed
-
pg_upgrade: clarify the database names in error files
- 1634d361577a 13.0 cited
-
In the pg_upgrade test suite, don't write to src/test/regress.
- 40b132c1afbb 12.0 cited
-
Allow group access on PGDATA
- c37b3d08ca68 11.0 cited
-
Refactor dir/file permissions
- da9b580d8990 11.0 cited
-
Remove unused functions in regress.c.
- db3af9feb19f 11.0 cited
-
Make WAL segment size configurable at initdb time.
- fc49e24fa69a 11.0 cited
-
Fix bit-rot in pg_upgrade's test.sh, and improve documentation.
- 5bab1985dfc2 10.0 cited
On Wed, Oct 09, 2019 at 03:59:07PM +0200, Tomas Vondra wrote:
>FWIW I can reproduce this - it's enough to do this on the 11 cluster
>
>create table q_tbl_archiv as
>with
>qseason as (
>select table_name,column_name, ordinal_position
>,replace(column_name,'_season','') as col_qualifier
>-- ,'id_'||replace(column_name,'_season','') as id_column
>from information_schema.columns
>order by table_name
>)
>select qs.*,c.column_name as id_column, c.column_default as id_default
>from
> qseason qs
> left join information_schema.columns c on c.table_name=qs.table_name and
>c.column_name like 'id_%';
>
>
>and then
>
> analyze q_tbl_archiv
>
>which produces backtrace like this:
>
>No symbol "stats" in current context.
>(gdb) bt
>#0 0x0000746095262951 in __memmove_avx_unaligned_erms () from /lib64/libc.so.6
>#1 0x0000000000890a8e in varstrfastcmp_locale (a1p=0x17716b4 "per_language\a", len1=<optimized out>, a2p=0x176af28 '\177' <repeats 136 times>, "\021\004", len2=-4, ssup=<optimized out>, ssup=<optimized out>) at varlena.c:2320
>#2 0x0000000000890cb1 in varlenafastcmp_locale (x=24581808, y=24555300, ssup=0x7ffc649463f0) at varlena.c:2219
>#3 0x00000000005b73b4 in ApplySortComparator (ssup=0x7ffc649463f0, isNull2=false, datum2=<optimized out>, isNull1=false, datum1=<optimized out>) at ../../../src/include/utils/sortsupport.h:224
>#4 compare_scalars (a=<optimized out>, b=<optimized out>, arg=0x7ffc649463e0) at analyze.c:2700
>#5 0x00000000008f9953 in qsort_arg (a=a@entry=0x178fdc0, n=<optimized out>, n@entry=2158, es=es@entry=16, cmp=cmp@entry=0x5b7390 <compare_scalars>, arg=arg@entry=0x7ffc649463e0) at qsort_arg.c:140
>#6 0x00000000005b86a6 in compute_scalar_stats (stats=0x176a208, fetchfunc=<optimized out>, samplerows=<optimized out>, totalrows=2158) at analyze.c:2273
>#7 0x00000000005b9d95 in do_analyze_rel (onerel=onerel@entry=0x74608c00d3e8, params=params@entry=0x7ffc64946970, va_cols=va_cols@entry=0x0, acquirefunc=<optimized out>, relpages=22, inh=inh@entry=false, in_outer_xact=false, elevel=13)
> at analyze.c:529
>#8 0x00000000005bb2c9 in analyze_rel (relid=<optimized out>, relation=<optimized out>, params=params@entry=0x7ffc64946970, va_cols=0x0, in_outer_xact=<optimized out>, bstrategy=<optimized out>) at analyze.c:260
>#9 0x000000000062c7b0 in vacuum (relations=0x1727120, params=params@entry=0x7ffc64946970, bstrategy=<optimized out>, bstrategy@entry=0x0, isTopLevel=isTopLevel@entry=true) at vacuum.c:413
>#10 0x000000000062cd49 in ExecVacuum (pstate=pstate@entry=0x16c9518, vacstmt=vacstmt@entry=0x16a82b8, isTopLevel=isTopLevel@entry=true) at vacuum.c:199
>#11 0x00000000007a6d64 in standard_ProcessUtility (pstmt=0x16a8618, queryString=0x16a77a8 "", context=<optimized out>, params=0x0, queryEnv=0x0, dest=0x16a8710, completionTag=0x7ffc64946cb0 "") at utility.c:670
>#12 0x00000000007a4006 in PortalRunUtility (portal=0x170f368, pstmt=0x16a8618, isTopLevel=<optimized out>, setHoldSnapshot=<optimized out>, dest=0x16a8710, completionTag=0x7ffc64946cb0 "") at pquery.c:1175
>#13 0x00000000007a4b61 in PortalRunMulti (portal=portal@entry=0x170f368, isTopLevel=isTopLevel@entry=true, setHoldSnapshot=setHoldSnapshot@entry=false, dest=dest@entry=0x16a8710, altdest=altdest@entry=0x16a8710,
> completionTag=completionTag@entry=0x7ffc64946cb0 "") at pquery.c:1321
>#14 0x00000000007a5864 in PortalRun (portal=portal@entry=0x170f368, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true, run_once=run_once@entry=true, dest=dest@entry=0x16a8710, altdest=altdest@entry=0x16a8710,
> completionTag=0x7ffc64946cb0 "") at pquery.c:796
>#15 0x00000000007a174e in exec_simple_query (query_string=0x16a77a8 "") at postgres.c:1215
>
>Looking at compute_scalar_stats, the "stats" parameter does not seem
>particularly healthy:
>
>(gdb) p *stats
>$3 = {attr = 0x10, attrtypid = 12, attrtypmod = 0, attrtype = 0x1762e00, attrcollid = 356, anl_context = 0x7f7f7f7e00000000, compute_stats = 0x100, minrows = 144, extra_data = 0x1762e00, stats_valid = false, stanullfrac = 0,
> stawidth = 0, stadistinct = 0, stakind = {0, 0, 0, 0, 0}, staop = {0, 0, 0, 0, 0}, stacoll = {0, 0, 0, 0, 0}, numnumbers = {0, 0, 0, 0, 0}, stanumbers = {0x0, 0x0, 0x0, 0x0, 0x0}, numvalues = {0, 0, 0, 0, 2139062142}, stavalues = {
> 0x7f7f7f7f7f7f7f7f, 0x7f7f7f7f7f7f7f7f, 0x7f7f7f7f7f7f7f7f, 0x7f7f7f7f7f7f7f7f, 0x7f7f7f7f7f7f7f7f}, statypid = {2139062143, 2139062143, 2139062143, 2139062143, 2139062143}, statyplen = {32639, 32639, 32639, 32639, 32639},
> statypbyval = {127, 127, 127, 127, 127}, statypalign = "\177\177\177\177\177", tupattnum = 2139062143, rows = 0x7f7f7f7f7f7f7f7f, tupDesc = 0x7f7f7f7f7f7f7f7f, exprvals = 0x8, exprnulls = 0x4, rowstride = 24522240}
>
>Not sure about the root cause yet.
>
OK, a couple more observations - the table schema looks like this:
Table "public.q_tbl_archiv"
Column | Type | Collation | Nullable | Default
------------------+------------------------------------+-----------+----------+---------
table_name | information_schema.sql_identifier | | |
column_name | information_schema.sql_identifier | | |
ordinal_position | information_schema.cardinal_number | | |
col_qualifier | text | | |
id_column | information_schema.sql_identifier | | |
id_default | information_schema.character_data | | |
and I can succesfully do this:
test=# analyze q_tbl_archiv (table_name, column_name, ordinal_position, id_column, id_default);
ANALYZE
but as soon as I include the col_qualifier column, it fails:
test=# analyze q_tbl_archiv (table_name, column_name, ordinal_position, id_column, id_default, col_qualifier);
ERROR: compressed data is corrupted
But it fails differently (with the segfault) when analyzing just the one
column:
test=# analyze q_tbl_archiv (col_qualifier);
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Succeeded.
Moreover, there are some other interesting failures - I can do
select max(table_name) from q_tbl_archiv;
select max(column_name) from q_tbl_archiv;
select max(ordinal_position) from q_tbl_archiv;
but as soon as I try doing that with col_qualifier, it crashes and
burns:
select max(col_qualifier) from q_tbl_archiv;
The backtrace is rather strange in this case (a lot of missing calls,
etc.). However, when called for the next two columns, it still crashes,
but the backtraces look somewhat saner:
select max(id_column) from q_tbl_archiv;
Program received signal SIGSEGV, Segmentation fault.
0x00007db3186c6617 in __strlen_avx2 () from /lib64/libc.so.6
(gdb) bt
#0 0x00007db3186c6617 in __strlen_avx2 () from /lib64/libc.so.6
#1 0x0000000000894ced in cstring_to_text (s=0x7db32ce38935 <error: Cannot access memory at address 0x7db32ce38935>) at varlena.c:173
#2 name_text (fcinfo=<optimized out>) at varlena.c:3573
#3 0x000000000063860d in ExecInterpExpr (state=0x1136900, econtext=0x1135128, isnull=<optimized out>) at execExprInterp.c:649
#4 0x000000000064f699 in ExecEvalExprSwitchContext (isNull=0x7ffcfd8f3b2f, econtext=<optimized out>, state=<optimized out>) at ../../../src/include/executor/executor.h:307
#5 advance_aggregates (aggstate=0x1134ef0, aggstate=0x1134ef0) at nodeAgg.c:679
#6 agg_retrieve_direct (aggstate=0x1134ef0) at nodeAgg.c:1847
#7 ExecAgg (pstate=0x1134ef0) at nodeAgg.c:1572
#8 0x000000000063b58b in ExecProcNode (node=0x1134ef0) at ../../../src/include/executor/executor.h:239
#9 ExecutePlan (execute_once=<optimized out>, dest=0x1144248, direction=<optimized out>, numberTuples=0, sendTuples=<optimized out>, operation=CMD_SELECT, use_parallel_mode=<optimized out>, planstate=0x1134ef0, estate=0x1134c98)
at execMain.c:1646
#10 standard_ExecutorRun (queryDesc=0x1094f18, direction=<optimized out>, count=0, execute_once=<optimized out>) at execMain.c:364
#11 0x00000000007a43cc in PortalRunSelect (portal=0x10da368, forward=<optimized out>, count=0, dest=<optimized out>) at pquery.c:929
#12 0x00000000007a5958 in PortalRun (portal=portal@entry=0x10da368, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true, run_once=run_once@entry=true, dest=dest@entry=0x1144248, altdest=altdest@entry=0x1144248,
completionTag=0x7ffcfd8f3db0 "") at pquery.c:770
#13 0x00000000007a177e in exec_simple_query (query_string=0x10727a8 "select max(id_column) from q_tbl_archiv ;") at postgres.c:1215
#14 0x00000000007a2f3f in PostgresMain (argc=<optimized out>, argv=argv@entry=0x109e400, dbname=<optimized out>, username=<optimized out>) at postgres.c:4236
#15 0x00000000007237ce in BackendRun (port=0x1097c30, port=0x1097c30) at postmaster.c:4437
#16 BackendStartup (port=0x1097c30) at postmaster.c:4128
#17 ServerLoop () at postmaster.c:1704
#18 0x000000000072458e in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0x106c350) at postmaster.c:1377
#19 0x000000000047d101 in main (argc=3, argv=0x106c350) at main.c:228
select max(id_default) from q_tbl_archiv;
Program received signal SIGABRT, Aborted.
0x00007db3185a1e35 in raise () from /lib64/libc.so.6
(gdb) bt
#0 0x00007db3185a1e35 in raise () from /lib64/libc.so.6
#1 0x00007db31858c895 in abort () from /lib64/libc.so.6
#2 0x00000000008b4470 in ExceptionalCondition (conditionName=conditionName@entry=0xabe49e "1", errorType=errorType@entry=0x907128 "unrecognized TOAST vartag", fileName=fileName@entry=0xa4965b "execTuples.c",
lineNumber=lineNumber@entry=971) at assert.c:54
#3 0x00000000006466d3 in slot_deform_heap_tuple (natts=6, offp=0x1135170, tuple=<optimized out>, slot=0x1135128) at execTuples.c:985
#4 tts_buffer_heap_getsomeattrs (slot=0x1135128, natts=<optimized out>) at execTuples.c:676
#5 0x00000000006489fc in slot_getsomeattrs_int (slot=slot@entry=0x1135128, attnum=6) at execTuples.c:1877
#6 0x00000000006379a3 in slot_getsomeattrs (attnum=<optimized out>, slot=0x1135128) at ../../../src/include/executor/tuptable.h:345
#7 ExecInterpExpr (state=0x11364b0, econtext=0x1134cd8, isnull=<optimized out>) at execExprInterp.c:441
#8 0x000000000064f699 in ExecEvalExprSwitchContext (isNull=0x7ffcfd8f3b2f, econtext=<optimized out>, state=<optimized out>) at ../../../src/include/executor/executor.h:307
#9 advance_aggregates (aggstate=0x1134aa0, aggstate=0x1134aa0) at nodeAgg.c:679
#10 agg_retrieve_direct (aggstate=0x1134aa0) at nodeAgg.c:1847
#11 ExecAgg (pstate=0x1134aa0) at nodeAgg.c:1572
#12 0x000000000063b58b in ExecProcNode (node=0x1134aa0) at ../../../src/include/executor/executor.h:239
#13 ExecutePlan (execute_once=<optimized out>, dest=0x11439d8, direction=<optimized out>, numberTuples=0, sendTuples=<optimized out>, operation=CMD_SELECT, use_parallel_mode=<optimized out>, planstate=0x1134aa0, estate=0x1134848)
at execMain.c:1646
#14 standard_ExecutorRun (queryDesc=0x1094f18, direction=<optimized out>, count=0, execute_once=<optimized out>) at execMain.c:364
#15 0x00000000007a43cc in PortalRunSelect (portal=0x10da368, forward=<optimized out>, count=0, dest=<optimized out>) at pquery.c:929
#16 0x00000000007a5958 in PortalRun (portal=portal@entry=0x10da368, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true, run_once=run_once@entry=true, dest=dest@entry=0x11439d8, altdest=altdest@entry=0x11439d8,
completionTag=0x7ffcfd8f3db0 "") at pquery.c:770
#17 0x00000000007a177e in exec_simple_query (query_string=0x10727a8 "select max(id_default) from q_tbl_archiv ;") at postgres.c:1215
#18 0x00000000007a2f3f in PostgresMain (argc=<optimized out>, argv=argv@entry=0x109e4f0, dbname=<optimized out>, username=<optimized out>) at postgres.c:4236
#19 0x00000000007237ce in BackendRun (port=0x10976f0, port=0x10976f0) at postmaster.c:4437
#20 BackendStartup (port=0x10976f0) at postmaster.c:4128
#21 ServerLoop () at postmaster.c:1704
#22 0x000000000072458e in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0x106c350) at postmaster.c:1377
#23 0x000000000047d101 in main (argc=3, argv=0x106c350) at main.c:228
It's quite puzzling, though. If I had to guess, I'd say it's some sort
of memory management issue (either we're corrupting it somehow, or
perhaps using it after pfree).
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services