Thread

Commits

Same data as JSON: GET /api/v1/messages/:b64id/commits the thread's linked commits as JSON, with link sources. API reference →

Move into separate file all the SQL queries used in pg_upgrade tests
- 1924d508c335 10.20 landed
- 0e603b75c434 11.15 landed
- b6e525648d72 12.10 landed
- fae5f08e1719 13.6 landed
- b6dac98b0561 14.2 landed
- 0df9641d3905 15.0 landed
Add table to regression tests for binary-compatibility checks in pg_upgrade
- a9993416f80f 12.10 landed
- 755f04c72ef1 13.6 landed
- cf3d79aa31f2 14.2 landed
- 835bcba8b8d7 15.0 landed
Fix tests of pg_upgrade across different major versions
- afa09e4a9af6 12.9 landed
- 2a8dee6a67cc 13.5 landed
- f4e1c8892b9e 14.1 landed
- fa66b6dee084 15.0 landed
Multirange datatypes
- 6df7a9698bb0 14.0 cited
Work around cross-version-upgrade issues created by commit 9e38c2bb5.
- 97f73a978fc1 14.0 cited
Declare assorted array functions using anycompatible not anyelement.
- 9e38c2bb5093 14.0 cited
Remove factorial operators, leaving only the factorial() function.
- 76f412ab3105 14.0 cited
Create by default sql/ and expected/ for output directory in pg_regress
- e78900afd217 14.0 cited
Add missing include to pg_upgrade/version.c
- bc3a94dc0005 9.4.25 landed
- 984aa0ede1d2 9.5.20 landed
- e09ab32a2205 9.6.16 landed
Improve the check for pg_catalog.line data type in pg_upgrade
- 235a52ca0f26 9.4.25 landed
- f57b01dd75ee 9.5.20 landed
- 0a643de08715 9.6.16 landed
- 2218fdca496b 10.11 landed
- a970b6cdebd1 11.6 landed
- ebb4caa9120d 12.1 landed
- 8d48e6a7240c 13.0 landed
Improve the check for pg_catalog.unknown data type in pg_upgrade
- e86ece22114d 10.11 landed
- d071a2539ff4 11.6 landed
- a8e49ae0c381 12.1 landed
- a524f50d0fc6 13.0 landed
Check for tables with sql_identifier during pg_upgrade
- eaf900e842ab 12.1 landed
- 0ccfc2822366 13.0 landed
pg_upgrade: clarify the database names in error files
- 1634d361577a 13.0 cited
In the pg_upgrade test suite, don't write to src/test/regress.
- 40b132c1afbb 12.0 cited
Allow group access on PGDATA
- c37b3d08ca68 11.0 cited
Refactor dir/file permissions
- da9b580d8990 11.0 cited
Remove unused functions in regress.c.
- db3af9feb19f 11.0 cited
Make WAL segment size configurable at initdb time.
- fc49e24fa69a 11.0 cited
Fix bit-rot in pg_upgrade's test.sh, and improve documentation.
- 5bab1985dfc2 10.0 cited

BUG #16045: vacuum_db crash and illegal memory alloc after pg_upgrade from PG11 to PG12

PG Bug reporting form <noreply@postgresql.org> — 2019-10-08T17:08:53Z

The following bug has been logged on the website:

Bug reference:      16045
Logged by:          Hans Buschmann
Email address:      buschmann@nidsa.net
PostgreSQL version: 12.0
Operating system:   Windows 10 64bit
Description:        

I just did a pg_upgrade from pg 11.5 to pg 12.0 on my development machine
under Windows 64bit (both distributions from EDB).

cpsdb=# select version ();
                          version
------------------------------------------------------------
 PostgreSQL 12.0, compiled by Visual C++ build 1914, 64-bit
(1 row)

The pg_upgrade with --link went flawlessly, I started (only!) the new server
12.0 and could connect and access individual databases.

As recommended by the resulting analyze_new_cluster.bat I tried a full
vacuumdb with:

"N:/pgsql/bin/vacuumdb" -U postgres --all --analyze-only

which crashed with
vacuumdb: vacuuming database "cpsdb"
vacuumdb: error: vacuuming of table "admin.q_tbl_archiv" in database "cpsdb"
failed: ERROR:  compressed data is corrupted

I connected to the database through pgsql and looked at the table
"admin.q_tbl_archiv"

cpsdb=# \d+ q_tbl_archiv;
                                                   Table
"admin.q_tbl_archiv"
      Column      |                Type                | Collation |
Nullable | Default | Storage  | Stats target | Description
------------------+------------------------------------+-----------+----------+---------+----------+--------------+-------------
 table_name       | information_schema.sql_identifier  |           |        
 |         | plain    |              |
 column_name      | information_schema.sql_identifier  |           |        
 |         | plain    |              |
 ordinal_position | information_schema.cardinal_number |           |        
 |         | plain    |              |
 col_qualifier    | text                               |           |        
 |         | extended |              |
 id_column        | information_schema.sql_identifier  |           |        
 |         | plain    |              |
 id_default       | information_schema.character_data  |           |        
 |         | extended |              |
Access method: heap

When trying to select * from q_tbl_archiv I got:

cpsdb=# select * from q_tbl_archiv;
ERROR:  invalid memory alloc request size 18446744073709551613

This table was created a long time back under 9.5 or 9.6 with the (here
truncated) following command:


create table q_tbl_archiv as
with
qseason as (
select table_name,column_name, ordinal_position 
,replace(column_name,'_season','') as col_qualifier
-- ,'id_'||replace(column_name,'_season','') as id_column
from information_schema.columns 
where 
column_name like '%_season'
and ordinal_position < 10
and table_name in (
 'table1'
,'table2'
-- here truncated:
-- ... (here where all table of mine having columns like xxx_season)
-- to reproduce, change to your own tablenames in a test database
)
order by table_name
)
select qs.*,c.column_name as id_column, c.column_default as id_default
from 
	qseason qs
	left join information_schema.columns c on c.table_name=qs.table_name and
c.column_name like 'id_%'
;

Until now this table was always restored without error by migrating to a new
major version through pg_dump/initdb/pr_restore.

To verify the integrity of the table I restored the dump taken under pg_dump
from pg 11.5 just before the pg_upgrade to another machine.

The restore and analyze went OK and select * from q_tbl_archiv showed all
tuples, eg (edited):

cpsdb_dev=# select * from q_tbl_archiv;
        table_name        | column_name  | ordinal_position | col_qualifier
| id_column |                        id_default
--------------------------+--------------+------------------+---------------+-----------+----------------------------------------------------------
 table1                   | chm_season   |                2 | chm          
|           |
 table2                   | cs_season    |                2 | cs           
| id_cs     | nextval('table2_id_cs_seq'::regclass)
...

In conclusion, this seems to me like an error/omission of pg_upgrade.

It seems to handle these specially derived tables from information_schema
not correctly, resulting in failures of the upgraded database.

For me, this error is not so crucial, because this table is only used for
administrative purposes and can easily be restored from backup.

But I want to share my findings for the sake of other users of pg_upgrade.

Thanks for investigating!

Hans Buschmann

Re: BUG #16045: vacuum_db crash and illegal memory alloc after pg_upgrade from PG11 to PG12

Tomas Vondra <tomas.vondra@2ndquadrant.com> — 2019-10-09T13:24:36Z

On Tue, Oct 08, 2019 at 05:08:53PM +0000, PG Bug reporting form wrote:
>The following bug has been logged on the website:
>
>Bug reference:      16045
>Logged by:          Hans Buschmann
>Email address:      buschmann@nidsa.net
>PostgreSQL version: 12.0
>Operating system:   Windows 10 64bit
>Description:
>
>I just did a pg_upgrade from pg 11.5 to pg 12.0 on my development machine
>under Windows 64bit (both distributions from EDB).
>
>cpsdb=# select version ();
>                          version
>------------------------------------------------------------
> PostgreSQL 12.0, compiled by Visual C++ build 1914, 64-bit
>(1 row)
>
>The pg_upgrade with --link went flawlessly, I started (only!) the new server
>12.0 and could connect and access individual databases.
>
>As recommended by the resulting analyze_new_cluster.bat I tried a full
>vacuumdb with:
>
>"N:/pgsql/bin/vacuumdb" -U postgres --all --analyze-only
>
>which crashed with
>vacuumdb: vacuuming database "cpsdb"
>vacuumdb: error: vacuuming of table "admin.q_tbl_archiv" in database "cpsdb"
>failed: ERROR:  compressed data is corrupted
>
>I connected to the database through pgsql and looked at the table
>"admin.q_tbl_archiv"
>
>cpsdb=# \d+ q_tbl_archiv;
>                                                   Table
>"admin.q_tbl_archiv"
>      Column      |                Type                | Collation |
>Nullable | Default | Storage  | Stats target | Description
>------------------+------------------------------------+-----------+----------+---------+----------+--------------+-------------
> table_name       | information_schema.sql_identifier  |           |
> |         | plain    |              |
> column_name      | information_schema.sql_identifier  |           |
> |         | plain    |              |
> ordinal_position | information_schema.cardinal_number |           |
> |         | plain    |              |
> col_qualifier    | text                               |           |
> |         | extended |              |
> id_column        | information_schema.sql_identifier  |           |
> |         | plain    |              |
> id_default       | information_schema.character_data  |           |
> |         | extended |              |
>Access method: heap
>
>When trying to select * from q_tbl_archiv I got:
>
>cpsdb=# select * from q_tbl_archiv;
>ERROR:  invalid memory alloc request size 18446744073709551613
>
>This table was created a long time back under 9.5 or 9.6 with the (here
>truncated) following command:
>
>
>create table q_tbl_archiv as
>with
>qseason as (
>select table_name,column_name, ordinal_position
>,replace(column_name,'_season','') as col_qualifier
>-- ,'id_'||replace(column_name,'_season','') as id_column
>from information_schema.columns
>where
>column_name like '%_season'
>and ordinal_position < 10
>and table_name in (
> 'table1'
>,'table2'
>-- here truncated:
>-- ... (here where all table of mine having columns like xxx_season)
>-- to reproduce, change to your own tablenames in a test database
>)
>order by table_name
>)
>select qs.*,c.column_name as id_column, c.column_default as id_default
>from
>	qseason qs
>	left join information_schema.columns c on c.table_name=qs.table_name and
>c.column_name like 'id_%'
>;
>
>Until now this table was always restored without error by migrating to a new
>major version through pg_dump/initdb/pr_restore.
>
>To verify the integrity of the table I restored the dump taken under pg_dump
>from pg 11.5 just before the pg_upgrade to another machine.
>
>The restore and analyze went OK and select * from q_tbl_archiv showed all
>tuples, eg (edited):
>
>cpsdb_dev=# select * from q_tbl_archiv;
>        table_name        | column_name  | ordinal_position | col_qualifier
>| id_column |                        id_default
>--------------------------+--------------+------------------+---------------+-----------+----------------------------------------------------------
> table1                   | chm_season   |                2 | chm
>|           |
> table2                   | cs_season    |                2 | cs
>| id_cs     | nextval('table2_id_cs_seq'::regclass)
>...
>
>In conclusion, this seems to me like an error/omission of pg_upgrade.
>

There's clearly something bad happening. It's a bit strange, though. Had
this been a data corruption issue, I'd expect the pg_dump to fail too,
but it succeeds.

>It seems to handle these specially derived tables from information_schema
>not correctly, resulting in failures of the upgraded database.
>

Well, I don't see how that should make any difference. It's a CTAS and
that should create a regular table, that's not an issue. I wonder if
there were some changes to the data types involved, but that would be
essentially a break in on-disk format and we're careful about not doing
that ...

>For me, this error is not so crucial, because this table is only used for
>administrative purposes and can easily be restored from backup.
>
>But I want to share my findings for the sake of other users of pg_upgrade.
>

OK, thanks. Could you maybe set

 log_error_verbosity = verbose

before invoking the vacuum (you can set that in that session)? That
should give us more details about where exactly the error is triggered.
Even better, if you could attach a debugger to the session, set
breakpoints on locations triggering 'invalid memory alloc request size'
and then show the backtrace (obviously, that's more complicated).


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: BUG #16045: vacuum_db crash and illegal memory alloc after pg_upgrade from PG11 to PG12

Tomas Vondra <tomas.vondra@2ndquadrant.com> — 2019-10-09T13:59:07Z

FWIW I can reproduce this - it's enough to do this on the 11 cluster

create table q_tbl_archiv as
with
qseason as (
select table_name,column_name, ordinal_position
,replace(column_name,'_season','') as col_qualifier
-- ,'id_'||replace(column_name,'_season','') as id_column
from information_schema.columns
order by table_name
)
select qs.*,c.column_name as id_column, c.column_default as id_default
from
        qseason qs
        left join information_schema.columns c on c.table_name=qs.table_name and
c.column_name like 'id_%';


and then

    analyze q_tbl_archiv

which produces backtrace like this:

No symbol "stats" in current context.
(gdb) bt
#0  0x0000746095262951 in __memmove_avx_unaligned_erms () from /lib64/libc.so.6
#1  0x0000000000890a8e in varstrfastcmp_locale (a1p=0x17716b4 "per_language\a", len1=<optimized out>, a2p=0x176af28 '\177' <repeats 136 times>, "\021\004", len2=-4, ssup=<optimized out>, ssup=<optimized out>) at varlena.c:2320
#2  0x0000000000890cb1 in varlenafastcmp_locale (x=24581808, y=24555300, ssup=0x7ffc649463f0) at varlena.c:2219
#3  0x00000000005b73b4 in ApplySortComparator (ssup=0x7ffc649463f0, isNull2=false, datum2=<optimized out>, isNull1=false, datum1=<optimized out>) at ../../../src/include/utils/sortsupport.h:224
#4  compare_scalars (a=<optimized out>, b=<optimized out>, arg=0x7ffc649463e0) at analyze.c:2700
#5  0x00000000008f9953 in qsort_arg (a=a@entry=0x178fdc0, n=<optimized out>, n@entry=2158, es=es@entry=16, cmp=cmp@entry=0x5b7390 <compare_scalars>, arg=arg@entry=0x7ffc649463e0) at qsort_arg.c:140
#6  0x00000000005b86a6 in compute_scalar_stats (stats=0x176a208, fetchfunc=<optimized out>, samplerows=<optimized out>, totalrows=2158) at analyze.c:2273
#7  0x00000000005b9d95 in do_analyze_rel (onerel=onerel@entry=0x74608c00d3e8, params=params@entry=0x7ffc64946970, va_cols=va_cols@entry=0x0, acquirefunc=<optimized out>, relpages=22, inh=inh@entry=false, in_outer_xact=false, elevel=13)
    at analyze.c:529
#8  0x00000000005bb2c9 in analyze_rel (relid=<optimized out>, relation=<optimized out>, params=params@entry=0x7ffc64946970, va_cols=0x0, in_outer_xact=<optimized out>, bstrategy=<optimized out>) at analyze.c:260
#9  0x000000000062c7b0 in vacuum (relations=0x1727120, params=params@entry=0x7ffc64946970, bstrategy=<optimized out>, bstrategy@entry=0x0, isTopLevel=isTopLevel@entry=true) at vacuum.c:413
#10 0x000000000062cd49 in ExecVacuum (pstate=pstate@entry=0x16c9518, vacstmt=vacstmt@entry=0x16a82b8, isTopLevel=isTopLevel@entry=true) at vacuum.c:199
#11 0x00000000007a6d64 in standard_ProcessUtility (pstmt=0x16a8618, queryString=0x16a77a8 "", context=<optimized out>, params=0x0, queryEnv=0x0, dest=0x16a8710, completionTag=0x7ffc64946cb0 "") at utility.c:670
#12 0x00000000007a4006 in PortalRunUtility (portal=0x170f368, pstmt=0x16a8618, isTopLevel=<optimized out>, setHoldSnapshot=<optimized out>, dest=0x16a8710, completionTag=0x7ffc64946cb0 "") at pquery.c:1175
#13 0x00000000007a4b61 in PortalRunMulti (portal=portal@entry=0x170f368, isTopLevel=isTopLevel@entry=true, setHoldSnapshot=setHoldSnapshot@entry=false, dest=dest@entry=0x16a8710, altdest=altdest@entry=0x16a8710,
    completionTag=completionTag@entry=0x7ffc64946cb0 "") at pquery.c:1321
#14 0x00000000007a5864 in PortalRun (portal=portal@entry=0x170f368, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true, run_once=run_once@entry=true, dest=dest@entry=0x16a8710, altdest=altdest@entry=0x16a8710,
    completionTag=0x7ffc64946cb0 "") at pquery.c:796
#15 0x00000000007a174e in exec_simple_query (query_string=0x16a77a8 "") at postgres.c:1215

Looking at compute_scalar_stats, the "stats" parameter does not seem
particularly healthy:

(gdb) p *stats
$3 = {attr = 0x10, attrtypid = 12, attrtypmod = 0, attrtype = 0x1762e00, attrcollid = 356, anl_context = 0x7f7f7f7e00000000, compute_stats = 0x100, minrows = 144, extra_data = 0x1762e00, stats_valid = false, stanullfrac = 0,
  stawidth = 0, stadistinct = 0, stakind = {0, 0, 0, 0, 0}, staop = {0, 0, 0, 0, 0}, stacoll = {0, 0, 0, 0, 0}, numnumbers = {0, 0, 0, 0, 0}, stanumbers = {0x0, 0x0, 0x0, 0x0, 0x0}, numvalues = {0, 0, 0, 0, 2139062142}, stavalues = {
    0x7f7f7f7f7f7f7f7f, 0x7f7f7f7f7f7f7f7f, 0x7f7f7f7f7f7f7f7f, 0x7f7f7f7f7f7f7f7f, 0x7f7f7f7f7f7f7f7f}, statypid = {2139062143, 2139062143, 2139062143, 2139062143, 2139062143}, statyplen = {32639, 32639, 32639, 32639, 32639},
  statypbyval = {127, 127, 127, 127, 127}, statypalign = "\177\177\177\177\177", tupattnum = 2139062143, rows = 0x7f7f7f7f7f7f7f7f, tupDesc = 0x7f7f7f7f7f7f7f7f, exprvals = 0x8, exprnulls = 0x4, rowstride = 24522240}

Not sure about the root cause yet.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: BUG #16045: vacuum_db crash and illegal memory alloc after pg_upgrade from PG11 to PG12

Tom Lane <tgl@sss.pgh.pa.us> — 2019-10-09T14:07:01Z

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
> FWIW I can reproduce this - it's enough to do this on the 11 cluster

I failed to reproduce any problem from your example, but I was trying
in C locale on a Linux machine.  What environment are you testing?

			regards, tom lane

Re: BUG #16045: vacuum_db crash and illegal memory alloc after pg_upgrade from PG11 to PG12

Tomas Vondra <tomas.vondra@2ndquadrant.com> — 2019-10-09T14:18:41Z

On Wed, Oct 09, 2019 at 03:59:07PM +0200, Tomas Vondra wrote:
>FWIW I can reproduce this - it's enough to do this on the 11 cluster
>
>create table q_tbl_archiv as
>with
>qseason as (
>select table_name,column_name, ordinal_position
>,replace(column_name,'_season','') as col_qualifier
>-- ,'id_'||replace(column_name,'_season','') as id_column
>from information_schema.columns
>order by table_name
>)
>select qs.*,c.column_name as id_column, c.column_default as id_default
>from
>       qseason qs
>       left join information_schema.columns c on c.table_name=qs.table_name and
>c.column_name like 'id_%';
>
>
>and then
>
>   analyze q_tbl_archiv
>
>which produces backtrace like this:
>
>No symbol "stats" in current context.
>(gdb) bt
>#0  0x0000746095262951 in __memmove_avx_unaligned_erms () from /lib64/libc.so.6
>#1  0x0000000000890a8e in varstrfastcmp_locale (a1p=0x17716b4 "per_language\a", len1=<optimized out>, a2p=0x176af28 '\177' <repeats 136 times>, "\021\004", len2=-4, ssup=<optimized out>, ssup=<optimized out>) at varlena.c:2320
>#2  0x0000000000890cb1 in varlenafastcmp_locale (x=24581808, y=24555300, ssup=0x7ffc649463f0) at varlena.c:2219
>#3  0x00000000005b73b4 in ApplySortComparator (ssup=0x7ffc649463f0, isNull2=false, datum2=<optimized out>, isNull1=false, datum1=<optimized out>) at ../../../src/include/utils/sortsupport.h:224
>#4  compare_scalars (a=<optimized out>, b=<optimized out>, arg=0x7ffc649463e0) at analyze.c:2700
>#5  0x00000000008f9953 in qsort_arg (a=a@entry=0x178fdc0, n=<optimized out>, n@entry=2158, es=es@entry=16, cmp=cmp@entry=0x5b7390 <compare_scalars>, arg=arg@entry=0x7ffc649463e0) at qsort_arg.c:140
>#6  0x00000000005b86a6 in compute_scalar_stats (stats=0x176a208, fetchfunc=<optimized out>, samplerows=<optimized out>, totalrows=2158) at analyze.c:2273
>#7  0x00000000005b9d95 in do_analyze_rel (onerel=onerel@entry=0x74608c00d3e8, params=params@entry=0x7ffc64946970, va_cols=va_cols@entry=0x0, acquirefunc=<optimized out>, relpages=22, inh=inh@entry=false, in_outer_xact=false, elevel=13)
>   at analyze.c:529
>#8  0x00000000005bb2c9 in analyze_rel (relid=<optimized out>, relation=<optimized out>, params=params@entry=0x7ffc64946970, va_cols=0x0, in_outer_xact=<optimized out>, bstrategy=<optimized out>) at analyze.c:260
>#9  0x000000000062c7b0 in vacuum (relations=0x1727120, params=params@entry=0x7ffc64946970, bstrategy=<optimized out>, bstrategy@entry=0x0, isTopLevel=isTopLevel@entry=true) at vacuum.c:413
>#10 0x000000000062cd49 in ExecVacuum (pstate=pstate@entry=0x16c9518, vacstmt=vacstmt@entry=0x16a82b8, isTopLevel=isTopLevel@entry=true) at vacuum.c:199
>#11 0x00000000007a6d64 in standard_ProcessUtility (pstmt=0x16a8618, queryString=0x16a77a8 "", context=<optimized out>, params=0x0, queryEnv=0x0, dest=0x16a8710, completionTag=0x7ffc64946cb0 "") at utility.c:670
>#12 0x00000000007a4006 in PortalRunUtility (portal=0x170f368, pstmt=0x16a8618, isTopLevel=<optimized out>, setHoldSnapshot=<optimized out>, dest=0x16a8710, completionTag=0x7ffc64946cb0 "") at pquery.c:1175
>#13 0x00000000007a4b61 in PortalRunMulti (portal=portal@entry=0x170f368, isTopLevel=isTopLevel@entry=true, setHoldSnapshot=setHoldSnapshot@entry=false, dest=dest@entry=0x16a8710, altdest=altdest@entry=0x16a8710,
>   completionTag=completionTag@entry=0x7ffc64946cb0 "") at pquery.c:1321
>#14 0x00000000007a5864 in PortalRun (portal=portal@entry=0x170f368, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true, run_once=run_once@entry=true, dest=dest@entry=0x16a8710, altdest=altdest@entry=0x16a8710,
>   completionTag=0x7ffc64946cb0 "") at pquery.c:796
>#15 0x00000000007a174e in exec_simple_query (query_string=0x16a77a8 "") at postgres.c:1215
>
>Looking at compute_scalar_stats, the "stats" parameter does not seem
>particularly healthy:
>
>(gdb) p *stats
>$3 = {attr = 0x10, attrtypid = 12, attrtypmod = 0, attrtype = 0x1762e00, attrcollid = 356, anl_context = 0x7f7f7f7e00000000, compute_stats = 0x100, minrows = 144, extra_data = 0x1762e00, stats_valid = false, stanullfrac = 0,
> stawidth = 0, stadistinct = 0, stakind = {0, 0, 0, 0, 0}, staop = {0, 0, 0, 0, 0}, stacoll = {0, 0, 0, 0, 0}, numnumbers = {0, 0, 0, 0, 0}, stanumbers = {0x0, 0x0, 0x0, 0x0, 0x0}, numvalues = {0, 0, 0, 0, 2139062142}, stavalues = {
>   0x7f7f7f7f7f7f7f7f, 0x7f7f7f7f7f7f7f7f, 0x7f7f7f7f7f7f7f7f, 0x7f7f7f7f7f7f7f7f, 0x7f7f7f7f7f7f7f7f}, statypid = {2139062143, 2139062143, 2139062143, 2139062143, 2139062143}, statyplen = {32639, 32639, 32639, 32639, 32639},
> statypbyval = {127, 127, 127, 127, 127}, statypalign = "\177\177\177\177\177", tupattnum = 2139062143, rows = 0x7f7f7f7f7f7f7f7f, tupDesc = 0x7f7f7f7f7f7f7f7f, exprvals = 0x8, exprnulls = 0x4, rowstride = 24522240}
>
>Not sure about the root cause yet.
>

OK, a couple more observations - the table schema looks like this:

                              Table "public.q_tbl_archiv"
      Column      |                Type                | Collation | Nullable | Default 
------------------+------------------------------------+-----------+----------+---------
 table_name       | information_schema.sql_identifier  |           |          | 
 column_name      | information_schema.sql_identifier  |           |          | 
 ordinal_position | information_schema.cardinal_number |           |          | 
 col_qualifier    | text                               |           |          | 
 id_column        | information_schema.sql_identifier  |           |          | 
 id_default       | information_schema.character_data  |           |          | 

and I can succesfully do this:

  test=# analyze q_tbl_archiv (table_name, column_name, ordinal_position, id_column, id_default);
  ANALYZE

but as soon as I include the col_qualifier column, it fails:

  test=# analyze q_tbl_archiv (table_name, column_name, ordinal_position, id_column, id_default, col_qualifier);
  ERROR:  compressed data is corrupted

But it fails differently (with the segfault) when analyzing just the one
column:

  test=# analyze q_tbl_archiv (col_qualifier);
  server closed the connection unexpectedly
  	This probably means the server terminated abnormally
  	before or while processing the request.
  The connection to the server was lost. Attempting reset: Succeeded.

Moreover, there are some other interesting failures - I can do

  select max(table_name) from q_tbl_archiv;
  select max(column_name) from q_tbl_archiv;
  select max(ordinal_position) from q_tbl_archiv;

but as soon as I try doing that with col_qualifier, it crashes and
burns:

  select max(col_qualifier) from q_tbl_archiv;

The backtrace is rather strange in this case (a lot of missing calls,
etc.). However, when called for the next two columns, it still crashes,
but the backtraces look somewhat saner:

  select max(id_column) from q_tbl_archiv;

Program received signal SIGSEGV, Segmentation fault.
0x00007db3186c6617 in __strlen_avx2 () from /lib64/libc.so.6
(gdb) bt
#0  0x00007db3186c6617 in __strlen_avx2 () from /lib64/libc.so.6
#1  0x0000000000894ced in cstring_to_text (s=0x7db32ce38935 <error: Cannot access memory at address 0x7db32ce38935>) at varlena.c:173
#2  name_text (fcinfo=<optimized out>) at varlena.c:3573
#3  0x000000000063860d in ExecInterpExpr (state=0x1136900, econtext=0x1135128, isnull=<optimized out>) at execExprInterp.c:649
#4  0x000000000064f699 in ExecEvalExprSwitchContext (isNull=0x7ffcfd8f3b2f, econtext=<optimized out>, state=<optimized out>) at ../../../src/include/executor/executor.h:307
#5  advance_aggregates (aggstate=0x1134ef0, aggstate=0x1134ef0) at nodeAgg.c:679
#6  agg_retrieve_direct (aggstate=0x1134ef0) at nodeAgg.c:1847
#7  ExecAgg (pstate=0x1134ef0) at nodeAgg.c:1572
#8  0x000000000063b58b in ExecProcNode (node=0x1134ef0) at ../../../src/include/executor/executor.h:239
#9  ExecutePlan (execute_once=<optimized out>, dest=0x1144248, direction=<optimized out>, numberTuples=0, sendTuples=<optimized out>, operation=CMD_SELECT, use_parallel_mode=<optimized out>, planstate=0x1134ef0, estate=0x1134c98)
    at execMain.c:1646
#10 standard_ExecutorRun (queryDesc=0x1094f18, direction=<optimized out>, count=0, execute_once=<optimized out>) at execMain.c:364
#11 0x00000000007a43cc in PortalRunSelect (portal=0x10da368, forward=<optimized out>, count=0, dest=<optimized out>) at pquery.c:929
#12 0x00000000007a5958 in PortalRun (portal=portal@entry=0x10da368, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true, run_once=run_once@entry=true, dest=dest@entry=0x1144248, altdest=altdest@entry=0x1144248,
    completionTag=0x7ffcfd8f3db0 "") at pquery.c:770
#13 0x00000000007a177e in exec_simple_query (query_string=0x10727a8 "select max(id_column) from q_tbl_archiv ;") at postgres.c:1215
#14 0x00000000007a2f3f in PostgresMain (argc=<optimized out>, argv=argv@entry=0x109e400, dbname=<optimized out>, username=<optimized out>) at postgres.c:4236
#15 0x00000000007237ce in BackendRun (port=0x1097c30, port=0x1097c30) at postmaster.c:4437
#16 BackendStartup (port=0x1097c30) at postmaster.c:4128
#17 ServerLoop () at postmaster.c:1704
#18 0x000000000072458e in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0x106c350) at postmaster.c:1377
#19 0x000000000047d101 in main (argc=3, argv=0x106c350) at main.c:228

  select max(id_default) from q_tbl_archiv;

Program received signal SIGABRT, Aborted.
0x00007db3185a1e35 in raise () from /lib64/libc.so.6
(gdb) bt
#0  0x00007db3185a1e35 in raise () from /lib64/libc.so.6
#1  0x00007db31858c895 in abort () from /lib64/libc.so.6
#2  0x00000000008b4470 in ExceptionalCondition (conditionName=conditionName@entry=0xabe49e "1", errorType=errorType@entry=0x907128 "unrecognized TOAST vartag", fileName=fileName@entry=0xa4965b "execTuples.c",
    lineNumber=lineNumber@entry=971) at assert.c:54
#3  0x00000000006466d3 in slot_deform_heap_tuple (natts=6, offp=0x1135170, tuple=<optimized out>, slot=0x1135128) at execTuples.c:985
#4  tts_buffer_heap_getsomeattrs (slot=0x1135128, natts=<optimized out>) at execTuples.c:676
#5  0x00000000006489fc in slot_getsomeattrs_int (slot=slot@entry=0x1135128, attnum=6) at execTuples.c:1877
#6  0x00000000006379a3 in slot_getsomeattrs (attnum=<optimized out>, slot=0x1135128) at ../../../src/include/executor/tuptable.h:345
#7  ExecInterpExpr (state=0x11364b0, econtext=0x1134cd8, isnull=<optimized out>) at execExprInterp.c:441
#8  0x000000000064f699 in ExecEvalExprSwitchContext (isNull=0x7ffcfd8f3b2f, econtext=<optimized out>, state=<optimized out>) at ../../../src/include/executor/executor.h:307
#9  advance_aggregates (aggstate=0x1134aa0, aggstate=0x1134aa0) at nodeAgg.c:679
#10 agg_retrieve_direct (aggstate=0x1134aa0) at nodeAgg.c:1847
#11 ExecAgg (pstate=0x1134aa0) at nodeAgg.c:1572
#12 0x000000000063b58b in ExecProcNode (node=0x1134aa0) at ../../../src/include/executor/executor.h:239
#13 ExecutePlan (execute_once=<optimized out>, dest=0x11439d8, direction=<optimized out>, numberTuples=0, sendTuples=<optimized out>, operation=CMD_SELECT, use_parallel_mode=<optimized out>, planstate=0x1134aa0, estate=0x1134848)
    at execMain.c:1646
#14 standard_ExecutorRun (queryDesc=0x1094f18, direction=<optimized out>, count=0, execute_once=<optimized out>) at execMain.c:364
#15 0x00000000007a43cc in PortalRunSelect (portal=0x10da368, forward=<optimized out>, count=0, dest=<optimized out>) at pquery.c:929
#16 0x00000000007a5958 in PortalRun (portal=portal@entry=0x10da368, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true, run_once=run_once@entry=true, dest=dest@entry=0x11439d8, altdest=altdest@entry=0x11439d8,
    completionTag=0x7ffcfd8f3db0 "") at pquery.c:770
#17 0x00000000007a177e in exec_simple_query (query_string=0x10727a8 "select max(id_default) from q_tbl_archiv ;") at postgres.c:1215
#18 0x00000000007a2f3f in PostgresMain (argc=<optimized out>, argv=argv@entry=0x109e4f0, dbname=<optimized out>, username=<optimized out>) at postgres.c:4236
#19 0x00000000007237ce in BackendRun (port=0x10976f0, port=0x10976f0) at postmaster.c:4437
#20 BackendStartup (port=0x10976f0) at postmaster.c:4128
#21 ServerLoop () at postmaster.c:1704
#22 0x000000000072458e in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0x106c350) at postmaster.c:1377
#23 0x000000000047d101 in main (argc=3, argv=0x106c350) at main.c:228


It's quite puzzling, though. If I had to guess, I'd say it's some sort
of memory management issue (either we're corrupting it somehow, or
perhaps using it after pfree).

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: BUG #16045: vacuum_db crash and illegal memory alloc after pg_upgrade from PG11 to PG12

Tomas Vondra <tomas.vondra@2ndquadrant.com> — 2019-10-09T14:28:27Z

On Wed, Oct 09, 2019 at 10:07:01AM -0400, Tom Lane wrote:
>Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
>> FWIW I can reproduce this - it's enough to do this on the 11 cluster
>
>I failed to reproduce any problem from your example, but I was trying
>in C locale on a Linux machine.  What environment are you testing?
>
>			regards, tom lane

test=# show lc_collate ;
 lc_collate 
------------
 C.UTF-8
(1 row)


I can reproduce this pretty easily like this:

1) build 11

git checkout REL_11_STABLE
./configure --prefix=/home/user/pg-11 --enable-debug --enable-cassert && make -s clean && make -s -j4 install

2) build 12

git checkout REL_12_STABLE
./configure --prefix=/home/user/pg-12 --enable-debug --enable-cassert && make -s clean && make -s -j4 install

3) create the 11 cluster

/home/user/pg-11/bin/pg_ctl -D /tmp/data-11 init
/home/user/pg-11/bin/pg_ctl -D /tmp/data-11 -l /tmp/pg-11.log start
/home/user/pg-11/bin/createdb test
/home/user/pg-11/bin/psql test

4) create the table

    create table q_tbl_archiv as
    with
    qseason as (
    select table_name,column_name, ordinal_position
    ,replace(column_name,'_season','') as col_qualifier
    -- ,'id_'||replace(column_name,'_season','') as id_column
    from information_schema.columns
    order by table_name
    )
    select qs.*,c.column_name as id_column, c.column_default as id_default
    from
            qseason qs
            left join information_schema.columns c on c.table_name=qs.table_name and
    c.column_name like 'id_%';

5) shutdown the 11 cluster

  /home/user/pg-11/bin/pg_ctl -D /tmp/data-11 stop

6) init 12 cluster

  /home/user/pg-12/bin/pg_ctl -D /tmp/data-12 init

7) do the pg_upgrade thing

  /home/user/pg-12/bin/pg_upgrade -b /home/user/pg-11/bin -B /home/user/pg-12/bin -d /tmp/data-11 -D /tmp/data-12 -k

8) start 12 cluster

  /home/user/pg-12/bin/pg_ctl -D /tmp/data-12 -l /tmp/pg-12.log start

9) kabooom

  /home/user/pg-12/bin/psql test -c "analyze q_tbl_archiv"


On my system (Fedora 30 in x86_64) this reliably results a crash (and
various other crashes as demonstrated in my previous message).

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

AW: BUG #16045: vacuum_db crash and illegal memory alloc after pg_upgrade from PG11 to PG12

Hans Buschmann <buschmann@nidsa.net> — 2019-10-09T14:50:52Z

Hi Tomas,

Nice that you could reproduce it.
This was just the way I followed.

For your Info, here are my no-standard config params:

                name                |         current_setting         
------------------------------------+---------------------------------
 application_name                   | psql                            
 auto_explain.log_analyze           | on                              
 auto_explain.log_min_duration      | 0                               
 auto_explain.log_nested_statements | on                              
 client_encoding                    | WIN1252                         
 cluster_name                       | HB_DEV                          
 data_checksums                     | on                              
 DateStyle                          | ISO, DMY                        
 default_text_search_config         | pg_catalog.german               
 dynamic_shared_memory_type         | windows                         
 effective_cache_size               | 8GB                             
 lc_collate                         | C                               
 lc_ctype                           | German_Germany.1252             
 lc_messages                        | C                               
 lc_monetary                        | German_Germany.1252             
 lc_numeric                         | German_Germany.1252             
 lc_time                            | German_Germany.1252             
 log_destination                    | stderr                          
 log_directory                      | N:/ZZ_log/pg_log_hbdev          
 log_error_verbosity                | verbose                         
 log_file_mode                      | 0640                            
 log_line_prefix                    | WHB %a %t %i %e %2l:>           
 log_statement                      | mod                             
 log_temp_files                     | 0                               
 log_timezone                       | CET                             
 logging_collector                  | on                              
 maintenance_work_mem               | 128MB                           
 max_connections                    | 100                             
 max_stack_depth                    | 2MB                             
 max_wal_size                       | 1GB                             
 min_wal_size                       | 80MB                            
 pg_stat_statements.max             | 5000                            
 pg_stat_statements.track           | all                             
 random_page_cost                   | 1                               
 search_path                        | public, archiv, ablage, admin   
 server_encoding                    | UTF8                            
 server_version                     | 12.0                            
 shared_buffers                     | 1GB                             
 shared_preload_libraries           | auto_explain,pg_stat_statements 
 temp_buffers                       | 32MB                            
 TimeZone                           | CET                             
 transaction_deferrable             | off                             
 transaction_isolation              | read committed                  
 transaction_read_only              | off                             
 update_process_title               | off                             
 wal_buffers                        | 16MB                            
 wal_segment_size                   | 16MB                            
 work_mem                           | 32MB                            
(48 rows)

Indeed, the database has UTF8 Encoding.

The Extended error-log (i have set auto_explain):



WHB psql 2019-10-09 15:45:03 CEST  XX000  7:> ERROR:  XX000: invalid memory alloc request size 18446744073709551613
WHB psql 2019-10-09 15:45:03 CEST  XX000  8:> LOCATION:  palloc, d:\pginstaller_12.auto\postgres.windows-x64\src\backend\utils\mmgr\mcxt.c:934
WHB psql 2019-10-09 15:45:03 CEST  XX000  9:> STATEMENT:  select * from q_tbl_archiv;
WHB vacuumdb 2019-10-09 15:46:42 CEST  00000  1:> LOG:  00000: duration: 0.022 ms  plan:
	Query Text: SELECT pg_catalog.set_config('search_path', '', false);
	Result  (cost=0.00..0.01 rows=1 width=32) (actual time=0.014..0.015 rows=1 loops=1)
WHB vacuumdb 2019-10-09 15:46:42 CEST  00000  2:> LOCATION:  explain_ExecutorEnd, d:\pginstaller_12.auto\postgres.windows-x64\contrib\auto_explain\auto_explain.c:415
WHB vacuumdb 2019-10-09 15:46:42 CEST  00000  3:> LOG:  00000: duration: 0.072 ms  plan:
	Query Text: SELECT datname FROM pg_database WHERE datallowconn ORDER BY 1;
	Sort  (cost=1.16..1.16 rows=1 width=64) (actual time=0.063..0.064 rows=14 loops=1)
	  Sort Key: datname
	  Sort Method: quicksort  Memory: 26kB
	  ->  Seq Scan on pg_database  (cost=0.00..1.15 rows=1 width=64) (actual time=0.018..0.022 rows=14 loops=1)
	        Filter: datallowconn
	        Rows Removed by Filter: 1
WHB vacuumdb 2019-10-09 15:46:42 CEST  00000  4:> LOCATION:  explain_ExecutorEnd, d:\pginstaller_12.auto\postgres.windows-x64\contrib\auto_explain\auto_explain.c:415
WHB vacuumdb 2019-10-09 15:46:43 CEST  00000  1:> LOG:  00000: duration: 0.027 ms  plan:
	Query Text: SELECT pg_catalog.set_config('search_path', '', false);
	Result  (cost=0.00..0.01 rows=1 width=32) (actual time=0.012..0.013 rows=1 loops=1)
WHB vacuumdb 2019-10-09 15:46:43 CEST  00000  2:> LOCATION:  explain_ExecutorEnd, d:\pginstaller_12.auto\postgres.windows-x64\contrib\auto_explain\auto_explain.c:415
WHB vacuumdb 2019-10-09 15:46:43 CEST  00000  3:> LOG:  00000: duration: 1.036 ms  plan:
	Query Text: SELECT c.relname, ns.nspname FROM pg_catalog.pg_class c
	 JOIN pg_catalog.pg_namespace ns ON c.relnamespace OPERATOR(pg_catalog.=) ns.oid
	 LEFT JOIN pg_catalog.pg_class t ON c.reltoastrelid OPERATOR(pg_catalog.=) t.oid
	 WHERE c.relkind OPERATOR(pg_catalog.=) ANY (array['r', 'm'])
	 ORDER BY c.relpages DESC;
	Sort  (cost=56.56..56.59 rows=13 width=132) (actual time=0.843..0.854 rows=320 loops=1)
	  Sort Key: c.relpages DESC
	  Sort Method: quicksort  Memory: 110kB
	  ->  Hash Join  (cost=1.23..56.32 rows=13 width=132) (actual time=0.082..0.649 rows=320 loops=1)
	        Hash Cond: (c.relnamespace = ns.oid)
	        ->  Seq Scan on pg_class c  (cost=0.00..55.05 rows=13 width=76) (actual time=0.034..0.545 rows=320 loops=1)
	              Filter: ((relkind)::text = ANY ('{r,m}'::text[]))
	              Rows Removed by Filter: 950
	        ->  Hash  (cost=1.10..1.10 rows=10 width=68) (actual time=0.022..0.022 rows=10 loops=1)
	              Buckets: 1024  Batches: 1  Memory Usage: 9kB
	              ->  Seq Scan on pg_namespace ns  (cost=0.00..1.10 rows=10 width=68) (actual time=0.010..0.011 rows=10 loops=1)
WHB vacuumdb 2019-10-09 15:46:43 CEST  00000  4:> LOCATION:  explain_ExecutorEnd, d:\pginstaller_12.auto\postgres.windows-x64\contrib\auto_explain\auto_explain.c:415
WHB vacuumdb 2019-10-09 15:46:43 CEST  00000  5:> LOG:  00000: duration: 0.011 ms  plan:
	Query Text: SELECT pg_catalog.set_config('search_path', '', false);
	Result  (cost=0.00..0.01 rows=1 width=32) (actual time=0.008..0.008 rows=1 loops=1)
WHB vacuumdb 2019-10-09 15:46:43 CEST  00000  6:> LOCATION:  explain_ExecutorEnd, d:\pginstaller_12.auto\postgres.windows-x64\contrib\auto_explain\auto_explain.c:415
WHB  2019-10-09 15:47:01 CEST  00000 22:> LOG:  00000: server process (PID 4708) was terminated by exception 0xC0000005
WHB  2019-10-09 15:47:01 CEST  00000 23:> DETAIL:  Failed process was running: ANALYZE admin.q_tbl_archiv;
WHB  2019-10-09 15:47:01 CEST  00000 24:> HINT:  See C include file "ntstatus.h" for a description of the hexadecimal value.
WHB  2019-10-09 15:47:01 CEST  00000 25:> LOCATION:  LogChildExit, d:\pginstaller_12.auto\postgres.windows-x64\src\backend\postmaster\postmaster.c:3670
WHB  2019-10-09 15:47:01 CEST  00000 26:> LOG:  00000: terminating any other active server processes
WHB  2019-10-09 15:47:01 CEST  00000 27:> LOCATION:  HandleChildCrash, d:\pginstaller_12.auto\postgres.windows-x64\src\backend\postmaster\postmaster.c:3400
WHB psql 2019-10-09 15:47:01 CEST  57P02 10:> WARNING:  57P02: terminating connection because of crash of another server process
WHB psql 2019-10-09 15:47:01 CEST  57P02 11:> DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
WHB psql 2019-10-09 15:47:01 CEST  57P02 12:> HINT:  In a moment you should be able to reconnect to the database and repeat your command.
WHB psql 2019-10-09 15:47:01 CEST  57P02 13:> LOCATION:  quickdie, d:\pginstaller_12.auto\postgres.windows-x64\src\backend\tcop\postgres.c:2717
WHB  2019-10-09 15:47:02 CEST  57P02  3:> WARNING:  57P02: terminating connection because of crash of another server process
WHB  2019-10-09 15:47:02 CEST  57P02  4:> DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
WHB  2019-10-09 15:47:02 CEST  57P02  5:> HINT:  In a moment you should be able to reconnect to the database and repeat your command.
WHB  2019-10-09 15:47:02 CEST  57P02  6:> LOCATION:  quickdie, d:\pginstaller_12.auto\postgres.windows-x64\src\backend\tcop\postgres.c:2717
WHB  2019-10-09 15:47:02 CEST  00000 28:> LOG:  00000: all server processes terminated; reinitializing
WHB  2019-10-09 15:47:02 CEST  00000 29:> LOCATION:  PostmasterStateMachine, d:\pginstaller_12.auto\postgres.windows-x64\src\backend\postmaster\postmaster.c:3912
WHB  2019-10-09 15:47:02 CEST  00000  1:> LOG:  00000: database system was interrupted; last known up at 2019-10-09 15:46:03 CEST
WHB  2019-10-09 15:47:02 CEST  00000  2:> LOCATION:  StartupXLOG, d:\pginstaller_12.auto\postgres.windows-x64\src\backend\access\transam\xlog.c:6277


The table was imported successively by pg_dump/pg_restore from the previous versions into pg11.

This was the same what I did on the other machine (pg 11.5). On this test machine I could successfully Export the table with pg_dump -t.

On the erroneous PG12 Cluster I succeeded to recreate a similar  table with the original create table Statements: no Errors.

Under PG12 upgraded, I tried to select only the first column (select table_name from q_tbl_archiv) and got erroneaus results (shown first 2 entries):

cpsdb=# select table_name from q_tbl_archiv;
                 table_name
---------------------------------------------
 \x11chemmat\x17chm_season
 !collectionsheet\x15cs_season

It seems that the length Bytes are present in the Output.

Hope this Information helps.

Hans Buschmann

Re: BUG #16045: vacuum_db crash and illegal memory alloc after pg_upgrade from PG11 to PG12

Tomas Vondra <tomas.vondra@2ndquadrant.com> — 2019-10-09T23:07:59Z

Well, I think I found the root cause. It's because of 7c15cef86d, which
changed the definition of sql_identifier so that it's a domain over name
instead of varchar. So we now have this:

  SELECT typname, typlen FROM pg_type WHERE typname = 'sql_identifier':

  -[ RECORD 1 ]--+---------------
  typname        | sql_identifier
  typlen         | -1

instead of this

  -[ RECORD 1 ]--+---------------
  typname        | sql_identifier
  typlen         | 64

Unfortunately, that seems very much like a break of on-disk format, and
after pg_upgrade any table containing sql_identifier columns is pretty
much guaranteed to be badly mangled. For example, the first row from the
table used in the original report looks like this on PostgreSQL 11:

  test=# select ctid, * from q_tbl_archiv limit 1;
  -[ RECORD 1 ]----+--------------------------
  ctid             | (0,1)
  table_name       | _pg_foreign_data_wrappers
  column_name      | foreign_data_wrapper_name
  ordinal_position | 5
  col_qualifier    | foreign_data_wrapper_name
  id_column        | 
  id_default       | 

while on PostgreSQL 12 after pg_upgrade it looks like this

  test=# select ctid, table_name, column_name, ordinal_position from q_tbl_archiv limit 1;:
  -[ RECORD 1 ]----+---------------------------------------------------------
  ctid             | (0,1)
  table_name       | 5_pg_foreign_data_wrappers5foreign_data_wrapper_name\x05
  column_name      | _data_wrapper_name
  ordinal_position | 0

Not sure what to do about this :-(


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: BUG #16045: vacuum_db crash and illegal memory alloc after pg_upgrade from PG11 to PG12

Tom Lane <tgl@sss.pgh.pa.us> — 2019-10-09T23:18:45Z

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
> Well, I think I found the root cause. It's because of 7c15cef86d, which
> changed the definition of sql_identifier so that it's a domain over name
> instead of varchar.

Ah...

> Not sure what to do about this :-(

Fortunately, there should be close to zero people with user tables
depending on sql_identifier.  I think we should just add a test in
pg_upgrade that refuses to upgrade if there are any such columns.
It won't be the first such restriction.

			regards, tom lane

Re: BUG #16045: vacuum_db crash and illegal memory alloc after pg_upgrade from PG11 to PG12

Tomas Vondra <tomas.vondra@2ndquadrant.com> — 2019-10-09T23:28:36Z

On Wed, Oct 09, 2019 at 07:18:45PM -0400, Tom Lane wrote:
>Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
>> Well, I think I found the root cause. It's because of 7c15cef86d, which
>> changed the definition of sql_identifier so that it's a domain over name
>> instead of varchar.
>
>Ah...
>
>> Not sure what to do about this :-(
>
>Fortunately, there should be close to zero people with user tables
>depending on sql_identifier.  I think we should just add a test in
>pg_upgrade that refuses to upgrade if there are any such columns.
>It won't be the first such restriction.
>

Hmmm, yeah.  I agree the number of people using sql_identifier in user
tables is low, but OTOH we got this report within a week after release,
so maybe it's higher than we think.

Another option would be to teach pg_upgrade to switch the columns to
'text' or 'varchar', not sure if that's possible or how much work would
that be.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: BUG #16045: vacuum_db crash and illegal memory alloc after pg_upgrade from PG11 to PG12

Tom Lane <tgl@sss.pgh.pa.us> — 2019-10-09T23:41:54Z

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
> On Wed, Oct 09, 2019 at 07:18:45PM -0400, Tom Lane wrote:
>> Fortunately, there should be close to zero people with user tables
>> depending on sql_identifier.  I think we should just add a test in
>> pg_upgrade that refuses to upgrade if there are any such columns.
>> It won't be the first such restriction.

> Hmmm, yeah.  I agree the number of people using sql_identifier in user
> tables is low, but OTOH we got this report within a week after release,
> so maybe it's higher than we think.

True.

> Another option would be to teach pg_upgrade to switch the columns to
> 'text' or 'varchar', not sure if that's possible or how much work would
> that be.

I think it'd be a mess --- the actual hacking would have to happen in
pg_dump, I think, and it'd be a kluge because pg_dump doesn't normally
understand what server version its output is going to.  So we'd more
or less have to control it through a new pg_dump switch that pg_upgrade
would use.  Ick.

Also, even if we did try to silently convert such columns that way,
I bet we'd get other bug reports about "why'd my columns suddenly
change type?".  So I'd rather force the user to be involved.

			regards, tom lane

Re: BUG #16045: vacuum_db crash and illegal memory alloc after pg_upgrade from PG11 to PG12

Andres Freund <andres@anarazel.de> — 2019-10-10T01:48:13Z

On 2019-10-09 19:41:54 -0400, Tom Lane wrote:
> Also, even if we did try to silently convert such columns that way,
> I bet we'd get other bug reports about "why'd my columns suddenly
> change type?".  So I'd rather force the user to be involved.

+1

Re: BUG #16045: vacuum_db crash and illegal memory alloc after pg_upgrade from PG11 to PG12

Tomas Vondra <tomas.vondra@2ndquadrant.com> — 2019-10-10T13:43:03Z

On Wed, Oct 09, 2019 at 06:48:13PM -0700, Andres Freund wrote:
>On 2019-10-09 19:41:54 -0400, Tom Lane wrote:
>> Also, even if we did try to silently convert such columns that way,
>> I bet we'd get other bug reports about "why'd my columns suddenly
>> change type?".  So I'd rather force the user to be involved.
>
>+1

Fair enough, attached is a patch doing that, I think. Maybe the file
should be named differently, as it contains other objects than just
tables.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: BUG #16045: vacuum_db crash and illegal memory alloc after pg_upgrade from PG11 to PG12

Tom Lane <tgl@sss.pgh.pa.us> — 2019-10-10T14:19:12Z

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
> Fair enough, attached is a patch doing that, I think. Maybe the file
> should be named differently, as it contains other objects than just
> tables.

Seems about right, though I notice it will not detect domains over
sql_identifier.  How much do we care about that?

To identify such domains, I think we'd need something like
WHERE attypid IN (recursive-WITH-query), which makes me nervous.
We did support those starting with 8.4, which is as far back as
pg_upgrade will go, so in theory it should work.  But I think we
had bugs with such cases in old releases.  Do we want to assume
that the source server has been updated enough to avoid any such
bugs?  The expense of such a query might be daunting, too.

			regards, tom lane

Re: BUG #16045: vacuum_db crash and illegal memory alloc after pg_upgrade from PG11 to PG12

Tomas Vondra <tomas.vondra@2ndquadrant.com> — 2019-10-10T16:33:50Z

On Thu, Oct 10, 2019 at 10:19:12AM -0400, Tom Lane wrote:
>Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
>> Fair enough, attached is a patch doing that, I think. Maybe the file
>> should be named differently, as it contains other objects than just
>> tables.
>
>Seems about right, though I notice it will not detect domains over
>sql_identifier.  How much do we care about that?
>
>To identify such domains, I think we'd need something like
>WHERE attypid IN (recursive-WITH-query), which makes me nervous.
>We did support those starting with 8.4, which is as far back as
>pg_upgrade will go, so in theory it should work.  But I think we
>had bugs with such cases in old releases.  Do we want to assume
>that the source server has been updated enough to avoid any such
>bugs?  The expense of such a query might be daunting, too.
>

Not sure.

Regarding bugs, I think it's fine to assume the users are running
sufficiently recent version - they may not, but then they're probably
subject to various other bugs (data corruption, queries). If they're
not, then they'll either get false positives (in which case they'll be
forced to update) or false negatives (which is just as if we did
nothing).

For the query cost, I think we can assume the domain hierarchies are not
particularly deep (in practice I'd expect just domains directly on the
sql_identifier type). And I doubt people are using that very widely,
it's probably more like this report - ad-hoc CTAS, so just a couple of
items. So I wouldn't expect it to be a huge deal in most cases. But even
if it takes a second or two, it's a one-time cost.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: BUG #16045: vacuum_db crash and illegal memory alloc after pg_upgrade from PG11 to PG12

Tom Lane <tgl@sss.pgh.pa.us> — 2019-10-10T20:14:20Z

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
> On Thu, Oct 10, 2019 at 10:19:12AM -0400, Tom Lane wrote:
>> To identify such domains, I think we'd need something like
>> WHERE attypid IN (recursive-WITH-query), which makes me nervous.
>> We did support those starting with 8.4, which is as far back as
>> pg_upgrade will go, so in theory it should work.  But I think we
>> had bugs with such cases in old releases.  Do we want to assume
>> that the source server has been updated enough to avoid any such
>> bugs?  The expense of such a query might be daunting, too.

> For the query cost, I think we can assume the domain hierarchies are not
> particularly deep (in practice I'd expect just domains directly on the
> sql_identifier type). And I doubt people are using that very widely,
> it's probably more like this report - ad-hoc CTAS, so just a couple of
> items. So I wouldn't expect it to be a huge deal in most cases. But even
> if it takes a second or two, it's a one-time cost.

What I was worried about was the planner possibly trying to apply the
atttypid restriction as a scan qual using a subplan, which might be rather
awful.  But it doesn't look like that happens.  I get a hash semijoin to
the CTE output, in all branches back to 8.4, on this trial query:

explain
with recursive sqlidoids(toid) as (
select 'information_schema.sql_identifier'::pg_catalog.regtype as toid
union
select oid from pg_catalog.pg_type, sqlidoids
  where typtype = 'd' and typbasetype = sqlidoids.toid
)                       
SELECT n.nspname, c.relname, a.attname 
FROM    pg_catalog.pg_class c, 
        pg_catalog.pg_namespace n, 
        pg_catalog.pg_attribute a 
WHERE   c.oid = a.attrelid AND 
        NOT a.attisdropped AND 
        a.atttypid in (select toid from sqlidoids) AND
        c.relkind IN ('r','v','i') and
        c.relnamespace = n.oid AND 
        n.nspname !~ '^pg_temp_' AND 
        n.nspname !~ '^pg_toast_temp_' AND 
        n.nspname NOT IN ('pg_catalog', 'information_schema');

			regards, tom lane

Re: BUG #16045: vacuum_db crash and illegal memory alloc after pg_upgrade from PG11 to PG12

Tomas Vondra <tomas.vondra@2ndquadrant.com> — 2019-10-10T20:40:22Z

On Thu, Oct 10, 2019 at 04:14:20PM -0400, Tom Lane wrote:
>Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
>> On Thu, Oct 10, 2019 at 10:19:12AM -0400, Tom Lane wrote:
>>> To identify such domains, I think we'd need something like
>>> WHERE attypid IN (recursive-WITH-query), which makes me nervous.
>>> We did support those starting with 8.4, which is as far back as
>>> pg_upgrade will go, so in theory it should work.  But I think we
>>> had bugs with such cases in old releases.  Do we want to assume
>>> that the source server has been updated enough to avoid any such
>>> bugs?  The expense of such a query might be daunting, too.
>
>> For the query cost, I think we can assume the domain hierarchies are not
>> particularly deep (in practice I'd expect just domains directly on the
>> sql_identifier type). And I doubt people are using that very widely,
>> it's probably more like this report - ad-hoc CTAS, so just a couple of
>> items. So I wouldn't expect it to be a huge deal in most cases. But even
>> if it takes a second or two, it's a one-time cost.
>
>What I was worried about was the planner possibly trying to apply the
>atttypid restriction as a scan qual using a subplan, which might be rather
>awful.  But it doesn't look like that happens. 

OK.

> I get a hash semijoin to
>the CTE output, in all branches back to 8.4, on this trial query:
>
>explain
>with recursive sqlidoids(toid) as (
>select 'information_schema.sql_identifier'::pg_catalog.regtype as toid
>union
>select oid from pg_catalog.pg_type, sqlidoids
>  where typtype = 'd' and typbasetype = sqlidoids.toid
>)
>SELECT n.nspname, c.relname, a.attname
>FROM    pg_catalog.pg_class c,
>        pg_catalog.pg_namespace n,
>        pg_catalog.pg_attribute a
>WHERE   c.oid = a.attrelid AND
>        NOT a.attisdropped AND
>        a.atttypid in (select toid from sqlidoids) AND
>        c.relkind IN ('r','v','i') and
>        c.relnamespace = n.oid AND
>        n.nspname !~ '^pg_temp_' AND
>        n.nspname !~ '^pg_toast_temp_' AND
>        n.nspname NOT IN ('pg_catalog', 'information_schema');
>

I think that's not quite sufficient - the problem is that we can have
domains and composite types on sql_identifier, in some arbitrary order.
And the recursive CTE won't handle that the way it's written - it will
miss domains on composite types containing sql_identifier. And we have
quite a few of them in the information schema, so maybe someone created
a domain on one of those (however unlikely it may seem).

I think this recursive CTE does it correctly:

WITH RECURSIVE oids AS (
    -- type itself
    SELECT 'information_schema.sql_identifier'::regtype AS oid
    UNION ALL
    SELECT * FROM (
        -- domains on the type
        WITH x AS (SELECT oid FROM oids)
        SELECT t.oid FROM pg_catalog.pg_type t, x WHERE typbasetype = x.oid AND typtype = 'd'
        UNION
        -- composite types containing the type
        SELECT t.oid FROM pg_catalog.pg_type t, pg_catalog.pg_class c, pg_catalog.pg_attribute a, x
        WHERE t.typtype = 'c' AND
              t.oid = c.reltype AND
              c.oid = a.attrelid AND
              a.atttypid = x.oid
    ) foo
) 

I had to use CTE within CTE, because the 'oids' can be referenced only
once, but we have two subqueries there. Maybe there's a better solution.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: BUG #16045: vacuum_db crash and illegal memory alloc after pg_upgrade from PG11 to PG12

Tomas Vondra <tomas.vondra@2ndquadrant.com> — 2019-10-13T00:10:32Z

OK,

here is an updated patch, with the recursive CTE. I've done a fair
amount of testing on it on older versions (up to 9.4), and it seems to
work just fine.

Another thing that I noticed is that the query does not need to look at
RELKIND_COMPOSITE_TYPE, because we only really care about cases when
sql_identifier is stored on-disk. Composite type alone does not do that,
and the CTE includes OIDs of composite types that we then check against
relations and matviews.

Barring objections, I'll push this early next week.


BTW the query (including the RELKIND_COMPSITE_TYPE) was essentially just
a lightly-massaged copy of old_9_6_check_for_unknown_data_type_usage, so
that seems wrong too. The comment explicitly says:

 * Also check composite types, in case they are used for table columns.

but even a simple "create type c as (a unknown, b int)" without any
table using it enough to trigger the failure. But maybe it's
intentional, not sure.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: BUG #16045: vacuum_db crash and illegal memory alloc after pg_upgrade from PG11 to PG12

Tom Lane <tgl@sss.pgh.pa.us> — 2019-10-13T18:26:48Z

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
> here is an updated patch, with the recursive CTE. I've done a fair
> amount of testing on it on older versions (up to 9.4), and it seems to
> work just fine.

Might be a good idea to exclude attisdropped columns in the part of the
recursive query that's looking for sql_identifier columns of composite
types.  I'm not sure if composites can have dropped columns today,
but even if they can't it seems like a wise bit of future-proofing.
(We'll no doubt have occasion to use this logic again...)

Looks good other than that nit.

> BTW the query (including the RELKIND_COMPSITE_TYPE) was essentially just
> a lightly-massaged copy of old_9_6_check_for_unknown_data_type_usage, so
> that seems wrong too.

Yeah, we should back-port this logic into that check too, IMO.

			regards, tom lane

Re: BUG #16045: vacuum_db crash and illegal memory alloc after pg_upgrade from PG11 to PG12

Tomas Vondra <tomas.vondra@2ndquadrant.com> — 2019-10-13T20:38:48Z

On Sun, Oct 13, 2019 at 02:26:48PM -0400, Tom Lane wrote:
>Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
>> here is an updated patch, with the recursive CTE. I've done a fair
>> amount of testing on it on older versions (up to 9.4), and it seems to
>> work just fine.
>
>Might be a good idea to exclude attisdropped columns in the part of the
>recursive query that's looking for sql_identifier columns of composite
>types.  I'm not sure if composites can have dropped columns today,
>but even if they can't it seems like a wise bit of future-proofing.
>(We'll no doubt have occasion to use this logic again...)
>

Hmm? How could that be safe? Let's say we have a composite type with a
sql_identifier column, it's used in a table with data, and we drop the
column. We need the pg_type information to parse the existing, so how
could we skip attisdropped columns?

>Looks good other than that nit.
>
>> BTW the query (including the RELKIND_COMPSITE_TYPE) was essentially just
>> a lightly-massaged copy of old_9_6_check_for_unknown_data_type_usage, so
>> that seems wrong too.
>
>Yeah, we should back-port this logic into that check too, IMO.
>

You mean the recursive CTE, removal of RELKIND_COMPOSITE_TYPE or the
proposed change w.r.t. dropped columns?


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: BUG #16045: vacuum_db crash and illegal memory alloc after pg_upgrade from PG11 to PG12

Tom Lane <tgl@sss.pgh.pa.us> — 2019-10-14T14:16:40Z

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
> On Sun, Oct 13, 2019 at 02:26:48PM -0400, Tom Lane wrote:
>> Might be a good idea to exclude attisdropped columns in the part of the
>> recursive query that's looking for sql_identifier columns of composite
>> types.  I'm not sure if composites can have dropped columns today,

[ I checked this, they can ]

>> but even if they can't it seems like a wise bit of future-proofing.
>> (We'll no doubt have occasion to use this logic again...)

> Hmm? How could that be safe? Let's say we have a composite type with a
> sql_identifier column, it's used in a table with data, and we drop the
> column. We need the pg_type information to parse the existing, so how
> could we skip attisdropped columns?

It works exactly like it does for table rowtypes.

regression=# create type cfoo as (f1 int, f2 int, f3 int);
CREATE TYPE
regression=# alter type cfoo drop attribute f2;
ALTER TYPE
regression=# select attname,atttypid,attisdropped,attlen,attalign from pg_attribute where attrelid = 'cfoo'::regclass;
           attname            | atttypid | attisdropped | attlen | attalign 
------------------------------+----------+--------------+--------+----------
 f1                           |       23 | f            |      4 | i
 ........pg.dropped.2........ |        0 | t            |      4 | i
 f3                           |       23 | f            |      4 | i
(3 rows)

All we need to skip over the dead data is attlen/attalign, which are
preserved in pg_attribute even if the pg_type row is gone.

As this example shows, you don't really *have* to check attisdropped
because atttypid will be set to zero.  But the latter is just a
defense measure in case somebody forgets to check attisdropped;
you're not supposed to forget that.

			regards, tom lane

Re: BUG #16045: vacuum_db crash and illegal memory alloc after pg_upgrade from PG11 to PG12

Tomas Vondra <tomas.vondra@2ndquadrant.com> — 2019-10-14T16:35:38Z

On Mon, Oct 14, 2019 at 10:16:40AM -0400, Tom Lane wrote:
>Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
>> On Sun, Oct 13, 2019 at 02:26:48PM -0400, Tom Lane wrote:
>>> Might be a good idea to exclude attisdropped columns in the part of the
>>> recursive query that's looking for sql_identifier columns of composite
>>> types.  I'm not sure if composites can have dropped columns today,
>
>[ I checked this, they can ]
>
>>> but even if they can't it seems like a wise bit of future-proofing.
>>> (We'll no doubt have occasion to use this logic again...)
>
>> Hmm? How could that be safe? Let's say we have a composite type with a
>> sql_identifier column, it's used in a table with data, and we drop the
>> column. We need the pg_type information to parse the existing, so how
>> could we skip attisdropped columns?
>
>It works exactly like it does for table rowtypes.
>
>regression=# create type cfoo as (f1 int, f2 int, f3 int);
>CREATE TYPE
>regression=# alter type cfoo drop attribute f2;
>ALTER TYPE
>regression=# select attname,atttypid,attisdropped,attlen,attalign from pg_attribute where attrelid = 'cfoo'::regclass;
>           attname            | atttypid | attisdropped | attlen | attalign
>------------------------------+----------+--------------+--------+----------
> f1                           |       23 | f            |      4 | i
> ........pg.dropped.2........ |        0 | t            |      4 | i
> f3                           |       23 | f            |      4 | i
>(3 rows)
>
>All we need to skip over the dead data is attlen/attalign, which are
>preserved in pg_attribute even if the pg_type row is gone.
>
>As this example shows, you don't really *have* to check attisdropped
>because atttypid will be set to zero.  But the latter is just a
>defense measure in case somebody forgets to check attisdropped;
>you're not supposed to forget that.
>

Aha! I forgot we copy the necessary stuff into pg_attribute. Thanks for
clarifying, I'll polish and push the fix shortly.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: BUG #16045: vacuum_db crash and illegal memory alloc after pg_upgrade from PG11 to PG12

Tomas Vondra <tomas.vondra@2ndquadrant.com> — 2019-10-15T00:18:17Z

On Mon, Oct 14, 2019 at 06:35:38PM +0200, Tomas Vondra wrote:
> ...
>
>Aha! I forgot we copy the necessary stuff into pg_attribute. Thanks for
>clarifying, I'll polish and push the fix shortly.
>

I've pushed and backpatched the fix. Attached are similar fixes for the
existing pg_upgrade checks for pg_catalog.line and pg_catalog.unknown
types, which have the same issues with composite types and domains.

There are some additional details & examples in the commit messages.

I've kept this in two patches primarily because of backpatching - the
line fix should go back up to 9.4, the unknown is for 10.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: BUG #16045: vacuum_db crash and illegal memory alloc after pg_upgrade from PG11 to PG12

Justin Pryzby <pryzby@telsasoft.com> — 2019-10-15T04:41:18Z

On Tue, Oct 15, 2019 at 02:18:17AM +0200, Tomas Vondra wrote:
> On Mon, Oct 14, 2019 at 06:35:38PM +0200, Tomas Vondra wrote:
> >...
> >
> >Aha! I forgot we copy the necessary stuff into pg_attribute. Thanks for
> >clarifying, I'll polish and push the fix shortly.

Perhaps it'd be worth creating a test for on-disk format ?

Like a table with a column for each core type, which is either SELECTed from
after pg_upgrade, or pg_dump output compared before and after.

Justin

Re: BUG #16045: vacuum_db crash and illegal memory alloc after pg_upgrade from PG11 to PG12

Tomas Vondra <tomas.vondra@2ndquadrant.com> — 2019-10-15T07:07:25Z

On Mon, Oct 14, 2019 at 11:41:18PM -0500, Justin Pryzby wrote:
>On Tue, Oct 15, 2019 at 02:18:17AM +0200, Tomas Vondra wrote:
>> On Mon, Oct 14, 2019 at 06:35:38PM +0200, Tomas Vondra wrote:
>> >...
>> >
>> >Aha! I forgot we copy the necessary stuff into pg_attribute. Thanks for
>> >clarifying, I'll polish and push the fix shortly.
>
>Perhaps it'd be worth creating a test for on-disk format ?
>
>Like a table with a column for each core type, which is either SELECTed from
>after pg_upgrade, or pg_dump output compared before and after.
>

IMO that would be useful - we now have a couple of these checks for
different data types (line, unknown, sql_identifier), with a couple of
combinations each. And I've been looking if we do similar pg_upgrade
tests, but I haven't found anything. I mean, we do pg_upgrade the
cluster used for regression tests, but here we need to test a number of
cases that are meant to abort the pg_upgrade. So we'd need a number of
pg_upgrade runs, to test that.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: BUG #16045: vacuum_db crash and illegal memory alloc after pg_upgrade from PG11 to PG12

Tomas Vondra <tomas.vondra@2ndquadrant.com> — 2019-10-16T11:33:44Z

On Tue, Oct 15, 2019 at 02:18:17AM +0200, Tomas Vondra wrote:
>On Mon, Oct 14, 2019 at 06:35:38PM +0200, Tomas Vondra wrote:
>>...
>>
>>Aha! I forgot we copy the necessary stuff into pg_attribute. Thanks for
>>clarifying, I'll polish and push the fix shortly.
>>
>
>I've pushed and backpatched the fix. Attached are similar fixes for the
>existing pg_upgrade checks for pg_catalog.line and pg_catalog.unknown
>types, which have the same issues with composite types and domains.
>
>There are some additional details & examples in the commit messages.
>
>I've kept this in two patches primarily because of backpatching - the
>line fix should go back up to 9.4, the unknown is for 10.
>

I've just committed and pushed both fixes after some minor corrections.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: BUG #16045: vacuum_db crash and illegal memory alloc after pg_upgrade from PG11 to PG12

Tom Lane <tgl@sss.pgh.pa.us> — 2019-10-16T12:27:57Z

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
> I've just committed and pushed both fixes after some minor corrections.

Not quite right in 9.6 and before, according to crake.  Looks like
some issue with the CppAsString2'd constants?  Did we even have
CppAsString2 that far back?

			regards, tom lane

Re: BUG #16045: vacuum_db crash and illegal memory alloc after pg_upgrade from PG11 to PG12

Tom Lane <tgl@sss.pgh.pa.us> — 2019-10-16T13:26:42Z

I wrote:
> Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
>> I've just committed and pushed both fixes after some minor corrections.

> Not quite right in 9.6 and before, according to crake.  Looks like
> some issue with the CppAsString2'd constants?  Did we even have
> CppAsString2 that far back?

Yeah, we did.  On closer inspection I suspect that we need to #include
some other file to get the RELKIND_ constants in the old branches.

			regards, tom lane

Re: BUG #16045: vacuum_db crash and illegal memory alloc after pg_upgrade from PG11 to PG12

Tomas Vondra <tomas.vondra@2ndquadrant.com> — 2019-10-16T13:41:17Z

On Wed, Oct 16, 2019 at 03:26:42PM +0200, Tom Lane wrote:
>I wrote:
>> Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
>>> I've just committed and pushed both fixes after some minor corrections.
>
>> Not quite right in 9.6 and before, according to crake.  Looks like
>> some issue with the CppAsString2'd constants?  Did we even have
>> CppAsString2 that far back?
>
>Yeah, we did.  On closer inspection I suspect that we need to #include
>some other file to get the RELKIND_ constants in the old branches.
>

Oh! Looking.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: BUG #16045: vacuum_db crash and illegal memory alloc after pg_upgrade from PG11 to PG12

Tomas Vondra <tomas.vondra@2ndquadrant.com> — 2019-10-16T14:33:43Z

On Wed, Oct 16, 2019 at 03:26:42PM +0200, Tom Lane wrote:
>I wrote:
>> Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
>>> I've just committed and pushed both fixes after some minor corrections.
>
>> Not quite right in 9.6 and before, according to crake.  Looks like
>> some issue with the CppAsString2'd constants?  Did we even have
>> CppAsString2 that far back?
>
>Yeah, we did.  On closer inspection I suspect that we need to #include
>some other file to get the RELKIND_ constants in the old branches.
>

Yeah, the pg_class.h catalog was missing on pre-10 relases. It compiled
just fine, so I haven't noticed that during the backpatching :-(

Fixed, let's see if the buildfarm is happy with that.

regads

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: BUG #16045: vacuum_db crash and illegal memory alloc after pg_upgrade from PG11 to PG12

Bruce Momjian <bruce@momjian.us> — 2019-10-23T22:08:21Z

On Tue, Oct 15, 2019 at 02:18:17AM +0200, Tomas Vondra wrote:
> On Mon, Oct 14, 2019 at 06:35:38PM +0200, Tomas Vondra wrote:
> > ...
> > 
> > Aha! I forgot we copy the necessary stuff into pg_attribute. Thanks for
> > clarifying, I'll polish and push the fix shortly.
> > 
> 
> I've pushed and backpatched the fix. Attached are similar fixes for the
> existing pg_upgrade checks for pg_catalog.line and pg_catalog.unknown
> types, which have the same issues with composite types and domains.

This comit added old_11_check_for_sql_identifier_data_type_usage(), but
it did not use the clearer database error list format added to the
master branch in commit 1634d36157.  Attached is a patch to fix this,
which I have committed.

-- 
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +

pg_upgrade test for binary compatibility of core data types

Justin Pryzby <pryzby@telsasoft.com> — 2020-12-06T18:02:48Z

I'm finally returning to this 14 month old thread:
(was: Re: BUG #16045: vacuum_db crash and illegal memory alloc after pg_upgrade from PG11 to PG12)

On Tue, Oct 15, 2019 at 09:07:25AM +0200, Tomas Vondra wrote:
> On Mon, Oct 14, 2019 at 11:41:18PM -0500, Justin Pryzby wrote:
> > 
> > Perhaps it'd be worth creating a test for on-disk format ?
> > 
> > Like a table with a column for each core type, which is either SELECTed from
> > after pg_upgrade, or pg_dump output compared before and after.
> 
> IMO that would be useful - we now have a couple of these checks for
> different data types (line, unknown, sql_identifier), with a couple of
> combinations each. And I've been looking if we do similar pg_upgrade
> tests, but I haven't found anything. I mean, we do pg_upgrade the
> cluster used for regression tests, but here we need to test a number of
> cases that are meant to abort the pg_upgrade. So we'd need a number of
> pg_upgrade runs, to test that.

I meant to notice if the binary format is accidentally changed again, which was
what happened here:
7c15cef86 Base information_schema.sql_identifier domain on name, not varchar.

I added a table to the regression tests so it's processed by pg_upgrade tests,
run like:

| time make -C src/bin/pg_upgrade check oldsrc=`pwd`/11 oldbindir=`pwd`/11/tmp_install/usr/local/pgsql/bin

I checked that if I cherry-pick 0002 to v11, and comment out
old_11_check_for_sql_identifier_data_type_usage(), then pg_upgrade/test.sh
detects the original problem:
pg_dump: error: Error message from server: ERROR:  invalid memory alloc request size 18446744073709551613

I understand the buildfarm has its own cross-version-upgrade test, which I
think would catch this on its own.

These all seem to complicate use of pg_upgrade/test.sh, so 0001 is needed to
allow testing upgrade from older releases.

e78900afd217fa3eaa77c51e23a94c1466af421c Create by default sql/ and expected/ for output directory in pg_regress
40b132c1afbb4b1494aa8e48cc35ec98d2b90777 In the pg_upgrade test suite, don't write to src/test/regress.
fc49e24fa69a15efacd5b8958115ed9c43c48f9a Make WAL segment size configurable at initdb time.
c37b3d08ca6873f9d4eaf24c72a90a550970cbb8 Allow group access on PGDATA
da9b580d89903fee871cf54845ffa2b26bda2e11 Refactor dir/file permissions

-- 
Justin

Re: pg_upgrade test for binary compatibility of core data types

Justin Pryzby <pryzby@telsasoft.com> — 2020-12-16T17:22:23Z

On Sun, Dec 06, 2020 at 12:02:48PM -0600, Justin Pryzby wrote:
> I meant to notice if the binary format is accidentally changed again, which was
> what happened here:
> 7c15cef86 Base information_schema.sql_identifier domain on name, not varchar.
> 
> I added a table to the regression tests so it's processed by pg_upgrade tests,
> run like:
> 
> | time make -C src/bin/pg_upgrade check oldsrc=`pwd`/11 oldbindir=`pwd`/11/tmp_install/usr/local/pgsql/bin

Per cfbot, this avoids testing ::xml (support for which may not be enabled)
And also now tests oid types.

I think the per-version hacks should be grouped by logical change, rather than
by version.  Which I've started doing here.

-- 
Justin

Re: pg_upgrade test for binary compatibility of core data types

Justin Pryzby <pryzby@telsasoft.com> — 2020-12-27T19:07:29Z

On Wed, Dec 16, 2020 at 11:22:23AM -0600, Justin Pryzby wrote:
> On Sun, Dec 06, 2020 at 12:02:48PM -0600, Justin Pryzby wrote:
> > I meant to notice if the binary format is accidentally changed again, which was
> > what happened here:
> > 7c15cef86 Base information_schema.sql_identifier domain on name, not varchar.
> > 
> > I added a table to the regression tests so it's processed by pg_upgrade tests,
> > run like:
> > 
> > | time make -C src/bin/pg_upgrade check oldsrc=`pwd`/11 oldbindir=`pwd`/11/tmp_install/usr/local/pgsql/bin
> 
> Per cfbot, this avoids testing ::xml (support for which may not be enabled)
> And also now tests oid types.
> 
> I think the per-version hacks should be grouped by logical change, rather than
> by version.  Which I've started doing here.

rebased on 6df7a9698bb036610c1e8c6d375e1be38cb26d5f

-- 
Justin

Re: pg_upgrade test for binary compatibility of core data types

Peter Eisentraut <peter.eisentraut@enterprisedb.com> — 2021-01-11T14:28:08Z

On 2020-12-27 20:07, Justin Pryzby wrote:
> On Wed, Dec 16, 2020 at 11:22:23AM -0600, Justin Pryzby wrote:
>> On Sun, Dec 06, 2020 at 12:02:48PM -0600, Justin Pryzby wrote:
>>> I meant to notice if the binary format is accidentally changed again, which was
>>> what happened here:
>>> 7c15cef86 Base information_schema.sql_identifier domain on name, not varchar.
>>>
>>> I added a table to the regression tests so it's processed by pg_upgrade tests,
>>> run like:
>>>
>>> | time make -C src/bin/pg_upgrade check oldsrc=`pwd`/11 oldbindir=`pwd`/11/tmp_install/usr/local/pgsql/bin
>>
>> Per cfbot, this avoids testing ::xml (support for which may not be enabled)
>> And also now tests oid types.
>>
>> I think the per-version hacks should be grouped by logical change, rather than
>> by version.  Which I've started doing here.
> 
> rebased on 6df7a9698bb036610c1e8c6d375e1be38cb26d5f

I think these patches could use some in-place documentation of what they 
are trying to achieve and how they do it.  The required information is 
spread over a lengthy thread.  No one wants to read that.  Add commit 
messages to the patches.

Re: pg_upgrade test for binary compatibility of core data types

Bruce Momjian <bruce@momjian.us> — 2021-01-11T15:21:36Z

On Mon, Jan 11, 2021 at 03:28:08PM +0100, Peter Eisentraut wrote:
> On 2020-12-27 20:07, Justin Pryzby wrote:
> > rebased on 6df7a9698bb036610c1e8c6d375e1be38cb26d5f
> 
> I think these patches could use some in-place documentation of what they are
> trying to achieve and how they do it.  The required information is spread
> over a lengthy thread.  No one wants to read that.  Add commit messages to
> the patches.

Oh, I see that now, and agree that you need to explain each item with a
comment.  pg_upgrade is doing some odd things, so documenting everything
it does is a big win.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EnterpriseDB                             https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

Re: pg_upgrade test for binary compatibility of core data types

Justin Pryzby <pryzby@telsasoft.com> — 2021-01-12T04:13:52Z

On Mon, Jan 11, 2021 at 03:28:08PM +0100, Peter Eisentraut wrote:
> On 2020-12-27 20:07, Justin Pryzby wrote:
> > On Wed, Dec 16, 2020 at 11:22:23AM -0600, Justin Pryzby wrote:
> > > On Sun, Dec 06, 2020 at 12:02:48PM -0600, Justin Pryzby wrote:
> > > > I meant to notice if the binary format is accidentally changed again, which was
> > > > what happened here:
> > > > 7c15cef86 Base information_schema.sql_identifier domain on name, not varchar.
> > > > 
> > > > I added a table to the regression tests so it's processed by pg_upgrade tests,
> > > > run like:
> > > > 
> > > > | time make -C src/bin/pg_upgrade check oldsrc=`pwd`/11 oldbindir=`pwd`/11/tmp_install/usr/local/pgsql/bin
> > > 
> > > Per cfbot, this avoids testing ::xml (support for which may not be enabled)
> > > And also now tests oid types.
> > > 
> > > I think the per-version hacks should be grouped by logical change, rather than
> > > by version.  Which I've started doing here.
> > 
> > rebased on 6df7a9698bb036610c1e8c6d375e1be38cb26d5f
> 
> I think these patches could use some in-place documentation of what they are
> trying to achieve and how they do it.  The required information is spread
> over a lengthy thread.  No one wants to read that.  Add commit messages to
> the patches.

0001 patch fixes pg_upgrade/test.sh, which was disfunctional.
Portions of the first patch were independently handled by commits 52202bb39,
fa744697c, 091866724.  So this is rebased on those.
I guess updating this script should be a part of a beta-checklist somewhere,
since I guess nobody will want to backpatch changes for testing older releases.

0002 allows detecting the information_schema problem that was introduced at:
7c15cef86 Base information_schema.sql_identifier domain on name, not varchar.

+-- Create a table with different data types, to exercise binary compatibility
+-- during pg_upgrade test

If binary compatibility is changed I expect this will error, crash, at least
return wrong data, and thereby fail tests.

-- 
Justin

On Sun, Dec 06, 2020 at 12:02:48PM -0600, Justin Pryzby wrote:
> I checked that if I cherry-pick 0002 to v11, and comment out
> old_11_check_for_sql_identifier_data_type_usage(), then pg_upgrade/test.sh
> detects the original problem:
> pg_dump: error: Error message from server: ERROR:  invalid memory alloc request size 18446744073709551613
> 
> I understand the buildfarm has its own cross-version-upgrade test, which I
> think would catch this on its own.
> 
> These all seem to complicate use of pg_upgrade/test.sh, so 0001 is needed to
> allow testing upgrade from older releases.
> 
> e78900afd217fa3eaa77c51e23a94c1466af421c Create by default sql/ and expected/ for output directory in pg_regress
> 40b132c1afbb4b1494aa8e48cc35ec98d2b90777 In the pg_upgrade test suite, don't write to src/test/regress.
> fc49e24fa69a15efacd5b8958115ed9c43c48f9a Make WAL segment size configurable at initdb time.
> c37b3d08ca6873f9d4eaf24c72a90a550970cbb8 Allow group access on PGDATA
> da9b580d89903fee871cf54845ffa2b26bda2e11 Refactor dir/file permissions

Re: pg_upgrade test for binary compatibility of core data types

Bruce Momjian <bruce@momjian.us> — 2021-01-12T17:15:59Z

On Mon, Jan 11, 2021 at 10:13:52PM -0600, Justin Pryzby wrote:
> On Mon, Jan 11, 2021 at 03:28:08PM +0100, Peter Eisentraut wrote:
> > I think these patches could use some in-place documentation of what they are
> > trying to achieve and how they do it.  The required information is spread
> > over a lengthy thread.  No one wants to read that.  Add commit messages to
> > the patches.
> 
> 0001 patch fixes pg_upgrade/test.sh, which was disfunctional.
> Portions of the first patch were independently handled by commits 52202bb39,
> fa744697c, 091866724.  So this is rebased on those.
> I guess updating this script should be a part of a beta-checklist somewhere,
> since I guess nobody will want to backpatch changes for testing older releases.

Uh, what exactly is missing from the beta checklist?  I read the patch
and commit message but don't understand it.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EnterpriseDB                             https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

Re: pg_upgrade test for binary compatibility of core data types

Justin Pryzby <pryzby@telsasoft.com> — 2021-01-12T17:27:53Z

On Tue, Jan 12, 2021 at 12:15:59PM -0500, Bruce Momjian wrote:
> On Mon, Jan 11, 2021 at 10:13:52PM -0600, Justin Pryzby wrote:
> > On Mon, Jan 11, 2021 at 03:28:08PM +0100, Peter Eisentraut wrote:
> > > I think these patches could use some in-place documentation of what they are
> > > trying to achieve and how they do it.  The required information is spread
> > > over a lengthy thread.  No one wants to read that.  Add commit messages to
> > > the patches.
> > 
> > 0001 patch fixes pg_upgrade/test.sh, which was disfunctional.
> > Portions of the first patch were independently handled by commits 52202bb39,
> > fa744697c, 091866724.  So this is rebased on those.
> > I guess updating this script should be a part of a beta-checklist somewhere,
> > since I guess nobody will want to backpatch changes for testing older releases.
> 
> Uh, what exactly is missing from the beta checklist?  I read the patch
> and commit message but don't understand it.

Did you try to use test.sh to upgrade from a prior release ?

Evidently it's frequently forgotten, as evidenced by all the "deferred
maintenance" I had to do to allow testing the main patch (currently 0003).

See also:

commit 5bab1985dfc25eecf4b098145789955c0b246160
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date:   Thu Jun 8 13:48:27 2017 -0400

    Fix bit-rot in pg_upgrade's test.sh, and improve documentation.
    
    Doing a cross-version upgrade test with test.sh evidently hasn't been
    tested since circa 9.2, because the script lacked case branches for
    old-version servers newer than 9.1.  Future-proof that a bit, and
    clean up breakage induced by our recent drop of V0 function call
    protocol (namely that oldstyle_length() isn't in the regression
    suite anymore).

-- 
Justin

Re: pg_upgrade test for binary compatibility of core data types

Bruce Momjian <bruce@momjian.us> — 2021-01-12T17:53:56Z

On Tue, Jan 12, 2021 at 11:27:53AM -0600, Justin Pryzby wrote:
> On Tue, Jan 12, 2021 at 12:15:59PM -0500, Bruce Momjian wrote:
> > Uh, what exactly is missing from the beta checklist?  I read the patch
> > and commit message but don't understand it.
> 
> Did you try to use test.sh to upgrade from a prior release ?
> 
> Evidently it's frequently forgotten, as evidenced by all the "deferred
> maintenance" I had to do to allow testing the main patch (currently 0003).
> 
> See also:
> 
> commit 5bab1985dfc25eecf4b098145789955c0b246160
> Author: Tom Lane <tgl@sss.pgh.pa.us>
> Date:   Thu Jun 8 13:48:27 2017 -0400
> 
>     Fix bit-rot in pg_upgrade's test.sh, and improve documentation.
>     
>     Doing a cross-version upgrade test with test.sh evidently hasn't been
>     tested since circa 9.2, because the script lacked case branches for
>     old-version servers newer than 9.1.  Future-proof that a bit, and
>     clean up breakage induced by our recent drop of V0 function call
>     protocol (namely that oldstyle_length() isn't in the regression
>     suite anymore).

Oh, that is odd.  I thought that was regularly run.  I have my own test
infrastructure that I run for every major release so I never have run
the built-in one, except for make check-world.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EnterpriseDB                             https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

Re: pg_upgrade test for binary compatibility of core data types

Andrew Dunstan <andrew@dunslane.net> — 2021-01-12T21:44:28Z

On 1/12/21 12:53 PM, Bruce Momjian wrote:
> On Tue, Jan 12, 2021 at 11:27:53AM -0600, Justin Pryzby wrote:
>> On Tue, Jan 12, 2021 at 12:15:59PM -0500, Bruce Momjian wrote:
>>> Uh, what exactly is missing from the beta checklist?  I read the patch
>>> and commit message but don't understand it.
>> Did you try to use test.sh to upgrade from a prior release ?
>>
>> Evidently it's frequently forgotten, as evidenced by all the "deferred
>> maintenance" I had to do to allow testing the main patch (currently 0003).
>>
>> See also:
>>
>> commit 5bab1985dfc25eecf4b098145789955c0b246160
>> Author: Tom Lane <tgl@sss.pgh.pa.us>
>> Date:   Thu Jun 8 13:48:27 2017 -0400
>>
>>     Fix bit-rot in pg_upgrade's test.sh, and improve documentation.
>>     
>>     Doing a cross-version upgrade test with test.sh evidently hasn't been
>>     tested since circa 9.2, because the script lacked case branches for
>>     old-version servers newer than 9.1.  Future-proof that a bit, and
>>     clean up breakage induced by our recent drop of V0 function call
>>     protocol (namely that oldstyle_length() isn't in the regression
>>     suite anymore).
> Oh, that is odd.  I thought that was regularly run.  I have my own test
> infrastructure that I run for every major release so I never have run
> the built-in one, except for make check-world.
>

Cross version pg_upgrade is tested regularly in the buildfarm, but not
using test.sh. Instead it uses the saved data repository from a previous
run of the buildfarm client for the source branch, and tries to upgrade
that to the target branch.


cheers


andrew



--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: pg_upgrade test for binary compatibility of core data types

Peter Eisentraut <peter.eisentraut@enterprisedb.com> — 2021-01-15T08:00:31Z

On 2021-01-12 22:44, Andrew Dunstan wrote:
> Cross version pg_upgrade is tested regularly in the buildfarm, but not
> using test.sh. Instead it uses the saved data repository from a previous
> run of the buildfarm client for the source branch, and tries to upgrade
> that to the target branch.

Does it maintain a set of fixups similar to what is in test.sh?  Are 
those two sets the same?

Re: pg_upgrade test for binary compatibility of core data types

Tom Lane <tgl@sss.pgh.pa.us> — 2021-03-06T20:01:43Z

Peter Eisentraut <peter.eisentraut@enterprisedb.com> writes:
> On 2021-01-12 22:44, Andrew Dunstan wrote:
>> Cross version pg_upgrade is tested regularly in the buildfarm, but not
>> using test.sh. Instead it uses the saved data repository from a previous
>> run of the buildfarm client for the source branch, and tries to upgrade
>> that to the target branch.

> Does it maintain a set of fixups similar to what is in test.sh?  Are 
> those two sets the same?

Responding to Peter: the first answer is yes, the second is I didn't
check, but certainly Justin's patch makes them closer.

I spent some time poking through this set of patches.  I agree that
there's problem(s) here that we need to solve, but it feels like this
isn't a great way to solve them.  What I see in the patchset is:

v4-0001 mostly teaches test.sh about specific changes that have to be
made to historic versions of the regression database to allow them
to be reloaded into current servers.  As already discussed, this is
really duplicative of knowledge that's been embedded into the buildfarm
client over time.  It'd be better if we could refactor that so that
the buildfarm shares a common database of these actions with test.sh.
And said database ought to be in our git tree, so committers could
fix problems without having to get Andrew involved every time.
I think this could be represented as a psql script, at least in
versions that have psql \if (but that came in in v10, so maybe
we're there already).

(Taking a step back, maybe the regression database isn't an ideal
testbed for this in the first place.  But it does have the advantage of
not being a narrow-minded test that is going to miss things we haven't
explicitly thought of.)

v4-0002 is a bunch of random changes that mostly seem to revert hacky
adjustments previously made to improve test coverage.  I don't really
agree with any of these, nor see why they're necessary.  If they
are necessary then we need to restore the coverage somewhere else.
Admittedly, the previous changes were a bit hacky, but deleting them
(without even bothering to adjust the relevant comments) isn't the
answer.

v4-0003 is really the heart of the matter: it adds a table with some
previously-not-covered datatypes plus a query that purports to make sure
that we are covering all types of interest.  But I'm not sure I believe
that query.  It's got hard-wired assumptions about which typtype values
need to be covered.  Why is it okay to exclude range and multirange?
Are we sure that all composites are okay to exclude?  Likewise, the
restriction to pg_catalog and information_schema schemas seems likely to
bite us someday.  There are some very random exclusions based on name
patterns, which seem unsafe (let's list the specific type OIDs), and
again the nearby comments don't match the code.  But the biggest issue
is that this can only cover core datatypes, not any contrib stuff.

I don't know what we could do about contrib types.  Maybe we should
figure that covering core types is already a step forward, and be
happy with getting that done.

			regards, tom lane

Re: pg_upgrade test for binary compatibility of core data types

Justin Pryzby <pryzby@telsasoft.com> — 2021-04-30T18:33:48Z

On Sat, Mar 06, 2021 at 03:01:43PM -0500, Tom Lane wrote:
> Peter Eisentraut <peter.eisentraut@enterprisedb.com> writes:
> > On 2021-01-12 22:44, Andrew Dunstan wrote:
> >> Cross version pg_upgrade is tested regularly in the buildfarm, but not
> >> using test.sh. Instead it uses the saved data repository from a previous
> >> run of the buildfarm client for the source branch, and tries to upgrade
> >> that to the target branch.
> 
> > Does it maintain a set of fixups similar to what is in test.sh?  Are 
> > those two sets the same?
> 
> Responding to Peter: the first answer is yes, the second is I didn't
> check, but certainly Justin's patch makes them closer.

Right - I had meant to send this.

https://github.com/PGBuildFarm/client-code/blob/master/PGBuild/Modules/TestUpgradeXversion.pm

                $opsql = 'drop operator if exists public.=> (bigint, NONE)';
..
                my $missing_funcs = q{drop function if exists public.boxarea(box);
                              drop function if exists public.funny_dup17();
..
                my $prstmt = join(';',
                        'drop operator if exists #@# (bigint,NONE)',
                        'drop operator if exists #%# (bigint,NONE)',
                        'drop operator if exists !=- (bigint,NONE)',
..
                        $prstmt = join(';',
                                'drop operator @#@ (NONE, bigint)',
..
                                'drop aggregate if exists public.array_cat_accum(anyarray)',

> I spent some time poking through this set of patches.  I agree that
> there's problem(s) here that we need to solve, but it feels like this
> isn't a great way to solve them.  What I see in the patchset is:

For starters, is there a "release beta checklist" ?
Testing test.sh should be on it.
So should fuzz testing.

> v4-0001 mostly teaches test.sh about specific changes that have to be
> made to historic versions of the regression database to allow them
> to be reloaded into current servers.  As already discussed, this is
> really duplicative of knowledge that's been embedded into the buildfarm
> client over time.  It'd be better if we could refactor that so that
> the buildfarm shares a common database of these actions with test.sh.
> And said database ought to be in our git tree, so committers could
> fix problems without having to get Andrew involved every time.
> I think this could be represented as a psql script, at least in
> versions that have psql \if (but that came in in v10, so maybe
> we're there already).

I started this.  I don't know if it's compatible with the buildfarm client, but
I think any issues maybe can be avoided by using "IF EXISTS".

> v4-0002 is a bunch of random changes that mostly seem to revert hacky
> adjustments previously made to improve test coverage.  I don't really
> agree with any of these, nor see why they're necessary.  If they
> are necessary then we need to restore the coverage somewhere else.
> Admittedly, the previous changes were a bit hacky, but deleting them
> (without even bothering to adjust the relevant comments) isn't the
> answer.

It was necessary to avoid --wal-segsize and -g to allow testing upgrades from
versions which don't support those options.  I think test.sh should be portable
back to all supported versions.

When those options were added, it broke test.sh upgrading from old versions.
I changed this to a shell conditional for the "new" features:
| "$1" -N -A trust ${oldsrc:+--wal-segsize 1 -g}
Ideally it would check the version.

> v4-0003 is really the heart of the matter: it adds a table with some
> previously-not-covered datatypes plus a query that purports to make sure
> that we are covering all types of interest.

Actually the 'manytypes' table intends to include *all* core datatypes itself,
not just those that aren't included somewhere else.  I think "included
somewhere else" depends on the order of the regression these, and type_sanity
runs early, so the table might need to include many types that are created
later, to avoid "false positives" in the associated test.

> But I'm not sure I believe
> that query.  It's got hard-wired assumptions about which typtype values
> need to be covered.  Why is it okay to exclude range and multirange?
> Are we sure that all composites are okay to exclude?  Likewise, the
> restriction to pg_catalog and information_schema schemas seems likely to
> bite us someday.  There are some very random exclusions based on name
> patterns, which seem unsafe (let's list the specific type OIDs), and
> again the nearby comments don't match the code.  But the biggest issue
> is that this can only cover core datatypes, not any contrib stuff.

I changed to use regtype/OIDs, included range/multirange and stopped including
only pg_catalog/information_schema.  But didn't yet handle composites.

> I don't know what we could do about contrib types.  Maybe we should
> figure that covering core types is already a step forward, and be
> happy with getting that done.

Right .. this is meant to at least handle the lowest hanging fruit.

-- 
Justin

Re: pg_upgrade test for binary compatibility of core data types

Jacob Champion <pchampion@vmware.com> — 2021-07-16T16:21:07Z

On Fri, 2021-04-30 at 13:33 -0500, Justin Pryzby wrote:
> On Sat, Mar 06, 2021 at 03:01:43PM -0500, Tom Lane wrote:
> > v4-0001 mostly teaches test.sh about specific changes that have to be
> > made to historic versions of the regression database to allow them
> > to be reloaded into current servers.  As already discussed, this is
> > really duplicative of knowledge that's been embedded into the buildfarm
> > client over time.  It'd be better if we could refactor that so that
> > the buildfarm shares a common database of these actions with test.sh.
> > And said database ought to be in our git tree, so committers could
> > fix problems without having to get Andrew involved every time.
> > I think this could be represented as a psql script, at least in
> > versions that have psql \if (but that came in in v10, so maybe
> > we're there already).
> 
> I started this.  I don't know if it's compatible with the buildfarm client, but
> I think any issues maybe can be avoided by using "IF EXISTS".

I'm going to try pulling this into a psql script today and see how far
I get.

> > But I'm not sure I believe
> > that query.  It's got hard-wired assumptions about which typtype values
> > need to be covered.  Why is it okay to exclude range and multirange?
> > Are we sure that all composites are okay to exclude?  Likewise, the
> > restriction to pg_catalog and information_schema schemas seems likely to
> > bite us someday.  There are some very random exclusions based on name
> > patterns, which seem unsafe (let's list the specific type OIDs), and
> > again the nearby comments don't match the code.  But the biggest issue
> > is that this can only cover core datatypes, not any contrib stuff.
> 
> I changed to use regtype/OIDs, included range/multirange and stopped including
> only pg_catalog/information_schema.  But didn't yet handle composites.

Per cfbot, this test needs to be taught about the new
pg_brin_bloom_summary and pg_brin_minmax_multi_summary types.

--Jacob

Re: pg_upgrade test for binary compatibility of core data types

Jacob Champion <pchampion@vmware.com> — 2021-07-16T16:40:34Z

On Fri, 2021-07-16 at 16:21 +0000, Jacob Champion wrote:
> On Fri, 2021-04-30 at 13:33 -0500, Justin Pryzby wrote:
> > On Sat, Mar 06, 2021 at 03:01:43PM -0500, Tom Lane wrote:
> > > v4-0001 mostly teaches test.sh about specific changes that have to be
> > > made to historic versions of the regression database to allow them
> > > to be reloaded into current servers.  As already discussed, this is
> > > really duplicative of knowledge that's been embedded into the buildfarm
> > > client over time.  It'd be better if we could refactor that so that
> > > the buildfarm shares a common database of these actions with test.sh.
> > > And said database ought to be in our git tree, so committers could
> > > fix problems without having to get Andrew involved every time.
> > > I think this could be represented as a psql script, at least in
> > > versions that have psql \if (but that came in in v10, so maybe
> > > we're there already).
> > 
> > I started this.  I don't know if it's compatible with the buildfarm client, but
> > I think any issues maybe can be avoided by using "IF EXISTS".
> 
> I'm going to try pulling this into a psql script today and see how far
> I get.

I completely misread this exchange -- you already did this in 0004.
Sorry for the noise.

--Jacob

Re: pg_upgrade test for binary compatibility of core data types

Jacob Champion <pchampion@vmware.com> — 2021-07-16T18:02:18Z

On Fri, 2021-04-30 at 13:33 -0500, Justin Pryzby wrote:
> On Sat, Mar 06, 2021 at 03:01:43PM -0500, Tom Lane wrote:
> > v4-0001 mostly teaches test.sh about specific changes that have to be
> > made to historic versions of the regression database to allow them
> > to be reloaded into current servers.  As already discussed, this is
> > really duplicative of knowledge that's been embedded into the buildfarm
> > client over time.  It'd be better if we could refactor that so that
> > the buildfarm shares a common database of these actions with test.sh.
> > And said database ought to be in our git tree, so committers could
> > fix problems without having to get Andrew involved every time.
> > I think this could be represented as a psql script, at least in
> > versions that have psql \if (but that came in in v10, so maybe
> > we're there already).
> 
> I started this.  I don't know if it's compatible with the buildfarm client, but
> I think any issues maybe can be avoided by using "IF EXISTS".

Here are the differences I see on a first pass (without putting too
much thought into how significant the differences are). Buildfarm code
I'm comparing against is at [1].

- Both versions drop @#@ and array_cat_accum, but the buildfarm
additionally replaces them with a new operator and aggregate,
respectively.

- The buildfarm's dropping of table OIDs is probably more resilient,
since it loops over pg_class looking for relhasoids.

- The buildfarm handles (or drops) several contrib databases in
addition to the core regression DB.

- The psql script drops the first_el_agg_any aggregate and a `TRANSFORM
FOR integer`; I don't see any corresponding code in the buildfarm.

- Some version ranges are different between the two. For example,
abstime_/reltime_/tinterval_tbl are dropped by the buildfarm if the old
version is < 9.3, while the psql script drops them for old versions <=
10.

- The buildfarm drops the public.=> operator for much older versions of
Postgres. I assume we don't need that here.

- The buildfarm adjusts pg_proc for the location of regress.so; I see
there's a commented placeholder for this at the end of the psql script
but it's not yet implemented.

As an aside, I think the "fromv10" naming scheme for the "old version
<= 10" condition is unintuitive. If the old version is e.g. 9.6, we're
not upgrading "from 10".

--Jacob

[1] https://github.com/PGBuildFarm/client-code/blob/main/PGBuild/Modules/TestUpgradeXversion.pm

Re: pg_upgrade test for binary compatibility of core data types

Tom Lane <tgl@sss.pgh.pa.us> — 2021-09-11T18:19:04Z

Jacob Champion <pchampion@vmware.com> writes:
> On Fri, 2021-04-30 at 13:33 -0500, Justin Pryzby wrote:
>> I started this.  I don't know if it's compatible with the buildfarm client, but
>> I think any issues maybe can be avoided by using "IF EXISTS".

> Here are the differences I see on a first pass (without putting too
> much thought into how significant the differences are). Buildfarm code
> I'm comparing against is at [1].

I switched the CF entry for this to "Waiting on Author".  It's
been failing in the cfbot for a couple of months, and Jacob's
provided some review-ish comments here, so I think there's
plenty of reason to deem the ball to be in Justin's court.

			regards, tom lane

Re: pg_upgrade test for binary compatibility of core data types

Justin Pryzby <pryzby@telsasoft.com> — 2021-09-12T00:51:16Z

On Fri, Jul 16, 2021 at 06:02:18PM +0000, Jacob Champion wrote:
> On Fri, 2021-04-30 at 13:33 -0500, Justin Pryzby wrote:
> > On Sat, Mar 06, 2021 at 03:01:43PM -0500, Tom Lane wrote:
> > > v4-0001 mostly teaches test.sh about specific changes that have to be
> > > made to historic versions of the regression database to allow them
> > > to be reloaded into current servers.  As already discussed, this is
> > > really duplicative of knowledge that's been embedded into the buildfarm
> > > client over time.  It'd be better if we could refactor that so that
> > > the buildfarm shares a common database of these actions with test.sh.
> > > And said database ought to be in our git tree, so committers could
> > > fix problems without having to get Andrew involved every time.
> > > I think this could be represented as a psql script, at least in
> > > versions that have psql \if (but that came in in v10, so maybe
> > > we're there already).
> > 
> > I started this.  I don't know if it's compatible with the buildfarm client, but
> > I think any issues maybe can be avoided by using "IF EXISTS".
> 
> Here are the differences I see on a first pass (without putting too
> much thought into how significant the differences are). Buildfarm code
> I'm comparing against is at [1].
> 
> - Both versions drop @#@ and array_cat_accum, but the buildfarm
> additionally replaces them with a new operator and aggregate,
> respectively.
> 
> - The buildfarm's dropping of table OIDs is probably more resilient,
> since it loops over pg_class looking for relhasoids.

These are all "translated" from test.sh, so follow its logic.
Maybe it should be improved, but that's separate from this patch - which is
already doing a few unrelated things.

> - The buildfarm adjusts pg_proc for the location of regress.so; I see
> there's a commented placeholder for this at the end of the psql script
> but it's not yet implemented.

I didn't understand why this was done here, but it turns out it has to be done
*after* calling pg_dump.  So it has to stay where it is.

> - Some version ranges are different between the two. For example,
> abstime_/reltime_/tinterval_tbl are dropped by the buildfarm if the old
> version is < 9.3, while the psql script drops them for old versions <=
> 10.

This was an error.  Thanks.

> - The buildfarm drops the public.=> operator for much older versions of
> Postgres. I assume we don't need that here.

> As an aside, I think the "fromv10" naming scheme for the "old version
> <= 10" condition is unintuitive. If the old version is e.g. 9.6, we're
> not upgrading "from 10".

I renamed the version vars - feel free to suggest something better.

I'll solicit suggestions what else to do to progresss these.

@Andrew: did you have any comment on this part ?

|Subject: buildfarm xversion diff
|Forking https://www.postgresql.org/message-id/20210328231433.GI15100@telsasoft.com
|
|I gave suggestion how to reduce the "lines of diff" metric almost to nothing,
|allowing a very small "fudge factor", and which I think makes this a pretty
|good metric rather than a passable one.

-- 
Justin

Re: pg_upgrade test for binary compatibility of core data types

Andrew Dunstan <andrew@dunslane.net> — 2021-09-12T18:41:23Z

On 9/11/21 8:51 PM, Justin Pryzby wrote:
>
> @Andrew: did you have any comment on this part ?
>
> |Subject: buildfarm xversion diff
> |Forking https://www.postgresql.org/message-id/20210328231433.GI15100@telsasoft.com
> |
> |I gave suggestion how to reduce the "lines of diff" metric almost to nothing,
> |allowing a very small "fudge factor", and which I think makes this a pretty
> |good metric rather than a passable one.
>

Somehow I missed that. Looks like some good suggestions. I'll
experiment. (Note: we can't assume the presence of sed, especially on
Windows).


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: pg_upgrade test for binary compatibility of core data types

Andrew Dunstan <andrew@dunslane.net> — 2021-09-13T13:20:37Z

On 9/12/21 2:41 PM, Andrew Dunstan wrote:
> On 9/11/21 8:51 PM, Justin Pryzby wrote:
>> @Andrew: did you have any comment on this part ?
>>
>> |Subject: buildfarm xversion diff
>> |Forking https://www.postgresql.org/message-id/20210328231433.GI15100@telsasoft.com
>> |
>> |I gave suggestion how to reduce the "lines of diff" metric almost to nothing,
>> |allowing a very small "fudge factor", and which I think makes this a pretty
>> |good metric rather than a passable one.
>>
> Somehow I missed that. Looks like some good suggestions. I'll
> experiment. (Note: we can't assume the presence of sed, especially on
> Windows).
>
>

I tried with the attached patch on crake, which tests back as far as
9.2. Here are the diff counts from HEAD:


andrew@emma:HEAD $ grep -c '^[+-]' dumpdiff-REL9_* dumpdiff-REL_1*
dumpdiff-HEAD
dumpdiff-REL9_2_STABLE:514
dumpdiff-REL9_3_STABLE:169
dumpdiff-REL9_4_STABLE:185
dumpdiff-REL9_5_STABLE:221
dumpdiff-REL9_6_STABLE:11
dumpdiff-REL_10_STABLE:11
dumpdiff-REL_11_STABLE:73
dumpdiff-REL_12_STABLE:73
dumpdiff-REL_13_STABLE:73
dumpdiff-REL_14_STABLE:0
dumpdiff-HEAD:0


I've also attached those non-empty dumpdiff files for information, since
they are quite small.


There is still work to do, but this is promising. Next step: try it on
Windows.


cheers


andrew


-- 

Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: pg_upgrade test for binary compatibility of core data types

Andrew Dunstan <andrew@dunslane.net> — 2021-09-15T19:28:54Z

On 9/13/21 9:20 AM, Andrew Dunstan wrote:
> On 9/12/21 2:41 PM, Andrew Dunstan wrote:
>> On 9/11/21 8:51 PM, Justin Pryzby wrote:
>>> @Andrew: did you have any comment on this part ?
>>>
>>> |Subject: buildfarm xversion diff
>>> |Forking https://www.postgresql.org/message-id/20210328231433.GI15100@telsasoft.com
>>> |
>>> |I gave suggestion how to reduce the "lines of diff" metric almost to nothing,
>>> |allowing a very small "fudge factor", and which I think makes this a pretty
>>> |good metric rather than a passable one.
>>>
>> Somehow I missed that. Looks like some good suggestions. I'll
>> experiment. (Note: we can't assume the presence of sed, especially on
>> Windows).
>>
>>
> I tried with the attached patch on crake, which tests back as far as
> 9.2. Here are the diff counts from HEAD:
>
>
> andrew@emma:HEAD $ grep -c '^[+-]' dumpdiff-REL9_* dumpdiff-REL_1*
> dumpdiff-HEAD
> dumpdiff-REL9_2_STABLE:514
> dumpdiff-REL9_3_STABLE:169
> dumpdiff-REL9_4_STABLE:185
> dumpdiff-REL9_5_STABLE:221
> dumpdiff-REL9_6_STABLE:11
> dumpdiff-REL_10_STABLE:11
> dumpdiff-REL_11_STABLE:73
> dumpdiff-REL_12_STABLE:73
> dumpdiff-REL_13_STABLE:73
> dumpdiff-REL_14_STABLE:0
> dumpdiff-HEAD:0
>
>
> I've also attached those non-empty dumpdiff files for information, since
> they are quite small.
>
>
> There is still work to do, but this is promising. Next step: try it on
> Windows.
>
>

It appears to do the right thing on Windows. yay!


We probably need to get smarter about the heuristics, though, e.g. by
taking into account the buildfarm options and the platform. It would
also help a lot if we could make vcregress.pl honor USE_MODULE_DB.
That's on my TODO list, but it just got a lot higher priority.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: pg_upgrade test for binary compatibility of core data types

Andrew Dunstan <andrew@dunslane.net> — 2021-09-24T14:58:52Z

On 9/15/21 3:28 PM, Andrew Dunstan wrote:
> On 9/13/21 9:20 AM, Andrew Dunstan wrote:
>> On 9/12/21 2:41 PM, Andrew Dunstan wrote:
>>> On 9/11/21 8:51 PM, Justin Pryzby wrote:
>>>> @Andrew: did you have any comment on this part ?
>>>>
>>>> |Subject: buildfarm xversion diff
>>>> |Forking https://www.postgresql.org/message-id/20210328231433.GI15100@telsasoft.com
>>>> |
>>>> |I gave suggestion how to reduce the "lines of diff" metric almost to nothing,
>>>> |allowing a very small "fudge factor", and which I think makes this a pretty
>>>> |good metric rather than a passable one.
>>>>
>>> Somehow I missed that. Looks like some good suggestions. I'll
>>> experiment. (Note: we can't assume the presence of sed, especially on
>>> Windows).
>>>
>>>
>> I tried with the attached patch on crake, which tests back as far as
>> 9.2. Here are the diff counts from HEAD:
>>
>>
>> andrew@emma:HEAD $ grep -c '^[+-]' dumpdiff-REL9_* dumpdiff-REL_1*
>> dumpdiff-HEAD
>> dumpdiff-REL9_2_STABLE:514
>> dumpdiff-REL9_3_STABLE:169
>> dumpdiff-REL9_4_STABLE:185
>> dumpdiff-REL9_5_STABLE:221
>> dumpdiff-REL9_6_STABLE:11
>> dumpdiff-REL_10_STABLE:11
>> dumpdiff-REL_11_STABLE:73
>> dumpdiff-REL_12_STABLE:73
>> dumpdiff-REL_13_STABLE:73
>> dumpdiff-REL_14_STABLE:0
>> dumpdiff-HEAD:0
>>
>>
>> I've also attached those non-empty dumpdiff files for information, since
>> they are quite small.
>>
>>
>> There is still work to do, but this is promising. Next step: try it on
>> Windows.
>>
>>
> It appears to do the right thing on Windows. yay!
>
>
> We probably need to get smarter about the heuristics, though, e.g. by
> taking into account the buildfarm options and the platform. It would
> also help a lot if we could make vcregress.pl honor USE_MODULE_DB.
> That's on my TODO list, but it just got a lot higher priority.
>
>


Here's what I've committed:
<https://github.com/PGBuildFarm/client-code/commit/6317d82c0e897a29dabd57ed8159d13920401f96>


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: pg_upgrade test for binary compatibility of core data types

Michael Paquier <michael@paquier.xyz> — 2021-10-01T07:58:41Z

On Sat, Sep 11, 2021 at 07:51:16PM -0500, Justin Pryzby wrote:
> These are all "translated" from test.sh, so follow its logic.
> Maybe it should be improved, but that's separate from this patch - which is
> already doing a few unrelated things.

I was looking at this CF entry, and what you are doing in 0004 to move
the tweaks from pg_upgrade's test.sh to a separate SQL script that
uses psql's meta-commands like \if to check which version we are on is
really interesting.  The patch does not apply anymore, so this needs a
rebase.  The entry has been switched as waiting on author by Tom, but
you did not update it after sending the new versions in [1].  I am
wondering if we could have something cleaner than just a set booleans
as you do here for each check, as that does not help with the
readability of the tests.

[1]: https://www.postgresql.org/message-id/20210912005116.GF26465@telsasoft.com
--
Michael

Re: pg_upgrade test for binary compatibility of core data types

Michael Paquier <michael@paquier.xyz> — 2021-10-11T05:38:12Z

On Fri, Oct 01, 2021 at 04:58:41PM +0900, Michael Paquier wrote:
> I was looking at this CF entry, and what you are doing in 0004 to move
> the tweaks from pg_upgrade's test.sh to a separate SQL script that
> uses psql's meta-commands like \if to check which version we are on is
> really interesting.  The patch does not apply anymore, so this needs a
> rebase.  The entry has been switched as waiting on author by Tom, but
> you did not update it after sending the new versions in [1].  I am
> wondering if we could have something cleaner than just a set booleans
> as you do here for each check, as that does not help with the
> readability of the tests.

And so, I am back at this thread, looking at the set of patches
proposed from 0001 to 0004.  The patches are rather messy and mix many
things and concepts, but there are basically four things that stand
out:
- test.sh is completely broken when using PG >= 14 as new version
because of the removal of the test tablespace.  Older versions of
pg_regress don't support --make-tablespacedir so I am fine to stick a
couple of extra mkdirs for testtablespace/, expected/ and sql/ to
allow the script to work properly for major upgrades as a workaround,
but only if we use an old version.  We need to do something here for
HEAD and REL_14_STABLE.
- The script would fail when using PG <= 11 as old version because of
WITH OIDS relations.  We need to do something down to REL_12_STABLE.
I did not like much the approach taken to stick 4 ALTER TABLE queries
though (the patch was actually failing here for me), so instead I have
borrowed what the buildfarm has been doing with a DO block.  That
works fine, and that's more portable.
- Not using --extra-float-digits with PG <= 11 as older version causes
a bunch of diffs in the dumps, making the whole unreadable.  The patch
was doing that unconditionally for *all version*, which is not good.
We should only do that on the versions that need it, and we know the
old version number before taking any dumps so that's easy to check.
- The addition of --wal-segsize and --allow-group-access breaks the
script when using PG < 10 at initdb time as these got added in 11.
With 10 getting EOL'd next year and per the lack of complaints, I am
not excited to do anything here and I'd rather leave this out so as we
keep coverage for those options across *all* major versions upgraded
from 11~.  The buildfarm has tests down to 9.2, but for devs my take
is that this is enough for now.

This is for the basics in terms of fixing test.sh and what should be
backpatched.  In this aspect patches 0001 and 0002 were a bit
incorrect.  I am not sure that 0003 is that interesting as designed as
we would miss any new core types introduced.

0004 is something I'd like to get done on HEAD to ease the move of the
pg_upgrade tests to TAP, but it could be made a bit easier to read by
not having all those oldpgversion_XX_YY flags grouped together for
one.  So I am going to rewrite portions of it once done with the
above.

For now, attached is a patch to address the issues with test.sh that I
am planning to backpatch.  This fixes the facility on HEAD, while
minimizing the diffs between the dumps.  We could do more, like a
s/PROCEDURE/FUNCTION/ but that does not make the diffs really
unreadable either.  I have only tested that on HEAD as new version
down to 11 as the oldest version per the business with --wal-segsize.
This still needs tests with 12~ as new version though, which is boring
but not complicated at all :)
--
Michael

Re: pg_upgrade test for binary compatibility of core data types

Michael Paquier <michael@paquier.xyz> — 2021-10-13T01:36:08Z

On Mon, Oct 11, 2021 at 02:38:12PM +0900, Michael Paquier wrote:
> For now, attached is a patch to address the issues with test.sh that I
> am planning to backpatch.  This fixes the facility on HEAD, while
> minimizing the diffs between the dumps.  We could do more, like a
> s/PROCEDURE/FUNCTION/ but that does not make the diffs really
> unreadable either.  I have only tested that on HEAD as new version
> down to 11 as the oldest version per the business with --wal-segsize.
> This still needs tests with 12~ as new version though, which is boring
> but not complicated at all :)

Okay, tested and done as of fa66b6d.
--
Michael

Re: pg_upgrade test for binary compatibility of core data types

Justin Pryzby <pryzby@telsasoft.com> — 2021-11-07T19:22:00Z

On Mon, Oct 11, 2021 at 02:38:12PM +0900, Michael Paquier wrote:
> On Fri, Oct 01, 2021 at 04:58:41PM +0900, Michael Paquier wrote:
> > I was looking at this CF entry, and what you are doing in 0004 to move
> > the tweaks from pg_upgrade's test.sh to a separate SQL script that
> > uses psql's meta-commands like \if to check which version we are on is
> > really interesting.  The patch does not apply anymore, so this needs a
> > rebase.  The entry has been switched as waiting on author by Tom, but
> > you did not update it after sending the new versions in [1].  I am
> > wondering if we could have something cleaner than just a set booleans
> > as you do here for each check, as that does not help with the
> > readability of the tests.
> 
> And so, I am back at this thread, looking at the set of patches
> proposed from 0001 to 0004.  The patches are rather messy and mix many
> things and concepts, but there are basically four things that stand
> out:
> - test.sh is completely broken when using PG >= 14 as new version
> because of the removal of the test tablespace.  Older versions of
> pg_regress don't support --make-tablespacedir so I am fine to stick a
> couple of extra mkdirs for testtablespace/, expected/ and sql/ to
> allow the script to work properly for major upgrades as a workaround,
> but only if we use an old version.  We need to do something here for
> HEAD and REL_14_STABLE.
> - The script would fail when using PG <= 11 as old version because of
> WITH OIDS relations.  We need to do something down to REL_12_STABLE.
> I did not like much the approach taken to stick 4 ALTER TABLE queries
> though (the patch was actually failing here for me), so instead I have
> borrowed what the buildfarm has been doing with a DO block.  That
> works fine, and that's more portable.
> - Not using --extra-float-digits with PG <= 11 as older version causes
> a bunch of diffs in the dumps, making the whole unreadable.  The patch
> was doing that unconditionally for *all version*, which is not good.
> We should only do that on the versions that need it, and we know the
> old version number before taking any dumps so that's easy to check.
> - The addition of --wal-segsize and --allow-group-access breaks the
> script when using PG < 10 at initdb time as these got added in 11.
> With 10 getting EOL'd next year and per the lack of complaints, I am
> not excited to do anything here and I'd rather leave this out so as we
> keep coverage for those options across *all* major versions upgraded
> from 11~.  The buildfarm has tests down to 9.2, but for devs my take
> is that this is enough for now.

Michael handled those in fa66b6d.
Note that the patch assumes that the "old version" being pg_upgraded has
commit 97f73a978: "Work around cross-version-upgrade issues created by commit 9e38c2bb5."

That may be good enough for test.sh, but if the kludges were moved to a .sql
script which was also run by the buildfarm (in stead of its hardcoded kludges), then
it might be necessary to handle the additional stuff my patch did, like:

+                                       DROP TRANSFORM FOR integer LANGUAGE sql CASCADE;"
+                                       DROP FUNCTION boxarea(box);"
+                                       DROP FUNCTION funny_dup17();"
+                                       DROP TABLE abstime_tbl;"
+                                       DROP TABLE reltime_tbl;"
+                                       DROP TABLE tinterval_tbl;"
+                                       DROP AGGREGATE first_el_agg_any(anyelement);"
+                                       DROP AGGREGATE array_cat_accum(anyarray);"
+                                       DROP OPERATOR @#@(NONE,bigint);"

Or, maybe it's guaranteed that the animals all run latest version of old
branches, in which case I think some of the BF's existing logic could be
dropped, which would help to reconcile these two scripts:

                my $missing_funcs = q{drop function if exists public.boxarea(box);                                                                                                                             
                              drop function if exists public.funny_dup17();                                                                                                                                    
..                                                                                                                                                                                                             
                        $prstmt = join(';',                                                                                                                                                                    
                                'drop operator @#@ (NONE, bigint)',                                                                                                                                            
..                                                                                                                                                                                                             
                                'drop aggregate if exists public.array_cat_accum(anyarray)',                                                                                                                   

> This is for the basics in terms of fixing test.sh and what should be
> backpatched.  In this aspect patches 0001 and 0002 were a bit
> incorrect.  I am not sure that 0003 is that interesting as designed as
> we would miss any new core types introduced.

We wouldn't miss new core types, because of the 2nd part of type_sanity which
tests that each core type was included in the "manytypes" table.

+-- And now a test on the previous test, checking that all core types are                                                                                                                                      
+-- included in this table                                                                                                                                                                                     
+-- XXX or some other non-catalog table processed by pg_upgrade                                                                                                                                                
+SELECT oid, typname, typtype, typelem, typarray, typarray FROM pg_type t                                                                                                                                      
+WHERE typtype NOT IN ('p', 'c')                                                                                                                                                                               
+-- reg* which cannot be pg_upgraded                                                                                                                                                                           
+AND oid != ALL(ARRAY['regproc', 'regprocedure', 'regoper', 'regoperator', 'regconfig', 'regdictionary', 'regnamespace', 'regcollation']::regtype[])                                                           
+-- XML might be disabled at compile-time                                                                                                                                                                      
+AND oid != ALL(ARRAY['xml', 'gtsvector', 'pg_node_tree', 'pg_ndistinct', 'pg_dependencies', 'pg_mcv_list', 'pg_brin_bloom_summary', 'pg_brin_minmax_multi_summary']::regtype[])                               
+AND NOT EXISTS (SELECT 1 FROM pg_type u WHERE u.typarray=t.oid) -- exclude arrays                                                                                                                             
+AND NOT EXISTS (SELECT 1 FROM pg_attribute a WHERE a.atttypid=t.oid AND a.attnum>0 AND a.attrelid='manytypes'::regclass);                                                                                     

> 0004 is something I'd like to get done on HEAD to ease the move of the
> pg_upgrade tests to TAP, but it could be made a bit easier to read by
> not having all those oldpgversion_XX_YY flags grouped together for
> one.  So I am going to rewrite portions of it once done with the
> above.


-- 
Justin

Re: pg_upgrade test for binary compatibility of core data types

Michael Paquier <michael@paquier.xyz> — 2021-11-08T03:53:58Z

On Sun, Nov 07, 2021 at 01:22:00PM -0600, Justin Pryzby wrote:
> That may be good enough for test.sh, but if the kludges were moved to a .sql
> script which was also run by the buildfarm (in stead of its hardcoded kludges), then
> it might be necessary to handle the additional stuff my patch did, like:

> [...]
>
> Or, maybe it's guaranteed that the animals all run latest version of old
> branches, in which case I think some of the BF's existing logic could be
> dropped, which would help to reconcile these two scripts:

I am pretty sure that it is safe to assume that all buildfarm animals
run the top of the stable branch they are testing, at least on the
community side.  An advantage of moving all those SQLs to a script
that can be process with psql thanks to the \if metacommands you have
added is that buildfarm clients are not required to immediately update
their code to work properly.  Considering as well that we should
minimize the amount of duplication between all those things, I'd like
to think that we'd better apply 0002 and consider a backpatch to allow
the buildfarm to catch up on it.  It should at least allow to remove a
good chunk of the object cleanup done directly by the buildfarm.

>> This is for the basics in terms of fixing test.sh and what should be
>> backpatched.  In this aspect patches 0001 and 0002 were a bit
>> incorrect.  I am not sure that 0003 is that interesting as designed as
>> we would miss any new core types introduced.
> 
> We wouldn't miss new core types, because of the 2nd part of type_sanity which
> tests that each core type was included in the "manytypes" table.

+-- XML might be disabled at compile-time
+AND oid != ALL(ARRAY['xml', 'gtsvector', 'pg_node_tree',
'pg_ndistinct', 'pg_dependencies', 'pg_mcv_list',
'pg_brin_bloom_summary', 'pg_brin_minmax_multi_summary']::regtype[])

I believe that this comment is incomplete, applying only to the first
element listed in this array.  I guess that this had better document
why those catalogs are part of the list?  Good to see that adding a
reg* in core would immediately be noticed though, as far as I
understand this SQL.
--
Michael

Re: pg_upgrade test for binary compatibility of core data types

Michael Paquier <michael@paquier.xyz> — 2021-11-17T07:01:19Z

On Sun, Nov 07, 2021 at 01:22:00PM -0600, Justin Pryzby wrote:
> That may be good enough for test.sh, but if the kludges were moved to a .sql
> script which was also run by the buildfarm (in stead of its hardcoded kludges), then
> it might be necessary to handle the additional stuff my patch did, like:
>
> +                                       DROP TRANSFORM FOR integer LANGUAGE sql CASCADE;"
> +                                       DROP FUNCTION boxarea(box);"
> +                                       DROP FUNCTION funny_dup17();"

These apply for an old version <= v10.

> +                                       DROP TABLE abstime_tbl;"
> +                                       DROP TABLE reltime_tbl;"
> +                                       DROP TABLE tinterval_tbl;"

old version <= 9.3.

> +                                       DROP AGGREGATE first_el_agg_any(anyelement);"

Not sure about this one.

> +                                       DROP AGGREGATE array_cat_accum(anyarray);"
> +                                       DROP OPERATOR @#@(NONE,bigint);"

These are on 9.4.  It is worth noting that TestUpgradeXversion.pm
recreates those objects.  I'd agree to close the gap completely rather
than just moving what test.sh does to wipe out a maximum client code
for the buildfarm.

> Or, maybe it's guaranteed that the animals all run latest version of old
> branches, in which case I think some of the BF's existing logic could be
> dropped, which would help to reconcile these two scripts:

That seems like a worthy goal to reduce the amount of duplication with
the buildfarm code, while allowing tests from upgrades with older
versions (the WAL segment size and group permission issue in test.sh
had better be addressed in a better way, perhaps once the pg_upgrade
tests are moved to TAP).  There are also things specific to contrib/
modules with older versions, but that may be too specific for this
exercise.

+\if :oldpgversion_le84
+DROP FUNCTION public.myfunc(integer);
+\endif

The oldest version tested by the buildfarm is 9.2, so we could ignore
this part I guess?

Andrew, what do you think about this part?  Based on my read of this
thread, there is an agreement that this approach makes the buildfarm
code more manageable so as committers would not need to patch the
buildfarm code if their test fail.  I agree with this conclusion, but
I wanted to double-check with you first.  This would need a backpatch
down to 10 so as we could clean up a maximum of code in
TestUpgradeXversion.pm without waiting for an extra 5 years.  Please
note that I am fine to send a patch for the buildfarm client.

> We wouldn't miss new core types, because of the 2nd part of type_sanity which
> tests that each core type was included in the "manytypes" table.

Thanks, I see your point now after a closer read.

There is still a pending question for contrib modules, but I think
that we need to think larger here with a better integration of
contrib/ modules in the upgrade testing process.  Making that cheap
would require running the set of regression tests on the instance
to-be-upgraded first.  I think that one step in this direction would
be to have unique databases for each contrib/ modules, so as there is
no overlap with objects dropped?

Having some checks with code types looks fine as a first step, so
let's do that.  I have reviewed 0001, rewrote a couple of comments.
All the comments from upthread seem to be covered with that.  So I'd
like to get that applied on HEAD.  We could as well be less
conservative and backpatch that down to 12 to follow on 7c15cef so we
would be more careful with 15~ already (a backpatch down to 14 would
be enough for this purpose, actually thanks to the 14->15 upgrade
path).
--
Michael

Re: pg_upgrade test for binary compatibility of core data types

Andrew Dunstan <andrew@dunslane.net> — 2021-11-17T15:07:17Z

On 11/17/21 02:01, Michael Paquier wrote:
>
> The oldest version tested by the buildfarm is 9.2, so we could ignore
> this part I guess?
>
> Andrew, what do you think about this part?  Based on my read of this
> thread, there is an agreement that this approach makes the buildfarm
> code more manageable so as committers would not need to patch the
> buildfarm code if their test fail.  I agree with this conclusion, but
> I wanted to double-check with you first.  This would need a backpatch
> down to 10 so as we could clean up a maximum of code in
> TestUpgradeXversion.pm without waiting for an extra 5 years.  Please
> note that I am fine to send a patch for the buildfarm client.
>
>

In general I'm in agreement with the direction here. If we can have a
script that applies to back branches to make them suitable for upgrade
testing instead of embedding this in the buildfarm client, so much the
better.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: pg_upgrade test for binary compatibility of core data types

Michael Paquier <michael@paquier.xyz> — 2021-11-18T04:36:50Z

On Wed, Nov 17, 2021 at 10:07:17AM -0500, Andrew Dunstan wrote:
> In general I'm in agreement with the direction here. If we can have a
> script that applies to back branches to make them suitable for upgrade
> testing instead of embedding this in the buildfarm client, so much the
> better.

Okay.  I have worked on 0001 to add the table to check after the
binary compatibilities and applied it.  What remains on this thread is
0002 to move all the SQL queries into a psql-able file with the set of
\if clauses to control which query is run depending on the backend
version.  Justin, could you send a rebased version of that with all
the changes from the buildfarm client included?
--
Michael

Re: pg_upgrade test for binary compatibility of core data types

Justin Pryzby <pryzby@telsasoft.com> — 2021-11-18T04:47:28Z

On Wed, Nov 17, 2021 at 04:01:19PM +0900, Michael Paquier wrote:
> On Sun, Nov 07, 2021 at 01:22:00PM -0600, Justin Pryzby wrote:
> > That may be good enough for test.sh, but if the kludges were moved to a .sql
> > script which was also run by the buildfarm (in stead of its hardcoded kludges), then
> > it might be necessary to handle the additional stuff my patch did, like:
> >
> > +                                       DROP TRANSFORM FOR integer LANGUAGE sql CASCADE;"
> > +                                       DROP FUNCTION boxarea(box);"
> > +                                       DROP FUNCTION funny_dup17();"
> 
> These apply for an old version <= v10.
> 
> > +                                       DROP TABLE abstime_tbl;"
> > +                                       DROP TABLE reltime_tbl;"
> > +                                       DROP TABLE tinterval_tbl;"
> 
> old version <= 9.3.
> 
> > +                                       DROP AGGREGATE first_el_agg_any(anyelement);"
> 
> Not sure about this one.

See 97f73a978fc1aca59c6ad765548ce0096d95a923

> These are on 9.4.  It is worth noting that TestUpgradeXversion.pm
> recreates those objects.  I'd agree to close the gap completely rather
> than just moving what test.sh does to wipe out a maximum client code
> for the buildfarm.

>>Or, maybe it's guaranteed that the animals all run latest version of old
>>branches, in which case I think some of the BF's existing logic could be
>>dropped, which would help to reconcile these two scripts:
>>
>>                my $missing_funcs = q{drop function if exists public.boxarea(box);
>>                              drop function if exists public.funny_dup17();
>>..
>>                        $prstmt = join(';',
>>                                'drop operator @#@ (NONE, bigint)',
>>..
>>                                'drop aggregate if exists public.array_cat_accum(anyarray)',
>>

I'm not sure if everything the buildfarm does is needed anymore, or if any of
it could be removed now, rather than being implemented in test.sh.

boxarea, funny_dup - see also db3af9feb19f39827e916145f88fa5eca3130cb2
https://github.com/PGBuildFarm/client-code/commit/9ca42ac1783a8cf99c73b4f7c52bd05a6024669d

array_larger_accum/array_cat_accum - see also 97f73a978fc1aca59c6ad765548ce0096d95a923
https://github.com/PGBuildFarm/client-code/commit/a55c89869f30db894ab823df472e739cee2e8c91

@#@ 76f412ab310554acb970a0b73c8d1f37f35548c6 ??
https://github.com/PGBuildFarm/client-code/commit/b3fdb743d89dc91fcea47bd9651776c503f774ff
https://github.com/PGBuildFarm/client-code/commit/b44e9390e2d8d904ff8cabd906a2d4b5c8bd300a
https://github.com/PGBuildFarm/client-code/commit/3844503c8fde134f7cc29b3fb147d590b6d2fcc1

abstime:
https://github.com/PGBuildFarm/client-code/commit/f027d991d197036028ffa9070f4c9193076ed5ed

putenv
https://github.com/PGBuildFarm/client-code/commit/fa86d0b1bc7a8d7b9f15b1da8b8e43f4d3a08e2b

Re: pg_upgrade test for binary compatibility of core data types

Tom Lane <tgl@sss.pgh.pa.us> — 2021-11-18T04:57:51Z

Michael Paquier <michael@paquier.xyz> writes:
> Okay.  I have worked on 0001 to add the table to check after the
> binary compatibilities and applied it.

Something funny about that on prion:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prion&dt=2021-11-18%2001%3A55%3A38

@@ -747,6 +747,8 @@
   '{(2020-01-02 03:04:05, 2021-02-03 06:07:08)}'::tstzmultirange,
   arrayrange(ARRAY[1,2], ARRAY[2,1]),
   arraymultirange(arrayrange(ARRAY[1,2], ARRAY[2,1]));
+ERROR:  unrecognized key word: "ec2"
+HINT:  ACL key word must be "group" or "user".
 -- Sanity check on the previous table, checking that all core types are
 -- included in this table.
 SELECT oid, typname, typtype, typelem, typarray, typarray

Not sure what's going on there.

			regards, tom lane

Re: pg_upgrade test for binary compatibility of core data types

Michael Paquier <michael@paquier.xyz> — 2021-11-18T05:49:35Z

On Wed, Nov 17, 2021 at 11:57:51PM -0500, Tom Lane wrote:
> Something funny about that on prion:
> 
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prion&dt=2021-11-18%2001%3A55%3A38
> Not sure what's going on there.

Yes, that was just some missing quoting in the aclitem of this new
table.  prion uses a specific user name, "ec2-user", that caused the
failure.
--
Michael

Re: pg_upgrade test for binary compatibility of core data types

Michael Paquier <michael@paquier.xyz> — 2021-11-18T06:58:18Z

On Wed, Nov 17, 2021 at 10:47:28PM -0600, Justin Pryzby wrote:
> I'm not sure if everything the buildfarm does is needed anymore, or if any of
> it could be removed now, rather than being implemented in test.sh.

+-- This file has a bunch of kludges needed for testing upgrades
across major versions
+-- It supports testing the most recent version of an old release (not
any arbitrary minor version).

This could be better-worded.  Here is an idea:
--
-- SQL queries for major upgrade tests
--
-- This file includes a set of SQL queries to make a cluster to-be-upgraded
-- compatible with the version this file is on.  This requires psql,
-- as per-version queries are controlled with a set of \if clauses.

+\if :oldpgversion_le84
+DROP FUNCTION public.myfunc(integer);
+\endif
We could retire this part for <= 8.4.  The oldest version tested by
the buildfarm is 9.2.

+       psql -X -d regression -f "test-upgrade.sql" || psql_fix_sql_status=$?
Shouldn't we use an absolute path here?  I was testing a VPATH build
and that was not working properly.

+-- commit 9e38c2bb5 and 97f73a978
+-- DROP AGGREGATE array_larger_accum(anyarray);
+DROP AGGREGATE array_cat_accum(anyarray);
+
+-- commit 76f412ab3
+-- DROP OPERATOR @#@(bigint,NONE);
+DROP OPERATOR @#@(NONE,bigint);
+\endif
The buildfarm does "CREATE OPERATOR @#@" and "CREATE AGGREGATE
array_larger_accum" when dealing with an old version between 9.5 and
13.  Shouldn't we do the same and create those objects rather than a
plain DROP?  What you are doing is not wrong, and it should allow
upgrades to work, but that's a bit inconsistent with the buildfarm in
terms of coverage.

+   ver >= 905 AND ver <= 1300 AS oldpgversion_95_13,
+   ver >= 906 AND ver <= 1300 AS oldpgversion_96_13,
+   ver >= 906 AND ver <= 1000 AS oldpgversion_96_10,
So here, we have the choice between conditions that play with version
ranges or we could make those checks simpler but compensate with a set
of IF EXISTS queries.  I think that your choice is right.  The
buildfarm mixes both styles to compensate with the cases where the
objects are created after a drop.

The list of objects and the version ranges look correct to me.
--
Michael

Re: pg_upgrade test for binary compatibility of core data types

Michael Paquier <michael@paquier.xyz> — 2021-12-01T07:19:44Z

On Thu, Nov 18, 2021 at 03:58:18PM +0900, Michael Paquier wrote:
> +   ver >= 905 AND ver <= 1300 AS oldpgversion_95_13,
> +   ver >= 906 AND ver <= 1300 AS oldpgversion_96_13,
> +   ver >= 906 AND ver <= 1000 AS oldpgversion_96_10,
> So here, we have the choice between conditions that play with version
> ranges or we could make those checks simpler but compensate with a set
> of IF EXISTS queries.  I think that your choice is right.  The
> buildfarm mixes both styles to compensate with the cases where the
> objects are created after a drop.

So, I have come back to this part of the patch set, that moves the SQL
queries doing the pre-upgrade cleanups in the old version we upgrade
from, and decided to go with what looks like the simplest approach,
relying on some IFEs depending on the object types if they don't
exist for some cases.

While checking the whole thing, I have noticed that some of the
operations were not really necessary.  The result is rather clean now,
with a linear organization of the version logic, so as it is a
no-brainer to get that done in back-branches per the
backward-compatibility argument.

I'll get that done down to 10 to maximize its influence, then I'll
move on with the buildfarm code and send a patch to plug this and
reduce the dependencies between core and the buildfarm code.
--
Michael

Re: pg_upgrade test for binary compatibility of core data types

Michael Paquier <michael@paquier.xyz> — 2021-12-02T01:49:22Z

On Wed, Dec 01, 2021 at 04:19:44PM +0900, Michael Paquier wrote:
> I'll get that done down to 10 to maximize its influence, then I'll
> move on with the buildfarm code and send a patch to plug this and
> reduce the dependencies between core and the buildfarm code.

Okay, I have checked this one this morning, and applied the split down
to 10, so as we have a way to fix objects from the main regression
test suite.  The buildfarm client gets a bit cleaned up after that (I
have a patch for that, but I am not 100% sure that it is right).

Still, the global picture is larger than that because there is still
nothing done for contrib/ modules included in cross-version checks of
pg_upgrade by the buildfarm.  The core code tests don't do this much,
but if we were to do the same things as the buildfarm, then we would
need to run installcheck-world (roughly) on a deployed instance, then
pg_upgrade it.  That's not going to be cheap, for sure.

One thing that we could do is to use unique names for the databases of
the contrib/ modules when running an installcheck, so as these are
preserved for upgrades (the buildfarm client does that).  This has as
effect to increase the number of databases for an instance
installcheck'ed, so this had better be optional, at least.
--
Michael