Re: In-placre persistance change of a relation

Kyotaro Horiguchi <horikyota.ntt@gmail.com>

From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>

To: sfrost@snowman.net

Cc: osumi.takamichi@fujitsu.com, masao.fujii@oss.nttdata.com, ashutosh.bapat.oss@gmail.com, tsunakawa.takay@fujitsu.com, pgsql-hackers@lists.postgresql.org

Date: 2020-11-13T04:22:01Z

Lists: pgsql-hackers

Commits

Same data as JSON: GET /api/v1/messages/:b64id/commits the thread's linked commits as JSON, with link sources. API reference →

pg_dump: Refactor getIndexes()
- e2c52beecdea 15.0 cited
Optimize DropRelFileNodesAllBuffers() for recovery.
- bea449c635c0 14.0 cited

Attachments

v3-0001-In-place-table-persistence-change.patch (text/x-patch)

Hello. Before posting the next version, I'd like to explain what this
patch is.

1. The Issue

Bulk data loading is a long-time taking, I/O consuming task. Many
DBAs want that task is faster, even at the cost of increasing risk of
data-loss. wal_level=minimal is an answer to such a
request. Data-loading onto a table that is created in the current
transaction omits WAL-logging and synced at commit.

However, the optimization doesn't benefit the case where the
data-loading is performed onto existing tables. There are quite a few
cases where data is loaded into tables that already contains a lot of
data. Those cases don't take benefit of the optimization.

Another possible solution for bulk data-loading is UNLOGGED
tables. But when we switch LOGGED/UNLOGGED of a table, all the table
content is copied to a newly created heap, which is costly.

2. Proposed Solutions.

There are two proposed solutions are discussed on this mailing
list. One is wal_level = none (*1), which omits WAL-logging almost at
all. Another is extending the existing optimization to the ALTER TABLE
SET LOGGED/UNLOGGED cases, which is to be discussed in this new
thread.

3. In-place Persistence Change

So the attached is a PoC patch of the "another" solution. When we
want to change table persistence in-place, basically we need to do the
following steps.

(the talbe is exclusively locked)

(1) Flip BM_PERMANENT flag of all shared buffer blocks for the heap.

(2) Create or delete the init fork for existing heap.

(3) Flush all buffers of the relation to file system.

(4) Sync heap files.

(5) Make catalog changes.

4. Transactionality

The 1, 2 and 5 above need to be abort-able. 5 is rolled back by
existing infrastructure, and rolling-back of 1 and 2 are achieved by
piggybacking on the pendingDeletes mechanism.

5. Replication

Furthermore, that changes ought to be replicable to standbys. Catalog
changes are replicated as usual.

On-the-fly creation of the init fork leads to recovery mess. Even
though it is removed at abort, if the server crashed before
transaction end, the file is left alone and corrupts database in the
next recovery. I sought a way to create the init fork in
smgrPendingDelete but that needs relcache and relcache is not
available at that late of commit. Finally, I introduced the fifth fork
kind "INITTMP"(_itmp) only to signal that the init file is not
committed. I don't like that way but it seems working fine...

6. SQL Command

The second file in the patchset adds a syntax that changes persistence
of all tables in a tablespace.

ALTER TABLE ALL IN TABLESPACE <tsp> SET LOGGED/UNLOGGED [ NOWAIT ];

7. Testing

I tried to write TAP test for this, but IPC::Run::harness (or
interactive_psql) doesn't seem to work for me. I'm not sure what
exactly is happening but pty redirection doesn't work.

$in = "ls\n"; $out = ""; run ["/usr/bin/bash"], \$in, \$out; print $out;

works but

$in = "ls\n"; $out = ""; run ["/usr/bin/bash"], '<pty<', \$in, '>pty>', \$out; print $out;

doesn't respond.

The patch is attached.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center