Thread

Re: Race conditions, race conditions!

Tatsuo Ishii <t-ishii@sra.co.jp> — 2005-05-08T02:31:49Z
Are we going to put the fixes into 8.0.3 and so on? Or will it be
included in 8.0.4?
--
Tatsuo Ishii

> I promised an analysis of the problems Jan Wieck uncovered yesterday,
> so here it is.
> 
> Jan had developed a testbed (which I hope he will post) that essentially
> has a bunch of client threads all doing instances of the same
> transaction in READ COMMITTED mode.  The intended database invariant is
> that the number of rows in table t2 with a particular ID value is equal
> to the "cnt" field of the single row in t1 with that ID value:
> 
> 	begin;
> 	select cnt from t1 where id = K for update;
> 	delete from t2 where id = K;
> 	-- check returned rowcount to see that exactly CNT rows were deleted
> 	insert into t2 values (K);  -- repeat this N times
> 	update t1 set cnt = N where id = K;
> 	commit;
> 
> K is randomly chosen for each execution from the known set of t1 keys,
> and N is randomly chosen each time as a small positive integer.  This
> should maintain the invariant, since at any time only one transaction
> can be holding the SELECT FOR UPDATE lock on a particular t1 row.
> 
> The trick is to run this under astronomical load; 100 or so client
> threads on a garden-variety PC is about right.
> 
> What Jan was seeing was that every so often, a client thread would
> error out, reporting that its DELETE had deleted zero rows rather
> than the expected number.  But subsequent examination showed the t2
> rows as being there.
> 
> Investigation showed that the connected backend had properly acquired
> FOR UPDATE lock on the t1 row, but the snapshot it was using for the
> subsequent DELETE showed the inserter of the t1 row as still running.
> This should be impossible, since the SELECT FOR UPDATE cannot lock an
> uncommitted row, and in READ COMMITTED mode we certainly will take a
> new snapshot for the DELETE.
> 
> However, there is a race condition here.  During transaction commit,
> xact.c first marks the transaction committed in pg_clog, and then
> clears its XID from PGPROC.  This means there is a narrow window
> in which both TransactionIdDidCommit and TransactionIdIsInProgress
> will return true.  (We cannot do it the other way around, because
> if neither one is returning true, onlookers are entitled to assume
> that the transaction has crashed.)  However, the tqual.c routines
> will allow a row to be seen as committed as soon as
> TransactionIdDidCommit(xmin) returns true.  So the scenario is:
> 
> 1. Backend A does RecordTransactionCommit to mark itself committed
> in pg_clog, but then loses the CPU in the narrow window between
> doing that and clearing its PGPROC entry.  Because of the ridiculous
> load, it doesn't get control back for awhile.
> 
> 2. Backend B comes along to run the test transaction for the same K
> value.  It inspects the t1 row, concludes it's committed, marks the
> row as locked FOR UPDATE, and returns the results to the client.
> 
> 3. The client now issues the DELETE command.  B takes a new snapshot,
> but because A is still not cleared out of PGPROC, A's transaction
> is shown as still running in the snapshot.
> 
> 4. Now the DELETE will delete no rows, because it doesn't consider the
> t2 rows it should delete to be committed.
> 
> 
> AFAICS this race condition has always been there; certainly at
> least since Vadim put in MVCC, and it looks likely that the
> original Berkeley code had a form of the problem.
> 
> The correct fix is that the tqual.c routines should check commit status
> in the following way:
> 
> 	if (TransactionIdIsInProgress(xid))
> 	   // still in progress, don't touch tuple
> 	else if (TransactionIdDidCommit(xid))
> 	   // committed, mark accordingly
> 	else
> 	   // must be aborted or crashed, mark accordingly
> 
> rather than what they have traditionally done:
> 
> 	if (TransactionIdDidCommit(xid))
> 	   // committed, mark accordingly
> 	else if (TransactionIdDidAbort(xid))
> 	   // aborted, mark accordingly
> 	else
> 	   // assume still in progress, don't touch tuple
> 
> Vadim recognized that the former logic was necessary for VACUUM to use
> in deciding if it could clean up dead tuples, but apparently he never
> made the extrapolation that it should be used *everywhere* transaction
> status is tested.
> 
> 
> The other interesting thing we saw was an Assertion failure.  The
> postmaster log showed
> 
> WARNING:  relation "t1" page 196 is uninitialized --- fixing
> TRAP: FailedAssertion("!((((PageHeader) ((PageHeader) pageHeader))->pd_upper == 0))", File: "hio.c", Line: 263)
> LOG:  server process (PID 11296) was terminated by signal 6
> 
> The WARNING could only have come from VACUUM.  (Jan's testbed does
> launch a VACUUM every so often.)  The Assert failure is in
> RelationGetBufferForTuple where it is adding a new page to a table.
> 
> I interpret this as the guy doing RelationGetBufferForTuple added a
> page, but before he could initialize it, he lost control for long enough
> for a VACUUM to scan through the entire table, see the zeroed page, and
> "fix" it.  Then when the first guy got control again, his Assert saying
> the page was zeroes failed.  The window for this exists because bufmgr/smgr
> do physically extend the file, but when the new buffer is returned
> to the caller, it is only pinned, not locked.  So it is possible for
> VACUUM to acquire lock on the new page before RelationGetBufferForTuple
> does.  (This is exceedingly improbable, because VACUUM would have to
> scan through the entire relation first --- it wouldn't examine the
> new page at all unless it was included in RelationGetNumberOfBlocks
> at the start of VACUUM's scan.  We have not been able to reproduce the
> crash, but we did see it once.)
> 
> At first I thought this didn't have any conseqences worse than an
> Assert, since after all both VACUUM and the original extender are just
> trying to set up the page as empty.  But consider this scenario:
> 
> 1. Process A adds a zero page and gets blocked ahead of initializing it.
> 
> 2. Process B runs VACUUM ... sees the zero page ... fixes it ... puts it
>    in FSM.
> 
> 3. Process C takes the page from FSM and puts some new tuples in it.
> 
> 4. Process A initializes the page, thus discarding C's tuples.
> 
> This could not happen if Asserts are enabled, because A would Assert
> before re-initializing the page (and that's inside a lock so it is
> sure to happen that way).  But with Asserts off, it's within the realm
> of possibility under sufficiently heavy load.
> 
> This risk has been there since lazy VACUUM was created (it cannot
> happen with VACUUM FULL because that takes an exclusive lock on the
> whole table).  I believe the best fix is:
> 
> 1. Change RelationGetBufferForTuple to hold the "relation extension"
> lock until after it's exclusive-locked the new buffer.
> 
> 2. Adjust VACUUM to not try to "fix" a page without getting the relation
> extension lock.  If the page is still uninitialized after obtaining
> that lock, then it must be a really lost page, not one that someone
> is still in process of setting up.
> 
> There is a similar risk in btree indexes, with a similar fix.
> 
> 
> We are currently testing candidate patches for these problems,
> and so far haven't seen any further failures.
> 
> 			regards, tom lane
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 5: Have you checked our extensive FAQ?
> 
>                http://www.postgresql.org/docs/faq
>