Thread

  1. Re: Race conditions, race conditions!

    Tatsuo Ishii <t-ishii@sra.co.jp> — 2005-05-08T02:31:49Z

    Are we going to put the fixes into 8.0.3 and so on? Or will it be
    included in 8.0.4?
    --
    Tatsuo Ishii
    
    > I promised an analysis of the problems Jan Wieck uncovered yesterday,
    > so here it is.
    > 
    > Jan had developed a testbed (which I hope he will post) that essentially
    > has a bunch of client threads all doing instances of the same
    > transaction in READ COMMITTED mode.  The intended database invariant is
    > that the number of rows in table t2 with a particular ID value is equal
    > to the "cnt" field of the single row in t1 with that ID value:
    > 
    > 	begin;
    > 	select cnt from t1 where id = K for update;
    > 	delete from t2 where id = K;
    > 	-- check returned rowcount to see that exactly CNT rows were deleted
    > 	insert into t2 values (K);  -- repeat this N times
    > 	update t1 set cnt = N where id = K;
    > 	commit;
    > 
    > K is randomly chosen for each execution from the known set of t1 keys,
    > and N is randomly chosen each time as a small positive integer.  This
    > should maintain the invariant, since at any time only one transaction
    > can be holding the SELECT FOR UPDATE lock on a particular t1 row.
    > 
    > The trick is to run this under astronomical load; 100 or so client
    > threads on a garden-variety PC is about right.
    > 
    > What Jan was seeing was that every so often, a client thread would
    > error out, reporting that its DELETE had deleted zero rows rather
    > than the expected number.  But subsequent examination showed the t2
    > rows as being there.
    > 
    > Investigation showed that the connected backend had properly acquired
    > FOR UPDATE lock on the t1 row, but the snapshot it was using for the
    > subsequent DELETE showed the inserter of the t1 row as still running.
    > This should be impossible, since the SELECT FOR UPDATE cannot lock an
    > uncommitted row, and in READ COMMITTED mode we certainly will take a
    > new snapshot for the DELETE.
    > 
    > However, there is a race condition here.  During transaction commit,
    > xact.c first marks the transaction committed in pg_clog, and then
    > clears its XID from PGPROC.  This means there is a narrow window
    > in which both TransactionIdDidCommit and TransactionIdIsInProgress
    > will return true.  (We cannot do it the other way around, because
    > if neither one is returning true, onlookers are entitled to assume
    > that the transaction has crashed.)  However, the tqual.c routines
    > will allow a row to be seen as committed as soon as
    > TransactionIdDidCommit(xmin) returns true.  So the scenario is:
    > 
    > 1. Backend A does RecordTransactionCommit to mark itself committed
    > in pg_clog, but then loses the CPU in the narrow window between
    > doing that and clearing its PGPROC entry.  Because of the ridiculous
    > load, it doesn't get control back for awhile.
    > 
    > 2. Backend B comes along to run the test transaction for the same K
    > value.  It inspects the t1 row, concludes it's committed, marks the
    > row as locked FOR UPDATE, and returns the results to the client.
    > 
    > 3. The client now issues the DELETE command.  B takes a new snapshot,
    > but because A is still not cleared out of PGPROC, A's transaction
    > is shown as still running in the snapshot.
    > 
    > 4. Now the DELETE will delete no rows, because it doesn't consider the
    > t2 rows it should delete to be committed.
    > 
    > 
    > AFAICS this race condition has always been there; certainly at
    > least since Vadim put in MVCC, and it looks likely that the
    > original Berkeley code had a form of the problem.
    > 
    > The correct fix is that the tqual.c routines should check commit status
    > in the following way:
    > 
    > 	if (TransactionIdIsInProgress(xid))
    > 	   // still in progress, don't touch tuple
    > 	else if (TransactionIdDidCommit(xid))
    > 	   // committed, mark accordingly
    > 	else
    > 	   // must be aborted or crashed, mark accordingly
    > 
    > rather than what they have traditionally done:
    > 
    > 	if (TransactionIdDidCommit(xid))
    > 	   // committed, mark accordingly
    > 	else if (TransactionIdDidAbort(xid))
    > 	   // aborted, mark accordingly
    > 	else
    > 	   // assume still in progress, don't touch tuple
    > 
    > Vadim recognized that the former logic was necessary for VACUUM to use
    > in deciding if it could clean up dead tuples, but apparently he never
    > made the extrapolation that it should be used *everywhere* transaction
    > status is tested.
    > 
    > 
    > The other interesting thing we saw was an Assertion failure.  The
    > postmaster log showed
    > 
    > WARNING:  relation "t1" page 196 is uninitialized --- fixing
    > TRAP: FailedAssertion("!((((PageHeader) ((PageHeader) pageHeader))->pd_upper == 0))", File: "hio.c", Line: 263)
    > LOG:  server process (PID 11296) was terminated by signal 6
    > 
    > The WARNING could only have come from VACUUM.  (Jan's testbed does
    > launch a VACUUM every so often.)  The Assert failure is in
    > RelationGetBufferForTuple where it is adding a new page to a table.
    > 
    > I interpret this as the guy doing RelationGetBufferForTuple added a
    > page, but before he could initialize it, he lost control for long enough
    > for a VACUUM to scan through the entire table, see the zeroed page, and
    > "fix" it.  Then when the first guy got control again, his Assert saying
    > the page was zeroes failed.  The window for this exists because bufmgr/smgr
    > do physically extend the file, but when the new buffer is returned
    > to the caller, it is only pinned, not locked.  So it is possible for
    > VACUUM to acquire lock on the new page before RelationGetBufferForTuple
    > does.  (This is exceedingly improbable, because VACUUM would have to
    > scan through the entire relation first --- it wouldn't examine the
    > new page at all unless it was included in RelationGetNumberOfBlocks
    > at the start of VACUUM's scan.  We have not been able to reproduce the
    > crash, but we did see it once.)
    > 
    > At first I thought this didn't have any conseqences worse than an
    > Assert, since after all both VACUUM and the original extender are just
    > trying to set up the page as empty.  But consider this scenario:
    > 
    > 1. Process A adds a zero page and gets blocked ahead of initializing it.
    > 
    > 2. Process B runs VACUUM ... sees the zero page ... fixes it ... puts it
    >    in FSM.
    > 
    > 3. Process C takes the page from FSM and puts some new tuples in it.
    > 
    > 4. Process A initializes the page, thus discarding C's tuples.
    > 
    > This could not happen if Asserts are enabled, because A would Assert
    > before re-initializing the page (and that's inside a lock so it is
    > sure to happen that way).  But with Asserts off, it's within the realm
    > of possibility under sufficiently heavy load.
    > 
    > This risk has been there since lazy VACUUM was created (it cannot
    > happen with VACUUM FULL because that takes an exclusive lock on the
    > whole table).  I believe the best fix is:
    > 
    > 1. Change RelationGetBufferForTuple to hold the "relation extension"
    > lock until after it's exclusive-locked the new buffer.
    > 
    > 2. Adjust VACUUM to not try to "fix" a page without getting the relation
    > extension lock.  If the page is still uninitialized after obtaining
    > that lock, then it must be a really lost page, not one that someone
    > is still in process of setting up.
    > 
    > There is a similar risk in btree indexes, with a similar fix.
    > 
    > 
    > We are currently testing candidate patches for these problems,
    > and so far haven't seen any further failures.
    > 
    > 			regards, tom lane
    > 
    > ---------------------------(end of broadcast)---------------------------
    > TIP 5: Have you checked our extensive FAQ?
    > 
    >                http://www.postgresql.org/docs/faq
    >