Thread

  1. Re: PoC: VALGRIND_MAKE_MEM_NOACCESS for dynamic shared memory

    Tomas Vondra <tomas@vondra.me> — 2026-05-04T14:21:55Z

    
    On 5/4/26 15:56, Andres Freund wrote:
    > Hi,
    > 
    > On 2026-05-04 10:18:31 +0200, Tomas Vondra wrote:
    >> A couple of days ago we got a report regarding an incorrect shmem size
    >> calculation in btree [1], leading to a buffer overflow / memory
    >> corruption. Which became a much less common issue in our code, thanks to
    >> valgrind and similar tools. But it took me a while to realize valgrind
    >> won't catch this because we only instrument private memory (palloc et
    >> al), while shmem is left alone.
    >>
    >> I was wondering if it's feasible to improve this. Attached is a trivial
    >> patch that adjusts shm_toc.c to add a couple NOACCESS bytes after each
    >> entry in the segment. It seems to do the trick - with this we get a
    >> reasonable report (for the reproducer provided in the bug report, before
    >> it got fixed by 748d871b7c) from valgrind, with invalid accesses. See
    >> the attached .log for an example. It's much better than the confusing
    >> crashes due to corrupted state.
    >>
    >> There's an issue, though. It seems the valgrind memory markings are not
    >> shared between processes. The leader sets the shm_toc up, marks the
    >> ranges as NOACCESS, and then checks it while accessing the memory. But
    >> the parallel workers don't seem to see this, and so will produce no
    >> reports. I'm assuming this is the case, because all the reports come
    >> from the leader, never from the workers. Maybe there's a different
    >> explanation (e.g. maybe it's just the leader touching the memory?).
    > 
    > I assume the issue is just that the workers don't have the NOACCESS markers? I
    > think you'd need to do them in every process using the shm_toc.  Either by
    > doing it in shm_toc_attach() or in shm_toc().
    > 
    
    Right, but it'd also need to know how long the entry is. Which AFAIK it
    does not. We don't want to redo the calculation in every worker, but we
    might add a "len" field to the toc entry.
    
    > 
    > 
    >> An alternative would be to do mprotect(), but unfortunately it requires
    >> page-aligned ranges, which makes it somewhat useless for small overflows
    >> of a couple bytes (like here). It should work cross-process, I think,
    > 
    > It doesn't work across processes. Every process has their own mprotect "view".
    > 
    
    Ah, OK. So I guessed wrong, and it has the same issue as NOACCESS.
    
    > 
    >> FWIW the PoC patch adds a 32-byte chunk, not just a single byte. This is
    >> intentional, because if the state is an array, it's quite possible the
    >> invalid access steps over a fair number of bytes. I'm actually thinking
    >> it should be even larger.
    >>
    >> This modifies just the dynamic shmem, but maybe we should do this even
    >> for the "regular" shmem allocated at start. Issues in that would likely
    >> cause crashes pretty quick (unlike the btree issue, which requires a
    >> somewhat special reproducer), but a nice valgrind report helps.
    > 
    > I can tell you from experience that no, it's not necessarily quickly caught.
    > So yes, I think we should definitely do that.
    > 
    
    Understood.
    
    > 
    > 
    >>  /*
    >>   * Allocate shared memory from a segment managed by a table of contents.
    >>   *
    >> @@ -119,9 +127,19 @@ shm_toc_allocate(shm_toc *toc, Size nbytes)
    >>  	}
    >>  	vtoc->toc_allocated_bytes += nbytes;
    >>  
    >> +#ifdef USE_VALGRIND
    >> +	vtoc->toc_allocated_bytes += NUM_NOACCESS_BYTES;
    >> +#endif
    >> +
    >>  	SpinLockRelease(&toc->toc_mutex);
    >>  
    >> -	return ((char *) toc) + (total_bytes - allocated_bytes - nbytes);
    >> +#ifdef USE_VALGRIND
    >> +	/* make the bytes at the end no-access */
    >> +	VALGRIND_MAKE_MEM_NOACCESS(((char *) toc) + (total_bytes - allocated_bytes - NUM_NOACCESS_BYTES),
    >> +							   NUM_NOACCESS_BYTES);
    >> +#endif
    >> +
    >> +	return ((char *) toc) + (total_bytes - allocated_bytes - nbytes - NUM_NOACCESS_BYTES);
    >>  }
    > 
    > The size is already rounded up by that point:
    > 
    > 	/*
    > 	 * Make sure request is well-aligned.  XXX: MAXALIGN is not enough,
    > 	 * because atomic ops might need a wider alignment.  We don't have a
    > 	 * proper definition for the minimum to make atomic ops safe, but
    > 	 * BUFFERALIGN ought to be enough.
    > 	 */
    > 	nbytes = BUFFERALIGN(nbytes);
    > 
    > Which means that you'll often have a up to 32byte pad at the end of the
    > (ALIGNOF_BUFFER=32) allocation already. I don't care about the waste, but the
    > ALIGNOF_BUFFER padding will often prevent detecting smaller out-of-bounds
    > accesses.
    > 
    
    Good point. I forgot about this alignment rounding. I don't care about
    the waste either (it'd be a bit silly in a valgrind build anyway), but
    not catching smaller mistakes would not be great.
    
    
    regards
    
    -- 
    Tomas Vondra