17 Nov 2017 • Deadlock

In a bounded lock-free multi-producer queue, pushing an element takes two steps. First you need to acquire a node for writing to avoid races with other producers, then you need to flag the node as fully written once you're done with it. Then to pop from the queue you check if the head node has been fully written, then try to acquire it if you're multi-consumer too.

(Look here or here for the details.)

One issue to note is that dequeues can fail, even though the queue is non-empty, in the sense that items have been successfully enqueued and not yet dequeued. Specifically this can happen:

It makes sense to guard such a queue with a semaphore. An obvious way to do that is to make the semaphore count how many elements are in the queue. You can't do that with a queue like this! If you write something like:

// producer
if( q.push( x ) )
	sem_post( &sem );

// consumer
while( true ) {
	sem_wait( &sem );
	T y;
	if( q.pop( &y ) )
		// do things with y
}

but that's incorrect because you decrement sem even when you haven't dequeued anything, and the item pushed by P2 gets lost. So instead you might try:

// producer
if( q.push( x ) )
	sem_post( &sem );

// consumer
while( true ) {
	T y;
	if( q.pop( &y ) ) {
		// do things with y
		sem_wait( &sem );
	}
}

but then you're spinning while you wait for P1 to finish.

We ran into a particularly nasty instance of the first case of this at work. We have a queue which accepts various commands, and one of them is a "flush" command where the sender goes to sleep and the consumer wakes them up again (with an Event) once the flush is done. So something like this happened:

The fix is to make queueSem count the (negative) number of consumer threads that are asleep. The change is very simple:

// consumer
while( true ) {
	T y;
	if( q.pop( &y ) )
		// do things with y
	else
		sem_wait( &sem );
}

Which avoids the deadlock thusly:

I'm not totally satisfied with that solution because I feel like I've misunderstood something more fundamental, something that will stop me running into similar problems in the future. Please email me if you know!

All in all writing an MPSC lock-free queue has been an enormous waste of time courtesy of shit like this. We need to enqueue things from a signal handler, which means locks and malloc are gone. Even so, since we're single-consumer and we only push in signal handlers so I believe we could have used a mutex to avoid races between producers and kept the consumer lock-free. I wasn't sure if you could get signalled while still inside a signal handler. FYI the answer is no, unless you set SA_NODEFER. I also don't know if pthread_mutex_lock etc are signal-safe, and of course if you try to Google it you just get pages of trash about "you can deadlock if the thread holding the lock gets signalled!!!!!". Presumably they are but I didn't want to risk it.