Practical Multithreading: Part 2
Live, from St. Louis, it's Dale Wilson!
Today's OCI C++ lunch features Dale Wilson on multithreading. I missed his talk last month on syncrhonization. This month Dale focuses on data sharing.
First person is Dale now.
The problem
The simple function:
int sequenceNumber() {
static int seed;
return seed++;
}
is thoroughly non-thread-safe.
Fix Number One
One fix for the problem is
int sequenceNumber() {
static int seed;
return atomic_increment(seed);
}
where atomic_increment() is implemented by using a hardware instruction that does the right thing.
Fix Number Two
On single processor machines this solution
class Disabler {
Disabler() { cli(); }
~Disabler() { sti(); }
};
int sequenceNumber() {
Disabler safe;
static int seed = 0;
return seed++;
}
works. Here I'm using the RAII (resource allocation is initialization) style to clear the interrupt flag on entry to the function and set the interrupt flag on exit from the function. This is very fast and runs with very low overhead. But it cannot be used on multi-processor system unless the processor affinity of the process is set to a single processor.
The Real Solution
The real solution, of course, is:
mutex seedMutex;
int sequenceNumber() {
mutex::guard(seedMutex);
static int seed = 0;
return seed++;
}
(Starting a thread in constructors is a bad thing.)
Mutex Fundamentals
To protect data with a mutex:
- Identify small sets of coherent shared data. (As small as possible, but no smaller.)
- Protect each set with a mutex.
- Use RAII guard to lock/unlock the mutex.
If you lump three or four sets together, that's fine sometimes. However, the sets should not intersect, or the mutex approach will not work.
Mutexes should be locked for a short period of time. As short as possible, but no shorter: If you need to look at the shared data and make a decision, lock the mutex until you've made the decision and committed to it.
Do not perform operations on the outside world while a mutex is locked. Making CORBA calls while holding a mutex is a bad practice.
Minimize shared data.
Advantages of mutexes:
- Rock solid
- Well understood
- Easy to use
Disadvantages of mutexes:
- Deadlock
- Priority inversion
- Thread starvation
- More powerful than necessary
Recursive mutexes should be avoided, particularly near condition codes. (The designer of the pthreads API put recursive mutex in as a joke.)
If you think you need recursive mutex, then your design is wrong.
ReadWrite Mutexes
There are performance penalties associated with mutexes. Excluding multiple readers can cause major performance problems.
Think the subject index in the UCLA library. It's read many more times than written.
To solve this problem, a ReadWrite mutex can be used. But ReadWrite mutexes is tricky to implement. For example, Boost recalled its ReadWrite mutex.
ReadWrite mutexes that lets you promote a read mutex to a read/write mutex is always suspicious.
Lock-free, wait-free data structures and algorithms
Why can't we use atomic operations all the time?
Answer: Lock-free and wait-free data structure and algorithms.
(Used to be called non-blocking. Later renamed lock-free/wait-free.)
Given a data structure and a set of operations to be applied to it, a lock-free algorithm guarantees that at least one operation will succeed. It avoids some of the mutex overheads. Starvation is still possible.
A wait-free algorithm guarantees that every operation on the data structure will complete within a predictable bounded number of steps. Wait-free algorithms are usually complex and slow. But for certain real-time systems, the timing guarantee requires them. And they do work.
Solving the consensus problem.
Various atomic primitives have different consensus numbers---the number of threads that can simutaneously accessing the feature that can be "sorted out." You want your consensus number to be infinite.
We'll look at three atomic operations:
- Compare and swap (CAS)
- Double compare and swap (DCAS)
- Load Locked/Store Conditional (LLSC)
Compare and swap
Here's what CAS does:
templatebool CAS ( volatile IntegralType &target, IntegralType expected, IntegralType replacement) { ATOMIC: if (target != expected) return false target = replacement; return true; }
One advantage of CAS is that it is available on a variety of platforms. The disadvantages of CAS are that it is barely strong enough to solve the problem, and that it doesn't solve the A-B-A problem.
Double compare and swap
DCAS does a single compare, double swap:
templatebool CAS ( volatile IntegralType &target, IntegralType expected1, IntegralType replacement1, IntegralType expected2, IntegralType replacement2) { ATOMIC: if (target != expected) return false target1 = replacement1; target2 = replacement2; return true; }
The problem with DCAS is that it doesn't exist. There are many papers written that assume the existence of such an operation. There results are beautiful but not practical.
Load Locked/Store Conditional
Here's what Load Locked/Store Conditional looks like:
templateIntegralType LL(IntegralType &target) { ATOMIC: dohicky.lock(); return target; } void SomethingDangerousHappens() { dohicky.unlock(); } ...
Dohicky is a technical term here. LLSC is more powerful than CAS and DCAS. And it actually exists on some CPUs (i.e. Xeon). The disadvantages of LLSC is that "something dangerous" is poorly defined; and there is often only one Dohicky per CPU.
So CAS works and is available on many platforms.
General purpose CAS-based Lock-free algorithm
- All access to the data structure goes through a CAS controled pointer.
- Writer allocate new structure and swap it it./li>
- Need lock-free allocator and lock-free deallocator (a garbage collector)
And any platform that relies on a garbage collector is fundamentally wrong. (Steve Totten: This talk is not only practical multithreading. It's opinionated multithreading.)
Special purpose CAS-based Lock-free algorithms
Practical (but complex!) lock-free implementations exist for:
- Linked lists
- Queues
- Deques
- Maps
The CASContainer
I wrote CASContainer not to fit a traditional data structure into a lock-free algorithm, but try to find the natural data structures for a lock-free algorithm.
I added it to the the Thread Support Library.
As long as you limit yourself to the capabilities it offers, you are fine.
It's essentially a linked list:
- push uses compare and swap until it succeeds.
- swap_list puts a new list into the container and get back what used to be in the container.
- Pop-all swaps an empty list in.
- push_list pushes a list into the container.
I so much wanted to do Pop One. It can be done by pop_all, remove the desired entry, and then push_list the remainder back. This will destroy the ordering of entries in the container if ordering is important. (Other threads might have pushed stuff into the container while you have the list checked out.)
However for a common use case---a single consumer/multiple supplier queue, order is preserved.
A Hybrid approach
An unresolved problem: Lock free algoriths do not synchronize threads.
A hybrid approach has to be used for the dispatching problem from last month's lecture.
The performance numbers are very interesting.
First person is me now.
Dale shows several charts comparing the various approaches on different arhcitecture/operating system. I couldn't catch all the details. But I did hear something alone the lines of "Did I tell you I like the AMD chips." "But my Xeon is more expensive than your AMD."