<< August 28, 2006 | Home | August 30, 2006 >>

Practical Multithreading: Part 2

Live, from St. Louis, it's Dale Wilson!

Today's OCI C++ lunch features Dale Wilson on multithreading. I missed his talk last month on syncrhonization. This month Dale focuses on data sharing.

First person is Dale now.

The problem

The simple function:

int sequenceNumber() {
  static int seed;
  return seed++;
}
is thoroughly non-thread-safe.

Fix Number One

One fix for the problem is

int sequenceNumber() {
  static int seed;
  return atomic_increment(seed);
}
where atomic_increment() is implemented by using a hardware instruction that does the right thing.

Fix Number Two

On single processor machines this solution

class Disabler {
  Disabler() { cli(); }
  ~Disabler() { sti(); }
};

int sequenceNumber() {
  Disabler safe;
  static int seed = 0;
  return seed++;
}
works. Here I'm using the RAII (resource allocation is initialization) style to clear the interrupt flag on entry to the function and set the interrupt flag on exit from the function. This is very fast and runs with very low overhead. But it cannot be used on multi-processor system unless the processor affinity of the process is set to a single processor.

The Real Solution

The real solution, of course, is:

mutex seedMutex;

int sequenceNumber() {
  mutex::guard(seedMutex);
  static int seed = 0;
  return seed++;
}

(Starting a thread in constructors is a bad thing.)

Mutex Fundamentals

To protect data with a mutex:

  • Identify small sets of coherent shared data. (As small as possible, but no smaller.)
  • Protect each set with a mutex.
  • Use RAII guard to lock/unlock the mutex.

If you lump three or four sets together, that's fine sometimes. However, the sets should not intersect, or the mutex approach will not work.

Mutexes should be locked for a short period of time. As short as possible, but no shorter: If you need to look at the shared data and make a decision, lock the mutex until you've made the decision and committed to it.

Do not perform operations on the outside world while a mutex is locked. Making CORBA calls while holding a mutex is a bad practice.

Minimize shared data.

Advantages of mutexes:

  • Rock solid
  • Well understood
  • Easy to use

Disadvantages of mutexes:

  • Deadlock
  • Priority inversion
  • Thread starvation
  • More powerful than necessary

Recursive mutexes should be avoided, particularly near condition codes. (The designer of the pthreads API put recursive mutex in as a joke.)

If you think you need recursive mutex, then your design is wrong.

ReadWrite Mutexes

There are performance penalties associated with mutexes. Excluding multiple readers can cause major performance problems.

Think the subject index in the UCLA library. It's read many more times than written.

To solve this problem, a ReadWrite mutex can be used. But ReadWrite mutexes is tricky to implement. For example, Boost recalled its ReadWrite mutex.

ReadWrite mutexes that lets you promote a read mutex to a read/write mutex is always suspicious.

Lock-free, wait-free data structures and algorithms

Why can't we use atomic operations all the time?

Answer: Lock-free and wait-free data structure and algorithms.

(Used to be called non-blocking. Later renamed lock-free/wait-free.)

Given a data structure and a set of operations to be applied to it, a lock-free algorithm guarantees that at least one operation will succeed. It avoids some of the mutex overheads. Starvation is still possible.

A wait-free algorithm guarantees that every operation on the data structure will complete within a predictable bounded number of steps. Wait-free algorithms are usually complex and slow. But for certain real-time systems, the timing guarantee requires them. And they do work.

Solving the consensus problem.

Various atomic primitives have different consensus numbers---the number of threads that can simutaneously accessing the feature that can be "sorted out." You want your consensus number to be infinite.

We'll look at three atomic operations:

  • Compare and swap (CAS)
  • Double compare and swap (DCAS)
  • Load Locked/Store Conditional (LLSC)

Compare and swap

Here's what CAS does:

template 
bool CAS (
  volatile IntegralType &target,
  IntegralType expected,
  IntegralType replacement)
{
  ATOMIC:
  if (target != expected) return false
  target = replacement;
  return true;
}

One advantage of CAS is that it is available on a variety of platforms. The disadvantages of CAS are that it is barely strong enough to solve the problem, and that it doesn't solve the A-B-A problem.

Double compare and swap

DCAS does a single compare, double swap:

template 
bool CAS (
  volatile IntegralType &target,
  IntegralType expected1,
  IntegralType replacement1,
  IntegralType expected2,
  IntegralType replacement2)
{
  ATOMIC:
  if (target != expected) return false
  target1 = replacement1;
  target2 = replacement2;
  return true;
}

The problem with DCAS is that it doesn't exist. There are many papers written that assume the existence of such an operation. There results are beautiful but not practical.

Load Locked/Store Conditional

Here's what Load Locked/Store Conditional looks like:

template 
IntegralType LL(IntegralType &target)
{
  ATOMIC:
  dohicky.lock();
  return target;
}

void SomethingDangerousHappens() {
  dohicky.unlock();
}

...

Dohicky is a technical term here. LLSC is more powerful than CAS and DCAS. And it actually exists on some CPUs (i.e. Xeon). The disadvantages of LLSC is that "something dangerous" is poorly defined; and there is often only one Dohicky per CPU.

So CAS works and is available on many platforms.

General purpose CAS-based Lock-free algorithm

  • All access to the data structure goes through a CAS controled pointer.
  • Writer allocate new structure and swap it it./li>
  • Need lock-free allocator and lock-free deallocator (a garbage collector)

And any platform that relies on a garbage collector is fundamentally wrong. (Steve Totten: This talk is not only practical multithreading. It's opinionated multithreading.)

Special purpose CAS-based Lock-free algorithms

Practical (but complex!) lock-free implementations exist for:

  • Linked lists
  • Queues
  • Deques
  • Maps

The CASContainer

I wrote CASContainer not to fit a traditional data structure into a lock-free algorithm, but try to find the natural data structures for a lock-free algorithm.

I added it to the the Thread Support Library.

As long as you limit yourself to the capabilities it offers, you are fine.

It's essentially a linked list:

  • push uses compare and swap until it succeeds.
  • swap_list puts a new list into the container and get back what used to be in the container.
  • Pop-all swaps an empty list in.
  • push_list pushes a list into the container.

I so much wanted to do Pop One. It can be done by pop_all, remove the desired entry, and then push_list the remainder back. This will destroy the ordering of entries in the container if ordering is important. (Other threads might have pushed stuff into the container while you have the list checked out.)

However for a common use case---a single consumer/multiple supplier queue, order is preserved.

A Hybrid approach

An unresolved problem: Lock free algoriths do not synchronize threads.

A hybrid approach has to be used for the dispatching problem from last month's lecture.

The performance numbers are very interesting.

First person is me now.

Dale shows several charts comparing the various approaches on different arhcitecture/operating system. I couldn't catch all the details. But I did hear something alone the lines of "Did I tell you I like the AMD chips." "But my Xeon is more expensive than your AMD."

Tags :