Why AutoResetEvent is slow and how to synchronize threads in .NET
The problem is generally very simple: I have one book but two people are trying to read/write from it – how to solve the problem so the overall performance is faster than with a single guy (many processors, one shared memory).
There are many different ways to synchronize these workers and some of them are really efficient and others are just terribly slow.
Nehalem
The good news is, that from the first generation of i7 processors (Nehalem) the processor level synchronization primitives became really fast – Intel did a good job, so it’s on average 60% faster than it used to be on the P4 processors.
These synchronization primitives are XCHG and CMPXCHG commands: the first one replaces a value with another one in a single step, the second one replaces the value if it equals to a parameter (a sample implementation later in the article). To make them atomic, the processor supports a LOCK modifier, so these commands (even though they take longer time than a single clock cycle) cannot be interrupted.
Even on high level (C++, C#, Java…) the synchronization logics are built on these processor level primitives, so it is essential to know them if we want to get into multithreaded programming.
Context switching
One of our enemies in multithreaded programming is context switching – this process takes place when a thread passes the execution to the other one. It has to save its state (processor level caches, TLB...) and the other one has to load his old saved one.
It is similar when the above mentioned book is passed from one worker to the other: the previous has to note the page number, pass the book to the new one; the new has to open the book at his own page, put on the reading glasses and start to work. The more frequent the context switching is, the more time we waste with it. If it is very frequent (too many things are running at the same time), the processor can get into a so called ‘trashing’ state, when basically no actual work is done.
So, now we know the basics, let’s see how to synchronize threads in C#. The following examples show strict synchronization, when the two threads are always switching each other (ABAB…) and just pure exclusivity where the two threads are never accessing the resources at the same time but they may repeat before the other can run, not guaranteeing the ABAB… pattern (but the number of runs are somewhat balanced because the operating system tries to schedule them fairly).
Simple spin lock implemented with a Boolean variable. The two threads are in a round-robin share, so the run scheme is ABABAB…
public static void Thread1Spin() { while (run) { while (!aRun) Thread.Sleep(0); aRun = false a++; bRun = true } } | public static void Thread2Spin() { while (run) { while (!bRun) Thread.Sleep(0); bRun = false b++; aRun = true } } |
Synchronizing two threads using AutoResetEvent. The run scheme is ABABAB….
public static void Thread1Ar() { while (run) { arA.WaitOne(); a++; arB.Set(); } } | public static void Thread2Ar() { while (run) { arB.WaitOne(); b++; arA.Set(); } } |
Synchronizing two threads using Monitor. The exclusivity is guaranteed but the run scheme is somewhat random: A…B…B…A…B… The number or A and B runs are roughly equivalent.
public static void Thread2Monitor() { while (run) { Monitor.Enter(monitorobjectA); b++; Monitor.Exit(monitorobjectA); } } | public static void Thread2Monitor() { while (run) { Monitor.Enter(monitorobjectA); b++; Monitor.Exit(monitorobjectA); } } |
Using the CPU level XCHG, which is Interlocked.Exchange in the .NET framework. The run patter is somewhat random, but roughly balanced: A…B…B…A…B…
public static void Thread1Interlocked() { while (run) { while (Interlocked.Exchange( ref interlock, 1) != 0) { Thread.Sleep(0); } a++; interlock = 0; } } | public static void Thread2Interlocked() { while (run) { while (Interlocked.Exchange( ref interlock, 1) != 0) { Thread.Sleep(0); } b++; interlock = 0; } } |
Synchronizing two threads using the CPU level CMPXCHG command, which is Interlocked.CompareExchange in the .NET framework. The run pattern is guaranteed to be ABABAB…
public static void Thread1InterlCompExch() { interlock = 1; while (run) { while (Interlocked.CompareExchange( ref interlock, 0, 1) != 1) { Thread.Sleep(0); } a++; interlock = 2; } } } | public static void Thread2InterlCompExch() { while (run) { while (Interlocked.CompareExchange( ref interlock, 0, 2) != 2) { Thread.Sleep(0); } a++; interlock = 1; } } } |
No synchronization at all, let the two threads run as they prefer, incrementing the two distinct variables. The run patter is somewhat random, but roughly balanced: A…B…B…A…B…
public static void Thread1NoLock() { while (run) { a++; } } | public static void Thread1NoLock() { while (run) { b++; } } |
After running them for 10 seconds on an i7 Sandy Bridge, 4 core, 2.3GHz cpu, the results were quite interesting:
Method | A value | B value | A+B together |
---|---|---|---|
Bool spinlock | 23,836,625 | 23,836,626 | 47,673,251 |
AutoResetEvent | 187,369 | 187,369 | 374,738 |
Monitor | 246,575,356 | 181,121,252 | 427,696,608 |
Interlocked.Exchange | 396,831,918 | 410,724,451 | 807,556,369 |
Interlocked.CompareExchange | 23,570,194 | 23,570,195 | 47,140,389 |
Two threads, no sync | 2,499,541,636 | 2,676,081,957 | 5,175,623,593 |
Single thread, no sync | 5,032,547,986 | 0 | 5,032,547,986 |
There are a couple of very important things to learn from this: under no circumstance should anyone use AutoResetEvent in a performance critical system, as it is 2 magnitudes (100) slower than any other solution.
It is the best to let the system run as it can without enforcing synchronization of any kind, as the operating system scheduler is very good, so if the threads do not wait for each other the performance will increase when can use more cores.
If some kind of common synchronized resource access is required, it is at least a magnitude slower than native performance. If possible, create different resources for different threads and remove synchronization.
Monitor is a very efficient way to synchronize threads, however, writing our own custom Interlocked.Exchange based lock doubles the performance.
If strict round-robin style synchronization is required, it is two magnitudes slower than native performance and does not really matter what implementation we use (interestingly).
the Thread.Sleep(0) gets you 100% CPU, autoreset is still the winner if you need low CPU and a predictable run scheme
ReplyDeleteany idea what autoreset events are doing internally ?
DeleteThis is terrible advice. Any time one thread mutates data that another thread must read it requires synchronization primitives, unless you don't care about the final state of that data. In multi-core systems, a core may read stale data from local cache instead of RAM if no memory barrier is given. You can only avoid this with copy on write (COW), read only, or immutable data handling. The only other case is static field initializers in .NET which are internally guaranteed to be thread safe.
ReplyDelete