Thursday, August 15, 2013

Async-Await in .NET is not for performance – it’s for readability

While I see the LINQ on the .NET platform one of the best if not the best language feature, I think the async-await idea is more like smoke in the mirror than anything else. Actually, it’s even worse: as the underlying operations are not trivial in some cases we are better off using linear code.

Idea based on desktop apps

The whole async-await idea is mostly coming from the cumbersomeness of desktop GUI applications and UI threads: only a single thread can modify the UI so when a long running process finishes we need to ask the UI thread to display the result for us but the operation itself cannot update the UI. It’s not a language problem; it’s a Windows/framework problem. However, Microsoft decided to “solve” it with a language feature, because:

  • Concurrent programming is a hot topic and Microsoft wanted to “be there”
  • C# language evolves fast so adding a new feature is quite easy


While the latter is a great thing (hey Java, can you hear me?), adding a language feature because of marketing reasons is just plain wrong.

Myth #1: Await will make the code run faster

Aysnc-await is a unique language feature because it’s implemented more as code generator than anything else. Microsoft decided to support the async-await keywords with generated classes and a huge, somewhat undocumented framework that just “solves your problems”. The underlying implementation is so hidden, we tend to believe that it will just make the code faster somehow.

It will not. It will reorganize your code so you don’t have to write callbacks to complete the rest of the function when a long running process completes. The rewriting of your linear code to emulate the callbacks is massive, it worth to have a look at it using ILSpy. Anyhow, the Task<> that is created by an async function was there before, nothing new about it: it will be scheduled to run on the ThreadPool; which, again, was there from 1.0. However, in certain cases (like in a loop) the execution will be essentially sequential. No parallelism, no speedup.

On the other hand, while it can be really helpful in a desktop application it makes little or no sense at all in a server environment – like ASP.NET. Unfortunately VS2013 will generate sample ASP.NET project with async-awaits all over. It is wrong! Why?

  • Servers are optimized to handle lot of parallel requests – they will manage the threads for you, no need for extra "optimizations"
  • A single request should never take too long – use queues if that’s the case and refresh the client using Javascript instead


Unfortunately Microsoft decided to return Task<>  in the new ASP.NET api calls forcing us to either call .Wait() or to use async-await. As typical HTTP is around 100-300ms the whole idea of asynchronously serving all requests in a synchronous environment is somewhat flawed. Remember, the user is actually waiting for your HTTP response.

Myth #2: Creating a thread is expensive

The biggest argument against creating a new thread for your long running task is that thread creation is expensive. But how expensive exactly? I did two quick measurements:

  • Start 1000 threads and wait (Thread.Sleep) in all for 1 second – all threads completed in 1136ms, meaning that the overhead for creating, starting, and stopping a single thread was around 0.13ms
  • Start 1000 threads and do not do any work in them – all threads completed in 95ms, meaning that the overhead was 0.095ms per thread


Let’s say 0.1ms per thread is the overhead and creating even 1000 threads is not too stressing for any modern machines – including my average spec laptop. Let’s say you create a brand new thread for every request (bad idea) and a request takes 100ms to complete. That’s 0.1% overhead. It wouldn’t really rush to optimize this overhead.

However, if you have more than 1000 requests per sec (which is more than 86 million requests per day) you really should use some kind of queuing to limit the number of concurrent threads – but all server environments like ASP.NET does that for you, don’t worry.

Myth #3: Concurrent programming can be easy

In an average environment concurrent programming is pretty much unnecessary – no real need to squeeze every computing bit out from a CPU so a standard sequential program will just do fine. For the rare cases when we need to update the UI of a desktop app or do a little background processing, async-await will be more than enough.

However, for any high performance computing projects forget about the features offered by any language or framework. You will need to understand work scheduling, synchronization, hardware bottlenecks, data layout patterns in memory, and in some cases even SIMD instructions and branch prediction. No IDE, programming language or framework will understand that for you. Even in concurrent languages like Google Go you have to understand synchronization quite well to efficiently use your resources.

As currently we operate with a few CPU cores on a large, common, shared memory, concurrent programming is far from trivial – my guess is that until we change our hardware architecture it always will be.

Measurements

Out of curiosity I measured how long it would take to calculate the square root of a large amount of numbers (1000 times of 0, 1 million, 3 million numbers) divided into 1000 work units using different approaches (source codes below):

  • Single thread – do all calculations on a single thread
  • Lot of threads – try to start 1000 threads to do it (actually around 8 can start because the CPU is too busy running other calculations. This is a surprisingly good self-limiting side effect).
  • Lot of async – async done in a loop (which will essentially run sequentially)
  • Parallel.For – using the built in .NET parallel library
  • Task WhenAll – using the WhenAll() synchronizer and simply creating an running the tasks
  • ThreadPool – Using the default QueueUserWorkItem on the default thread pool




Couple things to take away from the chart:


  • Parallel.For is slightly faster than other solutions
  • Misusing async-await will yield linear performance without compiler warning
  • Creating a lot of threads to do the work is only slightly slower than the best performing solution (4% slower in this case).
So is async-await bad?

I think it would be crucial for everyone to understand the internals of async-await before jumping in because really surprising side effects can and will pop up - there are just too many questions around weird behavior of async-await on Stackoverflow. However, the idea is very useful in situations where a callback code would severely undermine the readability of the code, especially on desktop apps. Anywhere else? Not too sure.


Code snippets

private static void SingleThread()
{
    Stopwatch sw = Stopwatch.StartNew();
    for (int i = 0; i < n; i++)
    {
        Sq();
    }
    Console.WriteLine(sw.ElapsedMilliseconds);
}

public static void ParallelForeach()
{
    var sw = Stopwatch.StartNew();

    Parallel.For(0, n, i =>
    {
        Sq();
    });

    Console.WriteLine(n + " parallels foreach finished in " + sw.ElapsedMilliseconds);
}

public static async void LotofAsync()
{
    var sw = Stopwatch.StartNew();

    for (int i = 0; i < n; i++)
    {
        await Task.Factory.StartNew(
        () =>
        {
            Sq();
        });
    }

    Console.WriteLine(n + " awaits started and finished in " + sw.ElapsedMilliseconds);
}

public static void TaskWhenAll()
{
    var sw = Stopwatch.StartNew();
    var tasks = new List<Task>();

    for (int i = 0; i < n; i++)
    {
        var t = Task.Factory.StartNew(
        () =>
        {
            Sq();
        });
        tasks.Add(t);
    }

    Task.WhenAll(tasks).Wait();

    Console.WriteLine(n + " tasks WhenAll finished in " + sw.ElapsedMilliseconds);
}

public static void Threadpool()
{
    var sw = Stopwatch.StartNew();
    var tasks = new List<Task>();

    int total = n;
    var re = new ManualResetEvent(false);

    for (int i = 0; i < n; i++)
    {
        ThreadPool.QueueUserWorkItem(new WaitCallback(o =>
        {
            Sq();
            Interlocked.Decrement(ref total);
            if (total == 0)
            {
                re.Set(); // ThreadPool does not support waiting on tasks by itself.
            }
        }));
    }

    re.WaitOne();

    Console.WriteLine(n + " tasks WhenAll finished in " + sw.ElapsedMilliseconds);
}

public static void LotofThreads()
{
    var sw = Stopwatch.StartNew();

    var threads = new List<Thread>();

    for (int i = 0; i < n; i++)
    {
        var t = new Thread(new ThreadStart(() =>
        {
            Sq();
        }));

        threads.Add(t);
        t.Start();
    }

    threads.ForEach(t => t.Join());

    Console.WriteLine(n + " threads started and finished in " + sw.ElapsedMilliseconds);
}

private static double Sq()
{
    double sum = 0;
    for (int j = 0; j < 1 * 1000 * 1000; j++)
    {
        sum += Math.Sqrt(j);
    }
    return sum;
}

9 comments:

  1. I think you are missing the point of async/await in ASP.NET. Of course it is an extremely bad idea to spin up new threads on a web server. But that is the reason you SHOULD use async/await. A thread that is waiting on an async request will do no work, yet it exists and consumes resources. Using await you could potentially allow that thread to do other work while it is waiting for the async call to complete. An by async call this could mean a web service request/ db request/ IO request etc.

    ReplyDelete
  2. I'm with Dragomir here.

    If you have a service that is taking requests and immediately calling some sort of expensive I/O, async/await will use IOCP to keep threads from being blocked while waiting for the I/O to return. This, in turn frees the thread up to continue processing new requests (and/or processing a response after the I/O returns). So in terms of making I/O operations scale and use all of those cores on your server, async/await *is* about performance.

    But you're right to some degree, doing what you're showing above and using async/await for what would otherwise be non-asynchronous, non-I/O actions is a bad idea.

    ReplyDelete
    Replies
    1. Thanks for the feedback Ben. Although I think I disagree. If your server is doing so many I/O requests, you'll be I/O bound much earlier than even exhausting the maximum number of threads you could create if you create a new thread for each request.

      Having tens of thousands of pending I/O requests waiting on your server is as unlikely as it sounds. If it happens, the least of your concerns will be async/await (why not just use queues or thread pools by the way?)

      In theory it's all nice, in practice it's pretty much irrelevant. If it doesn't help in HPC, it won't help in an I/O bound system either. I'm more than happy to be proven wrong by an actual system - with performance measurement.

      Delete
    2. If your I/O is network its based on the clients, if you have a 10GB/s pipe you will run out of scalability with a thread per request faster than you will run out of bandwidth.

      The well known c10k problem is a good example of this: https://en.wikipedia.org/wiki/C10k_problem

      Delete
  3. Are you trolling? I'm not sure.

    Performance and scalability are orthogonal concepts, they are related but they don't mean the same thing. Async/Await is about scalability and responsivness rather than direct cpu performance, but it does improve total performance though better cpu use, but certainly not in the examples you've given, where you'd want to use Parallel.For as that's what it is for.

    Your threadpool example has a race condition, if it was doing anything more interesting, as you want to check the return value of Interlocked.Decrement not the variable which may have changed in the interim.

    The Task.WhenAll is should be using async/await, which is why you had to use .Wait() whereas the general use would be: await Task.WhenAll() prehaps with a ConfigureAwait(false) on the end if you aren't the last point on a UI thread callback. .Wait() is only recommended as a last resort because you can't await or you've become fed up of changing all the callers to return tasks/async. If you really wanted a sync pause it should be Task.WaitAll()

    This very quickly resembles the threadpool example, but with being able to wait on the threadpool - so now you have the ability to easily write threadpool based code, that you can write in a synchronus manner as shown by the lots of async example (probably want to add a .ConfigureAwait(false) there too) and equally make use of parallism as shown by the Task.WhenAll example.

    But this is all drinking orange juice and complaining its not apple juice; so I'm not sure if its a satrie post?

    If you want speed ups in a tight loop which doesn't do any thing slow (inc memory access) then you want to be comparing with RyuJIT ( http://blogs.msdn.com/b/clrcodegeneration/archive/2014/05/12/ryujit-ctp4-now-with-more-simd-types-and-better-os-support.aspx ) and using the Microsoft SIMD-enabled Vector Types ( https://www.nuget.org/packages/Microsoft.Bcl.Simd/1.0.2-beta )

    ReplyDelete
    Replies
    1. Sorry if that came off a bit harsh :-/

      Delete
  4. Hi Adam,

    Your measurements are done on purely compute-bound code. There is absolutely no reason to run any compute-bound code in more threads than number of CPUs. If you create 1000 active threads on the machine with 8 CPU cores, you are wasting CPU cycles on context switching. There is no way async can help you.

    Instead, you should measure I/o bound example.

    There are 2 ways to do I/o:
    1. Initialize request - get on the waiting queue - yield the quantum - context switch to another thread - receive interrupt - get on the active queue - get scheduled - context switch - read data from the buffer
    2. Initialize request - continue. For the current thread, the job is done. Once interrupt arrives, another thread from the pool (on the waiting queue) gets on the active queue, gets scheduled and calls a callback.

    Async method in combination with await allow you to do 2, in a very elegant manner. If you don't believe it's elegant, check how to do the same thing using completion port api directly.

    For most applications blocking system calls are just fine, but threads that are sleeping in the waiting queue consume memory. (On the other hand, they don't consume CPU, so your thread overhead estimation is irrelevant.) Don't create to many, and you will be fine.

    For applications where the scalability is critical, non-blocking calls allow more flexibility. Would you write those kind of apps using C#? That I don't know.

    Cheers,
    Artem

    ReplyDelete
  5. Hi Adam,

    Your measurements are done on purely compute-bound code. There is absolutely no reason to run any compute-bound code in more threads than number of CPUs. If you create 1000 active threads on the machine with 8 CPU cores, you are wasting CPU cycles on context switching. There is no way async can help you.

    Instead, you should measure I/o bound example.

    There are 2 ways to do I/o:
    1. Initialize request - get on the waiting queue - yield the quantum - context switch to another thread - receive interrupt - get on the active queue - get scheduled - context switch - read data from the buffer
    2. Initialize request - continue. For the current thread, the job is done. Once interrupt arrives, another thread from the pool (on the waiting queue) gets on the active queue, gets scheduled and calls a callback.

    Async method in combination with await allow you to do 2, in a very elegant manner. If you don't believe it's elegant, check how to do the same thing using completion port api directly.

    For most applications blocking system calls are just fine, but threads that are sleeping in the waiting queue consume memory. (On the other hand, they don't consume CPU, so your thread overhead estimation is irrelevant.) Don't create to many, and you will be fine.

    For applications where the scalability is critical, non-blocking calls allow more flexibility. Would you write those kind of apps using C#? That I don't know.

    Cheers,
    Artem

    ReplyDelete
  6. I though async await should be used only for IO operations (network calls/disk read/write) and async await does not lock up thread after initiating the call, thus releasing the thread back to thread pool to be available to serve any other request coming in. So async await in my opinion helps improve scalability of IO operations by freeing up the thread between network call initiation till the time the response arrives

    ReplyDelete