Sunday, October 26, 2014

Introduction to Multithreaded and Parallel Development - Part 2

Multithreading Basic issues with Compilers

From Wikipedia:
"A compiler is a computer program (or set of programs) that transforms source code written in a programming language (the source language) into another computer language (the target language...)..."

When we compile a line such as

x=x + y

Its first compiled to about 4 IL command:

ldsfld int32 x
ldsfld int32 y
add
stsfld int32 x
 
and about 2 assembly commands:

mov         eax,dword ptr ds:[00650C4Ch] 
add         dword ptr ds:[650C48h],eax 
 
We have no problem with IL code as it's not really executed by the CPU, its only later compiled to machine code.

We might encounter the following problems:

- A thread could be scheduled off after the 1st line of machine code, another thread will start and modify [00650C4Ch]  with a different value, leaving eax with incorrect value. this is called a racing condition, as the threads are racing with each other, the result is usually non-deterministic and unpredictable, but the problem might not show immediately, as the development machine's CPU might be slower or faster, have less or more cores and other differences between executing machines.

- Variables might be cached by the compiler, the compiler doesn't understand that other threads might want to use it, this might actually speed things up on a single thread program but will create many points of stale data, leading to incorrect results, but in .NET it could be solved by using volatile (which creates a performance problem as it requires the bus to lock and force the cores' cache to update).

- Compilers change, with almost every new version of .NET framework, Microsoft work hard to make the compiled code execute faster, these optimizations can cause your code to misbehave as new racing conditions may be created and that's one of the reasons it's important to write correct multithreaded code and not rely on the fact that your code does not crash or produce incorrect results.

Introduction to Threads and Synchronization

 
Critical Sections are sections of code that can only be executed one at a time, if multiple threads are attempting to enter a critical section, the first one will execute, the rest will wait for it to finish and then execute serially one by one.

Atomic operations are blocks of code that can execute without external elements affecting their state/data, so if we have a CPU atomic operation in relation to the example above, the value of X will always be X+Y.

Threads are execution units, they have their own stack and execute a method that was passed to them. When threads are created and started, they are scheduled in the OS scheduler for execution but they do not execute concurrently with every other thread in the system as there are less cores than threads.

What happens is, the OS have a list of threads to execute and by switching threads on and off it creates an illusion of concurrency. the amount of actual concurrency depends on the amount of CPU cores, so if we have 2 cores and 10 threads, only 2 threads are executed concurrently. Which threads are executed depends on the process's priority and the thread's priority, that process of scheduling threads on and off is called context switching, which is not free, so the more threads we have in the system, there is a performance penalty as there is less work actually done and more CPU resources are spent on context switching.

Locks guarantee that the code block locked is serially executed one at a time, locks are implementations of critical sections, there are many types of locks, each suitable for a different purpose.

Interlocked operations are considered to be atomic operations, consider the following, on a single core CPU, a single CPU command is considered atomic, a single line of code could be many CPU commands. on a multiple core CPU, a single command is not considered atomic because other cores could access the same memory location, to prevent that situation the CPU can lock the bus to guarantee that. so assuming the CPU command INC (increment by one) is desired, alone it will work on a single core CPU, but on multiple core CPU, LOCK INC must be executed instead, which first locks the bus, increment by one and unlocks the bus.

The problem most developers find with threads is not the simplicity that threads share memory and execute code, but the timing and how memory is written. When you write sequential code you don't care if the compiler decides to cache a variable or how it writes and decomposes long stream of bytes, When you write to a memory location, you don't expect other bytes to be read and written, only the bytes you want, but these bytes could be stale and not contain the real value, or they could be part of a bigger memory address, like an array, a struct or just a part of a collection of bytes defined under a different data type.
 

- Why we shouldn't lock every variable?

Lock acquisition is slow and even if it were fast, you would lose most of the gains you could have gotten from having multiple threads as the locked block is executed serially.

Locks needs to be thought about, as deadlocks are very easy to create with multiple locks.

If you're locking a single variable and not a process, it's easier to use an Interlocked operation if it applies.
 

- Shared State vs. Shared Memory

While shared memory is what we have in a process, there is no problem writing to a memory location whenever we want, the CPU doesn't make us lock that memory location or anything like that. What we do have to consider is that writing to a memory location could potentially affect other parts of our software, these parts look for variable values to determine their state and what they need to do next. Other than hardware problems memory cannot be corrupted, what gets corrupted is the state our threads see, that state needs to be modified as a whole.

Let's take a simple Dictionary as an example, first, items are divided to buckets, then buckets get full and links to other buckets can be created. Assuming we don't lock our dictionary's STATE, multiple threads can all create their own buckets, each one having only one item, so when they are added to the Dictionary, the end result could be only one bucket with one item or any kind of mutation in the dictionary's state. To make matters worse, we might have a counter that counts how many items we have in the dictionary and that counter can show 5 items while we have only one. Now let's make it even worse by saying we have a process that reorders the buckets after 10 inserts and well, you understand the gravity of our situation.

So that's another reason why locking individual variables is not a good practice.
 

- How to divide the work?

The easiest, most common, is to divide tasks to their own threads. In the example above, one task listens to incoming document, another does the conversion, etc'.

Another way is to divide long loops to shorter loops, each one executes in its own thread.
But, if all threads are updating the same variable, that variable could be a bottleneck and all cores needs to synchronize their cached value from the same memory location, this process of updating the cache is relatively slow and will slow things down if not make the whole process slower than a single thread.

You should take into account that starting threads takes time, so if the thread does not have a long-term task, you should use thread pool implementation, there is one provided by the framework, but if your tasks are depending on each other, you might want to consider implementing multiple TaskScheduler(s).

Multithreaded application usually doesn't linearly scale as there are races to resources, sometimes not even ones that you're aware of, CPU cache is one of them, you'll usually get the best performance if you don't share any modifiable resources between threads, read only resources are ok. at the end of the task, join the data in your calling thread.
 

- What happens if a thread fails?

Exception handling is important in a multithreaded environment, threads, like any other code, can fail, plan for failure, always catch exceptions and know how to handle them. In our example, what happens if the conversion fails? do we retry? how many times? do we mark the document as faulty and try again later? do we give up?

But most importantly, what are the reasons for failure? can we prevent them with better technique? strive to avoid problems as problems might indicate your logic is not fully optimized.

Our threads are just like any other block code, so we can try-catch and report back to the main thread what just happened or in the case of threadpools, BackgroundWorker has RunWorkerCompletedEventArgs which can report errors back.

If all else fails, application wide exception handling could be done with AppDomain.UnhandledException, but it should not be done for error handling, only for error reporting.

In my view, there are two most important reasons to monitor failed threads, the first is releasing locks (the lock syntactic sugar will release the lock, any other type of lock needs to be explicitly unlocked) and the second is to make sure that if another thread waits for the input from the first thread it will wait indefinitely, therefore not doing its task and causing the program to misbehave.
 

- What resource do threads take?

The most obvious one is CPU resources, when you add another thread, the time slice your application gets is being divided by the amount of threads you have, more threads doesn't necessarily mean more work or more time. There is a point in which more threads means less work as context switching takes more CPU than actual work is being done, the best range is the number of CPU cores up to about twice the CPU cores, depending on the ratio between work time and sleep time.

A less obvious one is Memory, each new thread takes about 4KB and reserved 1MB of memory for stack, while this is not much on a modern machine it's still something to take into account. note that if 1MB is not enough, it's usually because the thread misbehaves or uses extreme recursion (which could be replaced with queues).
 

- How to gracefully terminate threads?

To gracefully terminate threads, you should have an exit strategy, the simplest one would be a boolean volatile field which all threads listen to so they can finish their cycle, release locks and other resources (files, close network connections, etc') and end gracefully, you can wait for the threads termination with Join.

You should also have a backup plan, if any of the thread are not terminating in a decent amount of time (for example, 10 seconds), you can Abort them, but keep in mind that locks (other than the lock syntactic sugar) might stay locked, so make sure you release them in finally.
 

- How to have multiple threads access the same resource?

If they are only reading data, it's usually ok, but to know that, you really have to know the internal structure of the objects you're accessing, a misbehaving example would be a LFU caching object keeping track of which item is used the most, if the internal collection is not threadsafe (a threadsafe method is considered to be safe to call from multiple different threads in a defined and predictable manner), the object might fail either gracefully or disastrously.

If they are read/write then don't. In most cases it's simply not worth the time, potential bugs, locks and headache of having the same resource shared by multiple threads, the performance improvements could be significant, but then you wouldn't read an 'introduction to multithreading' but advanced material.

Instead of using read/write shared resources, create a service from these resources, have each thread request and wait for an answer from the service and do all the work for you. It might be easy to implement using queues, as each thread queues a request and wait for an answer, it might be slower, but the time that could be spent on fixing concurrency bugs usually doesn't pay off in business sense.
 

- Threadpool

As threads take time to start and end, their context switching also a cost and you end up with a whole bunch of CPU taken away from you just by using them.

The solution is A threadpool, which is an optimized way of using multiple threads.

The idea is that you have a number of threads about the same number of cores (depending on implementation) which will wait on a queue, since pushing and pulling jobs from a queue is relatively quick, you can push very small jobs and still enjoy the benefit of using multiple threads.
 

- what is this map/reduce thing I've been reading all over? they say its 'web scale'.

I dislike buzzwords, but the basic idea is that machines/processes are responsible for collecting only the information relevant (hence map) to a particular operation (hence reduce), this way you can have large numbers of machines/processes, each one doing a small amount of work and thus gain performance improvements while working on a cumulative large dataset.

Common Problems and Solutions 

Deadlocks

Deadlocks occur as multiple threads are attempting to use different locks which depend upon each other, causing a deadlock which locks all threads involved as none of the threads release their first lock, preventing others from using the resources they need, similar to gridlocks that occur in big cities when drivers are entering an occupied junction.

There are a few ways of resolving deadlocks, smart mechanisms that abort the whole process and retry again later, prevent locks from occurring if the entire list of locks is unavailable or keep track on lock wait time and abort/retry when the lock waits for too long, you can also try to avoid deadlocks by avoiding nested locks and keeping locking order, but these are just guidelines and might not prevent deadlocks in edge cases you might not have thought about.

In my experience, avoiding locks is a lot better and easier than trying to overcome deadlocks, but as system complexity is rising higher and higher avoiding locks might not be possible and deadlock prevention algorithms will be needed.

Locks in .NET involve the following: Monitor, Mutex, SpinLock, ReaderWriterLock, Semaphore, SemaphoreSlim and Interlocked operations, which involve atomic changes to variables, we'll get to all of these later.

For further reading, take a look at Deadlock prevention algorithms.

Livelock

A livelock is similar to deadlock, except a deadlock will basically block a thread from advancing while livelock will do work only to fail at the locked block, the negative side of a livelock is that although it's not changing anything it shouldn't, it is also not doing any work to advance the program. Similar to pressing both the break and the throttle pedals at full power in your car.

Racing Conditions/Data Races

Racing conditions occur as multiple threads are attempting to write to the same memory location making non-deterministic changes, the simplest example would be i=i + 1. You might think that this one line execute in one clock cycle and as one command in the CPU, but it could be compiled to more than one command.
For the sake of simplicity, let's think of it as such:

1. get value from i into register a;
2. add 1 to register a;
3. save value back from register a to variable i;
 
So if we have two threads executing the same line, our results will vary depending on when line 1 was executed, since register a might contain stale value if two threads already executed line 1, but when reaching to line 3, both will have the same value, defeating the purpose of the line i=i + 1.

The easiest way to avoid racing conditions is using locks, but locks make everything run in serial, preventing us from really parallelizing commands, they take a lot of time to lock and for other threads to continue and some of them are even asking the kernel for the actual lock which takes even more time.

The quicker, hardware-based solution is to use Interlocked commands, such as Interlocked.Increment / Interlocked.Add. Keep in mind that interlocked commands also run in serial across cores,  what they do is lock the bus, do their thing and cause a cache miss for other cores, so they need to reload the data from main memory.  If you have many Interlocked commands executed, perhaps its more beneficial to keep a thread local counter and update the main counter at certain intervals.

CPU Execution Order

CPU Out-of-order execution exists in CPUs since the 90's, what it means that the CPU optimize a block of commands to execute in a more efficient manner - memory wise, so if you have a few commands that already have the needed memory and a few commands that don't have the needed memory, the CPU will execute the commands that have the memory and fetch the memory it needs for the commands that don't have their data and then execute them, so less waiting is happening.

This actually cause problems in multithreaded applications since the programmer assumes that the CPU is executing the commands in the order they have written it, to make matter worse, some CPU reordering are actually writing to memory a lot later than the programmer intended, like at the end of the block of commands.

So if you have a few threads that rely on the fact that variable a is assigned before variable b, you might need to use memory barriers, especially in weak-consistency memory model CPUs (ARM/Itanium).

volatile, lock syntactic sugar, Thread.MemoryBarrier and Interlocked commands are your friends in this situation but like all synchronization commands, they have a performance penalty.

Logic Execution Order

Threads take time to start, a threadpool might already be executing other jobs and the whole system might be busy with 100% CPU utilization making everything slow and execute a little bit iffy, never count of one threads' result being ready, use messages, locks, flags etc' to signal the current state.

Again, remember that memory update order could be different than what you wrote, so use memory barriers where its important.

Cached Values  - Cache Coherence

Main memory access is order of magnitude slower than the CPU L1 or L2 cached access, for example, accessing L1 cache takes ~4 cycles, L2 takes ~10 cycles, L3 takes about 40-75 cycles and main memory takes ~100ns. On Nehalem (Intel i7) series of processors, L1 cache size if 64KB, L2 is 256KB and L3 is about 4-12 MB.

Considering the size and time to access these caches provide, it's important to know where your values are coming from and where they are going.

Let's think of the following situation, we have two threads, one is doing a write, the other reads from the same memory location, you might assume that the CPU knows when its cache is getting stale and it does, but only when notifying the CPU that it should note these memory locations, these operations are the Interlocked operations, these cause the CPU to signal a cache change when these values change.

The situation is getting worse when the two threads are intensively write the same memory location, what happens is that each time there's a write, the cache is becoming stale and needs to be refreshed, that process being repeated multiple times is actually slowing down the whole thing instead of speeding it up.

Access time - https://stackoverflow.com/questions/4087280/approximate-cost-to-access-various-caches-and-main-memory

Further reading - https://stackoverflow.com/questions/2538070/atomic-operation-cost

Excessive Parallelization - Excessive Context Switching

Sometimes threading is thought to be the magic solution to all problems, but threading comes with a cost, startup, ending and context switches are part of that cost. Starting 100 threads might not affect the system negatively but starting 10000 threads can cause some systems to slow down significantly, having the CPU spend more of its time on context switches will actually waste CPU rather than use it.

If you're starting 10000 threads, you might be doing something wrong, check if a threadpool can do a better job of scheduling your jobs or perhaps if async programming suit your needs, the idea behind Node.js is actually the solution to excessive context switching. Node.js is using a single thread for processing the JS code, opening  a new connection simply adds another job to the event queue, if that job is waiting indefinitely for a database response, it doesn't hold on to a thread that might or might not be switched into and then switched back, not doing anything but wasting CPU cycles on the context switch itself.

Reactive Extensions for .NET could be a nice read.

Starvation/Lock Contention/Convoying

These are related if not the same cause and effect, When you have multiple threads attempting to access the same resource, two things are actually happening, all threads that don't have a lock can't do their job, so they either wait or spin, the 2nd thing that happens is that there is a race toward getting a lock, it's not first come first serve, it's more like trying to push in a busy train at a 3rd world country, some will get in the first try, some will get in after the 3rd. So basically it means that the waiting threads lock acquisition is limited by the slowest thread lock duration.

To make sure your locks are not killing your performance, do the minimal amount of work inside the locks and remove as much logic as possible from inside the locks, as always Profiling is your friend, so you should check if two lock/release cycles can do a better job than having your logic in one large locked block.

In systems where fairness is important, one way to solve it is to use a ticket lock. At the time of this article, .NET does not implement ticket locks but it should be relatively easy to implement (though I'm not sure what the performance of such lock can be).

Spinning

There are two waiting schemes, one is to wait for notification, the other is to spin, when spinning, a loop waits for a condition to change, like a variable changing value. spinning has both advantages and disadvantages.

The biggest advantage is that it is very fast while the disadvantage is that it takes CPU, imagine 10 threads busy waiting (or spinning) on a variable value, this can take 100% of an 8 core CPU, which is wasteful.

Spinning locks are best used for very short locks, like for changing a variable, or updating a dictionary, never use them for IO locks.

To prevent spinning waits from affecting a system in a very bad way, Microsoft actually implemented them in an adaptive form the first few spins are only CPU, then they add a yield and eventually sleep, this way a programmers' mistake won't affect performance so much.

For spinning waiting schemes, you can use SpinWait.

For notification waiting scheme, you can use WaitHandle/CountdownEvent/AutoResetEvent/ManualResetEvent/ManualResetEventSlim or Monitor's Wait and Pulse/PulseAll.

Branching

While this is not directly related to multithreaded development but performance issues, it's good to know that CPU executes "if" commands by comparing values, if these values match, it jumps to one address, if not, another. There is also a part in the CPU that predicts which way a program will go and fetches the next predicted address. A misprediction takes time and when these mispredictions occur often it can affect the program's performance negatively.

In very tight loops, avoiding multiple IFs and switches can be beneficial as the predictor works by analyzing past results and its about 90% accurate, but if your conditionals coding results in unpredictable results, it will affect the performance.

Programming Errors

There are many programming errors that can occur with multithreading, but you should especially notice resource allocation, which thread is creating which resources and which threads destroy them. To make things simple, keep resources as local as possible, and only the creating thread should destroy these resources, this way your application logic will remain as simple as possible, writing complex code will most probably lead to bugs and time consuming debugging.

You might want to use a resource pool, so each thread is only responsible for releasing the resources while the pool is responsible for destroying them, use try/catch/finally or using statements so your resources are always disposed, even in case of an exception.

Create global resources before the threads are starting if possible or use Lazy with isThreadsafe true.

Lock-reentrancy means that a thread might attempt to use the same lock more than once. Most .NET locks are reentrant, which means you can lock multiple times, what you do have to remember is to release the lock the same amount of times, please read the documentation before relying on this fact as Semaphore is not reentrant and ReaderWriterLockSlim can be both.

No comments:

Post a Comment