Multi-core programming and design: tips and tricks

Multi-core programming is the new trend in software programming, the new big thing that will earn the big consulting companies the big bucks.

Don’t get me wrong: I am not saying it’s a fad or something like that, I am just saying that it’s not something new at all, SMP machines exists for many
years now.
I started working with this kind of machine in 1997, a SGI machine with 16 processors. Old stuff that becomes new stuff, again …
Since then, when working for banking or financial companies, this has always been the motto: let’s do it faster. Add more ram, more cpus, more disks.
And, yes please, use them as efficiently as possible: I paid the big bucks for the hardware, the software must use it (Nowadays the companies willing to spend the most time and money on multi-core programming will probably be the gaming companies).

So … We have more computing power now and we want to use it efficiently, we want to keep these cpus busy.
Of course this is not easy, in order to get a faster program, having more than one cpu does not make automatically any of your applications faster. One must choose the right technology in order to achieve the most efficiency out of these new multi-core cpus.

Multi-threading is the most common chosen technology to try to use the multiple cores available on your $600 desktop or +10,000$ on your brand new server.
But multi-threading is not the only technology that does exist to maximize your computing efficiency nor it is the right technology to solve every problem: multi-processing is also a good choice under certain circumstances.

So you want to optimize or develop a new program that would scale when in presence of multiple cores/cpus, here are the steps I follow when trying it:

First things first: check the operating system the program is supposed to run on.

why: not all operating systems are created equal when dealing with multiple cpus. Even if your preferred OS claims to support multiple cores it may not be able to use them efficiently i.e. where are the GL contention point in your OS ?

example: if your new application is a networking app and you think that multiple cpus will make it faster you should check if your OS implements a BGL around networking APIs.

conclusion: check your OS internals regarding areas where you expect multiple core to speed-up your application. usually I use SOLARIS and Linux, I tend to avoid MAC OSX and other UNIX like HP-UX and IBM AIX. I do not use windows boxes, hence I cannot make any comment on them.

I will not speak of correct time-sharing and scheduling in presence of multiple cores since all these OS have good support for those.

choose between multi-threading and multi-processing.
As I previously said, multi-threading is not always the right choice for your multi-core applications. This always a shock to people I talk with, since for them multi-threading is the only answer to multi-core programming.
So here are a description of situations where multiple processes may be a better choice than multiple threads in a single process:

robustness, multiple threads share the same address space, hence if one generates a fault, the whole application may crash. Multi-thread applications do not run in complete isolation, so if certain threads of your program MUST stay alive no matter what can happen to the other threads, multi-threading may be a bad choice. If restarting the program upon failure is not an issue, you can safely ignore the robustness requirement.
integrity, when designing multiple processes application, each process has its own address space so if one process goes crazy, it cannot corrupt the other processes data, it can only corrupt the shared data between processes. On the other hand, if a thread goes crazy it can corrupt the other threads data since there is no memory protection between threads. This is particularly true when a parent spawns a child process, the child process does whatever computation it needs, and then terminate. If any corruption occurs when doing the computation, it just goes away when the child dies: a new process is spawn to handle a new computation without being affected the previous child fault. There is no such safety net with threads. Corruption made by one thread is program wide, regardless of its lifetime.

Now a description of when a multi-threading application is more desirable than a multi-process one:

CPU bound applications, video/sound/images processing, numerical simulations (finite elements, molecules interactions)
high bandwidth message passing, this is when multiple threads exchanges messages for processing at very high rate i.e. the processing done on the various messages is very lightweight.
cache friendly code, this is related to CPU bound applications
model driven, multiple threads just make your coding easier: 1 request to be processed by 1 thread, like network connections etc. (yep this is naive e.g. see C10K problem, high performance server design)

Still not decided between multi-threaded and multi-process application ?

Measure the time necessary for the computations, if this amount of time is really small i.e. lightweight computation, compare it to the time taken to increment/decrement a semaphore between 2 processes. If they are in the same order of magnitude, you probably be more satisfied with multi-threading
Otherwise, multi-processing is a very viable choice. Multi-processing is a good choice when the computation takes an order of magnitude more time than the time spent exchanging data.

reduce contention
See ddj article – that’s what will kill your application (regardless of multi-threading or multi-processing).

reduce code size that is protected by mutexes/semaphores i.e. a whole numerical simulation between two threads is a bad idea

reduce the necessity for locking, fastest way of sharing data between two threads ? No sharing at all !!! if you are in a master/slave processing model, try to have the master make a copy of the data to be protected by the mutex/semaphore and feed the copy to the various slaves. Copying the shared data can be way much cheaper than competing for a lock (especially when the number of threads increase).

try to use dirty-reads, do you really need to know the exact value of a shared data i.e. the value of a counter.
Example: you have to remove items from a cache when the item is expired. The item expires at 11:00, the current timestamp value is protected by a mutex. Locking can be avoided, since reading 11:00 or 11:01 (it has been updated by another thread while reading it) does not really mater: as soon as the timestamp is greater than 11:00 deletion can be performed. Your counter/timestamp is quite big ? See if Lamport concurrent reading and writing of clocks algorithm applies to you (it’s simple and effective).
do not forget that memory is not cheap to access, check some machine architecture docs to educate yourself on these low levels issues

If dealing with a CPU bound application, beware of trashing: that beast will decrease your performance while maintaining your various cores busy

If dealing with a CPU bound application, maximize the use of your cpu architecture. This is quite a last resort action, but it can be very effective.
For example, I had to a given computation on a powerpc processor. The computation involved multiple floating point operations. The powerpc is a beautiful beast, you have multiple floating point engines when doing the calculations (in my case 4 if I remember correctly).

The trick was to make the compiler produce code that use all of them.

The following code and discussion are not about loop unrolling (yes it is closely related), here we try to avoid stalling the cpu pipeline by using the multiples floating point units.

original code (and simplified code):

float* a = new float[1000];
float* b = new float[1000];
float* c = new float[1000];
// most cpu time was spent in there...
for ( int i = 0; i < 1000; ++i )
{
    c[i] = 0;
    for ( int j = 0; j < 1000; ++j )
    {
        c[i] += a[j] * b[j];
    }
}

updated loop:

for ( int i = 0; i < 1000; ++i )
{
    float partial_sum = 0;
    for (int j = 0; j < 1000; j += 4)
    {
        // load data into the various processor registers
        float a1 = a[j];
        float a2 = a[j+1];
        float a3 = a[j+2];
        float a4 = a[j+3];
        float b1 = b[j];
        float b2 = b[j+1];
        float b3 = b[j+2];
        float b4 = b[j+3];

        // compute all sums in isolation, so they can be mapped
        // independantly to the various floating point engines
        float s1 = a1 * b1; // FPU engine 1
        float s2 = a2 * b2; // FPU engine 2
        float s3 = a3 * b3; // FPU engine 3
        float s4 = a4 * b4; // FPU engine 4

        s1 += s2; // FPU engine 1
        s3 += s4; // FPU engine 2

        partial_sum += s1 + s3; // FPU engine 1
    }
    c[i] = partial_sum;
}

This is longer but much more efficient and related to multi-threading: you are using the concurrency offered by your processor at the operation level: s1, s2, s3, s4 are computed in parallel, then s1 and s3 are computed in parallel again. This is very good, and most of the time, neglected source of concurrency for your programs.

The previous code, the updated loop, will probably be faster even on classic x86 chips with only 1 FPU, reason being that placing data in registers (explicitly) will be faster than retrieving the data from the cache.

Forget about the fact that this is a powerpc processor and there are multiple FPU engines etc.
What we really did here, was to provide the compiler with lot of context, on how the operation should be done and handled. After that the compiler can work on the optimizations, register allocations etc. Always provide your compiler a maximum of context on the computation to be performed.

You are not alone, seek advices and help
The conclusion to this post is: think twice before writing a multi-core application, check the os, your design and implementation with these previous notes before writing any line of code. Get some help from old schools SMP programmers, get them to review the design and get them to give you some tips. Multi-core programming is hard, really hard.

That’s all folks, for now. I’ll probably make a few other posts about multi-core programming, I have yet to decide the material to write on ;-)

Enjoy and share.

About this entry

You’re currently reading “Multi-core programming and design: tips and tricks,” an entry on my core dump is your sigfault

Published:: 16 December, 2008 / 12:24 am

Category:: concurency, programming, software development

Tags:: c, cpp, multithread

Caroline 3.2.09 / 4pm

Hi Paulo, did you actually try that improved code out on a multiFPU setup? What was the performance difference like? In the past I have found compilers to be pretty useless at finding parallelism even if you waved it in front of them… ended up having to craft parallel assembler instructions so that we could use most of the capability (back in about 93 on C40’s!). Call me a cynic, but I suspect hardware large grain parallelism and multithread programming has really been where the gains have occurred.

ppi 3.2.09 / 5pm

Yes I did, a few years ago when IBM was marketing their power architecture (I got a login to a powerpc smp machines to improve some code snippet).
I ran that kind of code on a G5, which had 2FPU’s and the code was faster.
Cannot remember if I was using gcc or ibm compiler (xlc) though …
Regarding compilers I ahve pretty much the same experiance as you e.g. the g5/g4 are beautiful processors and never got much out of them with gcc …

my core dump is your sigfault

Multi-core programming and design: tips and tricks

About this entry

2 Comments

Leave a reply to ppi Cancel reply

Pages

Recently

Monthly Archives

my core dump is your sigfault

Multi-core programming and design: tips and tricks

Share this:

Related

About this entry

2 Comments

Leave a reply to ppi Cancel reply

Pages

Recently

Monthly Archives

Blogroll