So you have just finished prototyping the latest and greatest algorithm you invented. It works well but it could do with a little more speed, so you decide you want to parallelise it so you can take advantage of your spiffy eight core machine you picked up last week on your research budget. This doesn’t seem like it will be too difficult, and within mere hours you have figured out how to divide up the work between multiple threads. Then after only a few more hours, you have something that can run with a variable number of threads and you are ready to start testing. This is where things start to go awry.
With a single thread, your algorithm manages to do a respectable 10.8 operations per second (ops). So you are hoping for a nice 21.6 ops with two threads, but alas you only get 15.6 ops. This is not a good sign, but you are really keen to test the new machine you got, so decide to just skip to eight threads. This test seems to take longer than you expect, but in the end it reports a low 2.1 ops. You are slightly taken aback, but after repeating the test a ridiculous number of times with similar results (just in case), you realise that there is a rather big problem, and you have no idea how to fix it.
The above scenario might not seem realistic, as you might expect things to at worst stay at 10.8 ops, but in fact something like this happened when I was doing work on parallelisation. I expected that things might plateau at some point, but when I got slower performance with more threads, it was entirely unexpected. Unfortunately, my work was part of an already large system, so it was incredibly difficult to try and identify where the problem was.
I’ll skip out all the details and just say that after much searching, the problem was identified as the memory allocator. Basically, the default memory manager on the systems I tested on (Linux and Solaris) is not designed with multi-threaded programs in mind, reminding me that sometimes you can blame the bug on the compiler (but not usually). So at this point I thought I’d have to write a small memory allocator myself, something that was rather undesirable.
Fortunately the Hoard came to my rescue. In case you haven’t heard of it before, Hoard is a memory allocator designed especially for multi-threaded programs. Using it is simple, just download the relevant binary and get it to load with the LD_PRELOAD environment variable or similar method for your respective platform. I found that doing this improved the multi-threaded results drastically and even the single threaded results noticeably. So if you ever think an issue is because of the memory allocator, consider having a quick look at Hoard. At the very least you will realise the problem lies in your code and you are the only one to blame, as usual.