Up to this point, we've talked about how processor clock speeds are not advancing as fast as they used to, we may have yet to hit the physical limits of the current generation of chip-making technology but we're not far from it. Dual-core CPUs are quite the norm now, and quad-core is becoming popular. I'm guessing we're going to see 16 or more threads in an end-user-level system in less than 2 years. That means we have to write concurrent software to be able to exploit this fact, or you'll be swept away by the people who do write scalable programs.
On the other hand, the data we need to operate on gets larger by the minute and memory access rates are becoming more and more the bottleneck. This emphasizes the role of cache hierarchies. But as the number of threads goes up, so does the rate of cache misses. This is but one of the problems we face in multi-threading our code.
Also, writing a concurrent program is hard. It's certainly harder that the serial version (most of the time,) both because we are used to (or trained to be used to) thinking or designing serially, and also most of our tools and languages focus on that. I'm not an expert on the theory of complexity, but it seems to me that designing a scalable parallel program (or algorithm) is innately much harder than ordinary program design.
A third source of problems are race conditions, deadlocks, livelocks, priority inversions and a plethora of other pitfalls and hazards. They make analyzing, testing, debugging and making guarantees about a program's behavior much harder.
Another hardship is performance. As it happens, naive ways of achieving concurrency can either lead to a lot of bugs due to lack of proper synchronization, or a performance penalty because of excessive and/or misplaced locking or flawed design. This performance hit (which mainly comes from bad design, but can be the from the overhead of the synchronization primitives) can be so huge that the parallelized version may even be slower than the serial version! Achieving (near) linear performance boost (linear in the number of hardware threads) is no simple task, even in the simplest and best cases (e.g. when data is not shared between the threads.)
Many programmers, faced with the task of parallelizing the design of a program, go the road of task parallelism, which means they divide that tasks a program has to do among the threads. The threads execute different operations on (the same or different) data. Pipelining is a variation of this method. This has several advantages. It's usually simple enough, it usually minimizes (or at least restricts) the data sharing between the threads and it can be fast and it can be generic and applicable to a broad range of situations.
But the approach is not scalable. There are not many situations that you can divide the program into 10 or more meaningful, mostly independent and orthogonal tasks and still keep the benefits of this approach. There's almost no way to divide a program into 1024 tasks!
Another approach is data parallelism and I will cover the meaning in a later post.
One way to evade many of the above problems is to go for a coarser-level of parallelism. That is, divide the program into many programs and not many threads. This can be used (most of the time) whether your design parallelizes tasks or data, but it's not applicable to many high-performance and real-time applications because whilst the design gets simpler and more manageable, the overhead of spawning a new process and interprocess communication can be noticeably higher than the same operations for threads. On some systems (namely Windows,) it's really significant. I will get a little bit more into this in the future.
Sunday, April 01, 2007
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment