Robert Važan Nov 4, 2023

Low-cost parallelization techniques

Sure, code should be efficient, but sometimes you just have to run a computation that is inherently heavy. You have already cut most of the fat and micro-optimizing the rest just isn't cost-effective or even possible. Processor clock is limited, so the only way to execute compute-bound operation quickly is to let it run in parallel. Parallelization can be a dangerous and expensive pet to keep though, but we don't want to just accept that and let it explode complexity of all software or restrict it to a tiny fraction of code and let the rest rot. That leads us to search for safe, cost-effective, and widely applicable parallelization methods.

I frequently work with code that needs parallelization. I don't have much time to tinker with it, so I gave this topic some thought, researched available approaches, and tested a few of them on my code. Here's my list of parallelization techniques sorted by increasing cost (and a bit by my preference):

Instruction-level parallelism: All processors produced in the last two decades are superscalar, which means they execute several instructions in parallel. They slurp several instructions from L1 cache, analyze their dependencies, and execute them in parallel where possible. Code does not need any changes, although some awareness of branch prediction and memory hierarchy is helpful. Buying a recent processor with high single-threaded performance is often the easiest solution. Processor manufacturers are free to boost performance with wider and deepier pipelines. Compilers are free to generate code amenable to parallel execution. The catch is that instruction-level parallelism only gets you so far. The technology is inherently limited by branch prediction and energy efficiency.
Auto-vectorization: All recent processors support SIMD instructions, which execute the same operation (e.g. addition) on a fixed-length vector of primitives. With properly constructed loops, arrays can be processed with high degree of parallelism. Compilers know this and automatically rewrite loops to use SIMD instructions where possible. The catch is that auto-vectorization is brittle. Add a condition here and an offset there and the compiler fails to vectorize. Such sudden, unpredictable drops in performance are a nuisance. Auto-vectorization therefore cannot be relied upon, but it's still a nice free goodie that can delay application of more expensive techniques.
Parallel iterators: This is usually the first explicit parallelization I apply. Many languages come with parallelizable iterators: Stream.parallel() in Java, IEnumerable.AsParallel() in NET, rayon's .par_iter() in Rust. Parallel iterators give you efficient parallelization with a lot of control over what and how is parallelized. If you stick with immutable data structures, parallel iterators are very safe to use. They are usually implemented using work-stealing, which is efficient at least down to 100ns operations and mostly preserves locality of reference of the original sequential code. Hardware manufacturers have plenty of opportunities to exploit multi-threaded code: adding more cores, hyperthreading, throttling single-core performance to fit more threads in the power envelope, multi-processor computers, and heterogeneous cores.
Structured concurrency: Structured concurrency has many variations, but the core idea is that threads form a tree, in which child threads terminate before parent does. Nowadays all such threads are virtual threads or tasks rather than heavy OS threads. They usually rely on the same work-stealing infrastructure that parallel iterators do with similar efficiency benefits. APIs that offer this functionality vary widely: coroutines, async/await, fork-join, even reactive. In all cases, there's a way to fork off a subtask and later wait for its completion, which either yields return value or an exception.
Parallelized libraries: Sometimes a task is implemented by a library that is already parallelized. Often all you have to do is to enable the parallelization. Library is a highly cost-effective solution if you can find one. Unfortunately, internally parallelized libraries are quite rare. If parallelization is implemented, it often lacks certain desirable properties like custom thread pool, metrics, or reproducibility.
Background jobs: While background jobs can be internally parallelized, their defining characteristic is that they run in parallel with foreground interactive tasks. The idea is to exploit the powerful but mostly idle processors in desktop computers. This is a very different kind of parallelism than everything else here, but it's very valuable. It's like having a dedicated cloud server but without having to pay for it and without the network overhead. The main downside is that you have to tinker with process priorities to avoid negative impact on UI responsiveness. Few people have the patience for that and even fewer get it right.
Vector types: Let's first draw a line between vector types and vector intrinsics. Both expose processor's SIMD capabilities in high-level languages, but whereas intrinsics do so directly and without much regard for portability, vector types are abstract constructs used to prove to the compiler that some piece of code can be indeed compiled using SIMD instructions. Many languages have them: Vector<T> in NET, incubating vector API in Java, and nightly simd module in Rust. Vector types guarantee certain minimum level of SIMD parallelism as long as there's hardware support. Compiler's auto-vectorization is still free to improve the code if it can. All code remains portable: vector types compile to whichever instructions the processor supports while unsupported operations gracefully degrade to sequential code. The main gotchas are in limited support of SIMD operations, possible performance degradation due to interference with auto-vectorization, and increased complexity of loops.
Cluster: Coding a custom computer cluster implementation is not as hard as it may seem. Parallelization model is similar to multi-threading with the key difference that subtasks have to be routed to nodes that have the data, because moving data over network is expensive. Although cluster implementation can take some effort, if you parallelize at sufficiently high level, all code running at lower level of abstraction will be implicitly parallelized too. That keeps the code simple and portable. Clusters offer unbounded horizontal scalability, which is a unique advantage among methods presented here.
Vector intrinsics: Low-level alternative to vector types with full support for all SIMD instructions. The loss of portability is usually not worth the extra performance.
Assembly: Even lower level alternative to vector intrinsics. Usually not worth it.
GPU: GPUs really are the king of performance, at least as far as consumer hardware goes. But the performance edge comes at the cost of rivers of developer sweat and pain. Just touching the GPU introduces loads of messy native library dependencies. GPUs have their own weird programming language and weird little operating system. They do not have unified memory architecture, so every data structure and algorithm must be aware of memory regions, especially the distinction between GPU and system memory. Both hardware and drivers have bugs, often to the point of crashing or producing garbage output. There's essentially no portability. Only Nvidia cards really work. Whatever little works on Intel and AMD cards generally does not work on their iGPUs. I do not recommend GPUs unless you desperately need more compute power.

So when deciding on how to parallelize your code, go to the top of this list and eliminate options until you find something that works for you. Hope this helps.