Simple-Fast-Easy-Parallelism-in-Shell-Pipelines #

Simple, Fast, Easy Parallelism in Shell Pipelines #

Created: January 12, 2023 9:28 PM URL: https://catern.com/posts/pipes.html The typical shell1 pipeline looks something like this: Usually src will output lines of data, and worker acts as a filter, processing each line and sending some transformed lines to sink. So there is a possibility of parallel processing on the basis of lines: We could have multiple worker tasks2 which each process different lines of data from src and output transformed lines to sink. But the typical data-processing Unix command reads lines of input from stdin, not from its arguments on the command line, so many commands simply don’t make sense to use with xargs. A technique that allows a pool of worker tasks2, executing in parallel, to process incoming lines as they arrive on stdin, would be strictly more general. Writing a parallel pipeline in any shell looks like this:3

src | { worker &
worker &
worker & } | sink

This will start up three workers in the background, which will all read data from src in parallel, and write output to sink. 5 Since all the worker tasks are reading input at the same time, we might run into a common concurrency issue: one worker task might get the first part of a line, while the rest of the line goes to another worker task. We can control interleaving of the output with other techniques too, such as stdbuf -oL, to force output to be written in units of whole lines, which will be atomic as long as those lines are short enough.