diff options
Diffstat (limited to 'src/parallel_examples.pod')
-rw-r--r-- | src/parallel_examples.pod | 2041 |
1 files changed, 2041 insertions, 0 deletions
diff --git a/src/parallel_examples.pod b/src/parallel_examples.pod new file mode 100644 index 0000000..9e06217 --- /dev/null +++ b/src/parallel_examples.pod @@ -0,0 +1,2041 @@ +#!/usr/bin/perl -w + +# SPDX-FileCopyrightText: 2021-2024 Ole Tange, http://ole.tange.dk and Free Software and Foundation, Inc. +# SPDX-License-Identifier: GFDL-1.3-or-later +# SPDX-License-Identifier: CC-BY-SA-4.0 + +=encoding utf8 + +=head1 GNU PARALLEL EXAMPLES + +=head2 EXAMPLE: Working as xargs -n1. Argument appending + +GNU B<parallel> can work similar to B<xargs -n1>. + +To compress all html files using B<gzip> run: + + find . -name '*.html' | parallel gzip --best + +If the file names may contain a newline use B<-0>. Substitute FOO BAR with +FUBAR in all files in this dir and subdirs: + + find . -type f -print0 | \ + parallel -q0 perl -i -pe 's/FOO BAR/FUBAR/g' + +Note B<-q> is needed because of the space in 'FOO BAR'. + + +=head2 EXAMPLE: Simple network scanner + +B<prips> can generate IP-addresses from CIDR notation. With GNU +B<parallel> you can build a simple network scanner to see which +addresses respond to B<ping>: + + prips 130.229.16.0/20 | \ + parallel --timeout 2 -j0 \ + 'ping -c 1 {} >/dev/null && echo {}' 2>/dev/null + + +=head2 EXAMPLE: Reading arguments from command line + +GNU B<parallel> can take the arguments from command line instead of +stdin (standard input). To compress all html files in the current dir +using B<gzip> run: + + parallel gzip --best ::: *.html + +To convert *.wav to *.mp3 using LAME running one process per CPU run: + + parallel lame {} -o {.}.mp3 ::: *.wav + + +=head2 EXAMPLE: Inserting multiple arguments + +When moving a lot of files like this: B<mv *.log destdir> you will +sometimes get the error: + + bash: /bin/mv: Argument list too long + +because there are too many files. You can instead do: + + ls | grep -E '\.log$' | parallel mv {} destdir + +This will run B<mv> for each file. It can be done faster if B<mv> gets +as many arguments that will fit on the line: + + ls | grep -E '\.log$' | parallel -m mv {} destdir + +In many shells you can also use B<printf>: + + printf '%s\0' *.log | parallel -0 -m mv {} destdir + + +=head2 EXAMPLE: Context replace + +To remove the files I<pict0000.jpg> .. I<pict9999.jpg> you could do: + + seq -w 0 9999 | parallel rm pict{}.jpg + +You could also do: + + seq -w 0 9999 | perl -pe 's/(.*)/pict$1.jpg/' | parallel -m rm + +The first will run B<rm> 10000 times, while the last will only run +B<rm> as many times needed to keep the command line length short +enough to avoid B<Argument list too long> (it typically runs 1-2 times). + +You could also run: + + seq -w 0 9999 | parallel -X rm pict{}.jpg + +This will also only run B<rm> as many times needed to keep the command +line length short enough. + + +=head2 EXAMPLE: Compute intensive jobs and substitution + +If ImageMagick is installed this will generate a thumbnail of a jpg +file: + + convert -geometry 120 foo.jpg thumb_foo.jpg + +This will run with number-of-cpus jobs in parallel for all jpg files +in a directory: + + ls *.jpg | parallel convert -geometry 120 {} thumb_{} + +To do it recursively use B<find>: + + find . -name '*.jpg' | \ + parallel convert -geometry 120 {} {}_thumb.jpg + +Notice how the argument has to start with B<{}> as B<{}> will include path +(e.g. running B<convert -geometry 120 ./foo/bar.jpg +thumb_./foo/bar.jpg> would clearly be wrong). The command will +generate files like ./foo/bar.jpg_thumb.jpg. + +Use B<{.}> to avoid the extra .jpg in the file name. This command will +make files like ./foo/bar_thumb.jpg: + + find . -name '*.jpg' | \ + parallel convert -geometry 120 {} {.}_thumb.jpg + + +=head2 EXAMPLE: Substitution and redirection + +This will generate an uncompressed version of .gz-files next to the .gz-file: + + parallel zcat {} ">"{.} ::: *.gz + +Quoting of > is necessary to postpone the redirection. Another +solution is to quote the whole command: + + parallel "zcat {} >{.}" ::: *.gz + +Other special shell characters (such as * ; $ > < | >> <<) also need +to be put in quotes, as they may otherwise be interpreted by the shell +and not given to GNU B<parallel>. + + +=head2 EXAMPLE: Composed commands + +A job can consist of several commands. This will print the number of +files in each directory: + + ls | parallel 'echo -n {}" "; ls {}|wc -l' + +To put the output in a file called <name>.dir: + + ls | parallel '(echo -n {}" "; ls {}|wc -l) >{}.dir' + +Even small shell scripts can be run by GNU B<parallel>: + + find . | parallel 'a={}; name=${a##*/};' \ + 'upper=$(echo "$name" | tr "[:lower:]" "[:upper:]");'\ + 'echo "$name - $upper"' + + ls | parallel 'mv {} "$(echo {} | tr "[:upper:]" "[:lower:]")"' + +Given a list of URLs, list all URLs that fail to download. Print the +line number and the URL. + + cat urlfile | parallel "wget {} 2>/dev/null || grep -n {} urlfile" + +Create a mirror directory with the same file names except all files and +symlinks are empty files. + + cp -rs /the/source/dir mirror_dir + find mirror_dir -type l | parallel -m rm {} '&&' touch {} + +Find the files in a list that do not exist + + cat file_list | parallel 'if [ ! -e {} ] ; then echo {}; fi' + + +=head2 EXAMPLE: Composed command with perl replacement string + +You have a bunch of file. You want them sorted into dirs. The dir of +each file should be named the first letter of the file name. + + parallel 'mkdir -p {=s/(.).*/$1/=}; mv {} {=s/(.).*/$1/=}' ::: * + + +=head2 EXAMPLE: Composed command with multiple input sources + +You have a dir with files named as 24 hours in 5 minute intervals: +00:00, 00:05, 00:10 .. 23:55. You want to find the files missing: + + parallel [ -f {1}:{2} ] "||" echo {1}:{2} does not exist \ + ::: {00..23} ::: {00..55..5} + + +=head2 EXAMPLE: Calling Bash functions + +If the composed command is longer than a line, it becomes hard to +read. In Bash you can use functions. Just remember to B<export -f> the +function. + + doit() { + echo Doing it for $1 + sleep 2 + echo Done with $1 + } + export -f doit + parallel doit ::: 1 2 3 + + doubleit() { + echo Doing it for $1 $2 + sleep 2 + echo Done with $1 $2 + } + export -f doubleit + parallel doubleit ::: 1 2 3 ::: a b + +To do this on remote servers you need to transfer the function using +B<--env>: + + parallel --env doit -S server doit ::: 1 2 3 + parallel --env doubleit -S server doubleit ::: 1 2 3 ::: a b + +If your environment (aliases, variables, and functions) is small you +can copy the full environment without having to +B<export -f> anything. See B<env_parallel>. + + +=head2 EXAMPLE: Function tester + +To test a program with different parameters: + + tester() { + if (eval "$@") >&/dev/null; then + perl -e 'printf "\033[30;102m[ OK ]\033[0m @ARGV\n"' "$@" + else + perl -e 'printf "\033[30;101m[FAIL]\033[0m @ARGV\n"' "$@" + fi + } + export -f tester + parallel tester my_program ::: arg1 arg2 + parallel tester exit ::: 1 0 2 0 + +If B<my_program> fails a red FAIL will be printed followed by the failing +command; otherwise a green OK will be printed followed by the command. + + +=head2 EXAMPLE: Identify few failing jobs + +B<--bar> works best if jobs have no output. If the failing jobs have +output you can identify the jobs like this: + + job-with-few-failures() { + # Force reproducibility + RANDOM=$1 + # This fails 1% (328 of 32768) + if [ $RANDOM -lt 328 ] ; then + echo Failed $1 + fi + } + export -f job-with-few-failures + seq 1000 | parallel --bar --tag job-with-few-failures + + +=head2 EXAMPLE: Continously show the latest line of output + +It can be useful to monitor the output of running jobs. + +This shows the most recent output line until a job finishes. After +which the output of the job is printed in full: + + parallel '{} | tee >(cat >&3)' ::: 'command 1' 'command 2' \ + 3> >(perl -ne '$|=1;chomp;printf"%.'$COLUMNS's\r",$_." "x100') + + +=head2 EXAMPLE: Log rotate + +Log rotation renames a logfile to an extension with a higher number: +log.1 becomes log.2, log.2 becomes log.3, and so on. The oldest log is +removed. To avoid overwriting files the process starts backwards from +the high number to the low number. This will keep 10 old versions of +the log: + + seq 9 -1 1 | parallel -j1 mv log.{} log.'{= $_++ =}' + mv log log.1 + + +=head2 EXAMPLE: Removing file extension when processing files + +When processing files removing the file extension using B<{.}> is +often useful. + +Create a directory for each zip-file and unzip it in that dir: + + parallel 'mkdir {.}; cd {.}; unzip ../{}' ::: *.zip + +Recompress all .gz files in current directory using B<bzip2> running 1 +job per CPU in parallel: + + parallel "zcat {} | bzip2 >{.}.bz2 && rm {}" ::: *.gz + +Convert all WAV files to MP3 using LAME: + + find sounddir -type f -name '*.wav' | parallel lame {} -o {.}.mp3 + +Put all converted in the same directory: + + find sounddir -type f -name '*.wav' | \ + parallel lame {} -o mydir/{/.}.mp3 + + +=head2 EXAMPLE: Replacing parts of file names + +If you deal with paired end reads, you will have files like +barcode1_R1.fq.gz, barcode1_R2.fq.gz, barcode2_R1.fq.gz, and +barcode2_R2.fq.gz. + +You want barcodeI<N>_R1 to be processed with barcodeI<N>_R2. + + parallel --plus myprocess {} {/_R1.fq.gz/_R2.fq.gz} ::: *_R1.fq.gz + +If the barcode does not contain '_R1', you can do: + + parallel --plus myprocess {} {/_R1/_R2} ::: *_R1.fq.gz + + +=head2 EXAMPLE: Removing strings from the argument + +If you have directory with tar.gz files and want these extracted in +the corresponding dir (e.g foo.tar.gz will be extracted in the dir +foo) you can do: + + parallel --plus 'mkdir {..}; tar -C {..} -xf {}' ::: *.tar.gz + +If you want to remove a different ending, you can use {%string}: + + parallel --plus echo {%_demo} ::: mycode_demo keep_demo_here + +You can also remove a starting string with {#string} + + parallel --plus echo {#demo_} ::: demo_mycode keep_demo_here + +To remove a string anywhere you can use regular expressions with +{/regexp/replacement} and leave the replacement empty: + + parallel --plus echo {/demo_/} ::: demo_mycode remove_demo_here + + +=head2 EXAMPLE: Download 24 images for each of the past 30 days + +Let us assume a website stores images like: + + https://www.example.com/path/to/YYYYMMDD_##.jpg + +where YYYYMMDD is the date and ## is the number 01-24. This will +download images for the past 30 days: + + getit() { + date=$(date -d "today -$1 days" +%Y%m%d) + num=$2 + echo wget https://www.example.com/path/to/${date}_${num}.jpg + } + export -f getit + + parallel getit ::: $(seq 30) ::: $(seq -w 24) + +B<$(date -d "today -$1 days" +%Y%m%d)> will give the dates in +YYYYMMDD with B<$1> days subtracted. + + +=head2 EXAMPLE: Download world map from NASA + +NASA provides tiles to download on earthdata.nasa.gov. Download tiles +for Blue Marble world map and create a 10240x20480 map. + + base=https://map1a.vis.earthdata.nasa.gov/wmts-geo/wmts.cgi + service="SERVICE=WMTS&REQUEST=GetTile&VERSION=1.0.0" + layer="LAYER=BlueMarble_ShadedRelief_Bathymetry" + set="STYLE=&TILEMATRIXSET=EPSG4326_500m&TILEMATRIX=5" + tile="TILEROW={1}&TILECOL={2}" + format="FORMAT=image%2Fjpeg" + url="$base?$service&$layer&$set&$tile&$format" + + parallel -j0 -q wget "$url" -O {1}_{2}.jpg ::: {0..19} ::: {0..39} + parallel eval convert +append {}_{0..39}.jpg line{}.jpg ::: {0..19} + convert -append line{0..19}.jpg world.jpg + + +=head2 EXAMPLE: Download Apollo-11 images from NASA using jq + +Search NASA using their API to get JSON for images related to 'apollo +11' and has 'moon landing' in the description. + +The search query returns JSON containing URLs to JSON containing +collections of pictures. One of the pictures in each of these +collection is I<large>. + +B<wget> is used to get the JSON for the search query. B<jq> is then +used to extract the URLs of the collections. B<parallel> then calls +B<wget> to get each collection, which is passed to B<jq> to extract +the URLs of all images. B<grep> filters out the I<large> images, and +B<parallel> finally uses B<wget> to fetch the images. + + base="https://images-api.nasa.gov/search" + q="q=apollo 11" + description="description=moon landing" + media_type="media_type=image" + wget -O - "$base?$q&$description&$media_type" | + jq -r .collection.items[].href | + parallel wget -O - | + jq -r .[] | + grep large | + parallel wget + + +=head2 EXAMPLE: Download video playlist in parallel + +B<youtube-dl> is an excellent tool to download videos. It can, +however, not download videos in parallel. This takes a playlist and +downloads 10 videos in parallel. + + url='youtu.be/watch?v=0wOf2Fgi3DE&list=UU_cznB5YZZmvAmeq7Y3EriQ' + export url + youtube-dl --flat-playlist "https://$url" | + parallel --tagstring {#} --lb -j10 \ + youtube-dl --playlist-start {#} --playlist-end {#} '"https://$url"' + + +=head2 EXAMPLE: Prepend last modified date (ISO8601) to file name + + parallel mv {} '{= $a=pQ($_); $b=$_;' \ + '$_=qx{date -r "$a" +%FT%T}; chomp; $_="$_ $b" =}' ::: * + +B<{=> and B<=}> mark a perl expression. B<pQ> perl-quotes the +string. B<date +%FT%T> is the date in ISO8601 with time. + +=head2 EXAMPLE: Save output in ISO8601 dirs + +Save output from B<ps aux> every second into dirs named +yyyy-mm-ddThh:mm:ss+zz:zz. + + seq 1000 | parallel -N0 -j1 --delay 1 \ + --results '{= $_=`date -Isec`; chomp=}/' ps aux + + +=head2 EXAMPLE: Digital clock with "blinking" : + +The : in a digital clock blinks. To make every other line have a ':' +and the rest a ' ' a perl expression is used to look at the 3rd input +source. If the value modulo 2 is 1: Use ":" otherwise use " ": + + parallel -k echo {1}'{=3 $_=$_%2?":":" "=}'{2}{3} \ + ::: {0..12} ::: {0..5} ::: {0..9} + + +=head2 EXAMPLE: Aggregating content of files + +This: + + parallel --header : echo x{X}y{Y}z{Z} \> x{X}y{Y}z{Z} \ + ::: X {1..5} ::: Y {01..10} ::: Z {1..5} + +will generate the files x1y01z1 .. x5y10z5. If you want to aggregate +the output grouping on x and z you can do this: + + parallel eval 'cat {=s/y01/y*/=} > {=s/y01//=}' ::: *y01* + +For all values of x and z it runs commands like: + + cat x1y*z1 > x1z1 + +So you end up with x1z1 .. x5z5 each containing the content of all +values of y. + + +=head2 EXAMPLE: Breadth first parallel web crawler/mirrorer + +This script below will crawl and mirror a URL in parallel. It +downloads first pages that are 1 click down, then 2 clicks down, then +3; instead of the normal depth first, where the first link link on +each page is fetched first. + +Run like this: + + PARALLEL=-j100 ./parallel-crawl http://gatt.org.yeslab.org/ + +Remove the B<wget> part if you only want a web crawler. + +It works by fetching a page from a list of URLs and looking for links +in that page that are within the same starting URL and that have not +already been seen. These links are added to a new queue. When all the +pages from the list is done, the new queue is moved to the list of +URLs and the process is started over until no unseen links are found. + + #!/bin/bash + + # E.g. http://gatt.org.yeslab.org/ + URL=$1 + # Stay inside the start dir + BASEURL=$(echo $URL | perl -pe 's:#.*::; s:(//.*/)[^/]*:$1:') + URLLIST=$(mktemp urllist.XXXX) + URLLIST2=$(mktemp urllist.XXXX) + SEEN=$(mktemp seen.XXXX) + + # Spider to get the URLs + echo $URL >$URLLIST + cp $URLLIST $SEEN + + while [ -s $URLLIST ] ; do + cat $URLLIST | + parallel lynx -listonly -image_links -dump {} \; \ + wget -qm -l1 -Q1 {} \; echo Spidered: {} \>\&2 | + perl -ne 's/#.*//; s/\s+\d+.\s(\S+)$/$1/ and + do { $seen{$1}++ or print }' | + grep -F $BASEURL | + grep -v -x -F -f $SEEN | tee -a $SEEN > $URLLIST2 + mv $URLLIST2 $URLLIST + done + + rm -f $URLLIST $URLLIST2 $SEEN + + +=head2 EXAMPLE: Process files from a tar file while unpacking + +If the files to be processed are in a tar file then unpacking one file +and processing it immediately may be faster than first unpacking all +files. + + tar xvf foo.tgz | perl -ne 'print $l;$l=$_;END{print $l}' | \ + parallel echo + +The Perl one-liner is needed to make sure the file is complete before +handing it to GNU B<parallel>. + + +=head2 EXAMPLE: Rewriting a for-loop and a while-read-loop + +for-loops like this: + + (for x in `cat list` ; do + do_something $x + done) | process_output + +and while-read-loops like this: + + cat list | (while read x ; do + do_something $x + done) | process_output + +can be written like this: + + cat list | parallel do_something | process_output + +For example: Find which host name in a list has IP address 1.2.3 4: + + cat hosts.txt | parallel -P 100 host | grep 1.2.3.4 + +If the processing requires more steps the for-loop like this: + + (for x in `cat list` ; do + no_extension=${x%.*}; + do_step1 $x scale $no_extension.jpg + do_step2 <$x $no_extension + done) | process_output + +and while-loops like this: + + cat list | (while read x ; do + no_extension=${x%.*}; + do_step1 $x scale $no_extension.jpg + do_step2 <$x $no_extension + done) | process_output + +can be written like this: + + cat list | parallel "do_step1 {} scale {.}.jpg ; do_step2 <{} {.}" |\ + process_output + +If the body of the loop is bigger, it improves readability to use a function: + + (for x in `cat list` ; do + do_something $x + [... 100 lines that do something with $x ...] + done) | process_output + + cat list | (while read x ; do + do_something $x + [... 100 lines that do something with $x ...] + done) | process_output + +can both be rewritten as: + + doit() { + x=$1 + do_something $x + [... 100 lines that do something with $x ...] + } + export -f doit + cat list | parallel doit + +=head2 EXAMPLE: Rewriting nested for-loops + +Nested for-loops like this: + + (for x in `cat xlist` ; do + for y in `cat ylist` ; do + do_something $x $y + done + done) | process_output + +can be written like this: + + parallel do_something {1} {2} :::: xlist ylist | process_output + +Nested for-loops like this: + + (for colour in red green blue ; do + for size in S M L XL XXL ; do + echo $colour $size + done + done) | sort + +can be written like this: + + parallel echo {1} {2} ::: red green blue ::: S M L XL XXL | sort + + +=head2 EXAMPLE: Finding the lowest difference between files + +B<diff> is good for finding differences in text files. B<diff | wc -l> +gives an indication of the size of the difference. To find the +differences between all files in the current dir do: + + parallel --tag 'diff {1} {2} | wc -l' ::: * ::: * | sort -nk3 + +This way it is possible to see if some files are closer to other +files. + + +=head2 EXAMPLE: for-loops with column names + +When doing multiple nested for-loops it can be easier to keep track of +the loop variable if is is named instead of just having a number. Use +B<--header :> to let the first argument be an named alias for the +positional replacement string: + + parallel --header : echo {colour} {size} \ + ::: colour red green blue ::: size S M L XL XXL + +This also works if the input file is a file with columns: + + cat addressbook.tsv | \ + parallel --colsep '\t' --header : echo {Name} {E-mail address} + + +=head2 EXAMPLE: All combinations in a list + +GNU B<parallel> makes all combinations when given two lists. + +To make all combinations in a single list with unique values, you +repeat the list and use replacement string B<{choose_k}>: + + parallel --plus echo {choose_k} ::: A B C D ::: A B C D + + parallel --plus echo 2{2choose_k} 1{1choose_k} ::: A B C D ::: A B C D + +B<{choose_k}> works for any number of input sources: + + parallel --plus echo {choose_k} ::: A B C D ::: A B C D ::: A B C D + +Where B<{choose_k}> does not care about order, B<{uniq}> cares about +order. It simply skips jobs where values from different input sources +are the same: + + parallel --plus echo {uniq} ::: A B C ::: A B C ::: A B C + parallel --plus echo {1uniq}+{2uniq}+{3uniq} \ + ::: A B C ::: A B C ::: A B C + +The behaviour of B<{choose_k}> is undefined, if the input values of each +source are different. + + +=head2 EXAMPLE: Which git branches are the most similar + +If you have a ton of branches in git, it may be useful to see how +similar the branches are. This gives a rough estimate: + + parallel --trim rl --plus --tag 'git diff {choose_k} | wc -c' \ + :::: <(git branch | grep -v '*') <(git branch | grep -v '*') | + sort -k3n + + +=head2 EXAMPLE: From a to b and b to c + +Assume you have input like: + + aardvark + babble + cab + dab + each + +and want to run combinations like: + + aardvark babble + babble cab + cab dab + dab each + +If the input is in the file in.txt: + + parallel echo {1} - {2} ::::+ <(head -n -1 in.txt) <(tail -n +2 in.txt) + +If the input is in the array $a here are two solutions: + + seq $((${#a[@]}-1)) | \ + env_parallel --env a echo '${a[{=$_--=}]} - ${a[{}]}' + parallel echo {1} - {2} ::: "${a[@]::${#a[@]}-1}" :::+ "${a[@]:1}" + + +=head2 EXAMPLE: Count the differences between all files in a dir + +Using B<--results> the results are saved in /tmp/diffcount*. + + parallel --results /tmp/diffcount "diff -U 0 {1} {2} | \ + tail -n +3 |grep -v '^@'|wc -l" ::: * ::: * + +To see the difference between file A and file B look at the file +'/tmp/diffcount/1/A/2/B'. + + +=head2 EXAMPLE: Speeding up fast jobs + +Starting a job on the local machine takes around 3-10 ms. This can be +a big overhead if the job takes very few ms to run. Often you can +group small jobs together using B<-X> which will make the overhead +less significant. Compare the speed of these: + + seq -w 0 9999 | parallel touch pict{}.jpg + seq -w 0 9999 | parallel -X touch pict{}.jpg + +If your program cannot take multiple arguments, then you can use GNU +B<parallel> to spawn multiple GNU B<parallel>s: + + seq -w 0 9999999 | \ + parallel -j10 -q -I,, --pipe parallel -j0 touch pict{}.jpg + +If B<-j0> normally spawns 252 jobs, then the above will try to spawn +2520 jobs. On a normal GNU/Linux system you can spawn 32000 jobs using +this technique with no problems. To raise the 32000 jobs limit raise +/proc/sys/kernel/pid_max to 4194303. + +If you do not need GNU B<parallel> to have control over each job (so +no need for B<--retries> or B<--joblog> or similar), then it can be +even faster if you can generate the command lines and pipe those to a +shell. So if you can do this: + + mygenerator | sh + +Then that can be parallelized like this: + + mygenerator | parallel --pipe --block 10M sh + +E.g. + + mygenerator() { + seq 10000000 | perl -pe 'print "echo This is fast job number "'; + } + mygenerator | parallel --pipe --block 10M sh + +The overhead is 100000 times smaller namely around 100 nanoseconds per +job. + + +=head2 EXAMPLE: Using shell variables + +When using shell variables you need to quote them correctly as they +may otherwise be interpreted by the shell. + +Notice the difference between: + + ARR=("My brother's 12\" records are worth <\$\$\$>"'!' Foo Bar) + parallel echo ::: ${ARR[@]} # This is probably not what you want + +and: + + ARR=("My brother's 12\" records are worth <\$\$\$>"'!' Foo Bar) + parallel echo ::: "${ARR[@]}" + +When using variables in the actual command that contains special +characters (e.g. space) you can quote them using B<'"$VAR"'> or using +"'s and B<-q>: + + VAR="My brother's 12\" records are worth <\$\$\$>" + parallel -q echo "$VAR" ::: '!' + export VAR + parallel echo '"$VAR"' ::: '!' + +If B<$VAR> does not contain ' then B<"'$VAR'"> will also work +(and does not need B<export>): + + VAR="My 12\" records are worth <\$\$\$>" + parallel echo "'$VAR'" ::: '!' + +If you use them in a function you just quote as you normally would do: + + VAR="My brother's 12\" records are worth <\$\$\$>" + export VAR + myfunc() { echo "$VAR" "$1"; } + export -f myfunc + parallel myfunc ::: '!' + + +=head2 EXAMPLE: Group output lines + +When running jobs that output data, you often do not want the output +of multiple jobs to run together. GNU B<parallel> defaults to grouping +the output of each job, so the output is printed when the job +finishes. If you want full lines to be printed while the job is +running you can use B<--line-buffer>. If you want output to be +printed as soon as possible you can use B<-u>. + +Compare the output of: + + parallel wget --progress=dot --limit-rate=100k \ + https://ftpmirror.gnu.org/parallel/parallel-20{}0822.tar.bz2 \ + ::: {12..16} + parallel --line-buffer wget --progress=dot --limit-rate=100k \ + https://ftpmirror.gnu.org/parallel/parallel-20{}0822.tar.bz2 \ + ::: {12..16} + parallel --latest-line wget --progress=dot --limit-rate=100k \ + https://ftpmirror.gnu.org/parallel/parallel-20{}0822.tar.bz2 \ + ::: {12..16} + parallel -u wget --progress=dot --limit-rate=100k \ + https://ftpmirror.gnu.org/parallel/parallel-20{}0822.tar.bz2 \ + ::: {12..16} + +=head2 EXAMPLE: Tag output lines + +GNU B<parallel> groups the output lines, but it can be hard to see +where the different jobs begin. B<--tag> prepends the argument to make +that more visible: + + parallel --tag wget --limit-rate=100k \ + https://ftpmirror.gnu.org/parallel/parallel-20{}0822.tar.bz2 \ + ::: {12..16} + +B<--tag> works with B<--line-buffer> but not with B<-u>: + + parallel --tag --line-buffer wget --limit-rate=100k \ + https://ftpmirror.gnu.org/parallel/parallel-20{}0822.tar.bz2 \ + ::: {12..16} + +Check the uptime of the servers in I<~/.parallel/sshloginfile>: + + parallel --tag -S .. --nonall uptime + + +=head2 EXAMPLE: Colorize output + +Give each job a new color. Most terminals support ANSI colors with the +escape code "\033[30;3Xm" where 0 <= X <= 7: + + seq 10 | \ + parallel --tagstring '\033[30;3{=$_=++$::color%8=}m' seq {} + parallel --rpl '{color} $_="\033[30;3".(++$::color%8)."m"' \ + --tagstring {color} seq {} ::: {1..10} + +To get rid of the initial \t (which comes from B<--tagstring>): + + ... | perl -pe 's/\t//' + + +=head2 EXAMPLE: Keep order of output same as order of input + +Normally the output of a job will be printed as soon as it +completes. Sometimes you want the order of the output to remain the +same as the order of the input. This is often important, if the output +is used as input for another system. B<-k> will make sure the order of +output will be in the same order as input even if later jobs end +before earlier jobs. + +Append a string to every line in a text file: + + cat textfile | parallel -k echo {} append_string + +If you remove B<-k> some of the lines may come out in the wrong order. + +Another example is B<traceroute>: + + parallel traceroute ::: qubes-os.org debian.org freenetproject.org + +will give traceroute of qubes-os.org, debian.org and +freenetproject.org, but it will be sorted according to which job +completed first. + +To keep the order the same as input run: + + parallel -k traceroute ::: qubes-os.org debian.org freenetproject.org + +This will make sure the traceroute to qubes-os.org will be printed +first. + +A bit more complex example is downloading a huge file in chunks in +parallel: Some internet connections will deliver more data if you +download files in parallel. For downloading files in parallel see: +"EXAMPLE: Download 10 images for each of the past 30 days". But if you +are downloading a big file you can download the file in chunks in +parallel. + +To download byte 10000000-19999999 you can use B<curl>: + + curl -r 10000000-19999999 https://example.com/the/big/file >file.part + +To download a 1 GB file we need 100 10MB chunks downloaded and +combined in the correct order. + + seq 0 99 | parallel -k curl -r \ + {}0000000-{}9999999 https://example.com/the/big/file > file + + +=head2 EXAMPLE: Parallel grep + +B<grep -r> greps recursively through directories. GNU B<parallel> can +often speed this up. + + find . -type f | parallel -k -j150% -n 1000 -m grep -H -n STRING {} + +This will run 1.5 job per CPU, and give 1000 arguments to B<grep>. + +There are situations where the above will be slower than B<grep -r>: + +=over 2 + +=item * + +If data is already in RAM. The overhead of starting jobs and buffering +output may outweigh the benefit of running in parallel. + +=item * + +If the files are big. If a file cannot be read in a single seek, the +disk may start thrashing. + +=back + +The speedup is caused by two factors: + +=over 2 + +=item * + +On rotating harddisks small files often require a seek for each +file. By searching for more files in parallel, the arm may pass +another wanted file on its way. + +=item * + +NVMe drives often perform better by having multiple command running in +parallel. + +=back + + +=head2 EXAMPLE: Grepping n lines for m regular expressions. + +The simplest solution to grep a big file for a lot of regexps is: + + grep -f regexps.txt bigfile + +Or if the regexps are fixed strings: + + grep -F -f regexps.txt bigfile + +There are 3 limiting factors: CPU, RAM, and disk I/O. + +RAM is easy to measure: If the B<grep> process takes up most of your +free memory (e.g. when running B<top>), then RAM is a limiting factor. + +CPU is also easy to measure: If the B<grep> takes >90% CPU in B<top>, +then the CPU is a limiting factor, and parallelization will speed this +up. + +It is harder to see if disk I/O is the limiting factor, and depending +on the disk system it may be faster or slower to parallelize. The only +way to know for certain is to test and measure. + + +=head3 Limiting factor: RAM + +The normal B<grep -f regexps.txt bigfile> works no matter the size of +bigfile, but if regexps.txt is so big it cannot fit into memory, then +you need to split this. + +B<grep -F> takes around 100 bytes of RAM and B<grep> takes about 500 +bytes of RAM per 1 byte of regexp. So if regexps.txt is 1% of your +RAM, then it may be too big. + +If you can convert your regexps into fixed strings do that. E.g. if +the lines you are looking for in bigfile all looks like: + + ID1 foo bar baz Identifier1 quux + fubar ID2 foo bar baz Identifier2 + +then your regexps.txt can be converted from: + + ID1.*Identifier1 + ID2.*Identifier2 + +into: + + ID1 foo bar baz Identifier1 + ID2 foo bar baz Identifier2 + +This way you can use B<grep -F> which takes around 80% less memory and +is much faster. + +If it still does not fit in memory you can do this: + + parallel --pipe-part -a regexps.txt --block 1M grep -F -f - -n bigfile | \ + sort -un | perl -pe 's/^\d+://' + +The 1M should be your free memory divided by the number of CPU threads and +divided by 200 for B<grep -F> and by 1000 for normal B<grep>. On +GNU/Linux you can do: + + free=$(awk '/^((Swap)?Cached|MemFree|Buffers):/ { sum += $2 } + END { print sum }' /proc/meminfo) + percpu=$((free / 200 / $(parallel --number-of-threads)))k + + parallel --pipe-part -a regexps.txt --block $percpu --compress \ + grep -F -f - -n bigfile | \ + sort -un | perl -pe 's/^\d+://' + +If you can live with duplicated lines and wrong order, it is faster to do: + + parallel --pipe-part -a regexps.txt --block $percpu --compress \ + grep -F -f - bigfile + +=head3 Limiting factor: CPU + +If the CPU is the limiting factor parallelization should be done on +the regexps: + + cat regexps.txt | parallel --pipe -L1000 --round-robin --compress \ + grep -f - -n bigfile | \ + sort -un | perl -pe 's/^\d+://' + +The command will start one B<grep> per CPU and read I<bigfile> one +time per CPU, but as that is done in parallel, all reads except the +first will be cached in RAM. Depending on the size of I<regexps.txt> it +may be faster to use B<--block 10m> instead of B<-L1000>. + +Some storage systems perform better when reading multiple chunks in +parallel. This is true for some RAID systems and for some network file +systems. To parallelize the reading of I<bigfile>: + + parallel --pipe-part --block 100M -a bigfile -k --compress \ + grep -f regexps.txt + +This will split I<bigfile> into 100MB chunks and run B<grep> on each of +these chunks. To parallelize both reading of I<bigfile> and I<regexps.txt> +combine the two using B<--cat>: + + parallel --pipe-part --block 100M -a bigfile --cat cat regexps.txt \ + \| parallel --pipe -L1000 --round-robin grep -f - {} + +If a line matches multiple regexps, the line may be duplicated. + +=head3 Bigger problem + +If the problem is too big to be solved by this, you are probably ready +for Lucene. + + +=head2 EXAMPLE: Using remote computers + +To run commands on a remote computer SSH needs to be set up and you +must be able to login without entering a password (The commands +B<ssh-copy-id>, B<ssh-agent>, and B<sshpass> may help you do that). + +If you need to login to a whole cluster, you typically do not want to +accept the host key for every host. You want to accept them the first +time and be warned if they are ever changed. To do that: + + # Add the servers to the sshloginfile + (echo servera; echo serverb) > .parallel/my_cluster + # Make sure .ssh/config exist + touch .ssh/config + cp .ssh/config .ssh/config.backup + # Disable StrictHostKeyChecking temporarily + (echo 'Host *'; echo StrictHostKeyChecking no) >> .ssh/config + parallel --slf my_cluster --nonall true + # Remove the disabling of StrictHostKeyChecking + mv .ssh/config.backup .ssh/config + +The servers in B<.parallel/my_cluster> are now added in B<.ssh/known_hosts>. + +To run B<echo> on B<server.example.com>: + + seq 10 | parallel --sshlogin server.example.com echo + +To run commands on more than one remote computer run: + + seq 10 | parallel --sshlogin s1.example.com,s2.example.net echo + +Or: + + seq 10 | parallel --sshlogin server.example.com \ + --sshlogin server2.example.net echo + +If the login username is I<foo> on I<server2.example.net> use: + + seq 10 | parallel --sshlogin server.example.com \ + --sshlogin foo@server2.example.net echo + +If your list of hosts is I<server1-88.example.net> with login I<foo>: + + seq 10 | parallel -Sfoo@server{1..88}.example.net echo + +To distribute the commands to a list of computers, make a file +I<mycomputers> with all the computers: + + server.example.com + foo@server2.example.com + server3.example.com + +Then run: + + seq 10 | parallel --sshloginfile mycomputers echo + +To include the local computer add the special sshlogin ':' to the list: + + server.example.com + foo@server2.example.com + server3.example.com + : + +GNU B<parallel> will try to determine the number of CPUs on each of +the remote computers, and run one job per CPU - even if the remote +computers do not have the same number of CPUs. + +If the number of CPUs on the remote computers is not identified +correctly the number of CPUs can be added in front. Here the computer +has 8 CPUs. + + seq 10 | parallel --sshlogin 8/server.example.com echo + + +=head2 EXAMPLE: Transferring of files + +To recompress gzipped files with B<bzip2> using a remote computer run: + + find logs/ -name '*.gz' | \ + parallel --sshlogin server.example.com \ + --transfer "zcat {} | bzip2 -9 >{.}.bz2" + +This will list the .gz-files in the I<logs> directory and all +directories below. Then it will transfer the files to +I<server.example.com> to the corresponding directory in +I<$HOME/logs>. On I<server.example.com> the file will be recompressed +using B<zcat> and B<bzip2> resulting in the corresponding file with +I<.gz> replaced with I<.bz2>. + +If you want the resulting bz2-file to be transferred back to the local +computer add I<--return {.}.bz2>: + + find logs/ -name '*.gz' | \ + parallel --sshlogin server.example.com \ + --transfer --return {.}.bz2 "zcat {} | bzip2 -9 >{.}.bz2" + +After the recompressing is done the I<.bz2>-file is transferred back to +the local computer and put next to the original I<.gz>-file. + +If you want to delete the transferred files on the remote computer add +I<--cleanup>. This will remove both the file transferred to the remote +computer and the files transferred from the remote computer: + + find logs/ -name '*.gz' | \ + parallel --sshlogin server.example.com \ + --transfer --return {.}.bz2 --cleanup "zcat {} | bzip2 -9 >{.}.bz2" + +If you want run on several computers add the computers to I<--sshlogin> +either using ',' or multiple I<--sshlogin>: + + find logs/ -name '*.gz' | \ + parallel --sshlogin server.example.com,server2.example.com \ + --sshlogin server3.example.com \ + --transfer --return {.}.bz2 --cleanup "zcat {} | bzip2 -9 >{.}.bz2" + +You can add the local computer using I<--sshlogin :>. This will disable the +removing and transferring for the local computer only: + + find logs/ -name '*.gz' | \ + parallel --sshlogin server.example.com,server2.example.com \ + --sshlogin server3.example.com \ + --sshlogin : \ + --transfer --return {.}.bz2 --cleanup "zcat {} | bzip2 -9 >{.}.bz2" + +Often I<--transfer>, I<--return> and I<--cleanup> are used together. They can be +shortened to I<--trc>: + + find logs/ -name '*.gz' | \ + parallel --sshlogin server.example.com,server2.example.com \ + --sshlogin server3.example.com \ + --sshlogin : \ + --trc {.}.bz2 "zcat {} | bzip2 -9 >{.}.bz2" + +With the file I<mycomputers> containing the list of computers it becomes: + + find logs/ -name '*.gz' | parallel --sshloginfile mycomputers \ + --trc {.}.bz2 "zcat {} | bzip2 -9 >{.}.bz2" + +If the file I<~/.parallel/sshloginfile> contains the list of computers +the special short hand I<-S ..> can be used: + + find logs/ -name '*.gz' | parallel -S .. \ + --trc {.}.bz2 "zcat {} | bzip2 -9 >{.}.bz2" + + +=head2 EXAMPLE: Advanced file transfer + +Assume you have files in in/*, want them processed on server, +and transferred back into /other/dir: + + parallel -S server --trc /other/dir/./{/}.out \ + cp {/} {/}.out ::: in/./* + + +=head2 EXAMPLE: Distributing work to local and remote computers + +Convert *.mp3 to *.ogg running one process per CPU on local computer +and server2: + + parallel --trc {.}.ogg -S server2,: \ + 'mpg321 -w - {} | oggenc -q0 - -o {.}.ogg' ::: *.mp3 + + +=head2 EXAMPLE: Running the same command on remote computers + +To run the command B<uptime> on remote computers you can do: + + parallel --tag --nonall -S server1,server2 uptime + +B<--nonall> reads no arguments. If you have a list of jobs you want +to run on each computer you can do: + + parallel --tag --onall -S server1,server2 echo ::: 1 2 3 + +Remove B<--tag> if you do not want the sshlogin added before the +output. + +If you have a lot of hosts use '-j0' to access more hosts in parallel. + + +=head2 EXAMPLE: Running 'sudo' on remote computers + +Put the password into passwordfile then run: + + parallel --ssh 'cat passwordfile | ssh' --nonall \ + -S user@server1,user@server2 sudo -S ls -l /root + + +=head2 EXAMPLE: Using remote computers behind NAT wall + +If the workers are behind a NAT wall, you need some trickery to get to +them. + +If you can B<ssh> to a jumphost, and reach the workers from there, +then the obvious solution would be this, but it B<does not work>: + + parallel --ssh 'ssh jumphost ssh' -S host1 echo ::: DOES NOT WORK + +It does not work because the command is dequoted by B<ssh> twice where +as GNU B<parallel> only expects it to be dequoted once. + +You can use a bash function and have GNU B<parallel> quote the command: + + jumpssh() { ssh -A jumphost ssh $(parallel --shellquote ::: "$@"); } + export -f jumpssh + parallel --ssh jumpssh -S host1 echo ::: this works + +Or you can instead put this in B<~/.ssh/config>: + + Host host1 host2 host3 + ProxyCommand ssh jumphost.domain nc -w 1 %h 22 + +It requires B<nc(netcat)> to be installed on jumphost. With this you +can simply: + + parallel -S host1,host2,host3 echo ::: This does work + +=head3 No jumphost, but port forwards + +If there is no jumphost but each server has port 22 forwarded from the +firewall (e.g. the firewall's port 22001 = port 22 on host1, 22002 = host2, +22003 = host3) then you can use B<~/.ssh/config>: + + Host host1.v + Port 22001 + Host host2.v + Port 22002 + Host host3.v + Port 22003 + Host *.v + Hostname firewall + +And then use host{1..3}.v as normal hosts: + + parallel -S host1.v,host2.v,host3.v echo ::: a b c + +=head3 No jumphost, no port forwards + +If ports cannot be forwarded, you need some sort of VPN to traverse +the NAT-wall. TOR is one options for that, as it is very easy to get +working. + +You need to install TOR and setup a hidden service. In B<torrc> put: + + HiddenServiceDir /var/lib/tor/hidden_service/ + HiddenServicePort 22 127.0.0.1:22 + +Then start TOR: B</etc/init.d/tor restart> + +The TOR hostname is now in B</var/lib/tor/hidden_service/hostname> and +is something similar to B<izjafdceobowklhz.onion>. Now you simply +prepend B<torsocks> to B<ssh>: + + parallel --ssh 'torsocks ssh' -S izjafdceobowklhz.onion \ + -S zfcdaeiojoklbwhz.onion,auclucjzobowklhi.onion echo ::: a b c + +If not all hosts are accessible through TOR: + + parallel -S 'torsocks ssh izjafdceobowklhz.onion,host2,host3' \ + echo ::: a b c + +See more B<ssh> tricks on https://en.wikibooks.org/wiki/OpenSSH/Cookbook/Proxies_and_Jump_Hosts + + +=head2 EXAMPLE: Use sshpass with ssh + +If you cannot use passwordless login, you may be able to use B<sshpass>: + + seq 10 | parallel -S user-with-password:MyPassword@server echo + +or: + + export SSHPASS='MyPa$$w0rd' + seq 10 | parallel -S user-with-password:@server echo + + +=head2 EXAMPLE: Use outrun instead of ssh + +B<outrun> lets you run a command on a remote server. B<outrun> sets up +a connection to access files at the source server, and automatically +transfers files. B<outrun> must be installed on the remote system. + +You can use B<outrun> in an sshlogin this way: + + parallel -S 'outrun user@server' command + +or: + + parallel --ssh outrun -S server command + + +=head2 EXAMPLE: Slurm cluster + +The Slurm Workload Manager is used in many clusters. + +Here is a simple example of using GNU B<parallel> to call B<srun>: + + #!/bin/bash + + #SBATCH --time 00:02:00 + #SBATCH --ntasks=4 + #SBATCH --job-name GnuParallelDemo + #SBATCH --output gnuparallel.out + + module purge + module load gnu_parallel + + my_parallel="parallel --delay .2 -j $SLURM_NTASKS" + my_srun="srun --export=all --exclusive -n1" + my_srun="$my_srun --cpus-per-task=1 --cpu-bind=cores" + $my_parallel "$my_srun" echo This is job {} ::: {1..20} + + +=head2 EXAMPLE: Parallelizing rsync + +B<rsync> is a great tool, but sometimes it will not fill up the +available bandwidth. Running multiple B<rsync> in parallel can fix +this. + + cd src-dir + find . -type f | + parallel -j10 -X rsync -zR -Ha ./{} fooserver:/dest-dir/ + +Adjust B<-j10> until you find the optimal number. + +B<rsync -R> will create the needed subdirectories, so all files are +not put into a single dir. The B<./> is needed so the resulting command +looks similar to: + + rsync -zR ././sub/dir/file fooserver:/dest-dir/ + +The B</./> is what B<rsync -R> works on. + +If you are unable to push data, but need to pull them and the files +are called digits.png (e.g. 000000.png) you might be able to do: + + seq -w 0 99 | parallel rsync -Havessh fooserver:src/*{}.png destdir/ + + +=head2 EXAMPLE: Use multiple inputs in one command + +Copy files like foo.es.ext to foo.ext: + + ls *.es.* | perl -pe 'print; s/\.es//' | parallel -N2 cp {1} {2} + +The perl command spits out 2 lines for each input. GNU B<parallel> +takes 2 inputs (using B<-N2>) and replaces {1} and {2} with the inputs. + +Count in binary: + + parallel -k echo ::: 0 1 ::: 0 1 ::: 0 1 ::: 0 1 ::: 0 1 ::: 0 1 + +Print the number on the opposing sides of a six sided die: + + parallel --link -a <(seq 6) -a <(seq 6 -1 1) echo + parallel --link echo :::: <(seq 6) <(seq 6 -1 1) + +Convert files from all subdirs to PNG-files with consecutive numbers +(useful for making input PNG's for B<ffmpeg>): + + parallel --link -a <(find . -type f | sort) \ + -a <(seq $(find . -type f|wc -l)) convert {1} {2}.png + +Alternative version: + + find . -type f | sort | parallel convert {} {#}.png + + +=head2 EXAMPLE: Use a table as input + +Content of table_file.tsv: + + foo<TAB>bar + baz <TAB> quux + +To run: + + cmd -o bar -i foo + cmd -o quux -i baz + +you can run: + + parallel -a table_file.tsv --colsep '\t' cmd -o {2} -i {1} + +Note: The default for GNU B<parallel> is to remove the spaces around +the columns. To keep the spaces: + + parallel -a table_file.tsv --trim n --colsep '\t' cmd -o {2} -i {1} + + +=head2 EXAMPLE: Output to database + +GNU B<parallel> can output to a database table and a CSV-file: + + dburl=csv:///%2Ftmp%2Fmydir + dbtableurl=$dburl/mytable.csv + parallel --sqlandworker $dbtableurl seq ::: {1..10} + +It is rather slow and takes up a lot of CPU time because GNU +B<parallel> parses the whole CSV file for each update. + +A better approach is to use an SQLite-base and then convert that to CSV: + + dburl=sqlite3:///%2Ftmp%2Fmy.sqlite + dbtableurl=$dburl/mytable + parallel --sqlandworker $dbtableurl seq ::: {1..10} + sql $dburl '.headers on' '.mode csv' 'SELECT * FROM mytable;' + +This takes around a second per job. + +If you have access to a real database system, such as PostgreSQL, it +is even faster: + + dburl=pg://user:pass@host/mydb + dbtableurl=$dburl/mytable + parallel --sqlandworker $dbtableurl seq ::: {1..10} + sql $dburl \ + "COPY (SELECT * FROM mytable) TO stdout DELIMITER ',' CSV HEADER;" + +Or MySQL: + + dburl=mysql://user:pass@host/mydb + dbtableurl=$dburl/mytable + parallel --sqlandworker $dbtableurl seq ::: {1..10} + sql -p -B $dburl "SELECT * FROM mytable;" > mytable.tsv + perl -pe 's/"/""/g; s/\t/","/g; s/^/"/; s/$/"/; + %s=("\\" => "\\", "t" => "\t", "n" => "\n"); + s/\\([\\tn])/$s{$1}/g;' mytable.tsv + + +=head2 EXAMPLE: Output to CSV-file for R + +If you have no need for the advanced job distribution control that a +database provides, but you simply want output into a CSV file that you +can read into R or LibreCalc, then you can use B<--results>: + + parallel --results my.csv seq ::: 10 20 30 + R + > mydf <- read.csv("my.csv"); + > print(mydf[2,]) + > write(as.character(mydf[2,c("Stdout")]),'') + + +=head2 EXAMPLE: Use XML as input + +The show Aflyttet on Radio 24syv publishes an RSS feed with their audio +podcasts on: http://arkiv.radio24syv.dk/audiopodcast/channel/4466232 + +Using B<xpath> you can extract the URLs for 2019 and download them +using GNU B<parallel>: + + wget -O - http://arkiv.radio24syv.dk/audiopodcast/channel/4466232 | \ + xpath -e "//pubDate[contains(text(),'2019')]/../enclosure/@url" | \ + parallel -u wget '{= s/ url="//; s/"//; =}' + + +=head2 EXAMPLE: Run the same command 10 times + +If you want to run the same command with the same arguments 10 times +in parallel you can do: + + seq 10 | parallel -n0 my_command my_args + + +=head2 EXAMPLE: Working as cat | sh. Resource inexpensive jobs and evaluation + +GNU B<parallel> can work similar to B<cat | sh>. + +A resource inexpensive job is a job that takes very little CPU, disk +I/O and network I/O. Ping is an example of a resource inexpensive +job. wget is too - if the webpages are small. + +The content of the file jobs_to_run: + + ping -c 1 10.0.0.1 + wget http://example.com/status.cgi?ip=10.0.0.1 + ping -c 1 10.0.0.2 + wget http://example.com/status.cgi?ip=10.0.0.2 + ... + ping -c 1 10.0.0.255 + wget http://example.com/status.cgi?ip=10.0.0.255 + +To run 100 processes simultaneously do: + + parallel -j 100 < jobs_to_run + +As there is not a I<command> the jobs will be evaluated by the shell. + + +=head2 EXAMPLE: Call program with FASTA sequence + +FASTA files have the format: + + >Sequence name1 + sequence + sequence continued + >Sequence name2 + sequence + sequence continued + more sequence + +To call B<myprog> with the sequence as argument run: + + cat file.fasta | + parallel --pipe -N1 --recstart '>' --rrs \ + 'read a; echo Name: "$a"; myprog $(tr -d "\n")' + + +=head2 EXAMPLE: Call program with interleaved FASTQ records + +FASTQ files have the format: + + @M10991:61:000000000-A7EML:1:1101:14011:1001 1:N:0:28 + CTCCTAGGTCGGCATGATGGGGGAAGGAGAGCATGGGAAGAAATGAGAGAGTAGCAAGG + + + #8BCCGGGGGFEFECFGGGGGGGGG@;FFGGGEG@FF<EE<@FFC,CEGCCGGFF<FGF + +Interleaved FASTQ starts with a line like these: + + @HWUSI-EAS100R:6:73:941:1973#0/1 + @EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG + @EAS139:136:FC706VJ:2:2104:15343:197393 1:N:18:1 + +where '/1' and ' 1:' determines this is read 1. + +This will cut big.fq into one chunk per CPU thread and pass it on +stdin (standard input) to the program fastq-reader: + + parallel --pipe-part -a big.fq --block -1 --regexp \ + --recend '\n' --recstart '@.*(/1| 1:.*)\n[A-Za-z\n\.~]' \ + fastq-reader + + +=head2 EXAMPLE: Processing a big file using more CPUs + +To process a big file or some output you can use B<--pipe> to split up +the data into blocks and pipe the blocks into the processing program. + +If the program is B<gzip -9> you can do: + + cat bigfile | parallel --pipe --recend '' -k gzip -9 > bigfile.gz + +This will split B<bigfile> into blocks of 1 MB and pass that to B<gzip +-9> in parallel. One B<gzip> will be run per CPU. The output of B<gzip +-9> will be kept in order and saved to B<bigfile.gz> + +B<gzip> works fine if the output is appended, but some processing does +not work like that - for example sorting. For this GNU B<parallel> can +put the output of each command into a file. This will sort a big file +in parallel: + + cat bigfile | parallel --pipe --files sort |\ + parallel -Xj1 sort -m {} ';' rm {} >bigfile.sort + +Here B<bigfile> is split into blocks of around 1MB, each block ending +in '\n' (which is the default for B<--recend>). Each block is passed +to B<sort> and the output from B<sort> is saved into files. These +files are passed to the second B<parallel> that runs B<sort -m> on the +files before it removes the files. The output is saved to +B<bigfile.sort>. + +GNU B<parallel>'s B<--pipe> maxes out at around 100 MB/s because every +byte has to be copied through GNU B<parallel>. But if B<bigfile> is a +real (seekable) file GNU B<parallel> can by-pass the copying and send +the parts directly to the program: + + parallel --pipe-part --block 100m -a bigfile --files sort |\ + parallel -Xj1 sort -m {} ';' rm {} >bigfile.sort + + +=head2 EXAMPLE: Grouping input lines + +When processing with B<--pipe> you may have lines grouped by a +value. Here is I<my.csv>: + + Transaction Customer Item + 1 a 53 + 2 b 65 + 3 b 82 + 4 c 96 + 5 c 67 + 6 c 13 + 7 d 90 + 8 d 43 + 9 d 91 + 10 d 84 + 11 e 72 + 12 e 102 + 13 e 63 + 14 e 56 + 15 e 74 + +Let us assume you want GNU B<parallel> to process each customer. In +other words: You want all the transactions for a single customer to be +treated as a single record. + +To do this we preprocess the data with a program that inserts a record +separator before each customer (column 2 = $F[1]). Here we first make +a 50 character random string, which we then use as the separator: + + sep=`perl -e 'print map { ("a".."z","A".."Z")[rand(52)] } (1..50);'` + cat my.csv | \ + perl -ape '$F[1] ne $l and print "'$sep'"; $l = $F[1]' | \ + parallel --recend $sep --rrs --pipe -N1 wc + +If your program can process multiple customers replace B<-N1> with a +reasonable B<--blocksize>. + + +=head2 EXAMPLE: Running more than 250 jobs workaround + +If you need to run a massive amount of jobs in parallel, then you will +likely hit the filehandle limit which is often around 250 jobs. If you +are super user you can raise the limit in /etc/security/limits.conf +but you can also use this workaround. The filehandle limit is per +process. That means that if you just spawn more GNU B<parallel>s then +each of them can run 250 jobs. This will spawn up to 2500 jobs: + + cat myinput |\ + parallel --pipe -N 50 --round-robin -j50 parallel -j50 your_prg + +This will spawn up to 62500 jobs (use with caution - you need 64 GB +RAM to do this, and you may need to increase /proc/sys/kernel/pid_max): + + cat myinput |\ + parallel --pipe -N 250 --round-robin -j250 parallel -j250 your_prg + + +=head2 EXAMPLE: Working as mutex and counting semaphore + +The command B<sem> is an alias for B<parallel --semaphore>. + +A counting semaphore will allow a given number of jobs to be started +in the background. When the number of jobs are running in the +background, GNU B<sem> will wait for one of these to complete before +starting another command. B<sem --wait> will wait for all jobs to +complete. + +Run 10 jobs concurrently in the background: + + for i in *.log ; do + echo $i + sem -j10 gzip $i ";" echo done + done + sem --wait + +A mutex is a counting semaphore allowing only one job to run. This +will edit the file I<myfile> and prepends the file with lines with the +numbers 1 to 3. + + seq 3 | parallel sem sed -i -e '1i{}' myfile + +As I<myfile> can be very big it is important only one process edits +the file at the same time. + +Name the semaphore to have multiple different semaphores active at the +same time: + + seq 3 | parallel sem --id mymutex sed -i -e '1i{}' myfile + + +=head2 EXAMPLE: Mutex for a script + +Assume a script is called from cron or from a web service, but only +one instance can be run at a time. With B<sem> and B<--shebang-wrap> +the script can be made to wait for other instances to finish. Here in +B<bash>: + + #!/usr/bin/sem --shebang-wrap -u --id $0 --fg /bin/bash + + echo This will run + sleep 5 + echo exclusively + +Here B<perl>: + + #!/usr/bin/sem --shebang-wrap -u --id $0 --fg /usr/bin/perl + + print "This will run "; + sleep 5; + print "exclusively\n"; + +Here B<python>: + + #!/usr/local/bin/sem --shebang-wrap -u --id $0 --fg /usr/bin/python + + import time + print "This will run "; + time.sleep(5) + print "exclusively"; + + +=head2 EXAMPLE: Start editor with file names from stdin (standard input) + +You can use GNU B<parallel> to start interactive programs like emacs or vi: + + cat filelist | parallel --tty -X emacs + cat filelist | parallel --tty -X vi + +If there are more files than will fit on a single command line, the +editor will be started again with the remaining files. + + +=head2 EXAMPLE: Running sudo + +B<sudo> requires a password to run a command as root. It caches the +access, so you only need to enter the password again if you have not +used B<sudo> for a while. + +The command: + + parallel sudo echo ::: This is a bad idea + +is no good, as you would be prompted for the sudo password for each of +the jobs. Instead do: + + sudo parallel echo ::: This is a good idea + +This way you only have to enter the sudo password once. + +=head2 EXAMPLE: Run ping in parallel + +B<ping> prints out statistics when killed with CTRL-C. + +Unfortunately, CTRL-C will also normally kill GNU B<parallel>. + +But by using B<--open-tty> and ignoring SIGINT you can get the wanted effect: + + parallel -j0 --open-tty --lb --tag ping '{= $SIG{INT}=sub {} =}' \ + ::: 1.1.1.1 8.8.8.8 9.9.9.9 21.21.21.21 80.80.80.80 88.88.88.88 + +B<--open-tty> will make the B<ping>s receive SIGINT (from CTRL-C). +CTRL-C will not kill GNU B<parallel>, so that will only exit after +B<ping> is done. + + +=head2 EXAMPLE: GNU Parallel as queue system/batch manager + +GNU B<parallel> can work as a simple job queue system or batch manager. +The idea is to put the jobs into a file and have GNU B<parallel> read +from that continuously. As GNU B<parallel> will stop at end of file we +use B<tail> to continue reading: + + true >jobqueue; tail -n+0 -f jobqueue | parallel + +To submit your jobs to the queue: + + echo my_command my_arg >> jobqueue + +You can of course use B<-S> to distribute the jobs to remote +computers: + + true >jobqueue; tail -n+0 -f jobqueue | parallel -S .. + +Output only will be printed when reading the next input after a job +has finished: So you need to submit a job after the first has finished +to see the output from the first job. + +If you keep this running for a long time, jobqueue will grow. A way of +removing the jobs already run is by making GNU B<parallel> stop when +it hits a special value and then restart. To use B<--eof> to make GNU +B<parallel> exit, B<tail> also needs to be forced to exit: + + true >jobqueue; + while true; do + tail -n+0 -f jobqueue | + (parallel -E StOpHeRe -S ..; echo GNU Parallel is now done; + perl -e 'while(<>){/StOpHeRe/ and last};print <>' jobqueue > j2; + (seq 1000 >> jobqueue &); + echo Done appending dummy data forcing tail to exit) + echo tail exited; + mv j2 jobqueue + done + +In some cases you can run on more CPUs and computers during the night: + + # Day time + echo 50% > jobfile + cp day_server_list ~/.parallel/sshloginfile + # Night time + echo 100% > jobfile + cp night_server_list ~/.parallel/sshloginfile + tail -n+0 -f jobqueue | parallel --jobs jobfile -S .. + +GNU B<parallel> discovers if B<jobfile> or B<~/.parallel/sshloginfile> +changes. + + +=head2 EXAMPLE: GNU Parallel as dir processor + +If you have a dir in which users drop files that needs to be processed +you can do this on GNU/Linux (If you know what B<inotifywait> is +called on other platforms file a bug report): + + inotifywait -qmre MOVED_TO -e CLOSE_WRITE --format %w%f my_dir |\ + parallel -u echo + +This will run the command B<echo> on each file put into B<my_dir> or +subdirs of B<my_dir>. + +You can of course use B<-S> to distribute the jobs to remote +computers: + + inotifywait -qmre MOVED_TO -e CLOSE_WRITE --format %w%f my_dir |\ + parallel -S .. -u echo + +If the files to be processed are in a tar file then unpacking one file +and processing it immediately may be faster than first unpacking all +files. Set up the dir processor as above and unpack into the dir. + +Using GNU B<parallel> as dir processor has the same limitations as +using GNU B<parallel> as queue system/batch manager. + + +=head2 EXAMPLE: Locate the missing package + +If you have downloaded source and tried compiling it, you may have seen: + + $ ./configure + [...] + checking for something.h... no + configure: error: "libsomething not found" + +Often it is not obvious which package you should install to get that +file. Debian has `apt-file` to search for a file. `tracefile` from +https://codeberg.org/tange/tangetools can tell which files a program +tried to access. In this case we are interested in one of the last +files: + + $ tracefile -un ./configure | tail | parallel -j0 apt-file search + + +=head1 AUTHOR + +When using GNU B<parallel> for a publication please cite: + +O. Tange (2011): GNU Parallel - The Command-Line Power Tool, ;login: +The USENIX Magazine, February 2011:42-47. + +This helps funding further development; and it won't cost you a cent. +If you pay 10000 EUR you should feel free to use GNU Parallel without citing. + +Copyright (C) 2007-10-18 Ole Tange, http://ole.tange.dk + +Copyright (C) 2008-2010 Ole Tange, http://ole.tange.dk + +Copyright (C) 2010-2024 Ole Tange, http://ole.tange.dk and Free +Software Foundation, Inc. + +Parts of the manual concerning B<xargs> compatibility is inspired by +the manual of B<xargs> from GNU findutils 4.4.2. + + +=head1 LICENSE + +This program is free software; you can redistribute it and/or modify +it under the terms of the GNU General Public License as published by +the Free Software Foundation; either version 3 of the License, or +at your option any later version. + +This program is distributed in the hope that it will be useful, +but WITHOUT ANY WARRANTY; without even the implied warranty of +MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +GNU General Public License for more details. + +You should have received a copy of the GNU General Public License +along with this program. If not, see <https://www.gnu.org/licenses/>. + +=head2 Documentation license I + +Permission is granted to copy, distribute and/or modify this +documentation under the terms of the GNU Free Documentation License, +Version 1.3 or any later version published by the Free Software +Foundation; with no Invariant Sections, with no Front-Cover Texts, and +with no Back-Cover Texts. A copy of the license is included in the +file LICENSES/GFDL-1.3-or-later.txt. + +=head2 Documentation license II + +You are free: + +=over 9 + +=item B<to Share> + +to copy, distribute and transmit the work + +=item B<to Remix> + +to adapt the work + +=back + +Under the following conditions: + +=over 9 + +=item B<Attribution> + +You must attribute the work in the manner specified by the author or +licensor (but not in any way that suggests that they endorse you or +your use of the work). + +=item B<Share Alike> + +If you alter, transform, or build upon this work, you may distribute +the resulting work only under the same, similar or a compatible +license. + +=back + +With the understanding that: + +=over 9 + +=item B<Waiver> + +Any of the above conditions can be waived if you get permission from +the copyright holder. + +=item B<Public Domain> + +Where the work or any of its elements is in the public domain under +applicable law, that status is in no way affected by the license. + +=item B<Other Rights> + +In no way are any of the following rights affected by the license: + +=over 2 + +=item * + +Your fair dealing or fair use rights, or other applicable +copyright exceptions and limitations; + +=item * + +The author's moral rights; + +=item * + +Rights other persons may have either in the work itself or in +how the work is used, such as publicity or privacy rights. + +=back + +=back + +=over 9 + +=item B<Notice> + +For any reuse or distribution, you must make clear to others the +license terms of this work. + +=back + +A copy of the full license is included in the file as +LICENCES/CC-BY-SA-4.0.txt + + +=head1 SEE ALSO + +B<parallel>(1), B<parallel_tutorial>(7), B<env_parallel>(1), +B<parset>(1), B<parsort>(1), B<parallel_alternatives>(7), +B<parallel_design>(7), B<niceload>(1), B<sql>(1), B<ssh>(1), +B<ssh-agent>(1), B<sshpass>(1), B<ssh-copy-id>(1), B<rsync>(1) + +=cut |