summaryrefslogtreecommitdiffstats
path: root/src/parallel_book.pod
diff options
context:
space:
mode:
Diffstat (limited to 'src/parallel_book.pod')
-rw-r--r--src/parallel_book.pod403
1 files changed, 403 insertions, 0 deletions
diff --git a/src/parallel_book.pod b/src/parallel_book.pod
new file mode 100644
index 0000000..2919af7
--- /dev/null
+++ b/src/parallel_book.pod
@@ -0,0 +1,403 @@
+#!/usr/bin/perl -w
+
+# SPDX-FileCopyrightText: 2021-2024 Ole Tange, http://ole.tange.dk and Free Software and Foundation, Inc.
+# SPDX-License-Identifier: GFDL-1.3-or-later
+# SPDX-License-Identifier: CC-BY-SA-4.0
+
+=encoding utf8
+
+=head1 Why should you read this book?
+
+If you write shell scripts to do the same processing for different
+input, then GNU B<parallel> will make your life easier and make your
+scripts run faster.
+
+The book is written so you get the juicy parts first: The goal is that
+you read just enough to get you going. GNU B<parallel> has an
+overwhelming amount of special features to help in different
+situations, and to avoid overloading you with information, the most
+used features are presented first.
+
+All the examples are tested in Bash, and most will work in other
+shells, too, but there are a few exceptions. So you are recommended to
+use Bash while testing out the examples.
+
+
+=head1 Learn GNU Parallel in 5 minutes
+
+You just need to run commands in parallel. You do not care about fine
+tuning.
+
+To get going please run this to make some example files:
+
+ # If your system does not have 'seq', replace 'seq' with 'jot'
+ seq 5 | parallel seq {} '>' example.{}
+
+=head2 Input sources
+
+GNU B<parallel> reads values from input sources. One input source is
+the command line. The values are put after B<:::> :
+
+ parallel echo ::: 1 2 3 4 5
+
+This makes it easy to run the same program on some files:
+
+ parallel wc ::: example.*
+
+If you give multiple B<:::>s, GNU B<parallel> will generate all
+combinations:
+
+ parallel wc ::: -l -c ::: example.*
+
+GNU B<parallel> can also read the values from stdin (standard input):
+
+ seq 5 | parallel echo
+
+
+=head2 Building the command line
+
+The command line is put before the B<:::>. It can contain contain a
+command and options for the command:
+
+ parallel wc -l ::: example.*
+
+The command can contain multiple programs. Just remember to quote
+characters that are interpreted by the shell (such as B<;>):
+
+ parallel echo counting lines';' wc -l ::: example.*
+
+The value will normally be appended to the command, but can be placed
+anywhere by using the replacement string B<{}>:
+
+ parallel echo counting {}';' wc -l {} ::: example.*
+
+When using multiple input sources you use the positional replacement
+strings B<{1}> and B<{2}>:
+
+ parallel echo count {1} in {2}';' wc {1} {2} ::: -l -c ::: example.*
+
+You can check what will be run with B<--dry-run>:
+
+ parallel --dry-run echo count {1} in {2}';' wc {1} {2} ::: -l -c ::: example.*
+
+This is a good idea to do for every command until you are comfortable
+with GNU B<parallel>.
+
+=head2 Controlling the output
+
+The output will be printed as soon as the command completes. This
+means the output may come in a different order than the input:
+
+ parallel sleep {}';' echo {} done ::: 5 4 3 2 1
+
+You can force GNU B<parallel> to print in the order of the values with
+B<--keep-order>/B<-k>. This will still run the commands in parallel.
+The output of the later jobs will be delayed, until the earlier jobs
+are printed:
+
+ parallel -k sleep {}';' echo {} done ::: 5 4 3 2 1
+
+
+=head2 Controlling the execution
+
+If your jobs are compute intensive, you will most likely run one job
+for each core in the system. This is the default for GNU B<parallel>.
+
+But sometimes you want more jobs running. You control the number of
+job slots with B<-j>. Give B<-j> the number of jobs to run in
+parallel:
+
+ parallel -j50 \
+ wget https://ftpmirror.gnu.org/parallel/parallel-{1}{2}22.tar.bz2 \
+ ::: 2012 2013 2014 2015 2016 \
+ ::: 01 02 03 04 05 06 07 08 09 10 11 12
+
+
+=head2 Pipe mode
+
+GNU B<parallel> can also pass blocks of data to commands on stdin
+(standard input):
+
+ seq 1000000 | parallel --pipe wc
+
+This can be used to process big text files. By default GNU B<parallel>
+splits on \n (newline) and passes a block of around 1 MB to each job.
+
+
+=head2 That's it
+
+You have now learned the basic use of GNU B<parallel>. This will
+probably cover most cases of your use of GNU B<parallel>.
+
+The rest of this document will go into more details on each of the
+sections and cover special use cases.
+
+
+=head1 Learn GNU Parallel in an hour
+
+In this part we will dive deeper into what you learned in the first 5 minutes.
+
+To get going please run this to make some example files:
+
+ seq 6 > seq6
+ seq 6 -1 1 > seq-6
+
+=head2 Input sources
+
+On top of the command line, input sources can also be stdin (standard
+input or '-'), files and fifos and they can be mixed. Files are given
+after B<-a> or B<::::>. So these all do the same:
+
+ parallel echo Dice1={1} Dice2={2} ::: 1 2 3 4 5 6 ::: 6 5 4 3 2 1
+ parallel echo Dice1={1} Dice2={2} :::: <(seq 6) :::: <(seq 6 -1 1)
+ parallel echo Dice1={1} Dice2={2} :::: seq6 seq-6
+ parallel echo Dice1={1} Dice2={2} :::: seq6 :::: seq-6
+ parallel -a seq6 -a seq-6 echo Dice1={1} Dice2={2}
+ parallel -a seq6 echo Dice1={1} Dice2={2} :::: seq-6
+ parallel echo Dice1={1} Dice2={2} ::: 1 2 3 4 5 6 :::: seq-6
+ cat seq-6 | parallel echo Dice1={1} Dice2={2} :::: seq6 -
+
+If stdin (standard input) is the only input source, you do not need the '-':
+
+ cat seq6 | parallel echo Dice1={1}
+
+=head3 Linking input sources
+
+You can link multiple input sources with B<:::+> and B<::::+>:
+
+ parallel echo {1}={2} ::: I II III IV V VI :::+ 1 2 3 4 5 6
+ parallel echo {1}={2} ::: I II III IV V VI ::::+ seq6
+
+The B<:::+> (and B<::::+>) will link each value to the corresponding
+value in the previous input source, so value number 3 from the first
+input source will be linked to value number 3 from the second input
+source.
+
+You can combine B<:::+> and B<:::>, so you link 2 input sources, but
+generate all combinations with other input sources:
+
+ parallel echo Dice1={1}={2} Dice2={3}={4} ::: I II III IV V VI ::::+ seq6 \
+ ::: VI V IV III II I ::::+ seq-6
+
+
+=head2 Building the command line
+
+=head3 The command
+
+The command can be a script, a binary or a Bash function if the
+function is exported using B<export -f>:
+
+ # Works only in Bash
+ my_func() {
+ echo in my_func "$1"
+ }
+ export -f my_func
+ parallel my_func ::: 1 2 3
+
+If the command is complex, it often improves readability to make it
+into a function.
+
+
+=head3 The replacement strings
+
+GNU B<parallel> has some replacement strings to make it easier to
+refer to the input read from the input sources.
+
+If the input is B<mydir/mysubdir/myfile.myext> then:
+
+ {} = mydir/mysubdir/myfile.myext
+ {.} = mydir/mysubdir/myfile
+ {/} = myfile.myext
+ {//} = mydir/mysubdir
+ {/.} = myfile
+ {#} = the sequence number of the job
+ {%} = the job slot number
+
+When a job is started it gets a sequence number that starts at 1 and
+increases by 1 for each new job. The job also gets assigned a slot
+number. This number is from 1 to the number of jobs running in
+parallel. It is unique between the running jobs, but is re-used as
+soon as a job finishes.
+
+=head4 The positional replacement strings
+
+The replacement strings have corresponding positional replacement
+strings. If the value from the 3rd input source is
+B<mydir/mysubdir/myfile.myext>:
+
+ {3} = mydir/mysubdir/myfile.myext
+ {3.} = mydir/mysubdir/myfile
+ {3/} = myfile.myext
+ {3//} = mydir/mysubdir
+ {3/.} = myfile
+
+So the number of the input source is simply prepended inside the {}'s.
+
+
+=head1 Replacement strings
+
+--plus replacement strings
+
+change the replacement string (-I --extensionreplace --basenamereplace --basenamereplace --dirnamereplace --basenameextensionreplace --seqreplace --slotreplace
+
+--header with named replacement string
+
+{= =}
+
+Dynamic replacement strings
+
+=head2 Defining replacement strings
+
+
+
+
+=head2 Copying environment
+
+env_parallel
+
+=head2 Controlling the output
+
+=head3 parset
+
+B<parset> is a shell function to get the output from GNU B<parallel>
+into shell variables.
+
+B<parset> is fully supported for B<Bash/Zsh/Ksh> and partially supported
+for B<ash/dash>. I will assume you run B<Bash>.
+
+To activate B<parset> you have to run:
+
+ . `which env_parallel.bash`
+
+(replace B<bash> with your shell's name).
+
+Then you can run:
+
+ parset a,b,c seq ::: 4 5 6
+ echo "$c"
+
+or:
+
+ parset 'a b c' seq ::: 4 5 6
+ echo "$c"
+
+If you give a single variable, this will become an array:
+
+ parset arr seq ::: 4 5 6
+ echo "${arr[1]}"
+
+B<parset> has one limitation: If it reads from a pipe, the output will
+be lost.
+
+ echo This will not work | parset myarr echo
+ echo Nothing: "${myarr[*]}"
+
+Instead you can do this:
+
+ echo This will work > tempfile
+ parset myarr echo < tempfile
+ echo ${myarr[*]}
+
+sql
+cvs
+
+
+=head2 Controlling the execution
+
+--dryrun -v
+
+=head2 Remote execution
+
+For this section you must have B<ssh> access with no password to 2
+servers: B<$server1> and B<$server2>.
+
+ server1=server.example.com
+ server2=server2.example.net
+
+So you must be able to do this:
+
+ ssh $server1 echo works
+ ssh $server2 echo works
+
+It can be setup by running 'ssh-keygen -t dsa; ssh-copy-id $server1'
+and using an empty passphrase. Or you can use B<ssh-agent>.
+
+=head3 Workers
+
+=head3 --transferfile
+
+B<--transferfile> I<filename> will transfer I<filename> to the
+worker. I<filename> can contain a replacement string:
+
+ parallel -S $server1,$server2 --transferfile {} wc ::: example.*
+ parallel -S $server1,$server2 --transferfile {2} \
+ echo count {1} in {2}';' wc {1} {2} ::: -l -c ::: example.*
+
+A shorthand for B<--transferfile {}> is B<--transfer>.
+
+=head3 --return
+
+
+
+=head3 --cleanup
+
+A shorthand for B<--transfer --return {} --cleanup> is B<--trc {}>.
+
+
+=head2 Pipe mode
+
+--pipepart
+
+
+=head2 That's it
+
+=head1 Advanced usage
+
+parset fifo, cmd substitution, arrayelements, array with var names and cmds, env_parset
+
+
+env_parallel
+
+Interfacing with R.
+
+Interfacing with JSON/jq
+
+4dl() {
+ board="$(printf -- '%s' "${1}" | cut -d '/' -f4)"
+ thread="$(printf -- '%s' "${1}" | cut -d '/' -f6)"
+ wget -qO- "https://a.4cdn.org/${board}/thread/${thread}.json" |
+ jq -r '
+ .posts
+ | map(select(.tim != null))
+ | map((.tim | tostring) + .ext)
+ | map("https://i.4cdn.org/'"${board}"'/"+.)[]
+ ' |
+ parallel --gnu -j 0 wget -nv
+}
+
+Interfacing with XML/?
+
+Interfacing with HTML/?
+
+=head2 Controlling the execution
+
+--termseq
+
+
+=head2 Remote execution
+
+seq 10 | parallel --sshlogin 'ssh -i "key.pem" a@b.com' echo
+
+seq 10 | PARALLEL_SSH='ssh -i "key.pem"' parallel --sshlogin a@b.com echo
+
+seq 10 | parallel --ssh 'ssh -i "key.pem"' --sshlogin a@b.com echo
+
+ssh-agent
+
+The sshlogin file format
+
+Check if servers are up
+
+
+
+=cut