summaryrefslogtreecommitdiffstats
path: root/src/parallel_book.pod
blob: 2919af7917d8f79bb119cdee8cb35263c7cd7d36 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
#!/usr/bin/perl -w

# SPDX-FileCopyrightText: 2021-2024 Ole Tange, http://ole.tange.dk and Free Software and Foundation, Inc.
# SPDX-License-Identifier: GFDL-1.3-or-later
# SPDX-License-Identifier: CC-BY-SA-4.0

=encoding utf8

=head1 Why should you read this book?

If you write shell scripts to do the same processing for different
input, then GNU B<parallel> will make your life easier and make your
scripts run faster.

The book is written so you get the juicy parts first: The goal is that
you read just enough to get you going. GNU B<parallel> has an
overwhelming amount of special features to help in different
situations, and to avoid overloading you with information, the most
used features are presented first.

All the examples are tested in Bash, and most will work in other
shells, too, but there are a few exceptions. So you are recommended to
use Bash while testing out the examples.


=head1 Learn GNU Parallel in 5 minutes

You just need to run commands in parallel. You do not care about fine
tuning.

To get going please run this to make some example files:

  # If your system does not have 'seq', replace 'seq' with 'jot'
  seq 5 | parallel seq {} '>' example.{}

=head2 Input sources

GNU B<parallel> reads values from input sources. One input source is
the command line. The values are put after B<:::> :

  parallel echo ::: 1 2 3 4 5

This makes it easy to run the same program on some files:

  parallel wc ::: example.*

If you give multiple B<:::>s, GNU B<parallel> will generate all
combinations:

  parallel wc ::: -l -c ::: example.*

GNU B<parallel> can also read the values from stdin (standard input):

  seq 5 | parallel echo


=head2 Building the command line

The command line is put before the B<:::>. It can contain contain a
command and options for the command:

  parallel wc -l ::: example.*

The command can contain multiple programs. Just remember to quote
characters that are interpreted by the shell (such as B<;>):

  parallel echo counting lines';' wc -l ::: example.*

The value will normally be appended to the command, but can be placed
anywhere by using the replacement string B<{}>:

  parallel echo counting {}';' wc -l {} ::: example.*

When using multiple input sources you use the positional replacement
strings B<{1}> and B<{2}>:

  parallel echo count {1} in {2}';' wc {1} {2} ::: -l -c ::: example.*

You can check what will be run with B<--dry-run>:

  parallel --dry-run echo count {1} in {2}';' wc {1} {2} ::: -l -c ::: example.*

This is a good idea to do for every command until you are comfortable
with GNU B<parallel>.

=head2 Controlling the output

The output will be printed as soon as the command completes. This
means the output may come in a different order than the input:

  parallel sleep {}';' echo {} done ::: 5 4 3 2 1

You can force GNU B<parallel> to print in the order of the values with
B<--keep-order>/B<-k>. This will still run the commands in parallel.
The output of the later jobs will be delayed, until the earlier jobs
are printed:

  parallel -k sleep {}';' echo {} done ::: 5 4 3 2 1


=head2 Controlling the execution

If your jobs are compute intensive, you will most likely run one job
for each core in the system. This is the default for GNU B<parallel>.

But sometimes you want more jobs running. You control the number of
job slots with B<-j>. Give B<-j> the number of jobs to run in
parallel:

  parallel -j50 \
    wget https://ftpmirror.gnu.org/parallel/parallel-{1}{2}22.tar.bz2 \
    ::: 2012 2013 2014 2015 2016 \
    ::: 01 02 03 04 05 06 07 08 09 10 11 12


=head2 Pipe mode

GNU B<parallel> can also pass blocks of data to commands on stdin
(standard input):

  seq 1000000 | parallel --pipe wc

This can be used to process big text files. By default GNU B<parallel>
splits on \n (newline) and passes a block of around 1 MB to each job.


=head2 That's it

You have now learned the basic use of GNU B<parallel>. This will
probably cover most cases of your use of GNU B<parallel>.

The rest of this document will go into more details on each of the
sections and cover special use cases.


=head1 Learn GNU Parallel in an hour

In this part we will dive deeper into what you learned in the first 5 minutes.

To get going please run this to make some example files:

  seq 6 > seq6
  seq 6 -1 1 > seq-6

=head2 Input sources

On top of the command line, input sources can also be stdin (standard
input or '-'), files and fifos and they can be mixed. Files are given
after B<-a> or B<::::>. So these all do the same:

  parallel echo Dice1={1} Dice2={2} ::: 1 2 3 4 5 6 ::: 6 5 4 3 2 1
  parallel echo Dice1={1} Dice2={2} :::: <(seq 6) :::: <(seq 6 -1 1)
  parallel echo Dice1={1} Dice2={2} :::: seq6 seq-6
  parallel echo Dice1={1} Dice2={2} :::: seq6 :::: seq-6
  parallel -a seq6 -a seq-6 echo Dice1={1} Dice2={2}
  parallel -a seq6 echo Dice1={1} Dice2={2} :::: seq-6
  parallel echo Dice1={1} Dice2={2} ::: 1 2 3 4 5 6 :::: seq-6
  cat seq-6 | parallel echo Dice1={1} Dice2={2} :::: seq6 -

If stdin (standard input) is the only input source, you do not need the '-':

  cat seq6 | parallel echo Dice1={1}

=head3 Linking input sources

You can link multiple input sources with B<:::+> and B<::::+>:

  parallel echo {1}={2} ::: I II III IV V VI :::+ 1 2 3 4 5 6
  parallel echo {1}={2} ::: I II III IV V VI ::::+ seq6

The B<:::+> (and B<::::+>) will link each value to the corresponding
value in the previous input source, so value number 3 from the first
input source will be linked to value number 3 from the second input
source.

You can combine B<:::+> and B<:::>, so you link 2 input sources, but
generate all combinations with other input sources:

  parallel echo Dice1={1}={2} Dice2={3}={4} ::: I II III IV V VI ::::+ seq6 \
    ::: VI V IV III II I ::::+ seq-6


=head2 Building the command line

=head3 The command

The command can be a script, a binary or a Bash function if the
function is exported using B<export -f>:

  # Works only in Bash
  my_func() {
    echo in my_func "$1"
  }
  export -f my_func
  parallel my_func ::: 1 2 3

If the command is complex, it often improves readability to make it
into a function.


=head3 The replacement strings

GNU B<parallel> has some replacement strings to make it easier to
refer to the input read from the input sources.

If the input is B<mydir/mysubdir/myfile.myext> then:

  {} = mydir/mysubdir/myfile.myext
  {.} = mydir/mysubdir/myfile
  {/} = myfile.myext
  {//} = mydir/mysubdir
  {/.} = myfile
  {#} = the sequence number of the job
  {%} = the job slot number

When a job is started it gets a sequence number that starts at 1 and
increases by 1 for each new job. The job also gets assigned a slot
number. This number is from 1 to the number of jobs running in
parallel. It is unique between the running jobs, but is re-used as
soon as a job finishes.

=head4 The positional replacement strings

The replacement strings have corresponding positional replacement
strings. If the value from the 3rd input source is
B<mydir/mysubdir/myfile.myext>:

  {3} = mydir/mysubdir/myfile.myext
  {3.} = mydir/mysubdir/myfile
  {3/} = myfile.myext
  {3//} = mydir/mysubdir
  {3/.} = myfile

So the number of the input source is simply prepended inside the {}'s.


=head1 Replacement strings

--plus replacement strings

change the replacement string (-I --extensionreplace --basenamereplace --basenamereplace --dirnamereplace --basenameextensionreplace --seqreplace --slotreplace

--header with named replacement string

{= =}

Dynamic replacement strings

=head2 Defining replacement strings




=head2 Copying environment

env_parallel

=head2 Controlling the output

=head3 parset

B<parset> is a shell function to get the output from GNU B<parallel>
into shell variables.

B<parset> is fully supported for B<Bash/Zsh/Ksh> and partially supported
for B<ash/dash>. I will assume you run B<Bash>.

To activate B<parset> you have to run:

  . `which env_parallel.bash`

(replace B<bash> with your shell's name).

Then you can run:

  parset a,b,c seq ::: 4 5 6
  echo "$c"

or:

  parset 'a b c' seq ::: 4 5 6
  echo "$c"

If you give a single variable, this will become an array:

  parset arr seq ::: 4 5 6
  echo "${arr[1]}"

B<parset> has one limitation: If it reads from a pipe, the output will
be lost.

  echo This will not work | parset myarr echo
  echo Nothing: "${myarr[*]}"

Instead you can do this:

  echo This will work > tempfile
  parset myarr echo < tempfile
  echo ${myarr[*]}

sql
cvs


=head2 Controlling the execution

--dryrun -v

=head2 Remote execution

For this section you must have B<ssh> access with no password to 2
servers: B<$server1> and B<$server2>.

  server1=server.example.com
  server2=server2.example.net

So you must be able to do this:

  ssh $server1 echo works
  ssh $server2 echo works

It can be setup by running 'ssh-keygen -t dsa; ssh-copy-id $server1'
and using an empty passphrase. Or you can use B<ssh-agent>.

=head3 Workers

=head3 --transferfile

B<--transferfile> I<filename> will transfer I<filename> to the
worker. I<filename> can contain a replacement string:

  parallel -S $server1,$server2 --transferfile {} wc ::: example.*
  parallel -S $server1,$server2 --transferfile {2} \
     echo count {1} in {2}';' wc {1} {2} ::: -l -c ::: example.*

A shorthand for B<--transferfile {}> is B<--transfer>.

=head3 --return



=head3 --cleanup

A shorthand for B<--transfer --return {} --cleanup> is B<--trc {}>.


=head2 Pipe mode

--pipepart


=head2 That's it

=head1 Advanced usage

parset fifo, cmd substitution, arrayelements, array with var names and cmds, env_parset


env_parallel

Interfacing with R.

Interfacing with JSON/jq

4dl() {
  board="$(printf -- '%s' "${1}" | cut -d '/' -f4)"
  thread="$(printf -- '%s' "${1}" | cut -d '/' -f6)"
  wget -qO- "https://a.4cdn.org/${board}/thread/${thread}.json" |
    jq -r '
      .posts
      | map(select(.tim != null))
      | map((.tim | tostring) + .ext)
      | map("https://i.4cdn.org/'"${board}"'/"+.)[]
    ' |
      parallel --gnu -j 0 wget -nv
}

Interfacing with XML/?

Interfacing with HTML/?

=head2 Controlling the execution

--termseq


=head2 Remote execution

seq 10 | parallel --sshlogin 'ssh -i "key.pem" a@b.com' echo

seq 10 | PARALLEL_SSH='ssh -i "key.pem"' parallel --sshlogin a@b.com echo

seq 10 | parallel --ssh 'ssh -i "key.pem"' --sshlogin a@b.com echo

ssh-agent

The sshlogin file format

Check if servers are up



=cut