diff options
Diffstat (limited to 'upstream/fedora-rawhide/man1/perlpacktut.1')
-rw-r--r-- | upstream/fedora-rawhide/man1/perlpacktut.1 | 1413 |
1 files changed, 1413 insertions, 0 deletions
diff --git a/upstream/fedora-rawhide/man1/perlpacktut.1 b/upstream/fedora-rawhide/man1/perlpacktut.1 new file mode 100644 index 00000000..7ae087e5 --- /dev/null +++ b/upstream/fedora-rawhide/man1/perlpacktut.1 @@ -0,0 +1,1413 @@ +.\" -*- mode: troff; coding: utf-8 -*- +.\" Automatically generated by Pod::Man 5.01 (Pod::Simple 3.43) +.\" +.\" Standard preamble: +.\" ======================================================================== +.de Sp \" Vertical space (when we can't use .PP) +.if t .sp .5v +.if n .sp +.. +.de Vb \" Begin verbatim text +.ft CW +.nf +.ne \\$1 +.. +.de Ve \" End verbatim text +.ft R +.fi +.. +.\" \*(C` and \*(C' are quotes in nroff, nothing in troff, for use with C<>. +.ie n \{\ +. ds C` "" +. ds C' "" +'br\} +.el\{\ +. ds C` +. ds C' +'br\} +.\" +.\" Escape single quotes in literal strings from groff's Unicode transform. +.ie \n(.g .ds Aq \(aq +.el .ds Aq ' +.\" +.\" If the F register is >0, we'll generate index entries on stderr for +.\" titles (.TH), headers (.SH), subsections (.SS), items (.Ip), and index +.\" entries marked with X<> in POD. Of course, you'll have to process the +.\" output yourself in some meaningful fashion. +.\" +.\" Avoid warning from groff about undefined register 'F'. +.de IX +.. +.nr rF 0 +.if \n(.g .if rF .nr rF 1 +.if (\n(rF:(\n(.g==0)) \{\ +. if \nF \{\ +. de IX +. tm Index:\\$1\t\\n%\t"\\$2" +.. +. if !\nF==2 \{\ +. nr % 0 +. nr F 2 +. \} +. \} +.\} +.rr rF +.\" ======================================================================== +.\" +.IX Title "PERLPACKTUT 1" +.TH PERLPACKTUT 1 2024-01-25 "perl v5.38.2" "Perl Programmers Reference Guide" +.\" For nroff, turn off justification. Always turn off hyphenation; it makes +.\" way too many mistakes in technical documents. +.if n .ad l +.nh +.SH NAME +perlpacktut \- tutorial on "pack" and "unpack" +.SH DESCRIPTION +.IX Header "DESCRIPTION" +\&\f(CW\*(C`pack\*(C'\fR and \f(CW\*(C`unpack\*(C'\fR are two functions for transforming data according +to a user-defined template, between the guarded way Perl stores values +and some well-defined representation as might be required in the +environment of a Perl program. Unfortunately, they're also two of +the most misunderstood and most often overlooked functions that Perl +provides. This tutorial will demystify them for you. +.SH "The Basic Principle" +.IX Header "The Basic Principle" +Most programming languages don't shelter the memory where variables are +stored. In C, for instance, you can take the address of some variable, +and the \f(CW\*(C`sizeof\*(C'\fR operator tells you how many bytes are allocated to +the variable. Using the address and the size, you may access the storage +to your heart's content. +.PP +In Perl, you just can't access memory at random, but the structural and +representational conversion provided by \f(CW\*(C`pack\*(C'\fR and \f(CW\*(C`unpack\*(C'\fR is an +excellent alternative. The \f(CW\*(C`pack\*(C'\fR function converts values to a byte +sequence containing representations according to a given specification, +the so-called "template" argument. \f(CW\*(C`unpack\*(C'\fR is the reverse process, +deriving some values from the contents of a string of bytes. (Be cautioned, +however, that not all that has been packed together can be neatly unpacked \- +a very common experience as seasoned travellers are likely to confirm.) +.PP +Why, you may ask, would you need a chunk of memory containing some values +in binary representation? One good reason is input and output accessing +some file, a device, or a network connection, whereby this binary +representation is either forced on you or will give you some benefit +in processing. Another cause is passing data to some system call that +is not available as a Perl function: \f(CW\*(C`syscall\*(C'\fR requires you to provide +parameters stored in the way it happens in a C program. Even text processing +(as shown in the next section) may be simplified with judicious usage +of these two functions. +.PP +To see how (un)packing works, we'll start with a simple template +code where the conversion is in low gear: between the contents of a byte +sequence and a string of hexadecimal digits. Let's use \f(CW\*(C`unpack\*(C'\fR, since +this is likely to remind you of a dump program, or some desperate last +message unfortunate programs are wont to throw at you before they expire +into the wild blue yonder. Assuming that the variable \f(CW$mem\fR holds a +sequence of bytes that we'd like to inspect without assuming anything +about its meaning, we can write +.PP +.Vb 2 +\& my( $hex ) = unpack( \*(AqH*\*(Aq, $mem ); +\& print "$hex\en"; +.Ve +.PP +whereupon we might see something like this, with each pair of hex digits +corresponding to a byte: +.PP +.Vb 1 +\& 41204d414e204120504c414e20412043414e414c2050414e414d41 +.Ve +.PP +What was in this chunk of memory? Numbers, characters, or a mixture of +both? Assuming that we're on a computer where ASCII (or some similar) +encoding is used: hexadecimal values in the range \f(CW0x40\fR \- \f(CW0x5A\fR +indicate an uppercase letter, and \f(CW0x20\fR encodes a space. So we might +assume it is a piece of text, which some are able to read like a tabloid; +but others will have to get hold of an ASCII table and relive that +firstgrader feeling. Not caring too much about which way to read this, +we note that \f(CW\*(C`unpack\*(C'\fR with the template code \f(CW\*(C`H\*(C'\fR converts the contents +of a sequence of bytes into the customary hexadecimal notation. Since +"a sequence of" is a pretty vague indication of quantity, \f(CW\*(C`H\*(C'\fR has been +defined to convert just a single hexadecimal digit unless it is followed +by a repeat count. An asterisk for the repeat count means to use whatever +remains. +.PP +The inverse operation \- packing byte contents from a string of hexadecimal +digits \- is just as easily written. For instance: +.PP +.Vb 2 +\& my $s = pack( \*(AqH2\*(Aq x 10, 30..39 ); +\& print "$s\en"; +.Ve +.PP +Since we feed a list of ten 2\-digit hexadecimal strings to \f(CW\*(C`pack\*(C'\fR, the +pack template should contain ten pack codes. If this is run on a computer +with ASCII character coding, it will print \f(CW0123456789\fR. +.SH "Packing Text" +.IX Header "Packing Text" +Let's suppose you've got to read in a data file like this: +.PP +.Vb 4 +\& Date |Description | Income|Expenditure +\& 01/24/2001 Zed\*(Aqs Camel Emporium 1147.99 +\& 01/28/2001 Flea spray 24.99 +\& 01/29/2001 Camel rides to tourists 235.00 +.Ve +.PP +How do we do it? You might think first to use \f(CW\*(C`split\*(C'\fR; however, since +\&\f(CW\*(C`split\*(C'\fR collapses blank fields, you'll never know whether a record was +income or expenditure. Oops. Well, you could always use \f(CW\*(C`substr\*(C'\fR: +.PP +.Vb 7 +\& while (<>) { +\& my $date = substr($_, 0, 11); +\& my $desc = substr($_, 12, 27); +\& my $income = substr($_, 40, 7); +\& my $expend = substr($_, 52, 7); +\& ... +\& } +.Ve +.PP +It's not really a barrel of laughs, is it? In fact, it's worse than it +may seem; the eagle-eyed may notice that the first field should only be +10 characters wide, and the error has propagated right through the other +numbers \- which we've had to count by hand. So it's error-prone as well +as horribly unfriendly. +.PP +Or maybe we could use regular expressions: +.PP +.Vb 5 +\& while (<>) { +\& my($date, $desc, $income, $expend) = +\& m|(\ed\ed/\ed\ed/\ed{4}) (.{27}) (.{7})(.*)|; +\& ... +\& } +.Ve +.PP +Urgh. Well, it's a bit better, but \- well, would you want to maintain +that? +.PP +Hey, isn't Perl supposed to make this sort of thing easy? Well, it does, +if you use the right tools. \f(CW\*(C`pack\*(C'\fR and \f(CW\*(C`unpack\*(C'\fR are designed to help +you out when dealing with fixed-width data like the above. Let's have a +look at a solution with \f(CW\*(C`unpack\*(C'\fR: +.PP +.Vb 4 +\& while (<>) { +\& my($date, $desc, $income, $expend) = unpack("A10xA27xA7A*", $_); +\& ... +\& } +.Ve +.PP +That looks a bit nicer; but we've got to take apart that weird template. +Where did I pull that out of? +.PP +OK, let's have a look at some of our data again; in fact, we'll include +the headers, and a handy ruler so we can keep track of where we are. +.PP +.Vb 5 +\& 1 2 3 4 5 +\& 1234567890123456789012345678901234567890123456789012345678 +\& Date |Description | Income|Expenditure +\& 01/28/2001 Flea spray 24.99 +\& 01/29/2001 Camel rides to tourists 235.00 +.Ve +.PP +From this, we can see that the date column stretches from column 1 to +column 10 \- ten characters wide. The \f(CW\*(C`pack\*(C'\fR\-ese for "character" is +\&\f(CW\*(C`A\*(C'\fR, and ten of them are \f(CW\*(C`A10\*(C'\fR. So if we just wanted to extract the +dates, we could say this: +.PP +.Vb 1 +\& my($date) = unpack("A10", $_); +.Ve +.PP +OK, what's next? Between the date and the description is a blank column; +we want to skip over that. The \f(CW\*(C`x\*(C'\fR template means "skip forward", so we +want one of those. Next, we have another batch of characters, from 12 to +38. That's 27 more characters, hence \f(CW\*(C`A27\*(C'\fR. (Don't make the fencepost +error \- there are 27 characters between 12 and 38, not 26. Count 'em!) +.PP +Now we skip another character and pick up the next 7 characters: +.PP +.Vb 1 +\& my($date,$description,$income) = unpack("A10xA27xA7", $_); +.Ve +.PP +Now comes the clever bit. Lines in our ledger which are just income and +not expenditure might end at column 46. Hence, we don't want to tell our +\&\f(CW\*(C`unpack\*(C'\fR pattern that we \fBneed\fR to find another 12 characters; we'll +just say "if there's anything left, take it". As you might guess from +regular expressions, that's what the \f(CW\*(C`*\*(C'\fR means: "use everything +remaining". +.IP \(bu 3 +Be warned, though, that unlike regular expressions, if the \f(CW\*(C`unpack\*(C'\fR +template doesn't match the incoming data, Perl will scream and die. +.PP +Hence, putting it all together: +.PP +.Vb 2 +\& my ($date, $description, $income, $expend) = +\& unpack("A10xA27xA7xA*", $_); +.Ve +.PP +Now, that's our data parsed. I suppose what we might want to do now is +total up our income and expenditure, and add another line to the end of +our ledger \- in the same format \- saying how much we've brought in and +how much we've spent: +.PP +.Vb 6 +\& while (<>) { +\& my ($date, $desc, $income, $expend) = +\& unpack("A10xA27xA7xA*", $_); +\& $tot_income += $income; +\& $tot_expend += $expend; +\& } +\& +\& $tot_income = sprintf("%.2f", $tot_income); # Get them into +\& $tot_expend = sprintf("%.2f", $tot_expend); # "financial" format +\& +\& $date = POSIX::strftime("%m/%d/%Y", localtime); +\& +\& # OK, let\*(Aqs go: +\& +\& print pack("A10xA27xA7xA*", $date, "Totals", +\& $tot_income, $tot_expend); +.Ve +.PP +Oh, hmm. That didn't quite work. Let's see what happened: +.PP +.Vb 4 +\& 01/24/2001 Zed\*(Aqs Camel Emporium 1147.99 +\& 01/28/2001 Flea spray 24.99 +\& 01/29/2001 Camel rides to tourists 1235.00 +\& 03/23/2001Totals 1235.001172.98 +.Ve +.PP +OK, it's a start, but what happened to the spaces? We put \f(CW\*(C`x\*(C'\fR, didn't +we? Shouldn't it skip forward? Let's look at what "pack" in perlfunc says: +.PP +.Vb 1 +\& x A null byte. +.Ve +.PP +Urgh. No wonder. There's a big difference between "a null byte", +character zero, and "a space", character 32. Perl's put something +between the date and the description \- but unfortunately, we can't see +it! +.PP +What we actually need to do is expand the width of the fields. The \f(CW\*(C`A\*(C'\fR +format pads any non-existent characters with spaces, so we can use the +additional spaces to line up our fields, like this: +.PP +.Vb 2 +\& print pack("A11 A28 A8 A*", $date, "Totals", +\& $tot_income, $tot_expend); +.Ve +.PP +(Note that you can put spaces in the template to make it more readable, +but they don't translate to spaces in the output.) Here's what we got +this time: +.PP +.Vb 4 +\& 01/24/2001 Zed\*(Aqs Camel Emporium 1147.99 +\& 01/28/2001 Flea spray 24.99 +\& 01/29/2001 Camel rides to tourists 1235.00 +\& 03/23/2001 Totals 1235.00 1172.98 +.Ve +.PP +That's a bit better, but we still have that last column which needs to +be moved further over. There's an easy way to fix this up: +unfortunately, we can't get \f(CW\*(C`pack\*(C'\fR to right-justify our fields, but we +can get \f(CW\*(C`sprintf\*(C'\fR to do it: +.PP +.Vb 5 +\& $tot_income = sprintf("%.2f", $tot_income); +\& $tot_expend = sprintf("%12.2f", $tot_expend); +\& $date = POSIX::strftime("%m/%d/%Y", localtime); +\& print pack("A11 A28 A8 A*", $date, "Totals", +\& $tot_income, $tot_expend); +.Ve +.PP +This time we get the right answer: +.PP +.Vb 3 +\& 01/28/2001 Flea spray 24.99 +\& 01/29/2001 Camel rides to tourists 1235.00 +\& 03/23/2001 Totals 1235.00 1172.98 +.Ve +.PP +So that's how we consume and produce fixed-width data. Let's recap what +we've seen of \f(CW\*(C`pack\*(C'\fR and \f(CW\*(C`unpack\*(C'\fR so far: +.IP \(bu 3 +Use \f(CW\*(C`pack\*(C'\fR to go from several pieces of data to one fixed-width +version; use \f(CW\*(C`unpack\*(C'\fR to turn a fixed-width-format string into several +pieces of data. +.IP \(bu 3 +The pack format \f(CW\*(C`A\*(C'\fR means "any character"; if you're \f(CW\*(C`pack\*(C'\fRing and +you've run out of things to pack, \f(CW\*(C`pack\*(C'\fR will fill the rest up with +spaces. +.IP \(bu 3 +\&\f(CW\*(C`x\*(C'\fR means "skip a byte" when \f(CW\*(C`unpack\*(C'\fRing; when \f(CW\*(C`pack\*(C'\fRing, it means +"introduce a null byte" \- that's probably not what you mean if you're +dealing with plain text. +.IP \(bu 3 +You can follow the formats with numbers to say how many characters +should be affected by that format: \f(CW\*(C`A12\*(C'\fR means "take 12 characters"; +\&\f(CW\*(C`x6\*(C'\fR means "skip 6 bytes" or "character 0, 6 times". +.IP \(bu 3 +Instead of a number, you can use \f(CW\*(C`*\*(C'\fR to mean "consume everything else +left". +.Sp +\&\fBWarning\fR: when packing multiple pieces of data, \f(CW\*(C`*\*(C'\fR only means +"consume all of the current piece of data". That's to say +.Sp +.Vb 1 +\& pack("A*A*", $one, $two) +.Ve +.Sp +packs all of \f(CW$one\fR into the first \f(CW\*(C`A*\*(C'\fR and then all of \f(CW$two\fR into +the second. This is a general principle: each format character +corresponds to one piece of data to be \f(CW\*(C`pack\*(C'\fRed. +.SH "Packing Numbers" +.IX Header "Packing Numbers" +So much for textual data. Let's get onto the meaty stuff that \f(CW\*(C`pack\*(C'\fR +and \f(CW\*(C`unpack\*(C'\fR are best at: handling binary formats for numbers. There is, +of course, not just one binary format \- life would be too simple \- but +Perl will do all the finicky labor for you. +.SS Integers +.IX Subsection "Integers" +Packing and unpacking numbers implies conversion to and from some +\&\fIspecific\fR binary representation. Leaving floating point numbers +aside for the moment, the salient properties of any such representation +are: +.IP \(bu 4 +the number of bytes used for storing the integer, +.IP \(bu 4 +whether the contents are interpreted as a signed or unsigned number, +.IP \(bu 4 +the byte ordering: whether the first byte is the least or most +significant byte (or: little-endian or big-endian, respectively). +.PP +So, for instance, to pack 20302 to a signed 16 bit integer in your +computer's representation you write +.PP +.Vb 1 +\& my $ps = pack( \*(Aqs\*(Aq, 20302 ); +.Ve +.PP +Again, the result is a string, now containing 2 bytes. If you print +this string (which is, generally, not recommended) you might see +\&\f(CW\*(C`ON\*(C'\fR or \f(CW\*(C`NO\*(C'\fR (depending on your system's byte ordering) \- or something +entirely different if your computer doesn't use ASCII character encoding. +Unpacking \f(CW$ps\fR with the same template returns the original integer value: +.PP +.Vb 1 +\& my( $s ) = unpack( \*(Aqs\*(Aq, $ps ); +.Ve +.PP +This is true for all numeric template codes. But don't expect miracles: +if the packed value exceeds the allotted byte capacity, high order bits +are silently discarded, and unpack certainly won't be able to pull them +back out of some magic hat. And, when you pack using a signed template +code such as \f(CW\*(C`s\*(C'\fR, an excess value may result in the sign bit +getting set, and unpacking this will smartly return a negative value. +.PP +16 bits won't get you too far with integers, but there is \f(CW\*(C`l\*(C'\fR and \f(CW\*(C`L\*(C'\fR +for signed and unsigned 32\-bit integers. And if this is not enough and +your system supports 64 bit integers you can push the limits much closer +to infinity with pack codes \f(CW\*(C`q\*(C'\fR and \f(CW\*(C`Q\*(C'\fR. A notable exception is provided +by pack codes \f(CW\*(C`i\*(C'\fR and \f(CW\*(C`I\*(C'\fR for signed and unsigned integers of the +"local custom" variety: Such an integer will take up as many bytes as +a local C compiler returns for \f(CWsizeof(int)\fR, but it'll use \fIat least\fR +32 bits. +.PP +Each of the integer pack codes \f(CW\*(C`sSlLqQ\*(C'\fR results in a fixed number of bytes, +no matter where you execute your program. This may be useful for some +applications, but it does not provide for a portable way to pass data +structures between Perl and C programs (bound to happen when you call +XS extensions or the Perl function \f(CW\*(C`syscall\*(C'\fR), or when you read or +write binary files. What you'll need in this case are template codes that +depend on what your local C compiler compiles when you code \f(CW\*(C`short\*(C'\fR or +\&\f(CW\*(C`unsigned long\*(C'\fR, for instance. These codes and their corresponding +byte lengths are shown in the table below. Since the C standard leaves +much leeway with respect to the relative sizes of these data types, actual +values may vary, and that's why the values are given as expressions in +C and Perl. (If you'd like to use values from \f(CW%Config\fR in your program +you have to import it with \f(CW\*(C`use Config\*(C'\fR.) +.PP +.Vb 5 +\& signed unsigned byte length in C byte length in Perl +\& s! S! sizeof(short) $Config{shortsize} +\& i! I! sizeof(int) $Config{intsize} +\& l! L! sizeof(long) $Config{longsize} +\& q! Q! sizeof(long long) $Config{longlongsize} +.Ve +.PP +The \f(CW\*(C`i!\*(C'\fR and \f(CW\*(C`I!\*(C'\fR codes aren't different from \f(CW\*(C`i\*(C'\fR and \f(CW\*(C`I\*(C'\fR; they are +tolerated for completeness' sake. +.SS "Unpacking a Stack Frame" +.IX Subsection "Unpacking a Stack Frame" +Requesting a particular byte ordering may be necessary when you work with +binary data coming from some specific architecture whereas your program could +run on a totally different system. As an example, assume you have 24 bytes +containing a stack frame as it happens on an Intel 8086: +.PP +.Vb 11 +\& +\-\-\-\-\-\-\-\-\-+ +\-\-\-\-+\-\-\-\-+ +\-\-\-\-\-\-\-\-\-+ +\& TOS: | IP | TOS+4:| FL | FH | FLAGS TOS+14:| SI | +\& +\-\-\-\-\-\-\-\-\-+ +\-\-\-\-+\-\-\-\-+ +\-\-\-\-\-\-\-\-\-+ +\& | CS | | AL | AH | AX | DI | +\& +\-\-\-\-\-\-\-\-\-+ +\-\-\-\-+\-\-\-\-+ +\-\-\-\-\-\-\-\-\-+ +\& | BL | BH | BX | BP | +\& +\-\-\-\-+\-\-\-\-+ +\-\-\-\-\-\-\-\-\-+ +\& | CL | CH | CX | DS | +\& +\-\-\-\-+\-\-\-\-+ +\-\-\-\-\-\-\-\-\-+ +\& | DL | DH | DX | ES | +\& +\-\-\-\-+\-\-\-\-+ +\-\-\-\-\-\-\-\-\-+ +.Ve +.PP +First, we note that this time-honored 16\-bit CPU uses little-endian order, +and that's why the low order byte is stored at the lower address. To +unpack such a (unsigned) short we'll have to use code \f(CW\*(C`v\*(C'\fR. A repeat +count unpacks all 12 shorts: +.PP +.Vb 2 +\& my( $ip, $cs, $flags, $ax, $bx, $cx, $dx, $si, $di, $bp, $ds, $es ) = +\& unpack( \*(Aqv12\*(Aq, $frame ); +.Ve +.PP +Alternatively, we could have used \f(CW\*(C`C\*(C'\fR to unpack the individually +accessible byte registers FL, FH, AL, AH, etc.: +.PP +.Vb 2 +\& my( $fl, $fh, $al, $ah, $bl, $bh, $cl, $ch, $dl, $dh ) = +\& unpack( \*(AqC10\*(Aq, substr( $frame, 4, 10 ) ); +.Ve +.PP +It would be nice if we could do this in one fell swoop: unpack a short, +back up a little, and then unpack 2 bytes. Since Perl \fIis\fR nice, it +proffers the template code \f(CW\*(C`X\*(C'\fR to back up one byte. Putting this all +together, we may now write: +.PP +.Vb 5 +\& my( $ip, $cs, +\& $flags,$fl,$fh, +\& $ax,$al,$ah, $bx,$bl,$bh, $cx,$cl,$ch, $dx,$dl,$dh, +\& $si, $di, $bp, $ds, $es ) = +\& unpack( \*(Aqv2\*(Aq . (\*(AqvXXCC\*(Aq x 5) . \*(Aqv5\*(Aq, $frame ); +.Ve +.PP +(The clumsy construction of the template can be avoided \- just read on!) +.PP +We've taken some pains to construct the template so that it matches +the contents of our frame buffer. Otherwise we'd either get undefined values, +or \f(CW\*(C`unpack\*(C'\fR could not unpack all. If \f(CW\*(C`pack\*(C'\fR runs out of items, it will +supply null strings (which are coerced into zeroes whenever the pack code +says so). +.SS "How to Eat an Egg on a Net" +.IX Subsection "How to Eat an Egg on a Net" +The pack code for big-endian (high order byte at the lowest address) is +\&\f(CW\*(C`n\*(C'\fR for 16 bit and \f(CW\*(C`N\*(C'\fR for 32 bit integers. You use these codes +if you know that your data comes from a compliant architecture, but, +surprisingly enough, you should also use these pack codes if you +exchange binary data, across the network, with some system that you +know next to nothing about. The simple reason is that this +order has been chosen as the \fInetwork order\fR, and all standard-fearing +programs ought to follow this convention. (This is, of course, a stern +backing for one of the Lilliputian parties and may well influence the +political development there.) So, if the protocol expects you to send +a message by sending the length first, followed by just so many bytes, +you could write: +.PP +.Vb 1 +\& my $buf = pack( \*(AqN\*(Aq, length( $msg ) ) . $msg; +.Ve +.PP +or even: +.PP +.Vb 1 +\& my $buf = pack( \*(AqNA*\*(Aq, length( $msg ), $msg ); +.Ve +.PP +and pass \f(CW$buf\fR to your send routine. Some protocols demand that the +count should include the length of the count itself: then just add 4 +to the data length. (But make sure to read "Lengths and Widths" before +you really code this!) +.SS "Byte-order modifiers" +.IX Subsection "Byte-order modifiers" +In the previous sections we've learned how to use \f(CW\*(C`n\*(C'\fR, \f(CW\*(C`N\*(C'\fR, \f(CW\*(C`v\*(C'\fR and +\&\f(CW\*(C`V\*(C'\fR to pack and unpack integers with big\- or little-endian byte-order. +While this is nice, it's still rather limited because it leaves out all +kinds of signed integers as well as 64\-bit integers. For example, if you +wanted to unpack a sequence of signed big-endian 16\-bit integers in a +platform-independent way, you would have to write: +.PP +.Vb 1 +\& my @data = unpack \*(Aqs*\*(Aq, pack \*(AqS*\*(Aq, unpack \*(Aqn*\*(Aq, $buf; +.Ve +.PP +This is ugly. As of Perl 5.9.2, there's a much nicer way to express your +desire for a certain byte-order: the \f(CW\*(C`>\*(C'\fR and \f(CW\*(C`<\*(C'\fR modifiers. +\&\f(CW\*(C`>\*(C'\fR is the big-endian modifier, while \f(CW\*(C`<\*(C'\fR is the little-endian +modifier. Using them, we could rewrite the above code as: +.PP +.Vb 1 +\& my @data = unpack \*(Aqs>*\*(Aq, $buf; +.Ve +.PP +As you can see, the "big end" of the arrow touches the \f(CW\*(C`s\*(C'\fR, which is a +nice way to remember that \f(CW\*(C`>\*(C'\fR is the big-endian modifier. The same +obviously works for \f(CW\*(C`<\*(C'\fR, where the "little end" touches the code. +.PP +You will probably find these modifiers even more useful if you have +to deal with big\- or little-endian C structures. Be sure to read +"Packing and Unpacking C Structures" for more on that. +.SS "Floating point Numbers" +.IX Subsection "Floating point Numbers" +For packing floating point numbers you have the choice between the +pack codes \f(CW\*(C`f\*(C'\fR, \f(CW\*(C`d\*(C'\fR, \f(CW\*(C`F\*(C'\fR and \f(CW\*(C`D\*(C'\fR. \f(CW\*(C`f\*(C'\fR and \f(CW\*(C`d\*(C'\fR pack into (or unpack +from) single-precision or double-precision representation as it is provided +by your system. If your systems supports it, \f(CW\*(C`D\*(C'\fR can be used to pack and +unpack (\f(CW\*(C`long double\*(C'\fR) values, which can offer even more resolution +than \f(CW\*(C`f\*(C'\fR or \f(CW\*(C`d\*(C'\fR. \fBNote that there are different long double formats.\fR +.PP +\&\f(CW\*(C`F\*(C'\fR packs an \f(CW\*(C`NV\*(C'\fR, which is the floating point type used by Perl +internally. +.PP +There is no such thing as a network representation for reals, so if +you want to send your real numbers across computer boundaries, you'd +better stick to text representation, possibly using the hexadecimal +float format (avoiding the decimal conversion loss), unless you're +absolutely sure what's on the other end of the line. For the even more +adventuresome, you can use the byte-order modifiers from the previous +section also on floating point codes. +.SH "Exotic Templates" +.IX Header "Exotic Templates" +.SS "Bit Strings" +.IX Subsection "Bit Strings" +Bits are the atoms in the memory world. Access to individual bits may +have to be used either as a last resort or because it is the most +convenient way to handle your data. Bit string (un)packing converts +between strings containing a series of \f(CW0\fR and \f(CW1\fR characters and +a sequence of bytes each containing a group of 8 bits. This is almost +as simple as it sounds, except that there are two ways the contents of +a byte may be written as a bit string. Let's have a look at an annotated +byte: +.PP +.Vb 5 +\& 7 6 5 4 3 2 1 0 +\& +\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-+ +\& | 1 0 0 0 1 1 0 0 | +\& +\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-+ +\& MSB LSB +.Ve +.PP +It's egg-eating all over again: Some think that as a bit string this should +be written "10001100" i.e. beginning with the most significant bit, others +insist on "00110001". Well, Perl isn't biased, so that's why we have two bit +string codes: +.PP +.Vb 2 +\& $byte = pack( \*(AqB8\*(Aq, \*(Aq10001100\*(Aq ); # start with MSB +\& $byte = pack( \*(Aqb8\*(Aq, \*(Aq00110001\*(Aq ); # start with LSB +.Ve +.PP +It is not possible to pack or unpack bit fields \- just integral bytes. +\&\f(CW\*(C`pack\*(C'\fR always starts at the next byte boundary and "rounds up" to the +next multiple of 8 by adding zero bits as required. (If you do want bit +fields, there is "vec" in perlfunc. Or you could implement bit field +handling at the character string level, using split, substr, and +concatenation on unpacked bit strings.) +.PP +To illustrate unpacking for bit strings, we'll decompose a simple +status register (a "\-" stands for a "reserved" bit): +.PP +.Vb 4 +\& +\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-+\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-+ +\& | S Z \- A \- P \- C | \- \- \- \- O D I T | +\& +\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-+\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-+ +\& MSB LSB MSB LSB +.Ve +.PP +Converting these two bytes to a string can be done with the unpack +template \f(CW\*(Aqb16\*(Aq\fR. To obtain the individual bit values from the bit +string we use \f(CW\*(C`split\*(C'\fR with the "empty" separator pattern which dissects +into individual characters. Bit values from the "reserved" positions are +simply assigned to \f(CW\*(C`undef\*(C'\fR, a convenient notation for "I don't care where +this goes". +.PP +.Vb 3 +\& ($carry, undef, $parity, undef, $auxcarry, undef, $zero, $sign, +\& $trace, $interrupt, $direction, $overflow) = +\& split( //, unpack( \*(Aqb16\*(Aq, $status ) ); +.Ve +.PP +We could have used an unpack template \f(CW\*(Aqb12\*(Aq\fR just as well, since the +last 4 bits can be ignored anyway. +.SS Uuencoding +.IX Subsection "Uuencoding" +Another odd-man-out in the template alphabet is \f(CW\*(C`u\*(C'\fR, which packs a +"uuencoded string". ("uu" is short for Unix-to-Unix.) Chances are that +you won't ever need this encoding technique which was invented to overcome +the shortcomings of old-fashioned transmission mediums that do not support +other than simple ASCII data. The essential recipe is simple: Take three +bytes, or 24 bits. Split them into 4 six-packs, adding a space (0x20) to +each. Repeat until all of the data is blended. Fold groups of 4 bytes into +lines no longer than 60 and garnish them in front with the original byte count +(incremented by 0x20) and a \f(CW"\en"\fR at the end. \- The \f(CW\*(C`pack\*(C'\fR chef will +prepare this for you, a la minute, when you select pack code \f(CW\*(C`u\*(C'\fR on the menu: +.PP +.Vb 1 +\& my $uubuf = pack( \*(Aqu\*(Aq, $bindat ); +.Ve +.PP +A repeat count after \f(CW\*(C`u\*(C'\fR sets the number of bytes to put into an +uuencoded line, which is the maximum of 45 by default, but could be +set to some (smaller) integer multiple of three. \f(CW\*(C`unpack\*(C'\fR simply ignores +the repeat count. +.SS "Doing Sums" +.IX Subsection "Doing Sums" +An even stranger template code is \f(CW\*(C`%\*(C'\fR<\fInumber\fR>. First, because +it's used as a prefix to some other template code. Second, because it +cannot be used in \f(CW\*(C`pack\*(C'\fR at all, and third, in \f(CW\*(C`unpack\*(C'\fR, doesn't return the +data as defined by the template code it precedes. Instead it'll give you an +integer of \fInumber\fR bits that is computed from the data value by +doing sums. For numeric unpack codes, no big feat is achieved: +.PP +.Vb 2 +\& my $buf = pack( \*(Aqiii\*(Aq, 100, 20, 3 ); +\& print unpack( \*(Aq%32i3\*(Aq, $buf ), "\en"; # prints 123 +.Ve +.PP +For string values, \f(CW\*(C`%\*(C'\fR returns the sum of the byte values saving +you the trouble of a sum loop with \f(CW\*(C`substr\*(C'\fR and \f(CW\*(C`ord\*(C'\fR: +.PP +.Vb 1 +\& print unpack( \*(Aq%32A*\*(Aq, "\ex01\ex10" ), "\en"; # prints 17 +.Ve +.PP +Although the \f(CW\*(C`%\*(C'\fR code is documented as returning a "checksum": +don't put your trust in such values! Even when applied to a small number +of bytes, they won't guarantee a noticeable Hamming distance. +.PP +In connection with \f(CW\*(C`b\*(C'\fR or \f(CW\*(C`B\*(C'\fR, \f(CW\*(C`%\*(C'\fR simply adds bits, and this can be put +to good use to count set bits efficiently: +.PP +.Vb 1 +\& my $bitcount = unpack( \*(Aq%32b*\*(Aq, $mask ); +.Ve +.PP +And an even parity bit can be determined like this: +.PP +.Vb 1 +\& my $evenparity = unpack( \*(Aq%1b*\*(Aq, $mask ); +.Ve +.SS Unicode +.IX Subsection "Unicode" +Unicode is a character set that can represent most characters in most of +the world's languages, providing room for over one million different +characters. Unicode 3.1 specifies 94,140 characters: The Basic Latin +characters are assigned to the numbers 0 \- 127. The Latin\-1 Supplement with +characters that are used in several European languages is in the next +range, up to 255. After some more Latin extensions we find the character +sets from languages using non-Roman alphabets, interspersed with a +variety of symbol sets such as currency symbols, Zapf Dingbats or Braille. +(You might want to visit <https://www.unicode.org/> for a look at some of +them \- my personal favourites are Telugu and Kannada.) +.PP +The Unicode character sets associates characters with integers. Encoding +these numbers in an equal number of bytes would more than double the +requirements for storing texts written in Latin alphabets. +The UTF\-8 encoding avoids this by storing the most common (from a western +point of view) characters in a single byte while encoding the rarer +ones in three or more bytes. +.PP +Perl uses UTF\-8, internally, for most Unicode strings. +.PP +So what has this got to do with \f(CW\*(C`pack\*(C'\fR? Well, if you want to compose a +Unicode string (that is internally encoded as UTF\-8), you can do so by +using template code \f(CW\*(C`U\*(C'\fR. As an example, let's produce the Euro currency +symbol (code number 0x20AC): +.PP +.Vb 2 +\& $UTF8{Euro} = pack( \*(AqU\*(Aq, 0x20AC ); +\& # Equivalent to: $UTF8{Euro} = "\ex{20ac}"; +.Ve +.PP +Inspecting \f(CW$UTF8{Euro}\fR shows that it contains 3 bytes: +"\exe2\ex82\exac". However, it contains only 1 character, number 0x20AC. +The round trip can be completed with \f(CW\*(C`unpack\*(C'\fR: +.PP +.Vb 1 +\& $Unicode{Euro} = unpack( \*(AqU\*(Aq, $UTF8{Euro} ); +.Ve +.PP +Unpacking using the \f(CW\*(C`U\*(C'\fR template code also works on UTF\-8 encoded byte +strings. +.PP +Usually you'll want to pack or unpack UTF\-8 strings: +.PP +.Vb 3 +\& # pack and unpack the Hebrew alphabet +\& my $alefbet = pack( \*(AqU*\*(Aq, 0x05d0..0x05ea ); +\& my @hebrew = unpack( \*(AqU*\*(Aq, $utf ); +.Ve +.PP +Please note: in the general case, you're better off using +\&\f(CW\*(C`Encode::decode(\*(AqUTF\-8\*(Aq, $utf)\*(C'\fR to decode a UTF\-8 +encoded byte string to a Perl Unicode string, and +\&\f(CW\*(C`Encode::encode(\*(AqUTF\-8\*(Aq, $str)\*(C'\fR to encode a Perl Unicode +string to UTF\-8 bytes. These functions provide means of handling invalid byte +sequences and generally have a friendlier interface. +.SS "Another Portable Binary Encoding" +.IX Subsection "Another Portable Binary Encoding" +The pack code \f(CW\*(C`w\*(C'\fR has been added to support a portable binary data +encoding scheme that goes way beyond simple integers. (Details can +be found at <https://github.com/mworks\-project/mw_scarab/blob/master/Scarab\-0.1.00d19/doc/binary\-serialization.txt>, +the Scarab project.) A BER (Binary Encoded +Representation) compressed unsigned integer stores base 128 +digits, most significant digit first, with as few digits as possible. +Bit eight (the high bit) is set on each byte except the last. There +is no size limit to BER encoding, but Perl won't go to extremes. +.PP +.Vb 1 +\& my $berbuf = pack( \*(Aqw*\*(Aq, 1, 128, 128+1, 128*128+127 ); +.Ve +.PP +A hex dump of \f(CW$berbuf\fR, with spaces inserted at the right places, +shows 01 8100 8101 81807F. Since the last byte is always less than +128, \f(CW\*(C`unpack\*(C'\fR knows where to stop. +.SH "Template Grouping" +.IX Header "Template Grouping" +Prior to Perl 5.8, repetitions of templates had to be made by +\&\f(CW\*(C`x\*(C'\fR\-multiplication of template strings. Now there is a better way as +we may use the pack codes \f(CW\*(C`(\*(C'\fR and \f(CW\*(C`)\*(C'\fR combined with a repeat count. +The \f(CW\*(C`unpack\*(C'\fR template from the Stack Frame example can simply +be written like this: +.PP +.Vb 1 +\& unpack( \*(Aqv2 (vXXCC)5 v5\*(Aq, $frame ) +.Ve +.PP +Let's explore this feature a little more. We'll begin with the equivalent of +.PP +.Vb 1 +\& join( \*(Aq\*(Aq, map( substr( $_, 0, 1 ), @str ) ) +.Ve +.PP +which returns a string consisting of the first character from each string. +Using pack, we can write +.PP +.Vb 1 +\& pack( \*(Aq(A)\*(Aq.@str, @str ) +.Ve +.PP +or, because a repeat count \f(CW\*(C`*\*(C'\fR means "repeat as often as required", +simply +.PP +.Vb 1 +\& pack( \*(Aq(A)*\*(Aq, @str ) +.Ve +.PP +(Note that the template \f(CW\*(C`A*\*(C'\fR would only have packed \f(CW$str[0]\fR in full +length.) +.PP +To pack dates stored as triplets ( day, month, year ) in an array \f(CW@dates\fR +into a sequence of byte, byte, short integer we can write +.PP +.Vb 1 +\& $pd = pack( \*(Aq(CCS)*\*(Aq, map( @$_, @dates ) ); +.Ve +.PP +To swap pairs of characters in a string (with even length) one could use +several techniques. First, let's use \f(CW\*(C`x\*(C'\fR and \f(CW\*(C`X\*(C'\fR to skip forward and back: +.PP +.Vb 1 +\& $s = pack( \*(Aq(A)*\*(Aq, unpack( \*(Aq(xAXXAx)*\*(Aq, $s ) ); +.Ve +.PP +We can also use \f(CW\*(C`@\*(C'\fR to jump to an offset, with 0 being the position where +we were when the last \f(CW\*(C`(\*(C'\fR was encountered: +.PP +.Vb 1 +\& $s = pack( \*(Aq(A)*\*(Aq, unpack( \*(Aq(@1A @0A @2)*\*(Aq, $s ) ); +.Ve +.PP +Finally, there is also an entirely different approach by unpacking big +endian shorts and packing them in the reverse byte order: +.PP +.Vb 1 +\& $s = pack( \*(Aq(v)*\*(Aq, unpack( \*(Aq(n)*\*(Aq, $s ); +.Ve +.SH "Lengths and Widths" +.IX Header "Lengths and Widths" +.SS "String Lengths" +.IX Subsection "String Lengths" +In the previous section we've seen a network message that was constructed +by prefixing the binary message length to the actual message. You'll find +that packing a length followed by so many bytes of data is a +frequently used recipe since appending a null byte won't work +if a null byte may be part of the data. Here is an example where both +techniques are used: after two null terminated strings with source and +destination address, a Short Message (to a mobile phone) is sent after +a length byte: +.PP +.Vb 1 +\& my $msg = pack( \*(AqZ*Z*CA*\*(Aq, $src, $dst, length( $sm ), $sm ); +.Ve +.PP +Unpacking this message can be done with the same template: +.PP +.Vb 1 +\& ( $src, $dst, $len, $sm ) = unpack( \*(AqZ*Z*CA*\*(Aq, $msg ); +.Ve +.PP +There's a subtle trap lurking in the offing: Adding another field after +the Short Message (in variable \f(CW$sm\fR) is all right when packing, but this +cannot be unpacked naively: +.PP +.Vb 2 +\& # pack a message +\& my $msg = pack( \*(AqZ*Z*CA*C\*(Aq, $src, $dst, length( $sm ), $sm, $prio ); +\& +\& # unpack fails \- $prio remains undefined! +\& ( $src, $dst, $len, $sm, $prio ) = unpack( \*(AqZ*Z*CA*C\*(Aq, $msg ); +.Ve +.PP +The pack code \f(CW\*(C`A*\*(C'\fR gobbles up all remaining bytes, and \f(CW$prio\fR remains +undefined! Before we let disappointment dampen the morale: Perl's got +the trump card to make this trick too, just a little further up the sleeve. +Watch this: +.PP +.Vb 2 +\& # pack a message: ASCIIZ, ASCIIZ, length/string, byte +\& my $msg = pack( \*(AqZ* Z* C/A* C\*(Aq, $src, $dst, $sm, $prio ); +\& +\& # unpack +\& ( $src, $dst, $sm, $prio ) = unpack( \*(AqZ* Z* C/A* C\*(Aq, $msg ); +.Ve +.PP +Combining two pack codes with a slash (\f(CW\*(C`/\*(C'\fR) associates them with a single +value from the argument list. In \f(CW\*(C`pack\*(C'\fR, the length of the argument is +taken and packed according to the first code while the argument itself +is added after being converted with the template code after the slash. +This saves us the trouble of inserting the \f(CW\*(C`length\*(C'\fR call, but it is +in \f(CW\*(C`unpack\*(C'\fR where we really score: The value of the length byte marks the +end of the string to be taken from the buffer. Since this combination +doesn't make sense except when the second pack code isn't \f(CW\*(C`a*\*(C'\fR, \f(CW\*(C`A*\*(C'\fR +or \f(CW\*(C`Z*\*(C'\fR, Perl won't let you. +.PP +The pack code preceding \f(CW\*(C`/\*(C'\fR may be anything that's fit to represent a +number: All the numeric binary pack codes, and even text codes such as +\&\f(CW\*(C`A4\*(C'\fR or \f(CW\*(C`Z*\*(C'\fR: +.PP +.Vb 4 +\& # pack/unpack a string preceded by its length in ASCII +\& my $buf = pack( \*(AqA4/A*\*(Aq, "Humpty\-Dumpty" ); +\& # unpack $buf: \*(Aq13 Humpty\-Dumpty\*(Aq +\& my $txt = unpack( \*(AqA4/A*\*(Aq, $buf ); +.Ve +.PP +\&\f(CW\*(C`/\*(C'\fR is not implemented in Perls before 5.6, so if your code is required to +work on ancient Perls you'll need to \f(CW\*(C`unpack( \*(AqZ* Z* C\*(Aq)\*(C'\fR to get the length, +then use it to make a new unpack string. For example +.PP +.Vb 3 +\& # pack a message: ASCIIZ, ASCIIZ, length, string, byte +\& # (5.005 compatible) +\& my $msg = pack( \*(AqZ* Z* C A* C\*(Aq, $src, $dst, length $sm, $sm, $prio ); +\& +\& # unpack +\& ( undef, undef, $len) = unpack( \*(AqZ* Z* C\*(Aq, $msg ); +\& ($src, $dst, $sm, $prio) = unpack ( "Z* Z* x A$len C", $msg ); +.Ve +.PP +But that second \f(CW\*(C`unpack\*(C'\fR is rushing ahead. It isn't using a simple literal +string for the template. So maybe we should introduce... +.SS "Dynamic Templates" +.IX Subsection "Dynamic Templates" +So far, we've seen literals used as templates. If the list of pack +items doesn't have fixed length, an expression constructing the +template is required (whenever, for some reason, \f(CW\*(C`()*\*(C'\fR cannot be used). +Here's an example: To store named string values in a way that can be +conveniently parsed by a C program, we create a sequence of names and +null terminated ASCII strings, with \f(CW\*(C`=\*(C'\fR between the name and the value, +followed by an additional delimiting null byte. Here's how: +.PP +.Vb 2 +\& my $env = pack( \*(Aq(A*A*Z*)\*(Aq . keys( %Env ) . \*(AqC\*(Aq, +\& map( { ( $_, \*(Aq=\*(Aq, $Env{$_} ) } keys( %Env ) ), 0 ); +.Ve +.PP +Let's examine the cogs of this byte mill, one by one. There's the \f(CW\*(C`map\*(C'\fR +call, creating the items we intend to stuff into the \f(CW$env\fR buffer: +to each key (in \f(CW$_\fR) it adds the \f(CW\*(C`=\*(C'\fR separator and the hash entry value. +Each triplet is packed with the template code sequence \f(CW\*(C`A*A*Z*\*(C'\fR that +is repeated according to the number of keys. (Yes, that's what the \f(CW\*(C`keys\*(C'\fR +function returns in scalar context.) To get the very last null byte, +we add a \f(CW0\fR at the end of the \f(CW\*(C`pack\*(C'\fR list, to be packed with \f(CW\*(C`C\*(C'\fR. +(Attentive readers may have noticed that we could have omitted the 0.) +.PP +For the reverse operation, we'll have to determine the number of items +in the buffer before we can let \f(CW\*(C`unpack\*(C'\fR rip it apart: +.PP +.Vb 2 +\& my $n = $env =~ tr/\e0// \- 1; +\& my %env = map( split( /=/, $_ ), unpack( "(Z*)$n", $env ) ); +.Ve +.PP +The \f(CW\*(C`tr\*(C'\fR counts the null bytes. The \f(CW\*(C`unpack\*(C'\fR call returns a list of +name-value pairs each of which is taken apart in the \f(CW\*(C`map\*(C'\fR block. +.SS "Counting Repetitions" +.IX Subsection "Counting Repetitions" +Rather than storing a sentinel at the end of a data item (or a list of items), +we could precede the data with a count. Again, we pack keys and values of +a hash, preceding each with an unsigned short length count, and up front +we store the number of pairs: +.PP +.Vb 1 +\& my $env = pack( \*(AqS(S/A* S/A*)*\*(Aq, scalar keys( %Env ), %Env ); +.Ve +.PP +This simplifies the reverse operation as the number of repetitions can be +unpacked with the \f(CW\*(C`/\*(C'\fR code: +.PP +.Vb 1 +\& my %env = unpack( \*(AqS/(S/A* S/A*)\*(Aq, $env ); +.Ve +.PP +Note that this is one of the rare cases where you cannot use the same +template for \f(CW\*(C`pack\*(C'\fR and \f(CW\*(C`unpack\*(C'\fR because \f(CW\*(C`pack\*(C'\fR can't determine +a repeat count for a \f(CW\*(C`()\*(C'\fR\-group. +.SS "Intel HEX" +.IX Subsection "Intel HEX" +Intel HEX is a file format for representing binary data, mostly for +programming various chips, as a text file. (See +<https://en.wikipedia.org/wiki/.hex> for a detailed description, and +<https://en.wikipedia.org/wiki/SREC_(file_format)> for the Motorola +S\-record format, which can be unravelled using the same technique.) +Each line begins with a colon (':') and is followed by a sequence of +hexadecimal characters, specifying a byte count \fIn\fR (8 bit), +an address (16 bit, big endian), a record type (8 bit), \fIn\fR data bytes +and a checksum (8 bit) computed as the least significant byte of the two's +complement sum of the preceding bytes. Example: \f(CW\*(C`:0300300002337A1E\*(C'\fR. +.PP +The first step of processing such a line is the conversion, to binary, +of the hexadecimal data, to obtain the four fields, while checking the +checksum. No surprise here: we'll start with a simple \f(CW\*(C`pack\*(C'\fR call to +convert everything to binary: +.PP +.Vb 1 +\& my $binrec = pack( \*(AqH*\*(Aq, substr( $hexrec, 1 ) ); +.Ve +.PP +The resulting byte sequence is most convenient for checking the checksum. +Don't slow your program down with a for loop adding the \f(CW\*(C`ord\*(C'\fR values +of this string's bytes \- the \f(CW\*(C`unpack\*(C'\fR code \f(CW\*(C`%\*(C'\fR is the thing to use +for computing the 8\-bit sum of all bytes, which must be equal to zero: +.PP +.Vb 1 +\& die unless unpack( "%8C*", $binrec ) == 0; +.Ve +.PP +Finally, let's get those four fields. By now, you shouldn't have any +problems with the first three fields \- but how can we use the byte count +of the data in the first field as a length for the data field? Here +the codes \f(CW\*(C`x\*(C'\fR and \f(CW\*(C`X\*(C'\fR come to the rescue, as they permit jumping +back and forth in the string to unpack. +.PP +.Vb 1 +\& my( $addr, $type, $data ) = unpack( "x n C X4 C x3 /a", $bin ); +.Ve +.PP +Code \f(CW\*(C`x\*(C'\fR skips a byte, since we don't need the count yet. Code \f(CW\*(C`n\*(C'\fR takes +care of the 16\-bit big-endian integer address, and \f(CW\*(C`C\*(C'\fR unpacks the +record type. Being at offset 4, where the data begins, we need the count. +\&\f(CW\*(C`X4\*(C'\fR brings us back to square one, which is the byte at offset 0. +Now we pick up the count, and zoom forth to offset 4, where we are +now fully furnished to extract the exact number of data bytes, leaving +the trailing checksum byte alone. +.SH "Packing and Unpacking C Structures" +.IX Header "Packing and Unpacking C Structures" +In previous sections we have seen how to pack numbers and character +strings. If it were not for a couple of snags we could conclude this +section right away with the terse remark that C structures don't +contain anything else, and therefore you already know all there is to it. +Sorry, no: read on, please. +.PP +If you have to deal with a lot of C structures, and don't want to +hack all your template strings manually, you'll probably want to have +a look at the CPAN module \f(CW\*(C`Convert::Binary::C\*(C'\fR. Not only can it parse +your C source directly, but it also has built-in support for all the +odds and ends described further on in this section. +.SS "The Alignment Pit" +.IX Subsection "The Alignment Pit" +In the consideration of speed against memory requirements the balance +has been tilted in favor of faster execution. This has influenced the +way C compilers allocate memory for structures: On architectures +where a 16\-bit or 32\-bit operand can be moved faster between places in +memory, or to or from a CPU register, if it is aligned at an even or +multiple-of-four or even at a multiple-of eight address, a C compiler +will give you this speed benefit by stuffing extra bytes into structures. +If you don't cross the C shoreline this is not likely to cause you any +grief (although you should care when you design large data structures, +or you want your code to be portable between architectures (you do want +that, don't you?)). +.PP +To see how this affects \f(CW\*(C`pack\*(C'\fR and \f(CW\*(C`unpack\*(C'\fR, we'll compare these two +C structures: +.PP +.Vb 6 +\& typedef struct { +\& char c1; +\& short s; +\& char c2; +\& long l; +\& } gappy_t; +\& +\& typedef struct { +\& long l; +\& short s; +\& char c1; +\& char c2; +\& } dense_t; +.Ve +.PP +Typically, a C compiler allocates 12 bytes to a \f(CW\*(C`gappy_t\*(C'\fR variable, but +requires only 8 bytes for a \f(CW\*(C`dense_t\*(C'\fR. After investigating this further, +we can draw memory maps, showing where the extra 4 bytes are hidden: +.PP +.Vb 5 +\& 0 +4 +8 +12 +\& +\-\-+\-\-+\-\-+\-\-+\-\-+\-\-+\-\-+\-\-+\-\-+\-\-+\-\-+\-\-+ +\& |c1|xx| s |c2|xx|xx|xx| l | xx = fill byte +\& +\-\-+\-\-+\-\-+\-\-+\-\-+\-\-+\-\-+\-\-+\-\-+\-\-+\-\-+\-\-+ +\& gappy_t +\& +\& 0 +4 +8 +\& +\-\-+\-\-+\-\-+\-\-+\-\-+\-\-+\-\-+\-\-+ +\& | l | h |c1|c2| +\& +\-\-+\-\-+\-\-+\-\-+\-\-+\-\-+\-\-+\-\-+ +\& dense_t +.Ve +.PP +And that's where the first quirk strikes: \f(CW\*(C`pack\*(C'\fR and \f(CW\*(C`unpack\*(C'\fR +templates have to be stuffed with \f(CW\*(C`x\*(C'\fR codes to get those extra fill bytes. +.PP +The natural question: "Why can't Perl compensate for the gaps?" warrants +an answer. One good reason is that C compilers might provide (non-ANSI) +extensions permitting all sorts of fancy control over the way structures +are aligned, even at the level of an individual structure field. And, if +this were not enough, there is an insidious thing called \f(CW\*(C`union\*(C'\fR where +the amount of fill bytes cannot be derived from the alignment of the next +item alone. +.PP +OK, so let's bite the bullet. Here's one way to get the alignment right +by inserting template codes \f(CW\*(C`x\*(C'\fR, which don't take a corresponding item +from the list: +.PP +.Vb 1 +\& my $gappy = pack( \*(Aqcxs cxxx l!\*(Aq, $c1, $s, $c2, $l ); +.Ve +.PP +Note the \f(CW\*(C`!\*(C'\fR after \f(CW\*(C`l\*(C'\fR: We want to make sure that we pack a long +integer as it is compiled by our C compiler. And even now, it will only +work for the platforms where the compiler aligns things as above. +And somebody somewhere has a platform where it doesn't. +[Probably a Cray, where \f(CW\*(C`short\*(C'\fRs, \f(CW\*(C`int\*(C'\fRs and \f(CW\*(C`long\*(C'\fRs are all 8 bytes. :\-)] +.PP +Counting bytes and watching alignments in lengthy structures is bound to +be a drag. Isn't there a way we can create the template with a simple +program? Here's a C program that does the trick: +.PP +.Vb 2 +\& #include <stdio.h> +\& #include <stddef.h> +\& +\& typedef struct { +\& char fc1; +\& short fs; +\& char fc2; +\& long fl; +\& } gappy_t; +\& +\& #define Pt(struct,field,tchar) \e +\& printf( "@%d%s ", offsetof(struct,field), # tchar ); +\& +\& int main() { +\& Pt( gappy_t, fc1, c ); +\& Pt( gappy_t, fs, s! ); +\& Pt( gappy_t, fc2, c ); +\& Pt( gappy_t, fl, l! ); +\& printf( "\en" ); +\& } +.Ve +.PP +The output line can be used as a template in a \f(CW\*(C`pack\*(C'\fR or \f(CW\*(C`unpack\*(C'\fR call: +.PP +.Vb 1 +\& my $gappy = pack( \*(Aq@0c @2s! @4c @8l!\*(Aq, $c1, $s, $c2, $l ); +.Ve +.PP +Gee, yet another template code \- as if we hadn't plenty. But +\&\f(CW\*(C`@\*(C'\fR saves our day by enabling us to specify the offset from the beginning +of the pack buffer to the next item: This is just the value +the \f(CW\*(C`offsetof\*(C'\fR macro (defined in \f(CW\*(C`<stddef.h>\*(C'\fR) returns when +given a \f(CW\*(C`struct\*(C'\fR type and one of its field names ("member-designator" in +C standardese). +.PP +Neither using offsets nor adding \f(CW\*(C`x\*(C'\fR's to bridge the gaps is satisfactory. +(Just imagine what happens if the structure changes.) What we really need +is a way of saying "skip as many bytes as required to the next multiple of N". +In fluent templates, you say this with \f(CW\*(C`x!N\*(C'\fR where N is replaced by the +appropriate value. Here's the next version of our struct packaging: +.PP +.Vb 1 +\& my $gappy = pack( \*(Aqc x!2 s c x!4 l!\*(Aq, $c1, $s, $c2, $l ); +.Ve +.PP +That's certainly better, but we still have to know how long all the +integers are, and portability is far away. Rather than \f(CW2\fR, +for instance, we want to say "however long a short is". But this can be +done by enclosing the appropriate pack code in brackets: \f(CW\*(C`[s]\*(C'\fR. So, here's +the very best we can do: +.PP +.Vb 1 +\& my $gappy = pack( \*(Aqc x![s] s c x![l!] l!\*(Aq, $c1, $s, $c2, $l ); +.Ve +.SS "Dealing with Endian-ness" +.IX Subsection "Dealing with Endian-ness" +Now, imagine that we want to pack the data for a machine with a +different byte-order. First, we'll have to figure out how big the data +types on the target machine really are. Let's assume that the longs are +32 bits wide and the shorts are 16 bits wide. You can then rewrite the +template as: +.PP +.Vb 1 +\& my $gappy = pack( \*(Aqc x![s] s c x![l] l\*(Aq, $c1, $s, $c2, $l ); +.Ve +.PP +If the target machine is little-endian, we could write: +.PP +.Vb 1 +\& my $gappy = pack( \*(Aqc x![s] s< c x![l] l<\*(Aq, $c1, $s, $c2, $l ); +.Ve +.PP +This forces the short and the long members to be little-endian, and is +just fine if you don't have too many struct members. But we could also +use the byte-order modifier on a group and write the following: +.PP +.Vb 1 +\& my $gappy = pack( \*(Aq( c x![s] s c x![l] l )<\*(Aq, $c1, $s, $c2, $l ); +.Ve +.PP +This is not as short as before, but it makes it more obvious that we +intend to have little-endian byte-order for a whole group, not only +for individual template codes. It can also be more readable and easier +to maintain. +.SS "Alignment, Take 2" +.IX Subsection "Alignment, Take 2" +I'm afraid that we're not quite through with the alignment catch yet. The +hydra raises another ugly head when you pack arrays of structures: +.PP +.Vb 4 +\& typedef struct { +\& short count; +\& char glyph; +\& } cell_t; +\& +\& typedef cell_t buffer_t[BUFLEN]; +.Ve +.PP +Where's the catch? Padding is neither required before the first field \f(CW\*(C`count\*(C'\fR, +nor between this and the next field \f(CW\*(C`glyph\*(C'\fR, so why can't we simply pack +like this: +.PP +.Vb 3 +\& # something goes wrong here: +\& pack( \*(Aqs!a\*(Aq x @buffer, +\& map{ ( $_\->{count}, $_\->{glyph} ) } @buffer ); +.Ve +.PP +This packs \f(CW\*(C`3*@buffer\*(C'\fR bytes, but it turns out that the size of +\&\f(CW\*(C`buffer_t\*(C'\fR is four times \f(CW\*(C`BUFLEN\*(C'\fR! The moral of the story is that +the required alignment of a structure or array is propagated to the +next higher level where we have to consider padding \fIat the end\fR +of each component as well. Thus the correct template is: +.PP +.Vb 2 +\& pack( \*(Aqs!ax\*(Aq x @buffer, +\& map{ ( $_\->{count}, $_\->{glyph} ) } @buffer ); +.Ve +.SS "Alignment, Take 3" +.IX Subsection "Alignment, Take 3" +And even if you take all the above into account, ANSI still lets this: +.PP +.Vb 3 +\& typedef struct { +\& char foo[2]; +\& } foo_t; +.Ve +.PP +vary in size. The alignment constraint of the structure can be greater than +any of its elements. [And if you think that this doesn't affect anything +common, dismember the next cellphone that you see. Many have ARM cores, and +the ARM structure rules make \f(CW\*(C`sizeof (foo_t)\*(C'\fR == 4] +.SS "Pointers for How to Use Them" +.IX Subsection "Pointers for How to Use Them" +The title of this section indicates the second problem you may run into +sooner or later when you pack C structures. If the function you intend +to call expects a, say, \f(CW\*(C`void *\*(C'\fR value, you \fIcannot\fR simply take +a reference to a Perl variable. (Although that value certainly is a +memory address, it's not the address where the variable's contents are +stored.) +.PP +Template code \f(CW\*(C`P\*(C'\fR promises to pack a "pointer to a fixed length string". +Isn't this what we want? Let's try: +.PP +.Vb 3 +\& # allocate some storage and pack a pointer to it +\& my $memory = "\ex00" x $size; +\& my $memptr = pack( \*(AqP\*(Aq, $memory ); +.Ve +.PP +But wait: doesn't \f(CW\*(C`pack\*(C'\fR just return a sequence of bytes? How can we pass this +string of bytes to some C code expecting a pointer which is, after all, +nothing but a number? The answer is simple: We have to obtain the numeric +address from the bytes returned by \f(CW\*(C`pack\*(C'\fR. +.PP +.Vb 1 +\& my $ptr = unpack( \*(AqL!\*(Aq, $memptr ); +.Ve +.PP +Obviously this assumes that it is possible to typecast a pointer +to an unsigned long and vice versa, which frequently works but should not +be taken as a universal law. \- Now that we have this pointer the next question +is: How can we put it to good use? We need a call to some C function +where a pointer is expected. The \fBread\fR\|(2) system call comes to mind: +.PP +.Vb 1 +\& ssize_t read(int fd, void *buf, size_t count); +.Ve +.PP +After reading perlfunc explaining how to use \f(CW\*(C`syscall\*(C'\fR we can write +this Perl function copying a file to standard output: +.PP +.Vb 12 +\& require \*(Aqsyscall.ph\*(Aq; # run h2ph to generate this file +\& sub cat($){ +\& my $path = shift(); +\& my $size = \-s $path; +\& my $memory = "\ex00" x $size; # allocate some memory +\& my $ptr = unpack( \*(AqL\*(Aq, pack( \*(AqP\*(Aq, $memory ) ); +\& open( F, $path ) || die( "$path: cannot open ($!)\en" ); +\& my $fd = fileno(F); +\& my $res = syscall( &SYS_read, fileno(F), $ptr, $size ); +\& print $memory; +\& close( F ); +\& } +.Ve +.PP +This is neither a specimen of simplicity nor a paragon of portability but +it illustrates the point: We are able to sneak behind the scenes and +access Perl's otherwise well-guarded memory! (Important note: Perl's +\&\f(CW\*(C`syscall\*(C'\fR does \fInot\fR require you to construct pointers in this roundabout +way. You simply pass a string variable, and Perl forwards the address.) +.PP +How does \f(CW\*(C`unpack\*(C'\fR with \f(CW\*(C`P\*(C'\fR work? Imagine some pointer in the buffer +about to be unpacked: If it isn't the null pointer (which will smartly +produce the \f(CW\*(C`undef\*(C'\fR value) we have a start address \- but then what? +Perl has no way of knowing how long this "fixed length string" is, so +it's up to you to specify the actual size as an explicit length after \f(CW\*(C`P\*(C'\fR. +.PP +.Vb 2 +\& my $mem = "abcdefghijklmn"; +\& print unpack( \*(AqP5\*(Aq, pack( \*(AqP\*(Aq, $mem ) ); # prints "abcde" +.Ve +.PP +As a consequence, \f(CW\*(C`pack\*(C'\fR ignores any number or \f(CW\*(C`*\*(C'\fR after \f(CW\*(C`P\*(C'\fR. +.PP +Now that we have seen \f(CW\*(C`P\*(C'\fR at work, we might as well give \f(CW\*(C`p\*(C'\fR a whirl. +Why do we need a second template code for packing pointers at all? The +answer lies behind the simple fact that an \f(CW\*(C`unpack\*(C'\fR with \f(CW\*(C`p\*(C'\fR promises +a null-terminated string starting at the address taken from the buffer, +and that implies a length for the data item to be returned: +.PP +.Vb 2 +\& my $buf = pack( \*(Aqp\*(Aq, "abc\ex00efhijklmn" ); +\& print unpack( \*(Aqp\*(Aq, $buf ); # prints "abc" +.Ve +.PP +Albeit this is apt to be confusing: As a consequence of the length being +implied by the string's length, a number after pack code \f(CW\*(C`p\*(C'\fR is a repeat +count, not a length as after \f(CW\*(C`P\*(C'\fR. +.PP +Using \f(CW\*(C`pack(..., $x)\*(C'\fR with \f(CW\*(C`P\*(C'\fR or \f(CW\*(C`p\*(C'\fR to get the address where \f(CW$x\fR is +actually stored must be used with circumspection. Perl's internal machinery +considers the relation between a variable and that address as its very own +private matter and doesn't really care that we have obtained a copy. Therefore: +.IP \(bu 4 +Do not use \f(CW\*(C`pack\*(C'\fR with \f(CW\*(C`p\*(C'\fR or \f(CW\*(C`P\*(C'\fR to obtain the address of variable +that's bound to go out of scope (and thereby freeing its memory) before you +are done with using the memory at that address. +.IP \(bu 4 +Be very careful with Perl operations that change the value of the +variable. Appending something to the variable, for instance, might require +reallocation of its storage, leaving you with a pointer into no-man's land. +.IP \(bu 4 +Don't think that you can get the address of a Perl variable +when it is stored as an integer or double number! \f(CW\*(C`pack(\*(AqP\*(Aq, $x)\*(C'\fR will +force the variable's internal representation to string, just as if you +had written something like \f(CW\*(C`$x .= \*(Aq\*(Aq\*(C'\fR. +.PP +It's safe, however, to P\- or p\-pack a string literal, because Perl simply +allocates an anonymous variable. +.SH "Pack Recipes" +.IX Header "Pack Recipes" +Here are a collection of (possibly) useful canned recipes for \f(CW\*(C`pack\*(C'\fR +and \f(CW\*(C`unpack\*(C'\fR: +.PP +.Vb 2 +\& # Convert IP address for socket functions +\& pack( "C4", split /\e./, "123.4.5.6" ); +\& +\& # Count the bits in a chunk of memory (e.g. a select vector) +\& unpack( \*(Aq%32b*\*(Aq, $mask ); +\& +\& # Determine the endianness of your system +\& $is_little_endian = unpack( \*(Aqc\*(Aq, pack( \*(Aqs\*(Aq, 1 ) ); +\& $is_big_endian = unpack( \*(Aqxc\*(Aq, pack( \*(Aqs\*(Aq, 1 ) ); +\& +\& # Determine the number of bits in a native integer +\& $bits = unpack( \*(Aq%32I!\*(Aq, ~0 ); +\& +\& # Prepare argument for the nanosleep system call +\& my $timespec = pack( \*(AqL!L!\*(Aq, $secs, $nanosecs ); +.Ve +.PP +For a simple memory dump we unpack some bytes into just as +many pairs of hex digits, and use \f(CW\*(C`map\*(C'\fR to handle the traditional +spacing \- 16 bytes to a line: +.PP +.Vb 4 +\& my $i; +\& print map( ++$i % 16 ? "$_ " : "$_\en", +\& unpack( \*(AqH2\*(Aq x length( $mem ), $mem ) ), +\& length( $mem ) % 16 ? "\en" : \*(Aq\*(Aq; +.Ve +.SH "Funnies Section" +.IX Header "Funnies Section" +.Vb 5 +\& # Pulling digits out of nowhere... +\& print unpack( \*(AqC\*(Aq, pack( \*(Aqx\*(Aq ) ), +\& unpack( \*(Aq%B*\*(Aq, pack( \*(AqA\*(Aq ) ), +\& unpack( \*(AqH\*(Aq, pack( \*(AqA\*(Aq ) ), +\& unpack( \*(AqA\*(Aq, unpack( \*(AqC\*(Aq, pack( \*(AqA\*(Aq ) ) ), "\en"; +\& +\& # One for the road ;\-) +\& my $advice = pack( \*(Aqall u can in a van\*(Aq ); +.Ve +.SH Authors +.IX Header "Authors" +Simon Cozens and Wolfgang Laun. |