@c GNU Version-sort ordering documentation @c Copyright (C) 2019--2022 Free Software Foundation, Inc. @c Permission is granted to copy, distribute and/or modify this document @c under the terms of the GNU Free Documentation License, Version 1.3 or @c any later version published by the Free Software Foundation; with no @c Invariant Sections, no Front-Cover Texts, and no Back-Cover @c Texts. A copy of the license is included in the ``GNU Free @c Documentation License'' file as part of this distribution. @c Written by Assaf Gordon @node Version sort ordering @chapter Version sort ordering @node Version sort overview @section Version sort overview @dfn{Version sort} puts items such as file names and lines of text in an order that feels natural to people, when the text contains a mixture of letters and digits. Lexicographic sorting usually does not produce the order that one expects because comparisons are made on a character-by-character basis. Compare the sorting of the following items: @example Lexicographic sort: Version Sort: a1 a1 a120 a2 a13 a13 a2 a120 @end example Version sort functionality in GNU Coreutils is available in the @samp{ls -v}, @samp{ls --sort=version}, @samp{sort -V}, and @samp{sort --version-sort} commands. @node Using version sort in GNU Coreutils @subsection Using version sort in GNU Coreutils Two GNU Coreutils programs use version sort: @command{ls} and @command{sort}. To list files in version sort order, use @command{ls} with the @option{-v} or @option{--sort=version} option: @example default sort: version sort: $ ls -1 $ ls -1 -v a1 a1 a100 a1.4 a1.13 a1.13 a1.4 a1.40 a1.40 a2 a2 a100 @end example To sort text files in version sort order, use @command{sort} with the @option{-V} or @option{--version-sort} option: @example $ cat input b3 b11 b1 b20 lexicographic order: version sort order: $ sort input $ sort -V input b1 b1 b11 b3 b20 b11 b3 b20 @end example To sort a specific field in a file, use @option{-k/--key} with @samp{V} type sorting, which is often combined with @samp{b} to ignore leading blanks in the field: @example $ cat input2 100 b3 apples 2000 b11 oranges 3000 b1 potatoes 4000 b20 bananas $ sort -k 2bV,2 input2 3000 b1 potatoes 100 b3 apples 2000 b11 oranges 4000 b20 bananas @end example @node Version sort and natural sort @subsection Version sort and natural sort In GNU Coreutils, the name @dfn{version sort} was chosen because it is based on Debian GNU/Linux's algorithm of sorting packages' versions. Its goal is to answer questions like ``Which package is newer, @file{firefox-60.7.2} or @file{firefox-60.12.3}?'' In Coreutils this algorithm was slightly modified to work on more general input such as textual strings and file names (see @ref{Differences from Debian version sort}). In other contexts, such as other programs and other programming languages, a similar sorting functionality is called @uref{https://en.wikipedia.org/wiki/Natural_sort_order,natural sort}. @node Variations in version sort order @subsection Variations in version sort order Currently there is no standard for version sort. That is: there is no one correct way or universally agreed-upon way to order items. Each program and each programming language can decide its own ordering algorithm and call it ``version sort'', ``natural sort'', or other names. See @ref{Other version/natural sort implementations} for many examples of differing sorting possibilities, each with its own rules and variations. If you find a bug in the Coreutils implementation of version-sort, please report it. @xref{Reporting version sort bugs}. @node Version sort implementation @section Version sort implementation GNU Coreutils version sort is based on the ``upstream version'' part of @uref{https://www.debian.org/doc/debian-policy/ch-controlfields.html#version, Debian's versioning scheme}. This section describes the GNU Coreutils sort ordering rules. The next section (@ref{Differences from Debian version sort}) describes some differences between GNU Coreutils and Debian version sort. @node Version-sort ordering rules @subsection Version-sort ordering rules The version sort ordering rules are: @enumerate @item The strings are compared from left to right. @item First the initial part of each string consisting entirely of non-digit bytes is determined. @enumerate A @item These two parts (either of which may be empty) are compared lexically. If a difference is found it is returned. @item The lexical comparison is a lexicographic comparison of byte strings, except that: @enumerate a @item ASCII letters sort before other bytes. @item A tilde sorts before anything, even an empty string. @end enumerate @end enumerate @item Then the initial part of the remainder of each string that contains all the leading digits is determined. The numerical values represented by these two parts are compared, and any difference found is returned as the result of the comparison. @enumerate A @item For these purposes an empty string (which can only occur at the end of one or both version strings being compared) counts as zero. @item Because the numerical value is used, non-identical strings can compare equal. For example, @samp{123} compares equal to @samp{00123}, and the empty string compares equal to @samp{0}. @end enumerate @item These two steps (comparing and removing initial non-digit strings and initial digit strings) are repeated until a difference is found or both strings are exhausted. @end enumerate Consider the version-sort comparison of two file names: @file{foo07.7z} and @file{foo7a.7z}. The two strings will be broken down to the following parts, and the parts compared respectively from each string: @example foo @r{vs} foo @r{(rule 2, non-digits)} 07 @r{vs} 7 @r{(rule 3, digits)} . @r{vs} a. @r{(rule 2)} 7 @r{vs} 7 @r{(rule 3)} z @r{vs} z @r{(rule 2)} @end example Comparison flow based on above algorithm: @enumerate @item The first parts (@samp{foo}) are identical. @item The second parts (@samp{07} and @samp{7}) are compared numerically, and compare equal. @item The third parts (@samp{.} vs @samp{a.}) are compared lexically by ASCII value (rule 2.B). @item The first byte of the first string (@samp{.}) is compared to the first byte of the second string (@samp{a}). @item Rule 2.B.a says letters sorts before non-letters. Hence, @samp{a} comes before @samp{.}. @item The returned result is that @file{foo7a.7z} comes before @file{foo07.7z}. @end enumerate Result when using sort: @example $ cat input3 foo07.7z foo7a.7z $ sort -V input3 foo7a.7z foo07.7z @end example See @ref{Differences from Debian version sort} for additional rules that extend the Debian algorithm in Coreutils. @node Version sort is not the same as numeric sort @subsection Version sort is not the same as numeric sort Consider the following text file: @example $ cat input4 8.10 8.5 8.1 8.01 8.010 8.100 8.49 Numerical Sort: Version Sort: $ sort -n input4 $ sort -V input4 8.01 8.01 8.010 8.1 8.1 8.5 8.10 8.010 8.100 8.10 8.49 8.49 8.5 8.100 @end example Numeric sort (@samp{sort -n}) treats the entire string as a single numeric value, and compares it to other values. For example, @samp{8.1}, @samp{8.10} and @samp{8.100} are numerically equivalent, and are ordered together. Similarly, @samp{8.49} is numerically less than @samp{8.5}, and appears before first. Version sort (@samp{sort -V}) first breaks down the string into digit and non-digit parts, and only then compares each part (see annotated example in @ref{Version-sort ordering rules}). Comparing the string @samp{8.1} to @samp{8.01}, first the @samp{8}s are compared (and are identical), then the dots (@samp{.}) are compared and are identical, and lastly the remaining digits are compared numerically (@samp{1} and @samp{01}) - which are numerically equal. Hence, @samp{8.01} and @samp{8.1} are grouped together. Similarly, comparing @samp{8.5} to @samp{8.49} -- the @samp{8} and @samp{.} parts are identical, then the numeric values @samp{5} and @samp{49} are compared. The resulting @samp{5} appears before @samp{49}. This sorting order (where @samp{8.5} comes before @samp{8.49}) is common when assigning versions to computer programs (while perhaps not intuitive or ``natural'' for people). @node Version sort punctuation @subsection Version sort punctuation Punctuation is sorted by ASCII order (rule 2.B). @example $ touch 1.0.5_src.tar.gz 1.0_src.tar.gz $ ls -v -1 1.0.5_src.tar.gz 1.0_src.tar.gz @end example Why is @file{1.0.5_src.tar.gz} listed before @file{1.0_src.tar.gz}? Based on the version-sort ordering rules, the strings are broken down into the following parts: @example 1 @r{vs} 1 @r{(rule 3, all digits)} . @r{vs} . @r{(rule 2, all non-digits)} 0 @r{vs} 0 @r{(rule 3)} . @r{vs} _src.tar.gz @r{(rule 2)} 5 @r{vs} empty string @r{(no more bytes in the file name)} _src.tar.gz @r{vs} empty string @end example The fourth parts (@samp{.} and @samp{_src.tar.gz}) are compared lexically by ASCII order. The @samp{.} (ASCII value 46) is less than @samp{_} (ASCII value 95) -- and should be listed before it. Hence, @file{1.0.5_src.tar.gz} is listed first. If a different byte appears instead of the underscore (for example, percent sign @samp{%} ASCII value 37, which is less than dot's ASCII value of 46), that file will be listed first: @example $ touch 1.0.5_src.tar.gz 1.0%zzzzz.gz 1.0%zzzzz.gz 1.0.5_src.tar.gz @end example The same reasoning applies to the following example, as @samp{.} with ASCII value 46 is less than @samp{/} with ASCII value 47: @example $ cat input5 3.0/ 3.0.5 $ sort -V input5 3.0.5 3.0/ @end example @node Punctuation vs letters @subsection Punctuation vs letters Rule 2.B.a says letters sort before non-letters (after breaking down a string to digit and non-digit parts). @example $ cat input6 a% az $ sort -V input6 az a% @end example The input strings consist entirely of non-digits, and based on the above algorithm have only one part, all non-digits (@samp{a%} vs @samp{az}). Each part is then compared lexically, byte-by-byte; @samp{a} compares identically in both strings. Rule 2.B.a says a letter like @samp{z} sorts before a non-letter like @samp{%} -- hence @samp{az} appears first (despite @samp{z} having ASCII value of 122, much larger than @samp{%} with ASCII value 37). @node The tilde @samp{~} @subsection The tilde @samp{~} Rule 2.B.b says the tilde @samp{~} (ASCII 126) sorts before other bytes, and before an empty string. @example $ cat input7 1 1% 1.2 1~ ~ $ sort -V input7 ~ 1~ 1 1% 1.2 @end example The sorting algorithm starts by breaking down the string into non-digit (rule 2) and digit parts (rule 3). In the above input file, only the last line in the input file starts with a non-digit (@samp{~}). This is the first part. All other lines in the input file start with a digit -- their first non-digit part is empty. Based on rule 2.B.b, tilde @samp{~} sorts before other bytes and before the empty string -- hence it comes before all other strings, and is listed first in the sorted output. The remaining lines (@samp{1}, @samp{1%}, @samp{1.2}, @samp{1~}) follow similar logic: The digit part is extracted (1 for all strings) and compares equal. The following extracted parts for the remaining input lines are: empty part, @samp{%}, @samp{.}, @samp{~}. Tilde sorts before all others, hence the line @samp{1~} appears next. The remaining lines (@samp{1}, @samp{1%}, @samp{1.2}) are sorted based on previously explained rules. @node Version sort ignores locale @subsection Version sort ignores locale In version sort, Unicode characters are compared byte-by-byte according to their binary representation, ignoring their Unicode value or the current locale. Most commonly, Unicode characters are encoded as UTF-8 bytes; for example, GREEK SMALL LETTER ALPHA (U+03B1, @samp{α}) is encoded as the UTF-8 sequence @samp{0xCE 0xB1}). The encoding is compared byte-by-byte, e.g., first @samp{0xCE} (decimal value 206) then @samp{0xB1} (decimal value 177). @example $ touch aa az "a%" "aα" $ ls -1 -v aa az a% aα @end example Ignoring the first letter (@samp{a}) which is identical in all strings, the compared values are: @samp{a} and @samp{z} are letters, and sort before all other non-digits. Then, percent sign @samp{%} (ASCII value 37) is compared to the first byte of the UTF-8 sequence of @samp{α}, which is 0xCE or 206). The value 37 is smaller, hence @samp{a%} is listed before @samp{aα}. @node Differences from Debian version sort @section Differences from Debian version sort GNU Coreutils version sort differs slightly from the official Debian algorithm, in order to accommodate more general usage and file name listing. @node Hyphen-minus and colon @subsection Hyphen-minus @samp{-} and colon @samp{:} In Debian's version string syntax the version consists of three parts: @example [epoch:]upstream_version[-debian_revision] @end example The @samp{epoch} and @samp{debian_revision} parts are optional. Example of such version strings: @example 60.7.2esr-1~deb9u1 52.9.0esr-1~deb9u1 1:2.3.4-1+b2 327-2 1:1.0.13-3 2:1.19.2-1+deb9u5 @end example If the @samp{debian_revision part} is not present, hyphens @samp{-} are not allowed. If epoch is not present, colons @samp{:} are not allowed. If these parts are present, hyphen and/or colons can appear only once in valid Debian version strings. In GNU Coreutils, such restrictions are not reasonable (a file name can have many hyphens, a line of text can have many colons). As a result, in GNU Coreutils hyphens and colons are treated exactly like all other punctuation, i.e., they are sorted after letters. @xref{Version sort punctuation}. In Debian, these characters are treated differently than in Coreutils: a version string with hyphen will sort before similar strings without hyphens. Compare: @example $ touch 1ab-cd 1abb $ ls -v -1 1abb 1ab-cd $ if dpkg --compare-versions 1abb lt 1ab-cd > then echo sorted > else echo out of order > fi out of order @end example For further details, see @ref{Comparing two strings using Debian's algorithm} and @uref{https://bugs.gnu.org/35939,GNU Bug 35939}. @node Special priority in GNU Coreutils version sort @subsection Special priority in GNU Coreutils version sort In GNU Coreutils version sort, the following items have special priority and sort before all other strings (listed in order): @enumerate @item The empty string @item The string @samp{.} (a single dot, ASCII 46) @item The string @samp{..} (two dots) @item Strings starting with dot (@samp{.}) sort before strings starting with any other byte. @end enumerate Example: @example $ printf '%s\n' a "" b "." c ".." ".d20" ".d3" | sort -V . .. .d3 .d20 a b c @end example These priorities make perfect sense for @samp{ls -v}: The special files dot @samp{.} and dot-dot @samp{..} will be listed first, followed by any hidden files (files starting with a dot), followed by non-hidden files. For @samp{sort -V} these priorities might seem arbitrary. However, because the sorting code is shared between the @command{ls} and @command{sort} program, the ordering rules are the same. @node Special handling of file extensions @subsection Special handling of file extensions GNU Coreutils version sort implements specialized handling of strings that look like file names with extensions. This enables slightly more natural ordering of file names. The following additional rules apply when comparing two strings where both begin with non-@samp{.}. They also apply when comparing two strings where both begin with @samp{.} but neither is @samp{.} or @samp{..}. @enumerate @item A suffix (i.e., a file extension) is defined as: a dot, followed by an ASCII letter or tilde, followed by zero or more ASCII letters, digits, or tildes; all repeated zero or more times, and ending at string end. This is equivalent to matching the extended regular expression @code{(\.[A-Za-z~][A-Za-z0-9~]*)*$} in the C locale. @item The suffixes are temporarily removed, and the strings are compared without them, using version sort (see @ref{Version-sort ordering rules}) without special priority (see @ref{Special priority in GNU Coreutils version sort}). @item If the suffix-less strings do not compare equal, this comparison result is used and the suffixes are effectively ignored. @item If the suffix-less strings compare equal, the suffixes are restored and the entire strings are compared using version sort. @end enumerate Examples for rule 1: @itemize @item @samp{hello-8.txt}: the suffix is @samp{.txt} @item @samp{hello-8.2.txt}: the suffix is @samp{.txt} (@samp{.2} is not included because the dot is not followed by a letter) @item @samp{hello-8.0.12.tar.gz}: the suffix is @samp{.tar.gz} (@samp{.0.12} is not included) @item @samp{hello-8.2}: no suffix (suffix is an empty string) @item @samp{hello.foobar65}: the suffix is @samp{.foobar65} @item @samp{gcc-c++-10.8.12-0.7rc2.fc9.tar.bz2}: the suffix is @samp{.fc9.tar.bz2} (@samp{.7rc2} is not included as it begins with a digit) @item @samp{.autom4te.cfg}: the suffix is the entire string. @end itemize Examples for rule 2: @itemize @item Comparing @samp{hello-8.txt} to @samp{hello-8.2.12.txt}, the @samp{.txt} suffix is temporarily removed from both strings. @item Comparing @samp{foo-10.3.tar.gz} to @samp{foo-10.tar.xz}, the suffixes @samp{.tar.gz} and @samp{.tar.xz} are temporarily removed from the strings. @end itemize Example for rule 3: @itemize @item Comparing @samp{hello.foobar65} to @samp{hello.foobar4}, the suffixes (@samp{.foobar65} and @samp{.foobar4}) are temporarily removed. The remaining strings are identical (@samp{hello}). The suffixes are then restored, and the entire strings are compared (@samp{hello.foobar4} comes first). @end itemize Examples for rule 4: @itemize @item When comparing the strings @samp{hello-8.2.txt} and @samp{hello-8.10.txt}, the suffixes (@samp{.txt}) are temporarily removed. The remaining strings (@samp{hello-8.2} and @samp{hello-8.10}) are compared as previously described (@samp{hello-8.2} comes first). @slanted{(In this case the suffix removal algorithm does not have a noticeable effect on the resulting order.)} @end itemize @b{How does the suffix-removal algorithm effect ordering results?} Consider the comparison of hello-8.txt and hello-8.2.txt. Without the suffix-removal algorithm, the strings will be broken down to the following parts: @example hello- @r{vs} hello- @r{(rule 2, all non-digits)} 8 @r{vs} 8 @r{(rule 3, all digits)} .txt @r{vs} . @r{(rule 2)} empty @r{vs} 2 empty @r{vs} .txt @end example The comparison of the third parts (@samp{.} vs @samp{.txt}) will determine that the shorter string comes first - resulting in @file{hello-8.2.txt} appearing first. Indeed this is the order in which Debian's @command{dpkg} compares the strings. A more natural result is that @file{hello-8.txt} should come before @file{hello-8.2.txt}, and this is where the suffix-removal comes into play: The suffixes (@samp{.txt}) are removed, and the remaining strings are broken down into the following parts: @example hello- @r{vs} hello- @r{(rule 2, all non-digits)} 8 @r{vs} 8 @r{(rule 3, all digits)} empty @r{vs} . @r{(rule 2)} empty @r{vs} 2 @end example As empty strings sort before non-empty strings, the result is @samp{hello-8} being first. A real-world example would be listing files such as: @file{gcc_10.fc9.tar.gz} and @file{gcc_10.8.12.7rc2.fc9.tar.bz2}: Debian's algorithm would list @file{gcc_10.8.12.7rc2.fc9.tar.bz2} first, while @samp{ls -v} will list @file{gcc_10.fc9.tar.gz} first. These priorities make sense for @samp{ls -v}: Versioned files will be listed in a more natural order. For @samp{sort -V} these priorities might seem arbitrary. However, because the sorting code is shared between the @command{ls} and @command{sort} program, the ordering rules are the same. @node Comparing two strings using Debian's algorithm @subsection Comparing two strings using Debian's algorithm The Debian program @command{dpkg} (available on all Debian and Ubuntu installations) can compare two strings using the @option{--compare-versions} option. To use it, create a helper shell function (simply copy & paste the following snippet to your shell command-prompt): @example compver() @{ if dpkg --compare-versions "$1" lt "$2" then printf '%s\n' "$1" "$2" else printf '%s\n' "$2" "$1" fi @} @end example Then compare two strings by calling @command{compver}: @example $ compver 8.49 8.5 8.5 8.49 @end example Note that @command{dpkg} will warn if the strings have invalid syntax: @example $ compver "foo07.7z" "foo7a.7z" dpkg: warning: version 'foo07.7z' has bad syntax: version number does not start with digit dpkg: warning: version 'foo7a.7z' has bad syntax: version number does not start with digit foo7a.7z foo07.7z $ compver "3.0/" "3.0.5" dpkg: warning: version '3.0/' has bad syntax: invalid character in version number 3.0.5 3.0/ @end example To illustrate the different handling of hyphens between Debian and Coreutils algorithms (see @ref{Hyphen-minus and colon}): @example $ compver abb ab-cd 2>/dev/null $ printf 'abb\nab-cd\n' | sort -V ab-cd abb abb ab-cd @end example To illustrate the different handling of file extension: (see @ref{Special handling of file extensions}): @example $ compver hello-8.txt hello-8.2.txt 2>/dev/null hello-8.2.txt hello-8.txt $ printf '%s\n' hello-8.txt hello-8.2.txt | sort -V hello-8.txt hello-8.2.txt @end example @node Advanced version sort topics @section Advanced Topics @node Reporting version sort bugs @subsection Reporting version sort bugs If you suspect a bug in GNU Coreutils version sort (i.e., in the output of @samp{ls -v} or @samp{sort -V}), please first check the following: @enumerate @item Is the result consistent with Debian's own ordering (using @command{dpkg}, see @ref{Comparing two strings using Debian's algorithm})? If it is, then this is not a bug -- please do not report it. @item If the result differs from Debian's, is it explained by one of the sections in @ref{Differences from Debian version sort}? If it is, then this is not a bug -- please do not report it. @item If you have a question about specific ordering which is not explained here, please write to @email{coreutils@@gnu.org}, and provide a concise example that will help us diagnose the issue. @item If you still suspect a bug which is not explained by the above, please write to @email{bug-coreutils@@gnu.org} with a concrete example of the suspected incorrect output, with details on why you think it is incorrect. @end enumerate @node Other version/natural sort implementations @subsection Other version/natural sort implementations As previously mentioned, there are multiple variations on version/natural sort, each with its own rules. Some examples are: @itemize @item Natural Sorting variants in @uref{https://rosettacode.org/wiki/Natural_sorting,Rosetta Code}. @item Python's @uref{https://pypi.org/project/natsort/,natsort package} (includes detailed description of their sorting rules: @uref{https://natsort.readthedocs.io/en/master/howitworks.html, natsort -- how it works}). @item Ruby's @uref{https://github.com/github/version_sorter,version_sorter}. @item Perl has multiple packages for natual and version sorts (each likely with its own rules and nuances): @uref{https://metacpan.org/pod/Sort::Naturally,Sort::Naturally}, @uref{https://metacpan.org/pod/Sort::Versions,Sort::Versions}, @uref{https://metacpan.org/pod/CPAN::Version,CPAN::Version}. @item PHP has a built-in function @uref{https://www.php.net/manual/en/function.natsort.php,natsort}. @item NodeJS's @uref{https://www.npmjs.com/package/natural-sort,natural-sort package}. @item In zsh, the @uref{http://zsh.sourceforge.net/Doc/Release/Expansion.html#Glob-Qualifiers, glob modifier} @samp{*(n)} will expand to files in natural sort order. @item When writing C programs, the GNU libc library (@samp{glibc}) provides the @uref{http://man7.org/linux/man-pages/man3/strverscmp.3.html, strvercmp(3)} function to compare two strings, and @uref{http://man7.org/linux/man-pages/man3/versionsort.3.html,versionsort(3)} function to compare two directory entries (despite the names, they are not identical to GNU Coreutils version sort ordering). @item Using Debian's sorting algorithm in: @itemize @item python: @uref{https://stackoverflow.com/a/4957741, Stack Overflow Example #4957741}. @item NodeJS: @uref{https://www.npmjs.com/package/deb-version-compare, deb-version-compare}. @end itemize @end itemize @node Related source code @subsection Related source code @itemize @item Debian's code which splits a version string into @code{epoch/upstream_version/debian_revision} parts: @uref{https://git.dpkg.org/cgit/dpkg/dpkg.git/tree/lib/dpkg/parsehelp.c#n191, parsehelp.c:parseversion()}. @item Debian's code which performs the @code{upstream_version} comparison: @uref{https://git.dpkg.org/cgit/dpkg/dpkg.git/tree/lib/dpkg/version.c#n140, version.c}. @item Gnulib code (used by GNU Coreutils) which performs the version comparison: @uref{https://git.savannah.gnu.org/cgit/gnulib.git/tree/lib/filevercmp.c, filevercmp.c}. @end itemize