summaryrefslogtreecommitdiffstats
path: root/docs/canonical_filenames.html
diff options
context:
space:
mode:
Diffstat (limited to '')
-rw-r--r--docs/canonical_filenames.html156
1 files changed, 156 insertions, 0 deletions
diff --git a/docs/canonical_filenames.html b/docs/canonical_filenames.html
new file mode 100644
index 0000000..c1c03b1
--- /dev/null
+++ b/docs/canonical_filenames.html
@@ -0,0 +1,156 @@
+<HTML>
+<HEAD><TITLE>APR Canonical Filenames</TITLE></HEAD>
+<BODY>
+<h1>APR Canonical Filename</h1>
+
+<h2>Requirements</h2>
+
+<p>APR porters need to address the underlying discrepancies between
+file systems. To achieve a reasonable degree of security, the
+program depending upon APR needs to know that two paths may be
+compared, and that a mismatch is guaranteed to reflect that the
+two paths do not return the same resource</p>.
+
+<p>The first discrepancy is in volume roots. Unix and pure derivatives
+have only one root path, "/". Win32 and OS2 share root paths of
+the form "D:/", D: is the volume designation. However, this can
+be specified as "//./D:/" as well, indicating D: volume of the
+'this' machine. Win32 and OS2 also may employ a UNC root path,
+of the form "//server/share/" where share is a share-point of the
+specified network server. Finally, NetWare root paths are of the
+form "server/volume:/", or the simpler "volume:/" syntax for 'this'
+machine. All these non-Unix file systems accept volume:path,
+without a slash following the colon, as a path relative to the
+current working directory, which APR will treat as ambiguous, that
+is, neither an absolute nor a relative path per se.</p>
+
+<p>The second discrepancy is in the meaning of the 'this' directory.
+In general, 'this' must be eliminated from the path where it occurs.
+The syntax "path/./" and "path/" are both aliases to path. However,
+this isn't file system independent, since the double slash "//" has
+a special meaning on OS2 and Win32 at the start of the path name,
+and is invalid on those platforms before the "//server/share/" UNC
+root path is completed. Finally, as noted above, "//./volume/" is
+legal root syntax on WinNT, and perhaps others.</p>
+
+<p>The third discrepancy is in the context of the 'parent' directory.
+When "parent/path/.." occurs, the path must be unwound to "parent".
+It's also critical to simply truncate leading "/../" paths to "/",
+since the parent of the root is root. This gets tricky on the
+Win32 and OS2 platforms, since the ".." element is invalid before
+the "//server/share/" is complete, and the "//server/share/../"
+sequence is the complete UNC root "//server/share/". In relative
+paths, leading ".." elements are significant, until they are merged
+with an absolute path. The relative form must only retain the ".."
+segments as leading segments, to be resolved once merged to another
+relative or an absolute path.</p>
+
+<p>The fourth discrepancy occurs with acceptance of alternate character
+codes for the same element. Path separators are not retained within
+the APR canonical forms. The OS filesystem and APR (slashed) forms
+can both be returned as strings, to be used in the proper context.
+Unix, Win32 and Netware all accept slashes and backslashes as the
+same path separator symbol, although unix strictly accepts slashes.
+While the APR form of the name strictly uses slashes, always consider
+that there could be a platform that actually accepts slashes as a
+character within a segment name.</p>
+
+<p>The fifth and worst discrepancy plagues Win32, OS2, Netware, and some
+filesystems mounted in Unix. Case insensitivity can permit the same
+file to slip through in both it's proper case and alternate cases.
+Simply changing the case is insufficient for any character set beyond
+ASCII, since various dialectic forms of characters suffer from one to
+many or many to one translations. An example would be u-umlaut, which
+might be accepted as a single character u-umlaut, a two character
+sequence u and the zero-width umlaut, the upper case form of the same,
+or perhaps even a capital U alone. This can be handled in different
+ways depending on the purposes of the APR based program, but the one
+requirement is that the path must be absolute in order to resolve these
+ambiguities. Methods employed include comparison of device and inode
+file uniqifiers, which is a fairly fast operation, or querying the OS
+for the true form of the name, which can be much slower. Only the
+acknowledgement of the file names by the OS can validate the equality
+of two different cases of the same filename.</p>
+
+<p>The sixth discrepancy, illegal or insignificant characters, is especially
+significant in non-unix file systems. Trailing periods are accepted
+but never stored, therefore trailing periods must be ignored for any
+form of comparison. And all OS's have certain expectations of what
+characters are illegal (or undesirable due to confusion.)</p>
+
+<p>A final warning, canonical functions don't transform or resolve case
+or character ambiguity issues until they are resolved into an absolute
+path. The relative canonical path, while useful, while useful for URL
+or similar identifiers, cannot be used for testing or comparison of file
+system objects.</p>
+
+<hr>
+
+<h2>Canonical API</h2>
+
+Functions to manipulate the apr_canon_file_t (an opaque type) include:
+
+<ul>
+<li>Create canon_file_t (from char* path and canon_file_t parent path)
+<li>Merged canon_file_t (from path and parent, both canon_file_t)
+<li>Get char* path of all or some segments
+<li>Get path flags of IsRelative, IsVirtualRoot, and IsAbsolute
+<li>Compare two canon_file_t structures for file equality
+</ul>
+
+<p>The path is corrected to the file system case only if is in absolute
+form. The apr_canon_file_t should be preserved as long as possible and
+used as the parent to create child entries to reduce the number of expensive
+stat and case canonicalization calls to the OS.</p>
+
+<p>The comparison operation provides that the APR can postpone correction
+of case by simply relying upon the device and inode for equivalence. The
+stat implementation provides that two files are the same, while their
+strings are not equivalent, and eliminates the need for the operating
+system to return the proper form of the name.</p>
+
+<p>In any case, returning the char* path, with a flag to request the proper
+case, forces the OS calls to resolve the true names of each segment. Where
+there is a penalty for this operation and the stat device and inode test
+is faster, case correction is postponed until the char* result is requested.
+On platforms that identify the inode, device, or proper name interchangably
+with no penalties, this may occur when the name is initially processed.</p>
+
+<hr>
+
+<h2>Unix Example</h2>
+
+<p>First the simplest case:</p>
+
+<pre>
+Parse Canonical Name
+accepts parent path as canonical_t
+ this path as string
+
+Split this path Segments on '/'
+
+For each of this path Segments
+ If first Segment
+ If this Segment is Empty ([nothing]/)
+ Append this Root Segment (don't merge)
+ Continue to next Segment
+ Else is relative
+ Append parent Segments (to merge)
+ Continue with this Segment
+ If Segment is '.' or empty (2 slashes)
+ Discard this Segment
+ Continue with next Segment
+ If Segment is '..'
+ If no previous Segment or previous Segment is '..'
+ Append this Segment
+ Continue with next Segment
+ If previous Segment and previous is not Root Segment
+ Discard previous Segment
+ Discard this Segment
+ Continue with next Segment
+ Append this Relative Segment
+ Continue with next Segment
+</pre>
+
+</BODY>
+</HTML>