summaryrefslogtreecommitdiffstats
path: root/www/appfileformat.html
diff options
context:
space:
mode:
Diffstat (limited to '')
-rw-r--r--www/appfileformat.html557
1 files changed, 557 insertions, 0 deletions
diff --git a/www/appfileformat.html b/www/appfileformat.html
new file mode 100644
index 0000000..e590453
--- /dev/null
+++ b/www/appfileformat.html
@@ -0,0 +1,557 @@
+<!DOCTYPE html>
+<html><head>
+<meta name="viewport" content="width=device-width, initial-scale=1.0">
+<meta http-equiv="content-type" content="text/html; charset=UTF-8">
+<link href="sqlite.css" rel="stylesheet">
+<title>SQLite As An Application File Format</title>
+<!-- path= -->
+</head>
+<body>
+<div class=nosearch>
+<a href="index.html">
+<img class="logo" src="images/sqlite370_banner.gif" alt="SQLite" border="0">
+</a>
+<div><!-- IE hack to prevent disappearing logo --></div>
+<div class="tagline desktoponly">
+Small. Fast. Reliable.<br>Choose any three.
+</div>
+<div class="menu mainmenu">
+<ul>
+<li><a href="index.html">Home</a>
+<li class='mobileonly'><a href="javascript:void(0)" onclick='toggle_div("submenu")'>Menu</a>
+<li class='wideonly'><a href='about.html'>About</a>
+<li class='desktoponly'><a href="docs.html">Documentation</a>
+<li class='desktoponly'><a href="download.html">Download</a>
+<li class='wideonly'><a href='copyright.html'>License</a>
+<li class='desktoponly'><a href="support.html">Support</a>
+<li class='desktoponly'><a href="prosupport.html">Purchase</a>
+<li class='search' id='search_menubutton'>
+<a href="javascript:void(0)" onclick='toggle_search()'>Search</a>
+</ul>
+</div>
+<div class="menu submenu" id="submenu">
+<ul>
+<li><a href='about.html'>About</a>
+<li><a href='docs.html'>Documentation</a>
+<li><a href='download.html'>Download</a>
+<li><a href='support.html'>Support</a>
+<li><a href='prosupport.html'>Purchase</a>
+</ul>
+</div>
+<div class="searchmenu" id="searchmenu">
+<form method="GET" action="search">
+<select name="s" id="searchtype">
+<option value="d">Search Documentation</option>
+<option value="c">Search Changelog</option>
+</select>
+<input type="text" name="q" id="searchbox" value="">
+<input type="submit" value="Go">
+</form>
+</div>
+</div>
+<script>
+function toggle_div(nm) {
+var w = document.getElementById(nm);
+if( w.style.display=="block" ){
+w.style.display = "none";
+}else{
+w.style.display = "block";
+}
+}
+function toggle_search() {
+var w = document.getElementById("searchmenu");
+if( w.style.display=="block" ){
+w.style.display = "none";
+} else {
+w.style.display = "block";
+setTimeout(function(){
+document.getElementById("searchbox").focus()
+}, 30);
+}
+}
+function div_off(nm){document.getElementById(nm).style.display="none";}
+window.onbeforeunload = function(e){div_off("submenu");}
+/* Disable the Search feature if we are not operating from CGI, since */
+/* Search is accomplished using CGI and will not work without it. */
+if( !location.origin || !location.origin.match || !location.origin.match(/http/) ){
+document.getElementById("search_menubutton").style.display = "none";
+}
+/* Used by the Hide/Show button beside syntax diagrams, to toggle the */
+function hideorshow(btn,obj){
+var x = document.getElementById(obj);
+var b = document.getElementById(btn);
+if( x.style.display!='none' ){
+x.style.display = 'none';
+b.innerHTML='show';
+}else{
+x.style.display = '';
+b.innerHTML='hide';
+}
+return false;
+}
+var antiRobot = 0;
+function antiRobotGo(){
+if( antiRobot!=3 ) return;
+antiRobot = 7;
+var j = document.getElementById("mtimelink");
+if(j && j.hasAttribute("data-href")) j.href=j.getAttribute("data-href");
+}
+function antiRobotDefense(){
+document.body.onmousedown=function(){
+antiRobot |= 2;
+antiRobotGo();
+document.body.onmousedown=null;
+}
+document.body.onmousemove=function(){
+antiRobot |= 2;
+antiRobotGo();
+document.body.onmousemove=null;
+}
+setTimeout(function(){
+antiRobot |= 1;
+antiRobotGo();
+}, 100)
+antiRobotGo();
+}
+antiRobotDefense();
+</script>
+
+
+
+<h1 align="center">
+SQLite As An Application File Format
+</h1>
+
+<h2>Executive Summary</h2>
+
+<p>An SQLite database file with a defined schema
+often makes an excellent application file format.
+Here are a dozen reasons why this is so:
+
+<ol>
+<li> Simplified Application Development
+<li> Single-File Documents
+<li> High-Level Query Language
+<li> Accessible Content
+<li> Cross-Platform
+<li> Atomic Transactions
+<li> Incremental And Continuous Updates
+<li> Easily Extensible
+<li> Performance
+<li> Concurrent Use By Multiple Processes
+<li> Multiple Programming Languages
+<li> Better Applications
+</ol>
+
+<p>Each of these points will be described in more detail below,
+after first considering more closely the meaning of
+"application file format". See also the <a href="aff_short.html">short version</a> of this
+whitepaper.
+
+<h2>What Is An Application File Format?</h2>
+
+<p>
+An "application file format" is the file format
+used to persist application state to disk or to exchange
+information between programs.
+There are thousands of application file formats in use today.
+Here are just a few examples:
+
+<ul>
+<li>DOC - Word Perfect and Microsoft Office documents
+<li>DWG - AutoCAD drawings
+<li>PDF - Portable Document Format from Adobe
+<li>XLS - Microsoft Excel Spreadsheet
+<li>GIT - Git source code repository
+<li>EPUB - The Electronic Publication format used by non-Kindle eBooks
+<li>ODT - The Open Document format used by OpenOffice and others
+<li>PPT - Microsoft PowerPoint presentations
+<li>ODP - The Open Document presentation format used by OpenOffice and others
+</ul>
+
+<p>We make a distinction between a "file format" and an "application format".
+A file format is used to store a single object. So, for example, a GIF or
+JPEG file stores a single image, and an XHTML file stores text,
+so those are "file formats" and not "application formats". An EPUB file,
+in contrast, stores both text and images (as contained XHTML and GIF/JPEG
+files) and so it is considered an "application format". This article is
+about "application formats".
+
+<p>The boundary between a file format and an application format is fuzzy.
+This article calls JPEG a file format, but for an image editor, JPEG
+might be considered the application format. Much depends on context.
+For this article, let us say that a file format stores a single object
+and an application format stores many different objects and their relationships
+to one another.
+
+<p>Most application formats fit into one of these three categories:
+
+<ol>
+<li><p><b>Fully Custom Formats.</b>
+Custom formats are specifically designed for a single application.
+DOC, DWG, PDF, XLS, and PPT are examples of custom formats. Custom
+formats are usually contained within a single file, for ease of transport.
+They are also usually binary, though the DWG format is a notable exception.
+Custom file formats require specialized application code
+to read and write and are not normally accessible from commonly
+available tools such as unix command-line programs and text editors.
+In other words, custom formats are usually "opaque blobs".
+To access the content of a custom application file format, one needs
+a tool specifically engineered to read and/or write that format.
+
+<li><p><b>Pile-of-Files Formats.</b>
+Sometimes the application state is stored as a hierarchy of
+files. Git is a prime example of this, though the phenomenon occurs
+frequently in one-off and bespoke applications. A pile-of-files format
+essentially uses the filesystem as a key/value database, storing small
+chunks of information into separate files. This gives the
+advantage of making the content more accessible to common utility
+programs such as text editors or "awk" or "grep". But even if many
+of the files in a pile-of-files format
+are easily readable, there are usually some files that have their
+own custom format (example: Git "Packfiles") and are hence
+"opaque blobs" that are not readable
+or writable without specialized tools. It is also much less convenient
+to move a pile-of-files from one place or machine to another, than
+it is to move a single file. And it is hard to make a pile-of-files
+document into an email attachment, for example. Finally, a pile-of-files
+format breaks the "document metaphor":
+there is no one file that a user can point to
+that is "the document".
+
+<li><p><b>Wrapped Pile-of-Files Formats.</b>
+Some applications use a Pile-of-Files that is then encapsulated into
+some kind of single-file container, usually a ZIP archive.
+EPUB, ODT,and ODP are examples of this approach.
+An EPUB book is really just a ZIP archive that contains various
+XHTML files for the text of book chapters, GIF and JPEG images for
+the artwork, and a specialized catalog file that tells the eBook
+reader how all the XML and image files fit together. OpenOffice
+documents (ODT and ODP) are also ZIP archives containing XML and
+images that represent their content as well as "catalog" files that
+show the interrelationships between the component parts.
+
+<p>A wrapped pile-of-files format is a compromise between a full
+custom file format and a pure pile-of-files format.
+A wrapped pile-of-files format is not an opaque blob in the same sense
+as a custom format, since the component parts can still be accessed
+using any common ZIP archiver, but the format is not quite as accessible
+as a pure pile-of-files format because one does still need the ZIP
+archiver, and one cannot normally use command-line tools like "find"
+on the file hierarchy without first un-zipping it. On the other
+hand, a wrapped pile-of-files format does preserve the document
+metaphor by putting all content into a single disk file. And
+because it is compressed, the wrapped pile-of-files format tends to
+be more compact.
+
+<p>As with custom file formats, and unlike pure pile-of-file formats,
+a wrapped pile-of-files format is not as easy to edit, since
+usually the entire file must be rewritten in order to change any
+component part.
+</ol>
+
+<p>The purpose of this document is to argue in favor of a fourth
+new category of application file format: An SQLite database file.
+
+<h2>SQLite As The Application File Format</h2>
+
+<p>
+Any application state that can be recorded in a pile-of-files can
+also be recorded in an SQLite database with a simple key/value schema
+like this:
+<blockquote><pre>
+CREATE TABLE files(filename TEXT PRIMARY KEY, content BLOB);
+</pre></blockquote>
+If the content is compressed, then such an <a href="sqlar.html">SQLite Archive</a> database is
+<a href="affcase1.html#smaller">the same size</a> (&#177;1%)
+as an equivalent ZIP archive, and it has the advantage
+of being able to update individual "files" without rewriting
+the entire document.
+
+<p>
+But an SQLite database is not limited to a simple key/value structure
+like a pile-of-files database. An SQLite database can have dozens
+or hundreds or thousands of different tables, with dozens or
+hundreds or thousands of fields per table, each with different datatypes
+and constraints and particular meanings, all cross-referencing each other,
+appropriately and automatically indexed for rapid retrieval,
+and all stored efficiently and compactly in a single disk file.
+And all of this structure is succinctly documented for humans
+by the SQL schema.
+
+<p>In other words, an SQLite database can do everything that a
+pile-of-files or wrapped pile-of-files format can do, plus much more,
+and with greater lucidity.
+An SQLite database is a more versatile container than key/value
+filesystem or a ZIP archive. (For a detailed example, see the
+<a href="affcase1.html">OpenOffice case study</a> essay.)
+
+<p>The power of an SQLite database could, in theory, be achieved using
+a custom file format. But any custom file format that is as expressive
+as a relational database would likely require an enormous design specification
+and many tens or hundreds of thousands of lines of code to
+implement. And the end result would be an "opaque blob" that is
+inaccessible without specialized tools.
+
+<p>
+Hence, in comparison to other approaches, the use of
+an SQLite database as an application file format has
+compelling advantages. Here are a few of these advantages,
+enumerated and expounded:
+</p>
+
+<ol>
+<li><p><b>Simplified Application Development.</b>
+No new code is needed for reading or writing the application file.
+One has merely to link against the SQLite library, or include the
+<a href="amalgamation.html">single "sqlite3.c" source file</a> with the rest of the
+application C code, and SQLite will take care of all of the application
+file I/O. This can reduce application code size by many thousands of
+lines, with corresponding saving in development and maintenance costs.
+
+<p>SQLite is one of the
+<a href="mostdeployed.html">most used</a> software libraries in the world.
+There are literally tens of billions of SQLite database files in use
+daily, on smartphones and gadgets and in desktop applications.
+SQLite is <a href="testing.html">carefully tested</a> and proven reliable. It is not
+a component that needs much tuning or debugging, allowing developers
+to stay focused on application logic.
+
+<li><p><b>Single-File Documents.</b>
+An SQLite database is contained in a single file, which is easily
+copied or moved or attached. The "document" metaphor is preserved.
+
+<p>SQLite does not have any file naming requirements
+and so the application can use any custom file suffix that it wants
+to help identify the file as "belonging" to the application.
+SQLite database files contain a 4-byte <a href="fileformat2.html#appid">Application ID</a> in
+their headers that can be set to an application-defined value
+and then used to identify the "type" of the document for utility
+programs such as <a href="http://linux.die.net/man/1/file">file(1)</a>, further
+enhancing the document metaphor.
+
+
+<li><p><b>High-Level Query Language.</b>
+SQLite is a complete relational database engine, which means that the
+application can access content using high-level queries. Application
+developers need not spend time thinking about "how" to retrieve the
+information they need from a document. Developers write SQL that
+expresses "what" information they want and let the database engine
+to figure out how to best retrieve that content. This helps developers
+operate "heads up" and remain focused on solving the user's problem,
+and avoid time spent "heads down" fiddling with low-level file
+formatting details.
+
+<p>A pile-of-files format can be viewed as a key/value database.
+A key/value database is better than no database at all.
+But without transactions or indices or a high-level query language or
+a proper schema,
+it is much harder and more error prone to use a key/value database than
+a relational database.
+
+<li><p><b>Accessible Content.</b>
+Information held in an SQLite database file is accessible using
+commonly available open-source command-line tools - tools that
+are installed by default on Mac and Linux systems and that are
+freely available as a self-contained EXE file on Windows.
+Unlike custom file formats, application-specific programs are
+not required to read or write content in an SQLite database.
+An SQLite database file is not an opaque blob. It is true
+that command-line tools such as text editors or "grep" or "awk" are
+not useful on an SQLite database, but the SQL query language is a much
+more powerful and convenient way for examining the content, so the
+inability to use "grep" and "awk" and the like is not seen as a loss.
+
+<p>An SQLite database is a <a href="fileformat2.html">well-defined and well-documented</a>
+file format that is in widespread use by literally millions of applications
+and is backwards compatible to its inception in 2004 and which promises
+to continue to be compatible in decades to come. The longevity of
+SQLite database files is particularly important to bespoke applications,
+since it allows the document content to be accessed far in the
+future, long after all traces of the original application have been lost.
+Data lives longer than code.
+SQLite databases are <a href="locrsf.html">recommended by the US Library of Congress</a>
+as a storage format for long-term preservation of digital content.
+
+
+<li><p><b>Cross-Platform.</b>
+SQLite database files are portable between 32-bit and 64-bit machines and
+between big-endian and little-endian architectures and between any of the
+various flavors of Windows and Unix-like operating systems.
+The application using an SQLite application file format can store
+binary numeric data without having to worry about the byte-order of
+integers or floating point numbers.
+Text content can be read or written as UTF-8, UTF-16LE, or UTF-16BE and
+SQLite will automatically perform any necessary translations on-the-fly.
+
+<li><p><b>Atomic Transactions.</b>
+Writes to an SQLite database are <a href="atomiccommit.html">atomic</a>.
+They either happen completely
+or not at all, even during system crashes or power failures. So
+there is no danger of corrupting a document just because the power happened
+to go out at the same instant that a change was being written to disk.
+
+<p>SQLite is transactional, meaning that multiple changes can be grouped
+together such that either all or none of them occur, and so that the
+changes can be rolled back if a problem is found prior to commit.
+This allows an application to make a change incrementally, then run
+various sanity and consistency checks on the resulting data prior to
+committing the changes to disk. The
+<a href="http://www.fossil-scm.org/">Fossil</a> DVCS
+<a href="http://www.fossil-scm.org/fossil/doc/tip/www/selfcheck.wiki">uses this technique</a>
+to verify that no repository history has been lost prior to each change.
+
+<li><p><b>Incremental And Continuous Updates.</b>
+When writing to an SQLite database file, only those parts of the file that
+actually change are written out to disk. This makes the writing happen faster
+and saves wear on SSDs. This is an enormous advantage over custom
+and wrapped pile-of-files formats, both of which usually require a
+rewrite of the entire document in order to change a single byte.
+Pure pile-of-files formats can also
+do incremental updates to some extent, though the granularity of writes is
+usually larger with pile-of-file formats (a single file) than with SQLite
+(a single page).
+
+<p>SQLite also supports continuous update.
+Instead of collecting changes in memory and then writing
+them to disk only on a File/Save action, changes can be written back to
+the disk as they occur. This avoids loss of work on a system crash or
+power failure. An <a href="undoredo.html">automated undo/redo stack</a>, managed using triggers,
+can be kept in the on-disk database, meaning that undo/redo can occur
+across session boundaries.
+
+<li><p><b>Easily Extensible.</b>
+As an application grows, new features can be added to an
+SQLite application file format simply by adding new tables to the schema
+or by adding new columns to existing tables. Adding columns or tables
+does not change the meaning of prior queries, so with a
+modicum of care to ensuring that the meaning of legacy columns and
+tables are preserved, backwards compatibility is maintained.
+
+<p>It is possible to extend custom or pile-of-files formats too, of course,
+but doing is often much harder. If indices are added, then all application
+code that changes the corresponding tables must be located and modified to
+keep those indices up-to-date. If columns are added, then all application
+code that accesses the corresponding table must be located and modified to
+take into account the new columns.
+
+<li><p><b>Performance.</b>
+In many cases, an SQLite application file format will be
+<a href="fasterthanfs.html">faster than a pile-of-files format</a> or
+a custom format. In addition to being faster for raw read and
+writes, SQLite can often dramatically improves start-up times because
+instead of having to
+read and parse the entire document into memory, the application can
+do queries to extract only the information needed for the initial screen.
+As the application progresses, it only needs to load as much material as
+is needed to draw the next screen, and can discard information from
+prior screens that is no longer in use. This helps keep the memory
+footprint of the application under control.
+
+<p>A pile-of-files format can be read incrementally just like SQLite.
+But many developers are surprised to learn that SQLite can read and
+write smaller BLOBs (less than about 100KB in size) from its database
+faster than those same blobs can be read or written as separate files
+from the filesystem. (See
+<a href="fasterthanfs.html">35% Faster Than The Filesystem</a> and
+<a href="intern-v-extern-blob.html">Internal Versus External BLOBs</a> for further information.)
+There is overhead associated with operating a relational
+database engine, however one should not assume that direct file I/O
+is faster than SQLite database I/O, as often it is not.
+
+<p>In either case, if performance problems do arise in an SQLite application
+those problems can often be resolved by adding one or two <a href="lang_createindex.html">CREATE INDEX</a>
+statements to the schema or perhaps running <a href="lang_analyze.html">ANALYZE</a> one time
+and without having to touch a single line of
+application code. But if a performance problem comes up in a custom or
+pile-of-files format, the fix will often require extensive changes
+to application code to add and maintain new indices or to extract
+information using different algorithms.
+
+<li><p><b>Concurrent Use By Multiple Processes.</b>
+SQLite automatically coordinates concurrent access to the same
+document from multiple threads and/or processes. Two or more
+applications can connect and read from the same document at the
+same time. Writes are serialized, but as writes normally only
+take milliseconds, applications simply take turns writing.
+SQLite automatically ensures that the low-level format of the
+document is uncorrupted. Accomplishing the same with a custom
+or pile-of-files format, in contrast, requires extensive support
+in the application. And the application logic needed to support
+concurrency is a notorious bug-magnet.
+
+<li><p><b>Multiple Programming Languages.</b>
+Though SQLite is itself written in ANSI-C, interfaces exist for
+just about every other programming language you can think of:
+C++, C#, Objective-C, Java, Tcl, Perl, Python, Ruby, Erlang,
+JavaScript, and so forth. So programmers can develop in whatever
+language they are most comfortable with and which best matches
+the needs of the project.
+
+<p>An SQLite application file format is a great
+choice in cases where there is a collection or "federation" of
+separate programs, often written in different languages and by
+different development teams.
+This comes up commonly in research or laboratory
+environments where one team is responsible for data acquisition
+and other teams are responsible for various stages of analysis.
+Each team can use whatever hardware, operating system,
+programming language and development methodology that they are
+most comfortable with, and as long as all programs use an SQLite
+database with a common schema, they can all interoperate.
+
+
+<li><p><b>Better Applications.</b>
+If the application file format is an SQLite database, the complete
+documentation for that file format consists of the database schema,
+with perhaps a few extra words about what each table and column
+represents. The description of a custom file format,
+on the other hand, typically runs on for hundreds of
+pages. A pile-of-files format, while much simpler and easier to
+describe than a fully custom format, still tends to be much larger
+and more complex than an SQL schema dump, since the names and format
+for the individual files must still be described.
+
+<p>This is not a trivial point. A clear, concise, and easy to understand
+file format is a crucial part of any application design.
+Fred Brooks, in his all-time best-selling computer science text,
+<i>The Mythical Man-Month</i> says:
+<blockquote><i>Representation is the
+essence of computer programming.<br />...<br />
+Show me your flowcharts and conceal your tables, and I shall
+continue to be mystified. Show me your tables, and I won't usually
+need your flowcharts; they'll be obvious.</i></blockquote>
+<p>Rob Pike, in his
+<i>Rules of Programming</i> expresses the same idea this way:
+<blockquote>
+<i>Data dominates. If you've chosen the right data structures
+and organized things well, the algorithms will almost always
+be self-evident. Data structures, not algorithms, are central
+to programming.</i></blockquote>
+<p> Linus Torvalds used different words to say
+much the same thing on the Git mailing list on 2006-06-27:
+<blockquote>
+<i>Bad programmers worry about the code. Good programmers worry
+about data structures and their relationships.</i>
+</blockquote>
+
+<p>The point is this: an SQL database schema almost always does
+a far better job of defining and organizing the tables and
+data structures and their relationships.
+And having clear, concise, and well-defined representation
+almost always results in an application that performs better,
+has fewer problems, and is easier to develop and maintain.
+</ol>
+
+<h2>Conclusion</h2>
+
+<p>
+SQLite is not the perfect application file format for every situation.
+But in many cases, SQLite is a far better choice than either a custom
+file format, a pile-of-files, or a wrapped pile-of-files.
+SQLite is a high-level, stable, reliable, cross-platform, widely-deployed,
+extensible, performant, accessible, concurrent file format. It deserves
+your consideration as the standard file format on your next application
+design.
+<p align="center"><small><i>This page last modified on <a href="https://sqlite.org/docsrc/honeypot" id="mtimelink" data-href="https://sqlite.org/docsrc/finfo/pages/appfileformat.in?m=dcc7b088c6d63d3ce">2018-05-29 18:35:58</a> UTC </small></i></p>
+