From 18657a960e125336f704ea058e25c27bd3900dcb Mon Sep 17 00:00:00 2001 From: Daniel Baumann Date: Sun, 5 May 2024 19:28:19 +0200 Subject: Adding upstream version 3.40.1. Signed-off-by: Daniel Baumann --- www/affcase1.html | 686 ++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 686 insertions(+) create mode 100644 www/affcase1.html (limited to 'www/affcase1.html') diff --git a/www/affcase1.html b/www/affcase1.html new file mode 100644 index 0000000..d1c8e80 --- /dev/null +++ b/www/affcase1.html @@ -0,0 +1,686 @@ + + + + + +What If OpenDocument Used SQLite? + + + +
+ + + +
+
+Small. Fast. Reliable.
Choose any three. +
+ + +
+
+ + + +
+
+
+ + + + +

+What If OpenDocument Used SQLite?

+ +

Introduction

+ +

Suppose the +OpenDocument file format, +and specifically the "ODP" OpenDocument Presentation format, were +built around SQLite. Benefits would include: +

+ +

+Note that this is only a thought experiment. +We are not suggesting that OpenDocument be changed. +Nor is this article a criticism of the current OpenDocument +design. The point of this essay is to suggest ways to improve +future file format designs. + +

About OpenDocument And OpenDocument Presentation

+ +

+The OpenDocument file format is used for office applications: +word processors, spreadsheets, and presentations. It was originally +designed for the OpenOffice suite but has since been incorporated into +other desktop application suites. The OpenOffice application has been +forked and renamed a few times. This author's primary use for OpenDocument is +building slide presentations with either +NeoOffice on Mac, or +LibreOffice on Linux and Windows. + +

+An OpenDocument Presentation or "ODP" file is a +ZIP archive containing +XML files describing presentation slides and separate image files for the +various images that are included as part of the presentation. +(OpenDocument word processor and spreadsheet files are similarly +structured but are not considered by this article.) The reader can +easily see the content of an ODP file by using the "zip -l" command. +For example, the following is the "zip -l" output from a 49-slide presentation +about SQLite from the 2014 +SouthEast LinuxFest +conference: + +

+Archive:  self2014.odp
+  Length      Date    Time    Name
+---------  ---------- -----   ----
+       47  2014-06-21 12:34   mimetype
+        0  2014-06-21 12:34   Configurations2/statusbar/
+        0  2014-06-21 12:34   Configurations2/accelerator/current.xml
+        0  2014-06-21 12:34   Configurations2/floater/
+        0  2014-06-21 12:34   Configurations2/popupmenu/
+        0  2014-06-21 12:34   Configurations2/progressbar/
+        0  2014-06-21 12:34   Configurations2/menubar/
+        0  2014-06-21 12:34   Configurations2/toolbar/
+        0  2014-06-21 12:34   Configurations2/images/Bitmaps/
+    54702  2014-06-21 12:34   Pictures/10000000000001F40000018C595A5A3D.png
+    46269  2014-06-21 12:34   Pictures/100000000000012C000000A8ED96BFD9.png
+... 58 other pictures omitted...
+    13013  2014-06-21 12:34   Pictures/10000000000000EE0000004765E03BA8.png
+  1005059  2014-06-21 12:34   Pictures/10000000000004760000034223EACEFD.png
+   211831  2014-06-21 12:34   content.xml
+    46169  2014-06-21 12:34   styles.xml
+     1001  2014-06-21 12:34   meta.xml
+     9291  2014-06-21 12:34   Thumbnails/thumbnail.png
+    38705  2014-06-21 12:34   Thumbnails/thumbnail.pdf
+     9664  2014-06-21 12:34   settings.xml
+     9704  2014-06-21 12:34   META-INF/manifest.xml
+---------                     -------
+ 10961006                     78 files
+
+ +

+The ODP ZIP archive contains four different XML files: +content.xml, styles.xml, meta.xml, and settings.xml. Those four files +define the slide layout, text content, and styling. This particular +presentation contains 62 images, ranging from full-screen pictures to +tiny icons, each stored as a separate file in the Pictures +folder. The "mimetype" file contains a single line of text that says: + +

+application/vnd.oasis.opendocument.presentation
+
+ +

The purpose of the other files and folders is presently +unknown to the author but is probably not difficult to figure out. + +

Limitations Of The OpenDocument Presentation Format

+ +

+The use of a ZIP archive to encapsulate XML files plus resources is an +elegant approach to an application file format. +It is clearly superior to a custom binary file format. +But using an SQLite database as the +container, instead of ZIP, would be more elegant still. + +

A ZIP archive is basically a key/value database, optimized for +the case of write-once/read-many and for a relatively small number +of distinct keys (a few hundred to a few thousand) each with a large BLOB +as its value. A ZIP archive can be viewed as a "pile-of-files" +database. This works, but it has some shortcomings relative to an +SQLite database, as follows: + +

    +
  1. Incremental update is hard. +

    +It is difficult to update individual entries in a ZIP archive. +It is especially difficult to update individual entries in a ZIP +archive in a way that does not destroy +the entire document if the computer loses power and/or crashes +in the middle of the update. It is not impossible to do this, but +it is sufficiently difficult that nobody actually does it. Instead, whenever +the user selects "File/Save", the entire ZIP archive is rewritten. +Hence, "File/Save" takes longer than it ought, especially on +older hardware. Newer machines are faster, but it is still bothersome +that changing a single character in a 50 megabyte presentation causes one +to burn through 50 megabytes of the finite write life on the SSD. + +

  2. Startup is slow. +

    +In keeping with the pile-of-files theme, OpenDocument stores all slide +content in a single big XML file named "content.xml". +LibreOffice reads and parses this entire file just to display +the first slide. +LibreOffice also seems to +read all images into memory as well, which makes sense seeing as when +the user does "File/Save" it is going to have to write them all back out +again, even though none of them changed. The net effect is that +start-up is slow. Double-clicking an OpenDocument file brings up a +progress bar rather than the first slide. +This results in a bad user experience. +The situation grows ever more annoying as +the document size increases. + +

  3. More memory is required. +

    +Because ZIP archives are optimized for storing big chunks of content, they +encourage a style of programming where the entire document is read into +memory at startup, all editing occurs in memory, then the entire document +is written to disk during "File/Save". OpenOffice and its descendants +embrace that pattern. + +

    +One might argue that it is ok, in this era of multi-gigabyte desktops, to +read the entire document into memory. +But it is not ok. +For one, the amount of memory used far exceeds the (compressed) file size +on disk. So a 50MB presentation might take 200MB or more RAM. +That still is not a problem if one only edits a single document at a time. +But when working on a talk, this author will typically have 10 or 15 different +presentations up all at the same +time (to facilitate copy/paste of slides from past presentation) and so +gigabytes of memory are required. +Add in an open web browser or two and a few other +desktop apps, and suddenly the disk is whirling and the machine is swapping. +And even having just a single document is a problem when working +on an inexpensive Chromebook retrofitted with Ubuntu. +Using less memory is always better. +

    + +
  4. Crash recovery is difficult. +

    +The descendants of OpenOffice tend to segfault more often than commercial +competitors. Perhaps for this reason, the OpenOffice forks make +periodic backups of their in-memory documents so that users do not lose +all pending edits when the inevitable application crash does occur. +This causes frustrating pauses in the application for the few seconds +while each backup is being made. +After restarting from a crash, the user is presented with a dialog box +that walks them through the recovery process. Managing the crash +recovery this way involves lots of extra application logic and is +generally an annoyance to the user. + +

  5. Content is inaccessible. +

    +One cannot easily view, change, or extract the content of an +OpenDocument presentation using generic tools. +The only reasonable way to view or edit an OpenDocument document is to open +it up using an application that is specifically designed to read or write +OpenDocument (read: LibreOffice or one of its cousins). The situation +could be worse. One can extract and view individual images (say) from +a presentation using just the "zip" archiver tool. But it is not reasonable +try to extract the text from a slide. Remember that all content is stored +in a single "context.xml" file. That file is XML, so it is a text file. +But it is not a text file that can be managed with an ordinary text +editor. For the example presentation above, the content.xml file +consist of exactly two lines. The first line of the file is just: + +

    +<?xml version="1.0" encoding="UTF-8"?>
    +
    + +

    The second line of the file contains 211792 characters of +impenetrable XML. Yes, 211792 characters all on one line. +This file is a good stress-test for a text editor. +Thankfully, the file is not some obscure +binary format, but in terms of accessibility, it might as well be +written in Klingon. +

+ +

First Improvement: Replace ZIP with SQLite

+ +

+Let us suppose that instead of using a ZIP archive to store its files, +OpenDocument used a very simple SQLite database with the following +single-table schema: + +

+CREATE TABLE OpenDocTree(
+  filename TEXT PRIMARY KEY,  -- Name of file
+  filesize BIGINT,            -- Size of file after decompression
+  content BLOB                -- Compressed file content
+);
+
+ +

+For this first experiment, nothing else about the file format is changed. +The OpenDocument is still a pile-of-files, only now each file is a row +in an SQLite database rather than an entry in a ZIP archive. +This simple change does not use the power of a relational +database. Even so, this simple change shows some improvements. + + + +

+Surprisingly, using SQLite in place of ZIP makes the presentation +file smaller. Really. One would think that a relational database file +would be larger than a ZIP archive, but at least in the case of NeoOffice +that is not so. The following is an actual screen-scrape showing +the sizes of the same NeoOffice presentation, both in its original +ZIP archive format as generated by NeoOffice (self2014.odp), and +as repacked as an SQLite database using the +SQLAR utility: + +

+-rw-r--r--  1 drh  staff  10514994 Jun  8 14:32 self2014.odp
+-rw-r--r--  1 drh  staff  10464256 Jun  8 14:37 self2014.sqlar
+-rw-r--r--  1 drh  staff  10416644 Jun  8 14:40 zip.odp
+
+ +

+The SQLite database file ("self2014.sqlar") is about a +half percent smaller than the equivalent ODP file! How can this be? +Apparently the ZIP archive generator logic in NeoOffice +is not as efficient as it could be, because when the same pile-of-files +is recompressed using the command-line "zip" utility, one gets a file +("zip.odp") that is smaller still, by another half percent, as seen +in the third line above. So, a well-written ZIP archive +can be slightly smaller than the equivalent SQLite database, as one would +expect. But the difference is slight. The key take-away is that an +SQLite database is size-competitive with a ZIP archive. + +

+The other advantage to using SQLite in place of +ZIP is that the document can now be updated incrementally, without risk +of corrupting the document if a power loss or other crash occurs in the +middle of the update. (Remember that writes to +SQLite databases are atomic.) True, all the +content is still kept in a single big XML file ("content.xml") which must +be completely rewritten if so much as a single character changes. But +with SQLite, only that one file needs to change. The other 77 files in the +repository can remain unaltered. They do not all have to be rewritten, +which in turn makes "File/Save" run much faster and saves wear on SSDs. + +

Second Improvement: Split content into smaller pieces

+ +

+A pile-of-files encourages content to be stored in a few large chunks. +In the case of ODP, there are just four XML files that define the layout +off all slides in a presentation. An SQLite database allows storing +information in a few large chunks, but SQLite is also adept and efficient +at storing information in numerous smaller pieces. + +

+So then, instead of storing all content for all slides in a single +oversized XML file ("content.xml"), suppose there was a separate table +for storing the content of each slide separately. The table schema +might look something like this: + +

+CREATE TABLE slide(
+  pageNumber INTEGER,   -- The slide page number
+  slideContent TEXT     -- Slide content as XML or JSON
+);
+CREATE INDEX slide_pgnum ON slide(pageNumber); -- Optional
+
+ +

The content of each slide could still be stored as compressed XML. +But now each page is stored separately. So when opening a new document, +the application could simply run: + +

+SELECT slideContent FROM slide WHERE pageNumber=1;
+
+ +

This query will quickly and efficiently return the content of the first +slide, which could then be speedily parsed and displayed to the user. +Only one page needs to be read and parsed in order render the first screen, +which means that the first screen appears much faster and +there is no longer a need for an annoying progress bar. + +

If the application wanted +to keep all content in memory, it could continue reading and parsing the +other pages using a background thread after drawing the first page. Or, +since reading from SQLite is so efficient, the application might +instead choose to reduce its memory footprint and only keep a single +slide in memory at a time. Or maybe it keeps the current slide and the +next slide in memory, to facility rapid transitions to the next slide. + +

+Notice that dividing up the content into smaller pieces using an SQLite +table gives flexibility to the implementation. The application can choose +to read all content into memory at startup. Or it can read just a +few pages into memory and keep the rest on disk. Or it can read just +single page into memory at a time. And different versions of the application +can make different choices without having to make any changes to the +file format. Such options are not available when all content is in +a single big XML file in a ZIP archive. + +

+Splitting content into smaller pieces also helps File/Save operations +to go faster. Instead of having to write back the content of all pages +when doing a File/Save, the application only has to write back those +pages that have actually changed. + +

+One minor downside of splitting content into smaller pieces is that +compression does not work as well on shorter texts and so the size of +the document might increase. But as the bulk of the document space +is used to store images, a small reduction in the compression efficiency +of the text content will hardly be noticeable, and is a small price +to pay for an improved user experience. + +

Third Improvement: Versioning

+ +

+Once one is comfortable with the concept of storing each slide separately, +it is a small step to support versioning of the presentation. Consider +the following schema: + +

+CREATE TABLE slide(
+  slideId INTEGER PRIMARY KEY,
+  derivedFrom INTEGER REFERENCES slide,
+  content TEXT     -- XML or JSON or whatever
+);
+CREATE TABLE version(
+  versionId INTEGER PRIMARY KEY,
+  priorVersion INTEGER REFERENCES version,
+  checkinTime DATETIME,   -- When this version was saved
+  comment TEXT,           -- Description of this version
+  manifest TEXT           -- List of integer slideIds
+);
+
+ +

+In this schema, instead of each slide having a page number that determines +its order within the presentation, each slide has a unique +integer identifier that is unrelated to where it occurs in sequence. +The order of slides in the presentation is determined by a list of +slideIds, stored as a text string in the MANIFEST column of the VERSION +table. +Since multiple entries are allowed in the VERSION table, that means that +multiple presentations can be stored in the same document. + +

+On startup, the application first decides which version it +wants to display. Since the versionId will naturally increase in time +and one would normally want to see the latest version, an appropriate +query might be: + +

+SELECT manifest, versionId FROM version ORDER BY versionId DESC LIMIT 1;
+
+ +

+Or perhaps the application would rather use the +most recent checkinTime: + +

+SELECT manifest, versionId, max(checkinTime) FROM version;
+
+ +

+Using a single query such as the above, the application obtains a list +of the slideIds for all slides in the presentation. The application then +queries for the content of the first slide, and parses and displays that +content, as before. + +

(Aside: Yes, that second query above that uses "max(checkinTime)" +really does work and really does return a well-defined answer in SQLite. +Such a query either returns an undefined answer or generates an error +in many other SQL database engines, but in SQLite it does what you would +expect: it returns the manifest and versionId of the entry that has the +maximum checkinTime.) + +

When the user does a "File/Save", instead of overwriting the modified +slides, the application can now make new entries in the SLIDE table for +just those slides that have been added or altered. Then it creates a +new entry in the VERSION table containing the revised manifest. + +

The VERSION table shown above has columns to record a check-in comment +(presumably supplied by the user) and the time and date at which the File/Save +action occurred. It also records the parent version to record the history +of changes. Perhaps the manifest could be stored as a delta from the +parent version, though typically the manifest will be small enough that +storing a delta might be more trouble than it is worth. The SLIDE table +also contains a derivedFrom column which could be used for delta encoding +if it is determined that saving the slide content as a delta from its +previous version is a worthwhile optimization. + +

So with this simple change, the ODP file now stores not just the most +recent edit to the presentation, but a history of all historic edits. The +user would normally want to see just the most recent edition of the +presentation, but if desired, the user can now go backwards in time to +see historical versions of the same presentation. + +

Or, multiple presentations could be stored within the same document. + +

With such a schema, the application would no longer need to make +periodic backups of the unsaved changes to a separate file to avoid lost +work in the event of a crash. Instead, a special "pending" version could +be allocated and unsaved changes could be written into the pending version. +Because only changes would need to be written, not the entire document, +saving the pending changes would only involve writing a few kilobytes of +content, not multiple megabytes, and would take milliseconds instead of +seconds, and so it could be done frequently and silently in the background. +Then when a crash occurs and the user reboots, all (or almost all) +of their work is retained. If the user decides to discard unsaved changes, +they simply go back to the previous version. + +

+There are details to fill in here. +Perhaps a screen can be provided that displays a history changes +(perhaps with a graph) allowing the user to select which version they +want to view or edit. Perhaps some facility can be provided to merge +forks that might occur in the version history. And perhaps the +application should provide a means to purge old and unwanted versions. +The key point is that using an SQLite database to store the content, +rather than a ZIP archive, makes all of these features much, much easier +to implement, which increases the possibility that they will eventually +get implemented. + +

And So Forth...

+ +

+In the previous sections, we have seen how moving from a key/value +store implemented as a ZIP archive to a simple SQLite database +with just three tables can add significant capabilities to an application +file format. +We could continue to enhance the schema with new tables, with indexes +added for performance, with triggers and views for programming convenience, +and constraints to enforce consistency of content even in the face of +programming errors. Further enhancement ideas include: +

+ +

+An SQLite database has a lot of capability, which +this essay has only begun to touch upon. But hopefully this quick glimpse +has convinced some readers that using an SQL database as an application +file format is worth a second look. + +

+Some readers might resist using SQLite as an application +file format due to prior exposure to enterprise SQL databases and +the caveats and limitations of those other systems. +For example, many enterprise database +engines advise against storing large strings or BLOBs in the database +and instead suggest that large strings and BLOBs be stored as separate +files and the filename stored in the database. But SQLite +is not like that. Any column of an SQLite database can hold +a string or BLOB up to about a gigabyte in size. And for strings and +BLOBs of 100 kilobytes or less, +I/O performance is better than using separate +files. + +

+Some readers might be reluctant to consider SQLite as an application +file format because they have been inculcated with the idea that all +SQL database schemas must be factored into third normal form and store +only small primitive data types such as strings and integers. Certainly +relational theory is important and designers should strive to understand +it. But, as demonstrated above, it is often quite acceptable to store +complex information as XML or JSON in text fields of a database. +Do what works, not what your database professor said you ought to do. + +

Review Of The Benefits Of Using SQLite

+ +

+In summary, +the claim of this essay is that using SQLite as a container for an application +file format like OpenDocument +and storing lots of smaller objects in that container +works out much better than using a ZIP archive holding a few larger objects. +To wit: + +

    +
  1. +An SQLite database file is approximately the same size, and in some cases +smaller, than a ZIP archive holding the same information. + +

  2. +The atomic update capabilities +of SQLite allow small incremental changes +to be safely written into the document. This reduces total disk I/O +and improves File/Save performance, enhancing the user experience. + +

  3. +Startup time is reduced by allowing the application to read in only the +content shown for the initial screen. This largely eliminates the +need to show a progress bar when opening a new document. The document +just pops up immediately, further enhancing the user experience. + +

  4. +The memory footprint of the application can be dramatically reduced by +only loading content that is relevant to the current display and keeping +the bulk of the content on disk. The fast query capability of SQLite +make this a viable alternative to keeping all content in memory at all times. +And when applications use less memory, it makes the entire computer more +responsive, further enhancing the user experience. + +

  5. +The schema of an SQL database is able to represent information more directly +and succinctly than a key/value database such as a ZIP archive. This makes +the document content more accessible to third-party applications and scripts +and facilitates advanced features such as built-in document versioning, and +incremental saving of work in progress for recovery after a crash. +

+ +

+These are just a few of the benefits of using SQLite as an application file +format — the benefits that seem most likely to improve the user +experience for applications like OpenOffice. Other applications might +benefit from SQLite in different ways. See the Application File Format +document for additional ideas. + +

+Finally, let us reiterate that this essay is a thought experiment. +The OpenDocument format is well-established and already well-designed. +Nobody really believes that OpenDocument should be changed to use SQLite +as its container instead of ZIP. Nor is this article a criticism of +OpenDocument for not choosing SQLite as its container since OpenDocument +predates SQLite. Rather, the point of this article is to use OpenDocument +as a concrete example of how SQLite can be used to build better +application file formats for future projects. +

This page last modified on 2018-11-30 21:04:18 UTC

+ -- cgit v1.2.3