summaryrefslogtreecommitdiffstats
path: root/www/atomiccommit.html
diff options
context:
space:
mode:
authorDaniel Baumann <daniel.baumann@progress-linux.org>2024-05-05 17:28:19 +0000
committerDaniel Baumann <daniel.baumann@progress-linux.org>2024-05-05 17:28:19 +0000
commit18657a960e125336f704ea058e25c27bd3900dcb (patch)
tree17b438b680ed45a996d7b59951e6aa34023783f2 /www/atomiccommit.html
parentInitial commit. (diff)
downloadsqlite3-18657a960e125336f704ea058e25c27bd3900dcb.tar.xz
sqlite3-18657a960e125336f704ea058e25c27bd3900dcb.zip
Adding upstream version 3.40.1.upstream/3.40.1upstream
Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
Diffstat (limited to '')
-rw-r--r--www/atomiccommit.html1577
1 files changed, 1577 insertions, 0 deletions
diff --git a/www/atomiccommit.html b/www/atomiccommit.html
new file mode 100644
index 0000000..39ebd3a
--- /dev/null
+++ b/www/atomiccommit.html
@@ -0,0 +1,1577 @@
+<!DOCTYPE html>
+<html><head>
+<meta name="viewport" content="width=device-width, initial-scale=1.0">
+<meta http-equiv="content-type" content="text/html; charset=UTF-8">
+<link href="sqlite.css" rel="stylesheet">
+<title>Atomic Commit In SQLite</title>
+<!-- path= -->
+</head>
+<body>
+<div class=nosearch>
+<a href="index.html">
+<img class="logo" src="images/sqlite370_banner.gif" alt="SQLite" border="0">
+</a>
+<div><!-- IE hack to prevent disappearing logo --></div>
+<div class="tagline desktoponly">
+Small. Fast. Reliable.<br>Choose any three.
+</div>
+<div class="menu mainmenu">
+<ul>
+<li><a href="index.html">Home</a>
+<li class='mobileonly'><a href="javascript:void(0)" onclick='toggle_div("submenu")'>Menu</a>
+<li class='wideonly'><a href='about.html'>About</a>
+<li class='desktoponly'><a href="docs.html">Documentation</a>
+<li class='desktoponly'><a href="download.html">Download</a>
+<li class='wideonly'><a href='copyright.html'>License</a>
+<li class='desktoponly'><a href="support.html">Support</a>
+<li class='desktoponly'><a href="prosupport.html">Purchase</a>
+<li class='search' id='search_menubutton'>
+<a href="javascript:void(0)" onclick='toggle_search()'>Search</a>
+</ul>
+</div>
+<div class="menu submenu" id="submenu">
+<ul>
+<li><a href='about.html'>About</a>
+<li><a href='docs.html'>Documentation</a>
+<li><a href='download.html'>Download</a>
+<li><a href='support.html'>Support</a>
+<li><a href='prosupport.html'>Purchase</a>
+</ul>
+</div>
+<div class="searchmenu" id="searchmenu">
+<form method="GET" action="search">
+<select name="s" id="searchtype">
+<option value="d">Search Documentation</option>
+<option value="c">Search Changelog</option>
+</select>
+<input type="text" name="q" id="searchbox" value="">
+<input type="submit" value="Go">
+</form>
+</div>
+</div>
+<script>
+function toggle_div(nm) {
+var w = document.getElementById(nm);
+if( w.style.display=="block" ){
+w.style.display = "none";
+}else{
+w.style.display = "block";
+}
+}
+function toggle_search() {
+var w = document.getElementById("searchmenu");
+if( w.style.display=="block" ){
+w.style.display = "none";
+} else {
+w.style.display = "block";
+setTimeout(function(){
+document.getElementById("searchbox").focus()
+}, 30);
+}
+}
+function div_off(nm){document.getElementById(nm).style.display="none";}
+window.onbeforeunload = function(e){div_off("submenu");}
+/* Disable the Search feature if we are not operating from CGI, since */
+/* Search is accomplished using CGI and will not work without it. */
+if( !location.origin || !location.origin.match || !location.origin.match(/http/) ){
+document.getElementById("search_menubutton").style.display = "none";
+}
+/* Used by the Hide/Show button beside syntax diagrams, to toggle the */
+function hideorshow(btn,obj){
+var x = document.getElementById(obj);
+var b = document.getElementById(btn);
+if( x.style.display!='none' ){
+x.style.display = 'none';
+b.innerHTML='show';
+}else{
+x.style.display = '';
+b.innerHTML='hide';
+}
+return false;
+}
+var antiRobot = 0;
+function antiRobotGo(){
+if( antiRobot!=3 ) return;
+antiRobot = 7;
+var j = document.getElementById("mtimelink");
+if(j && j.hasAttribute("data-href")) j.href=j.getAttribute("data-href");
+}
+function antiRobotDefense(){
+document.body.onmousedown=function(){
+antiRobot |= 2;
+antiRobotGo();
+document.body.onmousedown=null;
+}
+document.body.onmousemove=function(){
+antiRobot |= 2;
+antiRobotGo();
+document.body.onmousemove=null;
+}
+setTimeout(function(){
+antiRobot |= 1;
+antiRobotGo();
+}, 100)
+antiRobotGo();
+}
+antiRobotDefense();
+</script>
+<div class=fancy>
+<div class=nosearch>
+<div class="fancy_title">
+Atomic Commit In SQLite
+</div>
+<div class="fancy_toc">
+<a onclick="toggle_toc()">
+<span class="fancy_toc_mark" id="toc_mk">&#x25ba;</span>
+Table Of Contents
+</a>
+<div id="toc_sub"><div class="fancy-toc1"><a href="#_introduction">1. Introduction</a></div>
+<div class="fancy-toc1"><a href="#_hardware_assumptions">2. Hardware Assumptions</a></div>
+<div class="fancy-toc1"><a href="#_single_file_commit">3. Single File Commit</a></div>
+<div class="fancy-toc2"><a href="#_initial_state">3.1. Initial State</a></div>
+<div class="fancy-toc2"><a href="#_acquiring_a_read_lock">3.2. Acquiring A Read Lock</a></div>
+<div class="fancy-toc2"><a href="#_reading_information_out_of_the_database">3.3. Reading Information Out Of The Database</a></div>
+<div class="fancy-toc2"><a href="#_obtaining_a_reserved_lock">3.4. Obtaining A Reserved Lock</a></div>
+<div class="fancy-toc2"><a href="#_creating_a_rollback_journal_file">3.5. Creating A Rollback Journal File</a></div>
+<div class="fancy-toc2"><a href="#_changing_database_pages_in_user_space">3.6. Changing Database Pages In User Space</a></div>
+<div class="fancy-toc2"><a href="#_flushing_the_rollback_journal_file_to_mass_storage">3.7. Flushing The Rollback Journal File To Mass Storage</a></div>
+<div class="fancy-toc2"><a href="#_obtaining_an_exclusive_lock">3.8. Obtaining An Exclusive Lock</a></div>
+<div class="fancy-toc2"><a href="#_writing_changes_to_the_database_file">3.9. Writing Changes To The Database File</a></div>
+<div class="fancy-toc2"><a href="#0_flushing_changes_to_mass_storage">3.10. 0 Flushing Changes To Mass Storage</a></div>
+<div class="fancy-toc2"><a href="#1_deleting_the_rollback_journal">3.11. 1 Deleting The Rollback Journal</a></div>
+<div class="fancy-toc2"><a href="#2_releasing_the_lock">3.12. 2 Releasing The Lock</a></div>
+<div class="fancy-toc1"><a href="#_rollback">4. Rollback</a></div>
+<div class="fancy-toc2"><a href="#_when_something_goes_wrong_">4.1. When Something Goes Wrong...</a></div>
+<div class="fancy-toc2"><a href="#_hot_rollback_journals">4.2. Hot Rollback Journals</a></div>
+<div class="fancy-toc2"><a href="#_obtaining_an_exclusive_lock_on_the_database">4.3. Obtaining An Exclusive Lock On The Database</a></div>
+<div class="fancy-toc2"><a href="#_rolling_back_incomplete_changes">4.4. Rolling Back Incomplete Changes</a></div>
+<div class="fancy-toc2"><a href="#_deleting_the_hot_journal">4.5. Deleting The Hot Journal</a></div>
+<div class="fancy-toc2"><a href="#_continue_as_if_the_uncompleted_writes_had_never_happened">4.6. Continue As If The Uncompleted Writes Had Never Happened</a></div>
+<div class="fancy-toc1"><a href="#_multi_file_commit">5. Multi-file Commit</a></div>
+<div class="fancy-toc2"><a href="#_separate_rollback_journals_for_each_database">5.1. Separate Rollback Journals For Each Database</a></div>
+<div class="fancy-toc2"><a href="#_the_super_journal_file">5.2. The Super-Journal File</a></div>
+<div class="fancy-toc2"><a href="#_updating_rollback_journal_headers">5.3. Updating Rollback Journal Headers</a></div>
+<div class="fancy-toc2"><a href="#_updating_the_database_files">5.4. Updating The Database Files</a></div>
+<div class="fancy-toc2"><a href="#_delete_the_super_journal_file">5.5. Delete The Super-Journal File</a></div>
+<div class="fancy-toc2"><a href="#_clean_up_the_rollback_journals">5.6. Clean Up The Rollback Journals</a></div>
+<div class="fancy-toc1"><a href="#_additional_details_of_the_commit_process">6. Additional Details Of The Commit Process</a></div>
+<div class="fancy-toc2"><a href="#_always_journal_complete_sectors">6.1. Always Journal Complete Sectors</a></div>
+<div class="fancy-toc2"><a href="#_dealing_with_garbage_written_into_journal_files">6.2. Dealing With Garbage Written Into Journal Files</a></div>
+<div class="fancy-toc2"><a href="#_cache_spill_prior_to_commit">6.3. Cache Spill Prior To Commit</a></div>
+<div class="fancy-toc1"><a href="#_optimizations">7. Optimizations</a></div>
+<div class="fancy-toc2"><a href="#_cache_retained_between_transactions">7.1. Cache Retained Between Transactions</a></div>
+<div class="fancy-toc2"><a href="#_exclusive_access_mode">7.2. Exclusive Access Mode</a></div>
+<div class="fancy-toc2"><a href="#_do_not_journal_freelist_pages">7.3. Do Not Journal Freelist Pages</a></div>
+<div class="fancy-toc2"><a href="#_single_page_updates_and_atomic_sector_writes">7.4. Single Page Updates And Atomic Sector Writes</a></div>
+<div class="fancy-toc2"><a href="#_filesystems_with_safe_append_semantics">7.5. Filesystems With Safe Append Semantics</a></div>
+<div class="fancy-toc2"><a href="#_persistent_rollback_journals">7.6. Persistent Rollback Journals</a></div>
+<div class="fancy-toc1"><a href="#_testing_atomic_commit_behavior">8. Testing Atomic Commit Behavior</a></div>
+<div class="fancy-toc1"><a href="#_things_that_can_go_wrong">9. Things That Can Go Wrong</a></div>
+<div class="fancy-toc2"><a href="#_broken_locking_implementations">9.1. Broken Locking Implementations</a></div>
+<div class="fancy-toc2"><a href="#_incomplete_disk_flushes">9.2. Incomplete Disk Flushes</a></div>
+<div class="fancy-toc2"><a href="#_partial_file_deletions">9.3. Partial File Deletions</a></div>
+<div class="fancy-toc2"><a href="#_garbage_written_into_files">9.4. Garbage Written Into Files</a></div>
+<div class="fancy-toc2"><a href="#_deleting_or_renaming_a_hot_journal">9.5. Deleting Or Renaming A Hot Journal</a></div>
+<div class="fancy-toc1"><a href="#_future_directions_and_conclusion">10. Future Directions And Conclusion</a></div>
+</div>
+</div>
+<script>
+function toggle_toc(){
+var sub = document.getElementById("toc_sub")
+var mk = document.getElementById("toc_mk")
+if( sub.style.display!="block" ){
+sub.style.display = "block";
+mk.innerHTML = "&#x25bc;";
+} else {
+sub.style.display = "none";
+mk.innerHTML = "&#x25ba;";
+}
+}
+</script>
+</div>
+
+
+
+
+<h1 id="_introduction"><span>1. </span> Introduction</h1>
+
+<p>An important feature of transactional databases like SQLite
+is "atomic commit".
+Atomic commit means that either all database changes within a single
+transaction occur or none of them occur. With atomic commit, it
+is as if many different writes to different sections of the database
+file occur instantaneously and simultaneously.
+Real hardware serializes writes to mass storage, and writing
+a single sector takes a finite amount of time.
+So it is impossible to truly write many different sectors of a
+database file simultaneously and/or instantaneously.
+But the atomic commit logic within
+SQLite makes it appear as if the changes for a transaction
+are all written instantaneously and simultaneously.</p>
+
+<p>SQLite has the important property that transactions appear
+to be atomic even if the transaction is interrupted by an
+operating system crash or power failure.</p>
+
+<p>This article describes the techniques used by SQLite to create the
+illusion of atomic commit.</p>
+
+<p>The information in this article applies only when SQLite is operating
+in "rollback mode", or in other words when SQLite is not
+using a <a href="wal.html">write-ahead log</a>. SQLite still supports atomic commit when
+write-ahead logging is enabled, but it accomplishes atomic commit by
+a different mechanism from the one described in this article. See
+the <a href="wal.html">write-ahead log documentation</a> for additional information on how
+SQLite supports atomic commit in that context.</p>
+
+<a name="hardware"></a>
+
+<h1 id="_hardware_assumptions"><span>2. </span> Hardware Assumptions</h1>
+
+<p>Throughout this article, we will call the mass storage device "disk"
+even though the mass storage device might really be flash memory.</p>
+
+<p>We assume that disk is written in chunks which we call a "sector".
+It is not possible to modify any part of the disk smaller than a sector.
+To change a part of the disk smaller than a sector, you have to read in
+the full sector that contains the part you want to change, make the
+change, then write back out the complete sector.</p>
+
+<p>On a traditional spinning disk, a sector is the minimum unit of transfer
+in both directions, both reading and writing. On flash memory, however,
+the minimum size of a read is typically much smaller than a minimum write.
+SQLite is only concerned with the minimum write amount and so for the
+purposes of this article, when we say "sector" we mean the minimum amount
+of data that can be written to mass storage in a single go.</p>
+
+<p>
+ Prior to SQLite version 3.3.14, a sector size of 512 bytes was
+ assumed in all cases. There was a compile-time option to change
+ this but the code had never been tested with a larger value. The
+ 512 byte sector assumption seemed reasonable since until very recently
+ all disk drives used a 512 byte sector internally. However, there
+ has recently been a push to increase the sector size of disks to
+ 4096 bytes. Also the sector size
+ for flash memory is usually larger than 512 bytes. For these reasons,
+ versions of SQLite beginning with 3.3.14 have a method in the OS
+ interface layer that interrogates the underlying filesystem to find
+ the true sector size. As currently implemented (version 3.5.0) this
+ method still returns a hard-coded value of 512 bytes, since there
+ is no standard way of discovering the true sector size on either
+ Unix or Windows. But the method is available for embedded device
+ manufacturers to tweak according to their own needs. And we have
+ left open the possibility of filling in a more meaningful implementation
+ on Unix and Windows in the future.</p>
+
+<p>SQLite has traditionally assumed that a sector write is <u>not</u> atomic.
+However, SQLite does always assume that a sector write is linear. By "linear"
+we mean that SQLite assumes that when writing a sector, the hardware begins
+at one end of the data and writes byte by byte until it gets to
+the other end. The write might go from beginning to end or from
+end to beginning. If a power failure occurs in the middle of a
+sector write it might be that part of the sector was modified
+and another part was left unchanged. The key assumption by SQLite
+is that if any part of the sector gets changed, then either the
+first or the last bytes will be changed. So the hardware will
+never start writing a sector in the middle and work towards the
+ends. We do not know if this assumption is always true but it
+seems reasonable.</p>
+
+<p>The previous paragraph states that SQLite does not assume that
+sector writes are atomic. This is true by default. But as of
+SQLite version 3.5.0, there is a new interface called the
+Virtual File System (<a href="vfs.html">VFS</a>) interface. The <a href="vfs.html">VFS</a> is the only means
+by which SQLite communicates to the underlying filesystem. The
+code comes with default VFS implementations for Unix and Windows
+and there is a mechanism for creating new custom VFS implementations
+at runtime. In this new VFS interface there is a method called
+xDeviceCharacteristics. This method interrogates the underlying
+filesystem to discover various properties and behaviors that the
+filesystem may or may not exhibit. The xDeviceCharacteristics
+method might indicate that sector writes are atomic, and if it does
+so indicate, SQLite will try to take advantage of that fact. But
+the default xDeviceCharacteristics method for both Unix and Windows
+does not indicate atomic sector writes and so these optimizations
+are normally omitted.</p>
+
+<p>SQLite assumes that the operating system will buffer writes and
+that a write request will return before data has actually been stored
+in the mass storage device.
+SQLite further assumes that write operations will be reordered by
+the operating system.
+For this reason, SQLite does a "flush" or "fsync" operation at key
+points. SQLite assumes that the flush or fsync will not return until
+all pending write operations for the file that is being flushed have
+completed. We are told that the flush and fsync primitives
+are broken on some versions of Windows and Linux. This is unfortunate.
+It opens SQLite up to the possibility of database corruption following
+a power loss in the middle of a commit. However, there is nothing
+that SQLite can do to test for or remedy the situation. SQLite
+assumes that the operating system that it is running on works as
+advertised. If that is not quite the case, well then hopefully you
+will not lose power too often.</p>
+
+<p>SQLite assumes that when a file grows in length that the new
+file space originally contains garbage and then later is filled in
+with the data actually written. In other words, SQLite assumes that
+the file size is updated before the file content. This is a
+pessimistic assumption and SQLite has to do some extra work to make
+sure that it does not cause database corruption if power is lost
+between the time when the file size is increased and when the
+new content is written. The xDeviceCharacteristics method of
+the <a href="vfs.html">VFS</a> might indicate that the filesystem will always write the
+data before updating the file size. (This is the
+SQLITE_IOCAP_SAFE_APPEND property for those readers who are looking
+at the code.) When the xDeviceCharacteristics method indicates
+that files content is written before the file size is increased,
+SQLite can forego some of its pedantic database protection steps
+and thereby decrease the amount of disk I/O needed to perform a
+commit. The current implementation, however, makes no such assumptions
+for the default VFSes for Windows and Unix.</p>
+
+<p>SQLite assumes that a file deletion is atomic from the
+point of view of a user process. By this we mean that if SQLite
+requests that a file be deleted and the power is lost during the
+delete operation, once power is restored either the file will
+exist completely with all if its original content unaltered, or
+else the file will not be seen in the filesystem at all. If
+after power is restored the file is only partially deleted,
+if some of its data has been altered or erased,
+or the file has been truncated but not completely removed, then
+database corruption will likely result.</p>
+
+<p>SQLite assumes that the detection and/or correction of
+bit errors caused by cosmic rays, thermal noise, quantum
+fluctuations, device driver bugs, or other mechanisms, is the
+responsibility of the underlying hardware and operating system.
+SQLite does not add any redundancy to the database file for
+the purpose of detecting corruption or I/O errors.
+SQLite assumes that the data it reads is exactly the same data
+that it previously wrote.</p>
+
+<p>By default, SQLite assumes that an operating system call to write
+a range of bytes will not damage or alter any bytes outside of that range
+even if a power loss or OS crash occurs during that write. We
+call this the "<a href="psow.html">powersafe overwrite</a>" property.
+Prior to <a href="releaselog/3_7_9.html">version 3.7.9</a> (2011-11-01),
+SQLite did not assume powersafe overwrite. But with the standard
+sector size increasing from 512 to 4096 bytes on most disk drives, it
+has become necessary to assume powersafe overwrite in order to maintain
+historical performance levels and so powersafe overwrite is assumed by
+default in recent versions of SQLite. The assumption of powersafe
+overwrite property can be disabled at compile-time or a run-time if
+desired. See the <a href="psow.html">powersafe overwrite documentation</a> for further
+details.
+
+
+<a name="section_3_0"></a>
+</p><h1 id="_single_file_commit"><span>3. </span> Single File Commit</h1>
+
+<p>We begin with an overview of the steps SQLite takes in order to
+perform an atomic commit of a transaction against a single database
+file. The details of file formats used to guard against damage from
+power failures and techniques for performing an atomic commit across
+multiple databases are discussed in later sections.</p>
+
+<a name="initstate"></a>
+
+<h2 id="_initial_state"><span>3.1. </span> Initial State</h2>
+
+<img src="images/ac/commit-0.gif" align="right" hspace="15">
+
+<p>The state of the computer when a database connection is
+first opened is shown conceptually by the diagram at the
+right.
+The area of the diagram on the extreme right (labeled "Disk") represents
+information stored on the mass storage device. Each rectangle is
+a sector. The blue color represents that the sectors contain
+original data.
+The middle area is the operating systems disk cache. At the
+onset of our example, the cache is cold and this is represented
+by leaving the rectangles of the disk cache empty.
+The left area of the diagram shows the content of memory for
+the process that is using SQLite. The database connection has
+just been opened and no information has been read yet, so the
+user space is empty.
+</p>
+<br clear="both">
+
+<a name="rdlck"></a>
+
+<h2 id="_acquiring_a_read_lock"><span>3.2. </span> Acquiring A Read Lock</h2>
+
+<img src="images/ac/commit-1.gif" align="right" hspace="15">
+
+<p>Before SQLite can write to a database, it must first read
+the database to see what is there already. Even if it is just
+appending new data, SQLite still has to read in the database
+schema from the "<a href="schematab.html">sqlite_schema</a>" table so that it can know
+how to parse the INSERT statements and discover where in the
+database file the new information should be stored.</p>
+
+<p>The first step toward reading from the database file
+is obtaining a shared lock on the database file. A "shared"
+lock allows two or more database connections to read from the
+database file at the same time. But a shared lock prevents
+another database connection from writing to the database file
+while we are reading it. This is necessary because if another
+database connection were writing to the database file at the
+same time we are reading from the database file, we might read
+some data before the change and other data after the change.
+This would make it appear as if the change made by the other
+process is not atomic.</p>
+
+<p>Notice that the shared lock is on the operating system
+disk cache, not on the disk itself. File locks
+really are just flags within the operating system kernel,
+usually. (The details depend on the specific OS layer
+interface.) Hence, the lock will instantly vanish if the
+operating system crashes or if there is a power loss. It
+is usually also the case that the lock will vanish if the
+process that created the lock exits.</p>
+
+<br clear="both">
+
+<a name="section_3_3"></a>
+<h2 id="_reading_information_out_of_the_database"><span>3.3. </span> Reading Information Out Of The Database</h2>
+
+<img src="images/ac/commit-2.gif" align="right" hspace="15">
+
+<p>After the shared lock is acquired, we can begin reading
+information from the database file. In this scenario, we
+are assuming a cold cache, so information must first be
+read from mass storage into the operating system cache then
+transferred from operating system cache into user space.
+On subsequent reads, some or all of the information might
+already be found in the operating system cache and so only
+the transfer to user space would be required.</p>
+
+<p>Usually only a subset of the pages in the database file
+are read. In this example we are showing three
+pages out of eight being read. In a typical application, a
+database will have thousands of pages and a query will normally
+only touch a small percentage of those pages.</p>
+
+<br clear="both">
+
+<a name="rsvdlock"></a>
+
+<h2 id="_obtaining_a_reserved_lock"><span>3.4. </span> Obtaining A Reserved Lock</h2>
+
+<img src="images/ac/commit-3.gif" align="right" hspace="15">
+
+<p>Before making changes to the database, SQLite first
+obtains a "reserved" lock on the database file. A reserved
+lock is similar to a shared lock in that both a reserved lock
+and shared lock allow other processes to read from the database
+file. A single reserve lock can coexist with multiple shared
+locks from other processes. However, there can only be a
+single reserved lock on the database file. Hence only a
+single process can be attempting to write to the database
+at one time.</p>
+
+<p>The idea behind a reserved lock is that it signals that
+a process intends to modify the database file in the near
+future but has not yet started to make the modifications.
+And because the modifications have not yet started, other
+processes can continue to read from the database. However,
+no other process should also begin trying to write to the
+database.</p>
+
+<br clear="both">
+<a name="section_3_5"></a>
+<h2 id="_creating_a_rollback_journal_file"><span>3.5. </span> Creating A Rollback Journal File</h2>
+<img src="images/ac/commit-4.gif" align="right" hspace="15">
+
+<p>Prior to making any changes to the database file, SQLite first
+creates a separate rollback journal file and writes into the
+rollback journal the original
+content of the database pages that are to be altered.
+The idea behind the rollback journal is that it contains
+all information needed to restore the database back to
+its original state.</p>
+
+<p>The rollback journal contains a small header (shown in green
+in the diagram) that records the original size of the database
+file. So if a change causes the database file to grow, we
+will still know the original size of the database. The page
+number is stored together with each database page that is
+written into the rollback journal.</p>
+
+<p>
+ When a new file is created, most desktop operating systems
+ (Windows, Linux, Mac OS X) will not actually write anything to
+ disk. The new file is created in the operating systems disk
+ cache only. The file is not created on mass storage until sometime
+ later, when the operating system has a spare moment. This creates
+ the impression to users that I/O is happening much faster than
+ is possible when doing real disk I/O. We illustrate this idea in
+ the diagram to the right by showing that the new rollback journal
+ appears in the operating system disk cache only and not on the
+ disk itself.</p>
+
+<br clear="both">
+<a name="section_3_6"></a>
+<h2 id="_changing_database_pages_in_user_space"><span>3.6. </span> Changing Database Pages In User Space</h2>
+<img src="images/ac/commit-5.gif" align="right" hspace="15">
+
+<p>After the original page content has been saved in the rollback
+journal, the pages can be modified in user memory. Each database
+connection has its own private copy of user space, so the changes
+that are made in user space are only visible to the database connection
+that is making the changes. Other database connections still see
+the information in operating system disk cache buffers which have
+not yet been changed. And so even though one process is busy
+modifying the database, other processes can continue to read their
+own copies of the original database content.</p>
+
+<br clear="both">
+<a name="section_3_7"></a>
+<h2 id="_flushing_the_rollback_journal_file_to_mass_storage"><span>3.7. </span> Flushing The Rollback Journal File To Mass Storage</h2>
+<img src="images/ac/commit-6.gif" align="right" hspace="15">
+
+<p>The next step is to flush the content of the rollback journal
+file to nonvolatile storage.
+As we will see later,
+this is a critical step in insuring that the database can survive
+an unexpected power loss.
+This step also takes a lot of time, since writing to nonvolatile
+storage is normally a slow operation.</p>
+
+<p>This step is usually more complicated than simply flushing
+the rollback journal to the disk. On most platforms two separate
+flush (or fsync()) operations are required. The first flush writes
+out the base rollback journal content. Then the header of the
+rollback journal is modified to show the number of pages in the
+rollback journal. Then the header is flushed to disk. The details
+on why we do this header modification and extra flush are provided
+in a later section of this paper.</p>
+
+<br clear="both">
+<a name="section_3_8"></a>
+<h2 id="_obtaining_an_exclusive_lock"><span>3.8. </span> Obtaining An Exclusive Lock</h2>
+<img src="images/ac/commit-7.gif" align="right" hspace="15">
+
+<p>Prior to making changes to the database file itself, we must
+obtain an exclusive lock on the database file. Obtaining an
+exclusive lock is really a two-step process. First SQLite obtains
+a "pending" lock. Then it escalates the pending lock to an
+exclusive lock.</p>
+
+<p>A pending lock allows other processes that already have a
+shared lock to continue reading the database file. But it
+prevents new shared locks from being established. The idea
+behind a pending lock is to prevent writer starvation caused
+by a large pool of readers. There might be dozens, even hundreds,
+of other processes trying to read the database file. Each process
+acquires a shared lock before it starts reading, reads what it
+needs, then releases the shared lock. If, however, there are
+many different processes all reading from the same database, it
+might happen that a new process always acquires its shared lock before
+the previous process releases its shared lock. And so there is
+never an instant when there are no shared locks on the database
+file and hence there is never an opportunity for the writer to
+seize the exclusive lock. A pending lock is designed to prevent
+that cycle by allowing existing shared locks to proceed but
+blocking new shared locks from being established. Eventually
+all shared locks will clear and the pending lock will then be
+able to escalate into an exclusive lock.</p>
+
+<br clear="both">
+<a name="section_3_9"></a>
+<h2 id="_writing_changes_to_the_database_file"><span>3.9. </span> Writing Changes To The Database File</h2>
+<img src="images/ac/commit-8.gif" align="right" hspace="15">
+
+<p>Once an exclusive lock is held, we know that no other
+processes are reading from the database file and it is
+safe to write changes into the database file. Usually
+those changes only go as far as the operating systems disk
+cache and do not make it all the way to mass storage.</p>
+
+<br clear="both">
+<a name="section_3_10"></a>
+<h2 id="0_flushing_changes_to_mass_storage"><span>3.10. </span>0 Flushing Changes To Mass Storage</h2>
+<img src="images/ac/commit-9.gif" align="right" hspace="15">
+
+<p>Another flush must occur to make sure that all the
+database changes are written into nonvolatile storage.
+This is a critical step to ensure that the database will
+survive a power loss without damage. However, because
+of the inherent slowness of writing to disk or flash memory,
+this step together with the rollback journal file flush in section
+3.7 above takes up most of the time required to complete a
+transaction commit in SQLite.</p>
+
+<br clear="both">
+<a name="section_3_11"></a>
+<h2 id="1_deleting_the_rollback_journal"><span>3.11. </span>1 Deleting The Rollback Journal</h2>
+<img src="images/ac/commit-A.gif" align="right" hspace="15">
+
+<p>After the database changes are all safely on the mass
+storage device, the rollback journal file is deleted.
+This is the instant where the transaction commits.
+If a power failure or system crash occurs prior to this
+point, then recovery processes to be described later make
+it appear as if no changes were ever made to the database
+file. If a power failure or system crash occurs after
+the rollback journal is deleted, then it appears as if
+all changes have been written to disk. Thus, SQLite gives
+the appearance of having made no changes to the database
+file or having made the complete set of changes to the
+database file depending on whether or not the rollback
+journal file exists.</p>
+
+<p>Deleting a file is not really an atomic operation, but
+it appears to be from the point of view of a user process.
+A process is always able to ask the operating system "does
+this file exist?" and the process will get back a yes or no
+answer. After a power failure that occurs during a
+transaction commit, SQLite will ask the operating system
+whether or not the rollback journal file exists. If the
+answer is "yes" then the transaction is incomplete and is
+rolled back. If the answer is "no" then it means the transaction
+did commit.</p>
+
+<p>The existence of a transaction depends on whether or
+not the rollback journal file exists and the deletion
+of a file appears to be an atomic operation from the point of
+view of a user-space process. Therefore,
+a transaction appears to be an atomic operation.</p>
+
+<p>The act of deleting a file is expensive on many systems.
+As an optimization, SQLite can be configured to truncate
+the journal file to zero bytes in length
+or overwrite the journal file header with zeros. In either
+case, the resulting journal file is no longer capable of rolling
+back and so the transaction still commits. Truncating a file
+to zero length, like deleting a file, is assumed to be an atomic
+operation from the point of view of a user process. Overwriting
+the header of the journal with zeros is not atomic, but if any
+part of the header is malformed the journal will not roll back.
+Hence, one can say that the commit occurs as soon as the header
+is sufficiently changed to make it invalid. Typically this happens
+as soon as the first byte of the header is zeroed.</p>
+
+<br clear="both">
+<a name="section_3_12"></a>
+<h2 id="2_releasing_the_lock"><span>3.12. </span>2 Releasing The Lock</h2>
+<img src="images/ac/commit-B.gif" align="right" hspace="15">
+
+<p>The last step in the commit process is to release the
+exclusive lock so that other processes can once again
+start accessing the database file.</p>
+
+<p>In the diagram at the right, we show that the information
+that was held in user space is cleared when the lock is released.
+This used to be literally true for older versions of SQLite. But
+more recent versions of SQLite keep the user space information
+in memory in case it might be needed again at the start of the
+next transaction. It is cheaper to reuse information that is
+already in local memory than to transfer the information back
+from the operating system disk cache or to read it off of the
+disk drive again. Prior to reusing the information in user space,
+we must first reacquire the shared lock and then we have to check
+to make sure that no other process modified the database file while
+we were not holding a lock. There is a counter in the first page
+of the database that is incremented every time the database file
+is modified. We can find out if another process has modified the
+database by checking that counter. If the database was modified,
+then the user space cache must be cleared and reread. But it is
+commonly the case that no changes have been made and the user
+space cache can be reused for a significant performance savings.</p>
+
+<br clear="both">
+<a name="rollback"></a>
+
+<h1 id="_rollback"><span>4. </span> Rollback</h1>
+
+<p>An atomic commit is supposed to happen instantaneously. But the processing
+described above clearly takes a finite amount of time.
+Suppose the power to the computer were cut
+part way through the commit operation described above. In order
+to maintain the illusion that the changes were instantaneous, we
+have to "rollback" any partial changes and restore the database to
+the state it was in prior to the beginning of the transaction.</p>
+
+<a name="crisis"></a>
+
+<h2 id="_when_something_goes_wrong_"><span>4.1. </span> When Something Goes Wrong...</h2>
+<img src="images/ac/rollback-0.gif" align="right" hspace="15">
+
+<p>Suppose the power loss occurred
+during <a href="#section_3_10">step 3.10</a> above,
+while the database changes were being written to disk.
+After power is restored, the situation might be something
+like what is shown to the right. We were trying to change
+three pages of the database file but only one page was
+successfully written. Another page was partially written
+and a third page was not written at all.</p>
+
+<p>The rollback journal is complete and intact on disk when
+the power is restored. This is a key point. The reason for
+the flush operation in <a href="#section_3_7">step 3.7</a>
+is to make absolutely sure that
+all of the rollback journal is safely on nonvolatile storage
+prior to making any changes to the database file itself.</p>
+
+<br clear="both">
+<a name="section_4_2"></a>
+<h2 id="_hot_rollback_journals"><span>4.2. </span> Hot Rollback Journals</h2>
+<img src="images/ac/rollback-1.gif" align="right" hspace="15">
+
+<p>The first time that any SQLite process attempts to access
+the database file, it obtains a shared lock as described in
+<a href="section_3_2">section 3.2</a> above.
+But then it notices that there is a
+rollback journal file present. SQLite then checks to see if
+the rollback journal is a "hot journal". A hot journal is
+a rollback journal that needs to be played back in order to
+restore the database to a sane state. A hot journal only
+exists when an earlier process was in the middle of committing
+a transaction when it crashed or lost power.</p>
+
+<p>A rollback journal is a "hot" journal if all of the following
+are true:</p>
+
+<ul>
+<li>The rollback journal exists.
+</li><li>The rollback journal is not an empty file.
+</li><li>There is no reserved lock on the main database file.
+</li><li>The header of the rollback journal is well-formed and in particular
+ has not been zeroed out.
+</li><li>The rollback journal does not
+contain the name of a super-journal file (see
+<a href="#section_5_5">section 5.5</a> below) or if does
+contain the name of a super-journal, then that super-journal
+file exists.
+</li></ul>
+
+<p>The presence of a hot journal is our indication
+that a previous process was trying to commit a transaction but
+it aborted for some reason prior to the completion of the
+commit. A hot journal means that
+the database file is in an inconsistent state and needs to
+be repaired (by rollback) prior to being used.</p>
+
+<br clear="both">
+<a name="exlock"></a>
+
+<h2 id="_obtaining_an_exclusive_lock_on_the_database"><span>4.3. </span> Obtaining An Exclusive Lock On The Database</h2>
+<img src="images/ac/rollback-2.gif" align="right" hspace="15">
+
+<p>The first step toward dealing with a hot journal is to
+obtain an exclusive lock on the database file. This prevents two
+or more processes from trying to rollback the same hot journal
+at the same time.</p>
+
+<br clear="both">
+<a name="section_4_4"></a>
+<h2 id="_rolling_back_incomplete_changes"><span>4.4. </span> Rolling Back Incomplete Changes</h2>
+<img src="images/ac/rollback-3.gif" align="right" hspace="15">
+
+<p>Once a process obtains an exclusive lock, it is permitted
+to write to the database file. It then proceeds to read the
+original content of pages out of the rollback journal and write
+that content back to where it came from in the database file.
+Recall that the header of the rollback journal records the original
+size of the database file prior to the start of the aborted
+transaction. SQLite uses this information to truncate the
+database file back to its original size in cases where the
+incomplete transaction caused the database to grow. At the
+end of this step, the database should be the same size and
+contain the same information as it did before the start of
+the aborted transaction.</p>
+
+<br clear="both">
+<a name="delhotjrnl"></a>
+
+<h2 id="_deleting_the_hot_journal"><span>4.5. </span> Deleting The Hot Journal</h2>
+<img src="images/ac/rollback-4.gif" align="right" hspace="15">
+
+<p>After all information in the rollback journal has been
+played back into the database file (and flushed to disk in case
+we encounter yet another power failure), the hot rollback journal
+can be deleted.</p>
+
+<p>As in <a href="#section_3_11">section 3.11</a>, the journal
+file might be truncated to zero length or its header might
+be overwritten with zeros as an optimization on systems where
+deleting a file is expensive. Either way, the journal is no
+longer hot after this step.</p>
+
+<br clear="both">
+<a name="cont"></a>
+
+<h2 id="_continue_as_if_the_uncompleted_writes_had_never_happened"><span>4.6. </span> Continue As If The Uncompleted Writes Had Never Happened</h2>
+<img src="images/ac/rollback-5.gif" align="right" hspace="15">
+
+<p>The final recovery step is to reduce the exclusive lock back
+to a shared lock. Once this happens, the database is back in the
+state that it would have been if the aborted transaction had never
+started. Since all of this recovery activity happens completely
+automatically and transparently, it appears to the program using
+SQLite as if the aborted transaction had never begun.</p>
+
+<br clear="both">
+<a name="multicommit"></a>
+
+<h1 id="_multi_file_commit"><span>5. </span> Multi-file Commit</h1>
+
+<p>SQLite allows a single
+<a href="c3ref/sqlite3.html">database connection</a> to talk to
+two or more database files simultaneously through the use of
+the <a href="lang_attach.html">ATTACH DATABASE</a> command.
+When multiple database files are modified within a single
+transaction, all files are updated atomically.
+In other words, either all of the database files are updated or
+else none of them are.
+Achieving an atomic commit across multiple database files is
+more complex that doing so for a single file. This section
+describes how SQLite works that bit of magic.</p>
+
+<a name="multijrnl"></a>
+
+<h2 id="_separate_rollback_journals_for_each_database"><span>5.1. </span> Separate Rollback Journals For Each Database</h2>
+<img src="images/ac/multi-0.gif" align="right" hspace="15">
+
+<p>When multiple database files are involved in a transaction,
+each database has its own rollback journal and each database
+is locked separately. The diagram at the right shows a scenario
+where three different database files have been modified within
+one transaction. The situation at this step is analogous to
+the single-file transaction scenario at
+<a href="#section_3_6">step 3.6</a>. Each database file has
+a reserved lock. For each database, the original content of pages
+that are being changed have been written into the rollback journal
+for that database, but the content of the journals have not yet
+been flushed to disk. No changes have been made to the database
+file itself yet, though presumably there are changes being held
+in user memory.</p>
+
+<p>For brevity, the diagrams in this section are simplified from
+those that came before. Blue color still signifies original content
+and pink still signifies new content. But the individual pages
+in the rollback journal and the database file are not shown and
+we are not making the distinction between information in the
+operating system cache and information that is on disk. All of
+these factors still apply in a multi-file commit scenario. They
+just take up a lot of space in the diagrams and they do not add
+any new information, so they are omitted here.</p>
+
+<br clear="both">
+<a name="sprjrnl"></a>
+
+<h2 id="_the_super_journal_file"><span>5.2. </span> The Super-Journal File</h2>
+<img src="images/ac/multi-1.gif" align="right" hspace="15">
+
+<p>The next step in a multi-file commit is the creation of a
+"super-journal" file. The name of the super-journal file is
+the same name as the original database filename (the database
+that was opened using the
+<a href="c3ref/open.html">sqlite3_open()</a> interface,
+not one of the <a href="lang_attach.html">ATTACHed</a> auxiliary
+databases) with the text "<b>-mj</b><i>HHHHHHHH</i>" appended where
+<i>HHHHHHHH</i> is a random 32-bit hexadecimal number. The
+random <i>HHHHHHHH</i> suffix changes for every new super-journal.</p>
+
+<p><i>(Nota bene: The formula for computing the super-journal filename
+given in the previous paragraph corresponds to the implementation as
+of SQLite version 3.5.0. But this formula is not part of the SQLite
+specification and is subject to change in future releases.)</i></p>
+
+<p>Unlike the rollback journals, the super-journal does not contain
+any original database page content. Instead, the super-journal contains
+the full pathnames for rollback journals for every database that is
+participating in the transaction.</p>
+
+<p>After the super-journal is constructed, its content is flushed
+to disk before any further actions are taken. On Unix, the directory
+that contains the super-journal is also synced in order to make sure
+the super-journal file will appear in the directory following a
+power failure.</p>
+
+<p>The purpose of the super-journal is to ensure that multi-file
+transactions are atomic across a power-loss. But if the database files
+have other settings that compromise integrity across a power-loss event
+(such as <a href="pragma.html#pragma_synchronous">PRAGMA synchronous=OFF</a> or <a href="pragma.html#pragma_journal_mode">PRAGMA journal_mode=MEMORY</a>) then
+the creation of the super-journal is omitted, as an optimization.
+
+<br clear="both">
+<a name="multijrnlupdate"></a>
+
+</p><h2 id="_updating_rollback_journal_headers"><span>5.3. </span> Updating Rollback Journal Headers</h2>
+<img src="images/ac/multi-2.gif" align="right" hspace="15">
+
+<p>The next step is to record the full pathname of the super-journal file
+in the header of every rollback journal. Space to hold the
+super-journal filename was reserved at the beginning of each rollback journal
+as the rollback journals were created.</p>
+
+<p>The content of each rollback journal is flushed to disk both before
+and after the super-journal filename is written into the rollback
+journal header. It is important to do both of these flushes. Fortunately,
+the second flush is usually inexpensive since typically only a single
+page of the journal file (the first page) has changed.</p>
+
+<p>This step is analogous to
+<a href="#section_3_7">step 3.7</a> in the single-file commit
+scenario described above.</p>
+
+<br clear="both">
+<a name="multidbupdate"></a>
+
+<h2 id="_updating_the_database_files"><span>5.4. </span> Updating The Database Files</h2>
+<img src="images/ac/multi-3.gif" align="right" hspace="15">
+
+<p>Once all rollback journal files have been flushed to disk, it
+is safe to begin updating database files. We have to obtain an
+exclusive lock on all database files before writing the changes.
+After all the changes are written, it is important to flush the
+changes to disk so that they will be preserved in the event of
+a power failure or operating system crash.</p>
+
+<p>This step corresponds to steps
+<a href="#section_3_8">3.8</a>,
+<a href="#section_3_9">3.9</a>, and
+<a href="#section_3_10">3.10</a> in the single-file commit
+scenario described previously.</p>
+
+
+<br clear="both">
+<a name="section_5_5"></a>
+<h2 id="_delete_the_super_journal_file"><span>5.5. </span> Delete The Super-Journal File</h2>
+<img src="images/ac/multi-4.gif" align="right" hspace="15">
+
+<p>The next step is to delete the super-journal file.
+This is the point where the multi-file transaction commits.
+This step corresponds to
+<a href="#section_3_11">step 3.11</a> in the single-file
+commit scenario where the rollback journal is deleted.</p>
+
+<p>If a power failure or operating system crash occurs at this
+point, the transaction will not rollback when the system reboots
+even though there are rollback journals present. The
+difference is the super-journal pathname in the header of the
+rollback journal. Upon restart, SQLite only considers a journal
+to be hot and will only playback the journal if there is no
+super-journal filename in the header (which is the case for
+a single-file commit) or if the super-journal file still
+exists on disk.</p>
+
+<br clear="both">
+<a name="cleanup"></a>
+
+<h2 id="_clean_up_the_rollback_journals"><span>5.6. </span> Clean Up The Rollback Journals</h2>
+<img src="images/ac/multi-5.gif" align="right" hspace="15">
+
+<p>The final step in a multi-file commit is to delete the
+individual rollback journals and drop the exclusive locks on
+the database files so that other processes can see the changes.
+This corresponds to
+<a href="#section_3_12">step 3.12</a> in the single-file
+commit sequence.</p>
+
+<p>The transaction has already committed at this point so timing
+is not critical in the deletion of the rollback journals.
+The current implementation deletes a single rollback journal
+then unlocks the corresponding database file before proceeding
+to the next rollback journal. But in the future we might change
+this so that all rollback journals are deleted before any database
+files are unlocked. As long as the rollback journal is deleted before
+its corresponding database file is unlocked it does not matter in what
+order the rollback journals are deleted or the database files are
+unlocked.</p>
+
+<a name="moredetail"></a>
+
+<h1 id="_additional_details_of_the_commit_process"><span>6. </span> Additional Details Of The Commit Process</h1>
+
+<p><a href="#section_3_0">Section 3.0</a> above provides an overview of
+how atomic commit works in SQLite. But it glosses over a number of
+important details. The following subsections will attempt to fill
+in the gaps.</p>
+
+<a name="completesectors"></a>
+
+<h2 id="_always_journal_complete_sectors"><span>6.1. </span> Always Journal Complete Sectors</h2>
+
+<p>When the original content of a database page is written into
+the rollback journal (as shown in <a href="#section_3_5">section 3.5</a>),
+SQLite always writes a complete sector of data, even if the
+page size of the database is smaller than the sector size.
+Historically, the sector size in SQLite has been hard coded to 512
+bytes and since the minimum page size is also 512 bytes, this has never
+been an issue. But beginning with SQLite version 3.3.14, it is possible
+for SQLite to use mass storage devices with a sector size larger than 512
+bytes. So, beginning with version 3.3.14, whenever any page within a
+sector is written into the journal file, all pages in that same sector
+are stored with it.</p>
+
+<p>It is important to store all pages of a sector in the rollback
+journal in order to prevent database corruption following a power
+loss while writing the sector. Suppose that pages 1, 2, 3, and 4 are
+all stored in sector 1 and that page 2 is modified. In order to write
+the changes to page 2, the underlying hardware must also rewrite the
+content of pages 1, 3, and 4 since the hardware must write the complete
+sector. If this write operation is interrupted by a power outage,
+one or more of the pages 1, 3, or 4 might be left with incorrect data.
+Hence, to avoid lasting corruption to the database, the original content
+of all of those pages must be contained in the rollback journal.</p>
+
+<a name="journalgarbage"></a>
+
+<h2 id="_dealing_with_garbage_written_into_journal_files"><span>6.2. </span> Dealing With Garbage Written Into Journal Files</h2>
+
+<p>When data is appended to the end of the rollback journal,
+SQLite normally makes the pessimistic assumption that the file
+is first extended with invalid "garbage" data and that afterwards
+the correct data replaces the garbage. In other words, SQLite assumes
+that the file size is increased first and then afterwards the content
+is written into the file. If a power failure occurs after the file
+size has been increased but before the file content has been written,
+the rollback journal can be left containing garbage data. If after
+power is restored, another SQLite process sees the rollback journal
+containing the garbage data and tries to roll it back into the original
+database file, it might copy some of the garbage into the database file
+and thus corrupt the database file.</p>
+
+<p>SQLite uses two defenses against this problem. In the first place,
+SQLite records the number of pages in the rollback journal in the header
+of the rollback journal. This number is initially zero. So during an
+attempt to rollback an incomplete (and possibly corrupt) rollback
+journal, the process doing the rollback will see that the journal
+contains zero pages and will thus make no changes to the database. Prior
+to a commit, the rollback journal is flushed to disk to ensure that
+all content has been synced to disk and there is no "garbage" left
+in the file, and only then is the page count in the header changed from
+zero to true number of pages in the rollback journal. The rollback journal
+header is always kept in a separate sector from any page data so that
+it can be overwritten and flushed without risking damage to a data
+page if a power outage occurs. Notice that the rollback journal
+is flushed to disk twice: once to write the page content and a second
+time to write the page count in the header.</p>
+
+<p>The previous paragraph describes what happens when the
+synchronous pragma setting is "full".</p>
+
+<blockquote>
+PRAGMA synchronous=FULL;
+</blockquote>
+
+<p>The default synchronous setting is full so the above is what usually
+happens. However, if the synchronous setting is lowered to "normal",
+SQLite only flushes the rollback journal once, after the page count has
+been written.
+This carries a risk of corruption because it might happen that the
+modified (non-zero) page count reaches the disk surface before all
+of the data does. The data will have been written first, but SQLite
+assumes that the underlying filesystem can reorder write requests and
+that the page count can be burned into oxide first even though its
+write request occurred last. So as a second line of defense, SQLite
+also uses a 32-bit checksum on every page of data in the rollback
+journal. This checksum is evaluated for each page during rollback
+while rolling back a journal as described in
+<a href="#section_4_4">section 4.4</a>. If an incorrect checksum
+is seen, the rollback is abandoned. Note that the checksum does
+not guarantee that the page data is correct since there is a small
+but finite probability that the checksum might be right even if the data is
+corrupt. But the checksum does at least make such an error unlikely.
+</p>
+
+<p>Note that the checksums in the rollback journal are not necessary
+if the synchronous setting is FULL. We only depend on the checksums
+when synchronous is lowered to NORMAL. Nevertheless, the checksums
+never hurt and so they are included in the rollback journal regardless
+of the synchronous setting.</p>
+
+<a name="cachespill"></a>
+
+<h2 id="_cache_spill_prior_to_commit"><span>6.3. </span> Cache Spill Prior To Commit</h2>
+
+<p>The commit process shown in <a href="#section_3_0">section 3.0</a>
+assumes that all database changes fit in memory until it is time to
+commit. This is the common case. But sometimes a larger change will
+overflow the user-space cache prior to transaction commit. In those
+cases, the cache must spill to the database before the transaction
+is complete.</p>
+
+<p>At the beginning of a cache spill, the status of the database
+connection is as shown in <a href="#section_3_6">step 3.6</a>.
+Original page content has been saved in the rollback journal and
+modifications of the pages exist in user memory. To spill the cache,
+SQLite executes steps <a href="#section_3_7">3.7</a> through
+<a href="#section_3_9">3.9</a>. In other words, the rollback journal
+is flushed to disk, an exclusive lock is acquired, and changes are
+written into the database. But the remaining steps are deferred
+until the transaction really commits. A new journal header is
+appended to the end of the rollback journal (in its own sector)
+and the exclusive database lock is retained, but otherwise processing
+returns to <a href="#section_3_6">step 3.6</a>. When the transaction
+commits, or if another cache spill occurs, steps
+<a href="#section_3_7">3.7</a> and <a href="#section_3_9">3.9</a> are
+repeated. (Step <a href="#section_3_8">3.8</a> is omitted on second
+and subsequent passes since an exclusive database lock is already held
+due to the first pass.)</p>
+
+<p>A cache spill causes the lock on the database file to
+escalate from reserved to exclusive. This reduces concurrency.
+A cache spill also causes extra disk flush or fsync operations to
+occur and these operations are slow, hence a cache spill can
+seriously reduce performance.
+For these reasons a cache spill is avoided whenever possible.</p>
+
+<a name="opts"></a>
+
+<h1 id="_optimizations"><span>7. </span> Optimizations</h1>
+
+<p>Profiling indicates that for most systems and in most circumstances
+SQLite spends most of its time doing disk I/O. It follows then that
+anything we can do to reduce the amount of disk I/O will likely have a
+large positive impact on the performance of SQLite. This section
+describes some of the techniques used by SQLite to try to reduce the
+amount of disk I/O to a minimum while still preserving atomic commit.</p>
+
+<a name="keepcache"></a>
+
+<h2 id="_cache_retained_between_transactions"><span>7.1. </span> Cache Retained Between Transactions</h2>
+
+<p><a href="#section_3_12">Step 3.12</a> of the commit process shows
+that once the shared lock has been released, all user-space cache
+images of database content must be discarded. This is done because
+without a shared lock, other processes are free to modify the database
+file content and so any user-space image of that content might become
+obsolete. Consequently, each new transaction would begin by rereading
+data which had previously been read. This is not as bad as it sounds
+at first since the data being read is still likely in the operating
+systems file cache. So the "read" is really just a copy of data
+from kernel space into user space. But even so, it still takes time.</p>
+
+<p>Beginning with SQLite version 3.3.14 a mechanism has been added
+to try to reduce the needless rereading of data. In newer versions
+of SQLite, the data in the user-space pager cache is retained when
+the lock on the database file is released. Later, after the
+shared lock is acquired at the beginning of the next transaction,
+SQLite checks to see if any other process has modified the database
+file. If the database has been changed in any way since the lock
+was last released, the user-space cache is erased at that point.
+But commonly the database file is unchanged and the user-space cache
+can be retained, and some unnecessary read operations can be avoided.</p>
+
+<p>In order to determine whether or not the database file has changed,
+SQLite uses a counter in the database header (in bytes 24 through 27)
+which is incremented during every change operation. SQLite saves a copy
+of this counter prior to releasing its database lock. Then after
+acquiring the next database lock it compares the saved counter value
+against the current counter value and erases the cache if the values
+are different, or reuses the cache if they are the same.</p>
+
+<a name="section_7_2"></a>
+<h2 id="_exclusive_access_mode"><span>7.2. </span> Exclusive Access Mode</h2>
+
+<p>SQLite version 3.3.14 adds the concept of "Exclusive Access Mode".
+In exclusive access mode, SQLite retains the exclusive
+database lock at the conclusion of each transaction. This prevents
+other processes from accessing the database, but in many deployments
+only a single process is using a database so this is not a
+serious problem. The advantage of exclusive access mode is that
+disk I/O can be reduced in three ways:</p>
+
+<ol>
+<li><p>It is not necessary to increment the change counter in the
+database header for transactions after the first transaction. This
+will often save a write of page one to both the rollback
+journal and the main database file.</p></li>
+
+<li><p>No other processes can change the database so there is never
+a need to check the change counter and clear the user-space cache
+at the beginning of a transaction.</p></li>
+
+<li><p>Each transaction can be committed by overwriting the rollback
+journal header with zeros rather than deleting the journal file.
+This avoids having to modify the directory entry for the journal file
+and it avoids having to deallocate disk sectors associated with the
+journal. Furthermore, the next transaction will overwrite existing
+journal file content rather than append new content and on most systems
+overwriting is much faster than appending.</p></li>
+</ol>
+
+<p>The third optimization, zeroing the journal file header rather than
+deleting the rollback journal file,
+does not depend on holding an exclusive lock at all times.
+This optimization can be set independently of exclusive lock mode
+using the <a href="pragma.html#pragma_journal_mode">journal_mode pragma</a>
+as described in <a href="#section_7_6">section 7.6</a> below.</p>
+
+<a name="freelistjrnl"></a>
+
+<h2 id="_do_not_journal_freelist_pages"><span>7.3. </span> Do Not Journal Freelist Pages</h2>
+
+<p>When information is deleted from an SQLite database, the pages used
+to hold the deleted information are added to a "<a href="fileformat2.html#freelist">freelist</a>". Subsequent
+inserts will draw pages off of this freelist rather than expanding the
+database file.</p>
+
+<p>Some freelist pages contain critical data; specifically the locations
+of other freelist pages. But most freelist pages contain nothing useful.
+These latter freelist pages are called "leaf" pages. We are free to
+modify the content of a leaf freelist page in the database without
+changing the meaning of the database in any way.</p>
+
+<p>Because the content of leaf freelist pages is unimportant, SQLite
+avoids storing leaf freelist page content in the rollback journal
+in <a href="#section_3_5">step 3.5</a> of the commit process.
+If a leaf freelist page is changed and that change does not get rolled back
+during a transaction recovery, the database is not harmed by the omission.
+Similarly, the content of a new freelist page is never written back
+into the database at <a href="#section_3_9">step 3.9</a> nor
+read from the database at <a href="#section_3_3">step 3.3</a>.
+These optimizations can greatly reduce the amount of I/O that occurs
+when making changes to a database file that contains free space.</p>
+
+<a name="atomicsector"></a>
+
+<h2 id="_single_page_updates_and_atomic_sector_writes"><span>7.4. </span> Single Page Updates And Atomic Sector Writes</h2>
+
+<p>Beginning in SQLite version 3.5.0, the new Virtual File System (VFS)
+interface contains a method named xDeviceCharacteristics which reports
+on special properties that the underlying mass storage device
+might have. Among the special properties that
+xDeviceCharacteristics might report is the ability of to do an
+atomic sector write.</p>
+
+<p>Recall that by default SQLite assumes that sector writes are
+linear but not atomic. A linear write starts at one end of the
+sector and changes information byte by byte until it gets to the
+other end of the sector. If a power loss occurs in the middle of
+a linear write then part of the sector might be modified while the
+other end is unchanged. In an atomic sector write, either the entire
+sector is overwritten or else nothing in the sector is changed.</p>
+
+<p>We believe that most modern disk drives implement atomic sector
+writes. When power is lost, the drive uses energy stored in capacitors
+and/or the angular momentum of the disk platter to provide power to
+complete any operation in progress. Nevertheless, there are so many
+layers in between the write system call and the on-board disk drive
+electronics that we take the safe approach in both Unix and w32 VFS
+implementations and assume that sector writes are not atomic. On the
+other hand, device
+manufacturers with more control over their filesystems might want
+to consider enabling the atomic write property of xDeviceCharacteristics
+if their hardware really does do atomic writes.</p>
+
+<p>When sector writes are atomic and the page size of a database is
+the same as a sector size, and when there is a database change that
+only touches a single database page, then SQLite skips the whole
+journaling and syncing process and simply writes the modified page
+directly into the database file. The change counter in the first
+page of the database file is modified separately since no harm is
+done if power is lost before the change counter can be updated.</p>
+
+<a name="safeappend"></a>
+
+<h2 id="_filesystems_with_safe_append_semantics"><span>7.5. </span> Filesystems With Safe Append Semantics</h2>
+
+<p>Another optimization introduced in SQLite version 3.5.0 makes
+use of "safe append" behavior of the underlying disk.
+Recall that SQLite assumes that when data is appended to a file
+(specifically to the rollback journal) that the size of the file
+is increased first and that the content is written second. So
+if power is lost after the file size is increased but before the
+content is written, the file is left containing invalid "garbage"
+data. The xDeviceCharacteristics method of the VFS might, however,
+indicate that the filesystem implements "safe append" semantics.
+This means that the content is written before the file size is
+increased so that it is impossible for garbage to be introduced
+into the rollback journal by a power loss or system crash.</p>
+
+<p>When safe append semantics are indicated for a filesystem,
+SQLite always stores the special value of -1 for the page count
+in the header of the rollback journal. The -1 page count value
+tells any process attempting to rollback the journal that the
+number of pages in the journal should be computed from the journal
+size. This -1 value is never changed. So that when a commit
+occurs, we save a single flush operation and a sector write of
+the first page of the journal file. Furthermore, when a cache
+spill occurs we no longer need to append a new journal header
+to the end of the journal; we can simply continue appending
+new pages to the end of the existing journal.</p>
+
+<a name="section_7_6"></a>
+<h2 id="_persistent_rollback_journals"><span>7.6. </span> Persistent Rollback Journals</h2>
+
+<p>Deleting a file is an expensive operation on many systems.
+So as an optimization, SQLite can be configured to avoid the
+delete operation of <a href="#section_3_11">section 3.11</a>.
+Instead of deleting the journal file in order to commit a transaction,
+the file is either truncated to zero bytes in length or its
+header is overwritten with zeros. Truncating the file to zero
+length saves having to make modifications to the directory containing
+the file since the file is not removed from the directory.
+Overwriting the header has the additional savings of not having
+to update the length of the file (in the "inode" on many systems)
+and not having to deal with newly freed disk sectors. Furthermore,
+at the next transaction the journal will be created by overwriting
+existing content rather than appending new content onto the end
+of a file, and overwriting is often much faster than appending.</p>
+
+<p>SQLite can be configured to commit transactions by overwriting
+the journal header with zeros instead of deleting the journal file
+by setting the "PERSIST" journaling mode using the
+<a href="pragma.html#pragma_journal_mode">journal_mode</a> PRAGMA.
+For example:</p>
+
+<blockquote><pre>
+PRAGMA journal_mode=PERSIST;
+</pre></blockquote>
+
+<p>The use of persistent journal mode provides a noticeable performance
+improvement on many systems. Of course, the drawback is that the
+journal files remain on the disk, using disk space and cluttering
+directories, long after the transaction commits. The only safe way
+to delete a persistent journal file is to commit a transaction
+with journaling mode set to DELETE:</p>
+
+<blockquote><pre>
+PRAGMA journal_mode=DELETE;
+BEGIN EXCLUSIVE;
+COMMIT;
+</pre></blockquote>
+
+<p>Beware of deleting persistent journal files by any other means
+since the journal file might be hot, in which case deleting it will
+corrupt the corresponding database file.</p>
+
+<p>Beginning in SQLite <a href="releaselog/3_6_4.html">version 3.6.4</a> (2008-10-15),
+the TRUNCATE journal mode is
+also supported:</p>
+
+<blockquote><pre>
+PRAGMA journal_mode=TRUNCATE;
+</pre></blockquote>
+
+<p>In truncate journal mode, the transaction is committed by truncating
+the journal file to zero length rather than deleting the journal file
+(as in DELETE mode) or by zeroing the header (as in PERSIST mode).
+TRUNCATE mode shares the advantage of PERSIST mode that the directory
+that contains the journal file and database does not need to be updated.
+Hence truncating a file is often faster than deleting it. TRUNCATE has
+the additional advantage that it is not followed by a
+system call (ex: fsync()) to synchronize the change to disk. It might
+be safer if it did.
+But on many modern filesystems, a truncate is an atomic and
+synchronous operation and so we think that TRUNCATE will usually be safe
+in the face of power failures. If you are uncertain about whether or
+not TRUNCATE will be synchronous and atomic on your filesystem and it is
+important to you that your database survive a power loss or operating
+system crash that occurs during the truncation operation, then you might
+consider using a different journaling mode.</p>
+
+<p>On embedded systems with synchronous filesystems, TRUNCATE results
+in slower behavior than PERSIST. The commit operation is the same speed.
+But subsequent transactions are slower following a TRUNCATE because it is
+faster to overwrite existing content than to append to the end of a file.
+New journal file entries will always be appended following a TRUNCATE but
+will usually overwrite with PERSIST.</p>
+
+<a name="testing"></a>
+
+<h1 id="_testing_atomic_commit_behavior"><span>8. </span> Testing Atomic Commit Behavior</h1>
+
+<p>The developers of SQLite are confident that it is robust
+in the face of power failures and system crashes because the
+automatic test procedures do extensive checks on
+the ability of SQLite to recover from simulated power loss.
+We call these the "crash tests".</p>
+
+<p>Crash tests in SQLite use a modified VFS that can simulate
+the kinds of filesystem damage that occur during a power
+loss or operating system crash. The crash-test VFS can simulate
+incomplete sector writes, pages filled with garbage data because
+a write has not completed, and out of order writes, all occurring
+at varying points during a test scenario. Crash tests execute
+transactions over and over, varying the time at which a simulated
+power loss occurs and the properties of the damage inflicted.
+Each test then reopens the database after the simulated crash and
+verifies that the transaction either occurred completely
+or not at all and that the database is in a completely
+consistent state.</p>
+
+<p>The crash tests in SQLite have discovered a number of very
+subtle bugs (now fixed) in the recovery mechanism. Some of
+these bugs were very obscure and unlikely to have been found
+using only code inspection and analysis techniques. From this
+experience, the developers of SQLite feel confident that any other
+database system that does not use a similar crash test system
+likely contains undetected bugs that will lead to database
+corruption following a system crash or power failure.</p>
+
+<a name="sect_9_0"></a>
+
+<h1 id="_things_that_can_go_wrong"><span>9. </span> Things That Can Go Wrong</h1>
+
+<p>The atomic commit mechanism in SQLite has proven to be robust,
+but it can be circumvented by a sufficiently creative
+adversary or a sufficiently broken operating system implementation.
+This section describes a few of the ways in which an SQLite database
+might be corrupted by a power failure or system crash.
+(See also: <a href="howtocorrupt.html">How To Corrupt Your Database Files</a>.)</p>
+
+<a name="brokenlocks"></a>
+
+<h2 id="_broken_locking_implementations"><span>9.1. </span> Broken Locking Implementations</h2>
+
+<p>SQLite uses filesystem locks to make sure that only one
+process and database connection is trying to modify the database
+at a time. The filesystem locking mechanism is implemented
+in the VFS layer and is different for every operating system.
+SQLite depends on this implementation being correct. If something
+goes wrong and two or more processes are able to write the same
+database file at the same time, severe damage can result.</p>
+
+<p>We have received reports of implementations of both
+Windows network filesystems and NFS in which locking was
+subtly broken. We can not verify these reports, but as
+locking is difficult to get right on a network filesystem
+we have no reason to doubt them. You are advised to
+avoid using SQLite on a network filesystem in the first place,
+since performance will be slow. But if you must use a
+network filesystem to store SQLite database files, consider
+using a secondary locking mechanism to prevent simultaneous
+writes to the same database even if the native filesystem
+locking mechanism malfunctions.</p>
+
+<p>The versions of SQLite that come preinstalled on Apple
+Mac OS X computers contain a version of SQLite that has been
+extended to use alternative locking strategies that work on
+all network filesystems that Apple supports. These extensions
+used by Apple work great as long as all processes are accessing
+the database file in the same way. Unfortunately, the locking
+mechanisms do not exclude one another, so if one process is
+accessing a file using (for example) AFP locking and another
+process (perhaps on a different machine) is using dot-file locks,
+the two processes might collide because AFP locks do not exclude
+dot-file locks or vice versa.</p>
+
+<a name="fsync"></a>
+
+<h2 id="_incomplete_disk_flushes"><span>9.2. </span> Incomplete Disk Flushes</h2>
+
+<p>SQLite uses the fsync() system call on Unix and the FlushFileBuffers()
+system call on w32 in order to sync the file system buffers onto disk
+oxide as shown in <a href="#section_3_7">step 3.7</a> and
+<a href="#section_3_10">step 3.10</a>. Unfortunately, we have received
+reports that neither of these interfaces works as advertised on many
+systems. We hear that FlushFileBuffers() can be completely disabled
+using registry settings on some Windows versions. Some historical
+versions of Linux contain versions of fsync() which are no-ops on
+some filesystems, we are told. Even on systems where
+FlushFileBuffers() and fsync() are said to be working, often
+the IDE disk control lies and says that data has reached oxide
+while it is still held only in the volatile control cache.</p>
+
+<p>On the Mac, you can set this pragma:</p>
+
+<blockquote>
+PRAGMA fullfsync=ON;
+</blockquote>
+
+<p>Setting fullfsync on a Mac will guarantee that data really does
+get pushed out to the disk platter on a flush. But the implementation
+of fullfsync involves resetting the disk controller. And so not only
+is it profoundly slow, it also slows down other unrelated disk I/O.
+So its use is not recommended.</p>
+
+<a name="filedel"></a>
+
+<h2 id="_partial_file_deletions"><span>9.3. </span> Partial File Deletions</h2>
+
+<p>SQLite assumes that file deletion is an atomic operation from the
+point of view of a user process. If power fails in the middle of
+a file deletion, then after power is restored SQLite expects to see
+either the entire file with all of its original data intact, or it
+expects not to find the file at all. Transactions may not be atomic
+on systems that do not work this way.</p>
+
+<a name="filegarbage"></a>
+
+<h2 id="_garbage_written_into_files"><span>9.4. </span> Garbage Written Into Files</h2>
+
+<p>SQLite database files are ordinary disk files that can be
+opened and written by ordinary user processes. A rogue process
+can open an SQLite database and fill it with corrupt data.
+Corrupt data might also be introduced into an SQLite database
+by bugs in the operating system or disk controller; especially
+bugs triggered by a power failure. There is nothing SQLite can
+do to defend against these kinds of problems.</p>
+
+<a name="mvhotjrnl"></a>
+
+<h2 id="_deleting_or_renaming_a_hot_journal"><span>9.5. </span> Deleting Or Renaming A Hot Journal</h2>
+
+<p>If a crash or power loss does occur and a hot journal is left on
+the disk, it is essential that the original database file and the hot
+journal remain on disk with their original names until the database
+file is opened by another SQLite process and rolled back.
+During recovery at <a href="#section_4_2">step 4.2</a> SQLite locates
+the hot journal by looking for a file in the same directory as the
+database being opened and whose name is derived from the name of the
+file being opened. If either the original database file or the
+hot journal have been moved or renamed, then the hot journal will
+not be seen and the database will not be rolled back.</p>
+
+<p>We suspect that a common failure mode for SQLite recovery happens
+like this: A power failure occurs. After power is restored, a well-meaning
+user or system administrator begins looking around on the disk for
+damage. They see their database file named "important.data". This file
+is perhaps familiar to them. But after the crash, there is also a
+hot journal named "important.data-journal". The user then deletes
+the hot journal, thinking that they are helping to cleanup the system.
+We know of no way to prevent this other than user education.</p>
+
+<p>If there are multiple (hard or symbolic) links to a database file,
+the journal will be created using the name of the link through which
+the file was opened. If a crash occurs and the database is opened again
+using a different link, the hot journal will not be located and no
+rollback will occur.</p>
+
+<p>Sometimes a power failure will cause a filesystem to be corrupted
+such that recently changed filenames are forgotten and the file is
+moved into a "/lost+found" directory. When that happens, the hot
+journal will not be found and recovery will not occur.
+SQLite tries to prevent this
+by opening and syncing the directory containing the rollback journal
+at the same time it syncs the journal file itself. However, the
+movement of files into /lost+found can be caused by unrelated processes
+creating unrelated files in the same directory as the main database file.
+And since this is out from under the control of SQLite, there is nothing
+that SQLite can do to prevent it. If you are running on a system that
+is vulnerable to this kind of filesystem namespace corruption (most
+modern journalling filesystems are immune, we believe) then you might
+want to consider putting each SQLite database file in its own private
+subdirectory.</p>
+
+<a name="future"></a>
+
+<h1 id="_future_directions_and_conclusion"><span>10. </span> Future Directions And Conclusion</h1>
+
+<p>Every now and then someone discovers a new failure mode for
+the atomic commit mechanism in SQLite and the developers have to
+put in a patch. This is happening less and less and the
+failure modes are becoming more and more obscure. But it would
+still be foolish to suppose that the atomic commit logic of
+SQLite is entirely bug-free. The developers are committed to fixing
+these bugs as quickly as they might be found.</p>
+
+<p>
+The developers are also on the lookout for new ways to
+optimize the commit mechanism. The current VFS implementations
+for Unix (Linux and Mac OS X) and Windows make pessimistic assumptions about
+the behavior of those systems. After consultation with experts
+on how these systems work, we might be able to relax some of the
+assumptions on these systems and allow them to run faster. In
+particular, we suspect that most modern filesystems exhibit the
+safe append property and that many of them might support atomic
+sector writes. But until this is known for certain, SQLite will
+take the conservative approach and assume the worst.</p>
+<p align="center"><small><i>This page last modified on <a href="https://sqlite.org/docsrc/honeypot" id="mtimelink" data-href="https://sqlite.org/docsrc/finfo/pages/atomiccommit.in?m=daed65c9ccb4a9262">2021-10-05 17:51:47</a> UTC </small></i></p>
+