diff options
Diffstat (limited to 'doc/wiki/MailboxFormat.mbox.txt')
-rw-r--r-- | doc/wiki/MailboxFormat.mbox.txt | 290 |
1 files changed, 290 insertions, 0 deletions
diff --git a/doc/wiki/MailboxFormat.mbox.txt b/doc/wiki/MailboxFormat.mbox.txt new file mode 100644 index 0000000..438f9d8 --- /dev/null +++ b/doc/wiki/MailboxFormat.mbox.txt @@ -0,0 +1,290 @@ +Mbox Mailbox Format +=================== + +Contents + + + 1. Mbox Mailbox Format + + 1. Locking + + 1. Dotlock + + 2. Deadlocks + + 2. Directory Structure + + 3. Dovecot's Metadata + + 4. Dovecot's Speed Optimizations + + 5. From Escaping + + 6. Mbox Variants + + 7. References + +Usually UNIX systems are configured by default to deliver mails to +'/var/mail/username' or '/var/spool/mail/username' mboxes. In IMAP world these +files are called INBOX mailboxes. IMAP protocol supports multiple mailboxes +however, so there needs to be a place for them as well. Typically they're +stored in '~/mail/' or '~/Mail/' directories. + +The mbox file contains all the messages of a single mailbox. Because of this, +the mbox format is typically thought of as a slow format. However with +Dovecot's indexing this isn't true. Only expunging messages from the beginning +of a large mbox file is slow with Dovecot, most other operations should be +fast. Also because all the mails are in a single file, searching is much faster +than with maildir. + +Modifications to mbox may require moving data around within the file, so +interruptions (eg. power failures) can cause the mbox to break more or less +badly. Although Dovecot tries to minimize the damage by moving the data in a +way that data should never get lost (only duplicated), mboxes still aren't +recommended to be used for important data. + +Locking +------- + +Locking is a mess with mboxes. There are multiple different ways to lock a +mbox, and software often uses incompatible locking. See <MboxLocking.txt> for +how to check what locking methods some commonly used programs use. + +There are at least four different ways to lock a mbox: + + * *dotlock*: 'mailboxname.lock' file created by almost all software when + writing to mboxes. This grants the writer an exclusive lock over the mbox, + so it's usually not used while reading the mbox so that other processes can + also read it at the same time. So while using a dotlock typically prevents + actual mailbox corruption, it doesn't protect against read errors if mailbox + is modified while a process is reading. + * *flock*: 'flock()' system call is quite commonly used for both read and + write locking. The read lock allows multiple processes to obtain a read lock + for the mbox, so it works well for reading as well. The one downside to it + is that it doesn't work if mailboxes are stored in NFS. + * *fcntl*: Very similar to *flock*, also commonly used by software. In some + systems this 'fcntl()' system call is compatible with 'flock()', but in + other systems it's not, so you shouldn't rely on it.*fcntl* works with NFS + if you're using lockd daemon in both NFS server and client. + * *lockf*: POSIX 'lockf()' locking. Because it allows creating only exclusive + locks, it's somewhat useless so Dovecot doesn't support it. With Linux + 'lockf()' is internally compatible with 'fcntl()' locks, but again you + shouldn't rely on this. + +Dotlock +------- + +Another problem with dotlocks is that if the mailboxes exist in '/var/mail/', +the user may not have write access to the directory, so the dotlock file can't +be created. There are a couple of ways to work around this: + + * Give a mail group write access to the directory and then make sure that all + software requiring access to the directory runs with the group's privileges. + This may mean making the binary itself setgid-mail, or using a separate + dotlock helper program which is setgid-mail. With Dovecot this can be done + by setting 'mail_privileged_group = mail'. + * Set sticky bit to the directory ('chmod +t /var/mail'). This makes it + somewhat safe to use, because users can't delete each others mailboxes, but + they can still create new files (the dotlock files). The downside to this is + that users can create whatever files they wish in there, such as a mbox for + newly created user who hadn't yet received mail. + +Deadlocks +--------- + +If multiple lock methods are used, which is usually the case since dotlocks +aren't typically used for read locking, the order in which the locking is done +is important. Consider if two programs were running at the same time, both use +dotlock and fcntl locking but in different order: + + * Program A: fcntl locks the mbox + * Program B at the same time: dotlocks the mbox + * Program A continues: tries to dotlock the mbox, but since it's already + dotlocked by B, it starts waiting + * Program B continues: tries to fcntl lock the mbox, but since it's already + fcntl locked by A, it starts waiting + +Now both of them are waiting for each others locks. Finally after a couple of +minutes they time out and fail the operation. + +Directory Structure +------------------- + +By default, when listing mailboxes, Dovecot simply assumes that all files it +sees are mboxes and all directories mean that they contain sub-mailboxes. There +are two special cases however which aren't listed: + + * '.subscriptions' file contains IMAP's mailbox subscriptions. + * '.imap/' directory contains Dovecot's index files. + +Because it's not possible to have a file which is also a directory, it's not +normally possible to create a mailbox and child mailboxes under it. + +However if you really want to be able to have mailboxes containing both +messages and child mailboxes under mbox, then Dovecot can be configured to do +this, subject to certain provisos; see <MboxChildFolders.txt>. + +Dovecot's Metadata +------------------ + +Dovecot uses C-Client (ie. UW-IMAP, Pine) compatible headers in mbox messages +to store metadata. These headers are: + + * X-IMAPbase: Contains UIDVALIDITY, last used UID and list of used keywords + * X-IMAP: Same as X-IMAPbase but also specifies that the message is a "pseudo + message" + * X-UID: Message's allocated UID + * Status: R (\Seen) and O (non-\Recent) flags + * X-Status: A (\Answered), F (\Flagged), T (\Draft) and D (\Deleted) flags + * X-Keywords: Message's keywords + * Content-Length: Length of the message body in bytes + +Whenever any of these headers exist, Dovecot treats them as its own private +metadata. It does sanity checks for them, so the headers may also be modified +or removed completely. None of these headers are sent to IMAP/POP3 clients when +they read the mail. + +*The MTA, MDA or LDA should strip all these headers _case-insensitively_ before +writing the mail to the mbox.* + +Only the first message contains the X-IMAP or X-IMAPbase header. The difference +is that when all the messages are deleted from mbox file, a "pseudo message" is +written to the mbox which contains X-IMAP header. This is the "DON'T DELETE +THIS MESSAGE -- FOLDER INTERNAL DATA" message which you hate seeing when using +non-C-client and non-Dovecot software. This is however important to prevent +abuse, otherwise the first mail which is received could contain faked +X-IMAPbase header which could cause trouble. + +If message contains X-Keywords header, it contains a space-separated list of +keywords for the mail. Since the same header can come from the mail's sender, +only the keywords are listed in X-IMAP header are used. + +The UID for a new message is calculated from "last used UID" in X-IMAP header + +1. This is done always, so fake X-UID headers don't really matter. This is also +why the pseudo message is important. Otherwise the UIDs could easily grow over +2^31 which some clients start treating as negative numbers, which then cause +all kinds of problems. Also when 2^32 is exceeded, Dovecot will also start +having some problems. + +Content-Length is used as long as another valid mail starts after that many +bytes. Because the byte count must be exact, it's quite unlikely that abusing +it can cause messages to be skipped (or rather appended to the previous +message's body). + +Status and X-Status headers are trusted completely, so it's pretty good idea to +filter them in LDA if possible. + +Dovecot's Speed Optimizations +----------------------------- + +Updating messages' flags and keywords can be a slow operation since you may +have to insert a new header (Status, X-Status, X-Keywords) or at least insert +data in the header's value. Some mbox MUAs do this simply by rewriting all of +the mbox after the inserted data. If the mbox is large, this can be very slow. +Dovecot optimizes this by always leaving some space characters after some of +its internal headers. It can use this space to move only minimal amount of data +necessary to get the necessary data inserted. Also if data is removed, it just +grows these spaces areas. + +'mbox_lazy_writes' setting works by adding and/or updating Dovecot's metadata +headers only after closing the mailbox or when messages are expunged from the +mailbox. C-Client works the same way. The upside of this is that it reduces +writes because multiple flag updates to same message can be grouped, and +sometimes the writes don't have to be done at all if the whole message is +expunged. The downside is that other processes don't notice the changes +immediately (but other Dovecot processes do notice because the changes are in +index files). + +'mbox_dirty_syncs' setting tries to avoid re-reading the mbox every time +something changes. Whenever the mbox changes (ie. timestamp or size), it first +checks if the mailbox's size changed. If it didn't, it most likely meant that +only message flags were changed so it does a full mbox read to find it. If the +mailbox shrunk, it means that mails were expunged and again Dovecot does a full +sync. Usually however the only thing besides Dovecot that modifies the mbox is +the LDA which appends new mails to the mbox. So if the mbox size was grown, +Dovecot first checks if the last known message is still where it was last time. +If it is, Dovecot reads only the newly added messages and goes into a "dirty +mode". As long as Dovecot is in dirty mode, it can't be certain that mails are +where it expects them to be, so whenever accessing some mail, it first verifies +that it really is the correct mail by finding its X-UID header. If the X-UID +header is different, it fallbacks to a full sync to find the mail's correct +position. The dirty mode goes away after a full sync. If 'mbox_lazy_writes' was +enabled and the mail didn't yet have X-UID header, Dovecot uses MD5 sum of a +couple of headers to compare the mails. + +'mbox_very_dirty_syncs' does the same as 'mbox_dirty_syncs', but the dirty +state is kept also when opening the mailbox. Normally opening the mailbox does +a full sync if it had been changed outside Dovecot. + +From Escaping +------------- + +In mboxes a new mail always begins with a "From " line, commonly referred to as +From_-line. To avoid confusion, lines beginning with "From " in message bodies +are usually prefixed with '>' character while the message is being written to +in mbox. + +Dovecot doesn't currently do this escaping however. Instead it prevents this +confusion by adding Content-Length headers so it knows later where the next +message begins. Dovecot doesn't either remove the '>' characters before sending +the data to clients. Both of these will probably be implemented later. + +Mbox Variants +------------- + +There are a few minor variants of this format: + +*mboxo* is the name of original mbox format originated with Unix System V. +Messages are stored in a single file, with each message beginning with a line +containing "From SENDER DATE". If "From " (case-sensitive, with the space) +occurs at the beginning of a line anywhere in the email, it is escaped with a +greater-than sign (to ">From "). Lines already quoted as such, for example +">From " or ">>>From " are *not* quoted again, which leads to irrecoverable +corruption of the message content. + +*mboxrd* was named for Raul Dhesi in June 1995, though several people came up +with the same idea around the same time. An issue with the mboxo format was +that if the text ">From " appeared in the body of an email (such as from a +reply quote), it was not possible to distinguish this from the mailbox format's +quoted ">From ". mboxrd fixes this by always quoting already quoted "From " +lines (e.g. ">From ", ">>From ", ">>>From ", etc.) as well, so readers can just +remove the first ">" character. This format is used by qmail and getmail +(>=4.35.0). + +*mboxcl* format was originated with Unix System V Release 4 mail tools. It adds +a Content-Length field which indicates the number of bytes in the message. This +is used to determine message boundaries. It still quotes "From " as the +original mboxo format does (and *not* as mboxrd does it). + +*mboxcl2* is like mboxcl but does away with the "From " quoting. + +*MMDF* (Multi-channel Memorandum Distribution Facility mailbox format) was +originated with the MMDF daemon. The format surrounds each message with lines +containing four control-A's. This eliminates the need to escape From: lines. + +Dovecot currently uses mboxcl2 format internally, but it's planned to move to +combination of mboxrd and mboxcl. + +*How a message is read stored in mbox extension ?* + + * An email client reader scans throughout mbox file looking for From_ lines. + * Any From_ line marks the beginning of a message. + * Once the reader finds a message, it extracts a (possibly corrupted) envelope + sender and delivery date out of the From_ line. + * It then reads until the next From_ line or scans till the end of file, + whenever From_ comes first. + * It removes the last blank line and deletes the quoting of >From_ lines and + >>From_ lines and so on. + +References +---------- + + * Wikipedia [http://en.wikipedia.org/wiki/Mbox] + * Qmail mbox [http://www.qmail.org/man/man5/mbox.html] + * Mbox family + [http://homepage.ntlworld.com/jonathan.deboynepollard/FGA/mail-mbox-formats.html] + * CommuniGatePro mbox + [http://www.communigate.com/CommuniGatePro/Mailboxes.html#mbox] + * MBOX File Viewer [http://www.freeviewer.org/mbox/] + +(This file was created from the wiki on 2019-06-19 12:42) |