summaryrefslogtreecommitdiffstats
path: root/doc/wiki/Design.Indexes.TransactionLog.txt
blob: e86751b9c570c604aba1c56de178e0adf21b7262 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
Transaction log
===============

The transaction log is a bit similar to transaction logs in databases. All the
updates to the main index files are first written to the transaction log, and
only after that the main index file is updated. There are several advantages to
this:

 * It provides atomic transactions: The transaction either succeeds, or it
   doesn't. For example if a transaction sets a flag to one message and removes
   it from another, it's guaranteed that both changes happen.
    * When updating the changes to the main index file, the last thing that's
      done is to update the "transaction log position" in the header. So if
      Dovecot crashes after having updated only the first flag, the next time
      the mailbox is opened both of the changes are done all over again.
 * It allows another process to quickly see what changes have been made. For
   example IMAP needs to get a list of external changes after each command.
    * This is also important when storing the index files in NFS or in a
      clustered filesystem. Instead of re-reading the whole index file after
      each external change, Dovecot can simply read the new changes from the
      transaction log and apply them to the in-memory copy of the main index.
      In-memory caching of 'dovecot.index.cache' file also relies on the
      transaction log telling what parts of the file has changed.
 * In future the transaction logs can be somewhat easily used to implement
   replication.

Internal vs. external
---------------------

Transactions are either internal or external. The difference is that external
transactions describe changes that were already made to the mailbox, while
internal transactions are commands to do something to the mailbox. When
beginning to synchronize a mailbox with index files, the index file is first
updated with all the external changes, and the uncommitted internal
transactions are applied on top of them.

When synchronizing the mailbox, using the synchronization transaction writes
only external transactions. Also if the index file is updated when saving new
mails to the mailbox, the append transactions must be external. This is because
the changes are already in the mailbox at the time the transaction is read.

Reading and writing
-------------------

Reading transaction logs doesn't require any locking at all. Writing is
exclusively locked using the index files' default lock method (as specified by
the 'lock_method' setting).

A new log is created by first creating a 'dovecot.index.log.newlock' dotlock
file. Once you have the dotlock, check again that the 'dovecot.index.log'
wasn't created (or recreated) by another process. If not, go ahead and write
the log header to the dotlock file and finally 'rename()' it to
'dovecot.index.log'.

Currently there doesn't exist actual transaction boundaries in the log file.
All the changes in a transaction are simply written as separate records to the
file. Each record begins with a 'struct mail_transaction_header', which
contains the record's size and type. The size is in <lockless integer format>
[Design.Indexes.txt].

The first transaction record is written with the size field being 0. Once the
whole transaction has been written, the 0 is updated with the actual size. This
way the transaction log readers won't see partial transactions because they
stop at the size=0 if the transaction isn't fully written yet.

Note that because there are no transaction boundaries, there's a small race
condition here with mmap()ed log files:

 1. Process A: write() half of the transaction
 2. Process B: mmap() the file.
 3. Process A: write() the rest of the transaction, updating the size=0 also
 4. Process B: parse the log file. it'll go past the original size=0 because
    the size had changed in the mmap, but it stops in the middle of the
    transaction because the mmap size doesn't contain the whole transaction

This probably isn't a big problem, because I've never seen this happen even
with stress tests. Should be fixed at some point anyway.

Header
------

The transaction log's header never changes, except the indexid field may be
overwritten with 0 if the log is found to be corrupted. The fields are:

major_version:
  If this doesn't match 'MAIL_TRANSACTION_LOG_MAJOR_VERSION', don't try to
  parse it. If Dovecot sees this, it'll recreate the log file.

minor_version:
  If this doesn't match 'MAIL_TRANSACTION_LOG_MINOR_VERSION', the log file
  contains some backwards compatible changes. Currently you can just ignore
  this field.

hdr_size:
  Size of the log file's header. Use this instead of 'sizeof(struct
  mail_transaction_log_header)', so that it's possible to add new fields and
  still be backwards compatible.

indexid:
  This field must match to main index file's indexid field.

file_seq:
  The file's creation sequence. Must be increasing.

prev_file_seq, prev_file_offset:
  Contains the sequence and offset of where the last transaction log ended.
  When transaction log is rotated and the reader's "sync position" still points
  to the previous log file, these fields allow it to easily check if there had
  been any more changes in the previous file.

create_stamp:
  UNIX timestamp when the file was created. Used in determining when to rotate
  the log file.

Record header
-------------

The transaction record header ('struct mail_transaction_header') contains
'size' and 'type' fields. The 'size' field is in <lockless integer format>
[Design.Indexes.txt]. A single transaction record may contain multiple changes
of the same type, although some types don't allow this. Because the size of the
transaction record for each type is known (or can be determined from the
type-specific record contents), the 'size' field can be used to figure out how
many changes need to be done. So for example a record can contain:

 * 'struct mail_transaction_header { type = MAIL_TRANSACTION_APPEND, size =
   sizeof(struct mail_index_record) * 2 }'
 * 'struct mail_index_record { uid = 1, flags = 0 } '
 * 'struct mail_index_record { uid = 2, flags = 0 } '

UIDs
----

Many record types contain 'uint32_t uid1, uid2' fields. This means that the
changes apply to all the messages in uid1..uid2 range. The messages don't
really have to exist in the range, so for example if the first messages in the
mailbox had UIDs 1, 100 and 1000, it would be possible to use uid1=1, uid2=1000
to describe changes made to these 3 messages. This also means that it's safe to
write transactions describing changes to messages that were just expunged by
another process (and already written to the log file before our changes).

Appends
-------

As described above, the appends must be in external transactions. The append
transaction's contents is simply the 'struct mail_index_record', so it contains
only the message's UID and flags. The message contents aren't written to
transaction log. Also if the message had any keywords when it was appended,
they're in a separate transaction record.

Expunges
--------

Because expunges actually destroy messages, they deserve some extra protection
to make it less likely to accidentally expunge wrong messages in case of for
example file corruption. The expunge transactions must have
'MAIL_TRANSACTION_EXPUNGE_PROT' ORed to the transaction type field. If an
expunge type is found without it, assume a corrupted transaction log.

Flag changes
------------

The flag changes are described in:

---%<-------------------------------------------------------------------------
struct mail_transaction_flag_update {
        uint32_t uid1, uid2;
        uint8_t add_flags;
        uint8_t remove_flags;
        uint16_t padding;
};
---%<-------------------------------------------------------------------------

The 'padding' is ignored completely. A single flag update structure can add new
flags or remove existing flags. Replacing all the files works by setting
'remove_flags = 0xFF' and the 'add_flags' containing the new flags.

Keyword changes
---------------

Specific keywords can be added or removed one keyword at a time:

---%<-------------------------------------------------------------------------
struct mail_transaction_keyword_update {
        uint8_t modify_type; /* enum modify_type : MODIFY_ADD / MODIFY_REMOVE
*/
        uint8_t padding;
        uint16_t name_size;
        /* unsigned char name[];
           array of { uint32_t uid1, uid2; }
        */
};
---%<-------------------------------------------------------------------------

There is padding after 'name[]' so that uid1 begins from a 32bit aligned
offset.

If you want to replace all the keywords (eg. IMAP's 'STORE 1:* FLAGS (keyword)'
command), you'll first have to remove all of them with
'MAIL_TRANSACTION_KEYWORD_RESET' and then add the new keywords.

Extensions
----------

Extension records allow creating and updating extension-specific header and
message record data. For example messages' offsets to cache file or mbox file
are stored in extensions.

Whenever using an extension, you'll need to first write
'MAIL_TRANSACTION_EXT_INTRO' record. This is a bit kludgy and hopefully will be
replaced by something better in future. The intro contains:

---%<-------------------------------------------------------------------------
struct mail_transaction_ext_intro {
        /* old extension: set ext_id. don't set name.
           new extension: ext_id = (uint32_t)-1. give name. */
        uint32_t ext_id;
        uint32_t reset_id;
        uint32_t hdr_size;
        uint16_t record_size;
        uint16_t record_align;
        uint16_t unused_padding;
        uint16_t name_size;
        /* unsigned char name[]; */
};
---%<-------------------------------------------------------------------------

If the extension already exists in the index file (it can't be removed), you
can use the 'ext_id' field directly. Otherwise you'll need to give a name to
the extension. It's always possible to just give the name if you don't know the
existing extension ID, but this uses more space of course.

'reset_id' contains kind of a "transaction validity" field. It's updated with
'MAIL_TRANSACTION_EXT_RESET' record, which also causes the extension records'
contents to be zeroed. If an introduction's 'reset_id' doesn't match the last
EXT_RESET, it means that the extension changes are stale and they must be
ignored. For example:

 * 'dovecot.index.cache' file's 'file_seq' header is used as a 'reset_id'.
   Initially it's 1.
 * Process A: Begins a cache transaction, updating some fields in it.
 * Process B: Decides to compress the cache file, and issues a 'reset_id = 2'
   change.
 * Process A: Commits the transaction with 'reset_id = 1', but the cache file
   offsets point to the old file, so the changes must be ignored.

'hdr_size' specifies the number of bytes the extension wants to have in the
index file's header.'record_size' specifies the number of bytes it wants to use
for each record. The sizes may grow or shrink any time.'record_align' contains
the required alignmentation for the field. For example if the extension
contains a 32bit integer, you want it to be 32bit aligned so that the process
won't crash in CPUs which require proper alignmentation. Then again if you want
to access the field as 4 bytes, the alignmentation can be 1.

Extension record updates typically are message-specific, so the changes must be
done for each message separately:

---%<-------------------------------------------------------------------------
struct mail_transaction_ext_rec_update {
        uint32_t uid;
        /* unsigned char data[]; */
};
---%<-------------------------------------------------------------------------

(This file was created from the wiki on 2019-06-19 12:42)