summaryrefslogtreecommitdiffstats
path: root/doc/wiki/Plugins.FTS.Squat.txt
diff options
context:
space:
mode:
Diffstat (limited to 'doc/wiki/Plugins.FTS.Squat.txt')
-rw-r--r--doc/wiki/Plugins.FTS.Squat.txt128
1 files changed, 128 insertions, 0 deletions
diff --git a/doc/wiki/Plugins.FTS.Squat.txt b/doc/wiki/Plugins.FTS.Squat.txt
new file mode 100644
index 0000000..07dc867
--- /dev/null
+++ b/doc/wiki/Plugins.FTS.Squat.txt
@@ -0,0 +1,128 @@
+Squat Full Text Search Indexing
+===============================
+
+NOTE: The Squat code is quite slow for large mailboxes. There are also a few
+bugs that are unlikely to be fixed. In v2.1+ it's recommended to use
+<fts-lucene> [Plugins.FTS.Lucene.txt] instead. There's also a "New Squat"
+redesign/reimplementation going on, which will likely end up having a new name.
+
+The main difference between Squat indexes and the others is that Squat provides
+support for substring searches, while pretty much all other FTS indexes support
+only matching from the beginning of words. By strictly reading the IMAP RFC it
+requires substring matching, so to optimize regular TEXT and BODY searches you
+must use Squat with Dovecot v2.0. The other indexes are used only for Dovecot's
+non-standard X-TEXT-FAST and X-BODY-FAST searches. However, almost all other
+commonly used IMAP servers no longer care about this requirement, so Dovecot
+v2.1 also no longer makes this distinction. In Dovecot v2.1 TEXT and BODY
+searches are done using any indexes available.
+
+ * The Squat indexes are currently updated only when SEARCH TEXT or SEARCH BODY
+ command is used. It's planned that in future they could be updated
+ immediately when messages are being saved.
+ * The Squat indexes take about 30% of the mailbox size (with partial=4
+ full=4).
+ * The initial Squat index building for large mailboxes can be very CPU and
+ memory hungry.
+ * The Squat indexes are stored among Dovecot's other index files. They're
+ called 'dovecot.index.search' and 'dovecot.index.search.uids'. It's safe to
+ delete both of these files to trigger a rebuild.
+
+Some statistics with Core2 duo 6600, 2 GB memory (partial=4 full=4):
+
+ * 1,4 GB mbox with 368000 messages: SEARCH BODY asdfgh
+ * Without Squat: 2 min 35 sec CPU, process RSS size 24 MB
+ * Initial Squat build: 6 min 36 sec CPU, process RSS size peaks at 800 MB,
+ drops to 96 MB at end.
+ * Subsequent searches: 0.10 sec CPU, 1.18 sec real with cold cache, process
+ RSS size 2 MB.
+ * Most of the time was spent verifying Squat results. Searching for a
+ 1..4 letter word gives results in 0.00 CPU secs and 0.80 real secs
+ with cold cache.
+
+ * 32 MB mbox with 7500 messages: SEARCH BODY asdfgh
+ * Without Squat: 2.3 sec CPU, 3.6 sec real with cold cache, process RSS
+ size 5 MB
+ * Initial Squat build: 8.9 sec CPU, 10.0 sec real with cold cache, process
+ RSS size peaks at 61 MB, drops to 20 MB at end.
+ * 48 MB maildir with 10000 messages: SEARCH BODY asdfgh
+ * Without Squat: 2.6 sec CPU, 9.7 sec real with cold cache
+ * Initial Squat build: 9.7 sec CPU, 15.3 sec real with cold cache
+
+Internal workings
+-----------------
+
+Squat works by building a trie of all 1..4 character combinations of all words
+in messages. For example given a word "world" it indexes "worl", "orld", "rld",
+"ld" and "d". Matching message UIDs are stored to each trie node, so Squat
+supports giving definitive answers to any searches with a length of 1..4
+characters. For longer search words the word is looked up in 4 letter pieces
+and the results are merged. The resulting list is a list of messages where the
+word *may* exist. This means that those messages are still opened and the word
+is searched from them the slow way.
+
+Squat also supports indexing more characters from the beginning of words. For
+example if 5 first characters are indexed, the word "character" would be cut to
+"chara". If a user then searches for "chara", Dovecot does:
+
+ 1. Search for messages having "char" (reply example: 1,5,10)
+ 2. Search for messages having "hara" (reply example: 3,5,10)
+ 3. Store the intersection of the above searches to "maybe UIDs" (example:
+ 5,10)
+ 4. Search for full word "chara" and store it as "definite UIDs" (reply
+ example: 5)
+ 5. Remove definite UIDs (5) from maybe UIDs (5,10 -> 10)
+
+In the above example Dovecot would know that message UID 5 has the word
+"chara", but it still has to check if UID 10 has it as a substring (e.g. in
+word "achara") by reading the message contents. If the user's search words
+match mostly non-substrings (which is common), using long enough full word
+indexing can improve the search performance a lot, especially when the word
+matches a lot of messages.
+
+The Squat name comes from Cyrus IMAP [http://cyrusimap.web.cmu.edu/] which
+implements slightly similar Squat indexes ("Search QUery Answer Tool").
+Dovecot's implementation and file format however is completely different. The
+main visible difference is that Dovecot allows updating the index incrementally
+instead of requiring to re-read the entire mailbox to build it. Cyrus Squat
+also doesn't support 1..3 characters long searches.
+
+Configuration
+-------------
+
+'fts_squat' setting can be used to change Squat options:
+
+ * partial=n: Length of the substring blocks to index. Default is 4 characters
+ and it's probably not a good idea to change it.
+ * full=n: Index n first characters from the beginning of words. Default is 4,
+ but it could be useful to increase this to e.g. 10 or so. However larger
+ values take more disk space.
+
+---%<-------------------------------------------------------------------------
+mail_plugins = $mail_plugins fts fts_squat
+
+plugin {
+ fts = squat
+ fts_squat = partial=4 full=10
+}
+---%<-------------------------------------------------------------------------
+
+Using both Squat and non-Squat (v2.0-only)
+------------------------------------------
+
+It's possible to use both Squat and non-Squat indices at the same time, so that
+TEXT/BODY are looked up from Squat indexes and X-TEXT-FAST/X-BODY-FAST are
+looked up from the non-Squat index. This of course means that indices will have
+to be built and updated for both, so it might not be that good idea.
+
+---%<-------------------------------------------------------------------------
+protocol imap {
+..
+ mail_plugins = fts fts_squat fts_solr
+}
+...
+plugin {
+ fts = squat solr
+}
+---%<-------------------------------------------------------------------------
+
+(This file was created from the wiki on 2019-06-19 12:42)