summaryrefslogtreecommitdiffstats
path: root/doc/src/sgml/bloom.sgml
diff options
context:
space:
mode:
Diffstat (limited to 'doc/src/sgml/bloom.sgml')
-rw-r--r--doc/src/sgml/bloom.sgml290
1 files changed, 290 insertions, 0 deletions
diff --git a/doc/src/sgml/bloom.sgml b/doc/src/sgml/bloom.sgml
new file mode 100644
index 0000000..095f84c
--- /dev/null
+++ b/doc/src/sgml/bloom.sgml
@@ -0,0 +1,290 @@
+<!-- doc/src/sgml/bloom.sgml -->
+
+<sect1 id="bloom" xreflabel="bloom">
+ <title>bloom</title>
+
+ <indexterm zone="bloom">
+ <primary>bloom</primary>
+ </indexterm>
+
+ <para>
+ <literal>bloom</literal> provides an index access method based on
+ <ulink url="https://en.wikipedia.org/wiki/Bloom_filter">Bloom filters</ulink>.
+ </para>
+
+ <para>
+ A Bloom filter is a space-efficient data structure that is used to test
+ whether an element is a member of a set. In the case of an index access
+ method, it allows fast exclusion of non-matching tuples via signatures
+ whose size is determined at index creation.
+ </para>
+
+ <para>
+ A signature is a lossy representation of the indexed attribute(s), and as
+ such is prone to reporting false positives; that is, it may be reported
+ that an element is in the set, when it is not. So index search results
+ must always be rechecked using the actual attribute values from the heap
+ entry. Larger signatures reduce the odds of a false positive and thus
+ reduce the number of useless heap visits, but of course also make the index
+ larger and hence slower to scan.
+ </para>
+
+ <para>
+ This type of index is most useful when a table has many attributes and
+ queries test arbitrary combinations of them. A traditional btree index is
+ faster than a bloom index, but it can require many btree indexes to support
+ all possible queries where one needs only a single bloom index. Note
+ however that bloom indexes only support equality queries, whereas btree
+ indexes can also perform inequality and range searches.
+ </para>
+
+ <sect2>
+ <title>Parameters</title>
+
+ <para>
+ A <literal>bloom</literal> index accepts the following parameters in its
+ <literal>WITH</literal> clause:
+ </para>
+
+ <variablelist>
+ <varlistentry>
+ <term><literal>length</literal></term>
+ <listitem>
+ <para>
+ Length of each signature (index entry) in bits. It is rounded up to the
+ nearest multiple of <literal>16</literal>. The default is
+ <literal>80</literal> bits and the maximum is <literal>4096</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+ </variablelist>
+ <variablelist>
+ <varlistentry>
+ <term><literal>col1 &mdash; col32</literal></term>
+ <listitem>
+ <para>
+ Number of bits generated for each index column. Each parameter's name
+ refers to the number of the index column that it controls. The default
+ is <literal>2</literal> bits and the maximum is <literal>4095</literal>.
+ Parameters for index columns not actually used are ignored.
+ </para>
+ </listitem>
+ </varlistentry>
+ </variablelist>
+ </sect2>
+
+ <sect2>
+ <title>Examples</title>
+
+ <para>
+ This is an example of creating a bloom index:
+ </para>
+
+<programlisting>
+CREATE INDEX bloomidx ON tbloom USING bloom (i1,i2,i3)
+ WITH (length=80, col1=2, col2=2, col3=4);
+</programlisting>
+
+ <para>
+ The index is created with a signature length of 80 bits, with attributes
+ i1 and i2 mapped to 2 bits, and attribute i3 mapped to 4 bits. We could
+ have omitted the <literal>length</literal>, <literal>col1</literal>,
+ and <literal>col2</literal> specifications since those have the default values.
+ </para>
+
+ <para>
+ Here is a more complete example of bloom index definition and usage, as
+ well as a comparison with equivalent btree indexes. The bloom index is
+ considerably smaller than the btree index, and can perform better.
+ </para>
+
+<programlisting>
+=# CREATE TABLE tbloom AS
+ SELECT
+ (random() * 1000000)::int as i1,
+ (random() * 1000000)::int as i2,
+ (random() * 1000000)::int as i3,
+ (random() * 1000000)::int as i4,
+ (random() * 1000000)::int as i5,
+ (random() * 1000000)::int as i6
+ FROM
+ generate_series(1,10000000);
+SELECT 10000000
+</programlisting>
+
+ <para>
+ A sequential scan over this large table takes a long time:
+<programlisting>
+=# EXPLAIN ANALYZE SELECT * FROM tbloom WHERE i2 = 898732 AND i5 = 123451;
+ QUERY PLAN
+-------------------------------------------------------------------&zwsp;-----------------------------------
+ Seq Scan on tbloom (cost=0.00..2137.14 rows=3 width=24) (actual time=15.480..15.480 rows=0 loops=1)
+ Filter: ((i2 = 898732) AND (i5 = 123451))
+ Rows Removed by Filter: 100000
+ Planning Time: 0.340 ms
+ Execution Time: 15.501 ms
+(5 rows)
+</programlisting>
+ </para>
+
+ <para>
+ Even with the btree index defined the result will still be a
+ sequential scan:
+<programlisting>
+=# CREATE INDEX btreeidx ON tbloom (i1, i2, i3, i4, i5, i6);
+CREATE INDEX
+=# SELECT pg_size_pretty(pg_relation_size('btreeidx'));
+ pg_size_pretty
+----------------
+ 3976 kB
+(1 row)
+=# EXPLAIN ANALYZE SELECT * FROM tbloom WHERE i2 = 898732 AND i5 = 123451;
+ QUERY PLAN
+-------------------------------------------------------------------&zwsp;-----------------------------------
+ Seq Scan on tbloom (cost=0.00..2137.00 rows=2 width=24) (actual time=12.604..12.604 rows=0 loops=1)
+ Filter: ((i2 = 898732) AND (i5 = 123451))
+ Rows Removed by Filter: 100000
+ Planning Time: 0.155 ms
+ Execution Time: 12.617 ms
+(5 rows)
+</programlisting>
+ </para>
+
+ <para>
+ Having the bloom index defined on the table is better than btree in
+ handling this type of search:
+<programlisting>
+=# CREATE INDEX bloomidx ON tbloom USING bloom (i1, i2, i3, i4, i5, i6);
+CREATE INDEX
+=# SELECT pg_size_pretty(pg_relation_size('bloomidx'));
+ pg_size_pretty
+----------------
+ 1584 kB
+(1 row)
+=# EXPLAIN ANALYZE SELECT * FROM tbloom WHERE i2 = 898732 AND i5 = 123451;
+ QUERY PLAN
+-------------------------------------------------------------------&zwsp;--------------------------------------------------
+ Bitmap Heap Scan on tbloom (cost=1792.00..1799.69 rows=2 width=24) (actual time=0.384..0.384 rows=0 loops=1)
+ Recheck Cond: ((i2 = 898732) AND (i5 = 123451))
+ Rows Removed by Index Recheck: 26
+ Heap Blocks: exact=26
+ -&gt; Bitmap Index Scan on bloomidx (cost=0.00..1792.00 rows=2 width=0) (actual time=0.350..0.350 rows=26 loops=1)
+ Index Cond: ((i2 = 898732) AND (i5 = 123451))
+ Planning Time: 0.122 ms
+ Execution Time: 0.407 ms
+(8 rows)
+</programlisting>
+ </para>
+
+ <para>
+ Now, the main problem with the btree search is that btree is inefficient
+ when the search conditions do not constrain the leading index column(s).
+ A better strategy for btree is to create a separate index on each column.
+ Then the planner will choose something like this:
+<programlisting>
+=# CREATE INDEX btreeidx1 ON tbloom (i1);
+CREATE INDEX
+=# CREATE INDEX btreeidx2 ON tbloom (i2);
+CREATE INDEX
+=# CREATE INDEX btreeidx3 ON tbloom (i3);
+CREATE INDEX
+=# CREATE INDEX btreeidx4 ON tbloom (i4);
+CREATE INDEX
+=# CREATE INDEX btreeidx5 ON tbloom (i5);
+CREATE INDEX
+=# CREATE INDEX btreeidx6 ON tbloom (i6);
+CREATE INDEX
+=# EXPLAIN ANALYZE SELECT * FROM tbloom WHERE i2 = 898732 AND i5 = 123451;
+ QUERY PLAN
+-------------------------------------------------------------------&zwsp;--------------------------------------------------------
+ Bitmap Heap Scan on tbloom (cost=24.34..32.03 rows=2 width=24) (actual time=0.032..0.033 rows=0 loops=1)
+ Recheck Cond: ((i5 = 123451) AND (i2 = 898732))
+ -&gt; BitmapAnd (cost=24.34..24.34 rows=2 width=0) (actual time=0.029..0.030 rows=0 loops=1)
+ -&gt; Bitmap Index Scan on btreeidx5 (cost=0.00..12.04 rows=500 width=0) (actual time=0.029..0.029 rows=0 loops=1)
+ Index Cond: (i5 = 123451)
+ -&gt; Bitmap Index Scan on btreeidx2 (cost=0.00..12.04 rows=500 width=0) (never executed)
+ Index Cond: (i2 = 898732)
+ Planning Time: 0.537 ms
+ Execution Time: 0.064 ms
+(9 rows)
+</programlisting>
+ Although this query runs much faster than with either of the single
+ indexes, we pay a penalty in index size. Each of the single-column
+ btree indexes occupies 2 MB, so the total space needed is 12 MB,
+ eight times the space used by the bloom index.
+ </para>
+ </sect2>
+
+ <sect2>
+ <title>Operator Class Interface</title>
+
+ <para>
+ An operator class for bloom indexes requires only a hash function for the
+ indexed data type and an equality operator for searching. This example
+ shows the operator class definition for the <type>text</type> data type:
+ </para>
+
+<programlisting>
+CREATE OPERATOR CLASS text_ops
+DEFAULT FOR TYPE text USING bloom AS
+ OPERATOR 1 =(text, text),
+ FUNCTION 1 hashtext(text);
+</programlisting>
+ </sect2>
+
+ <sect2>
+ <title>Limitations</title>
+ <para>
+ <itemizedlist>
+ <listitem>
+ <para>
+ Only operator classes for <type>int4</type> and <type>text</type> are
+ included with the module.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ Only the <literal>=</literal> operator is supported for search. But
+ it is possible to add support for arrays with union and intersection
+ operations in the future.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ <literal>bloom</literal> access method doesn't support
+ <literal>UNIQUE</literal> indexes.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ <literal>bloom</literal> access method doesn't support searching for
+ <literal>NULL</literal> values.
+ </para>
+ </listitem>
+ </itemizedlist>
+ </para>
+ </sect2>
+
+ <sect2>
+ <title>Authors</title>
+
+ <para>
+ Teodor Sigaev <email>teodor@postgrespro.ru</email>,
+ Postgres Professional, Moscow, Russia
+ </para>
+
+ <para>
+ Alexander Korotkov <email>a.korotkov@postgrespro.ru</email>,
+ Postgres Professional, Moscow, Russia
+ </para>
+
+ <para>
+ Oleg Bartunov <email>obartunov@postgrespro.ru</email>,
+ Postgres Professional, Moscow, Russia
+ </para>
+ </sect2>
+
+</sect1>