doc/src/sgml/html/bloom.html


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html xmlns="http://www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /><title>F.7. bloom</title><link rel="stylesheet" type="text/css" href="stylesheet.css" /><link rev="made" href="pgsql-docs@lists.postgresql.org" /><meta name="generator" content="DocBook XSL Stylesheets Vsnapshot" /><link rel="prev" href="basic-archive.html" title="F.6. basic_archive" /><link rel="next" href="btree-gin.html" title="F.8. btree_gin" /></head><body id="docContent" class="container-fluid col-10"><div class="navheader"><table width="100%" summary="Navigation header"><tr><th colspan="5" align="center">F.7. bloom</th></tr><tr><td width="10%" align="left"><a accesskey="p" href="basic-archive.html" title="F.6. basic_archive">Prev</a> </td><td width="10%" align="left"><a accesskey="u" href="contrib.html" title="Appendix F. Additional Supplied Modules">Up</a></td><th width="60%" align="center">Appendix F. Additional Supplied Modules</th><td width="10%" align="right"><a accesskey="h" href="index.html" title="PostgreSQL 15.7 Documentation">Home</a></td><td width="10%" align="right"> <a accesskey="n" href="btree-gin.html" title="F.8. btree_gin">Next</a></td></tr></table><hr /></div><div class="sect1" id="BLOOM"><div class="titlepage"><div><div><h2 class="title" style="clear: both">F.7. bloom</h2></div></div></div><div class="toc"><dl class="toc"><dt><span class="sect2"><a href="bloom.html#id-1.11.7.16.7">F.7.1. Parameters</a></span></dt><dt><span class="sect2"><a href="bloom.html#id-1.11.7.16.8">F.7.2. Examples</a></span></dt><dt><span class="sect2"><a href="bloom.html#id-1.11.7.16.9">F.7.3. Operator Class Interface</a></span></dt><dt><span class="sect2"><a href="bloom.html#id-1.11.7.16.10">F.7.4. Limitations</a></span></dt><dt><span class="sect2"><a href="bloom.html#id-1.11.7.16.11">F.7.5. Authors</a></span></dt></dl></div><a id="id-1.11.7.16.2" class="indexterm"></a><p>
  <code class="literal">bloom</code> provides an index access method based on
  <a class="ulink" href="https://en.wikipedia.org/wiki/Bloom_filter" target="_top">Bloom filters</a>.
 </p><p>
  A Bloom filter is a space-efficient data structure that is used to test
  whether an element is a member of a set.  In the case of an index access
  method, it allows fast exclusion of non-matching tuples via signatures
  whose size is determined at index creation.
 </p><p>
  A signature is a lossy representation of the indexed attribute(s), and as
  such is prone to reporting false positives; that is, it may be reported
  that an element is in the set, when it is not.  So index search results
  must always be rechecked using the actual attribute values from the heap
  entry.  Larger signatures reduce the odds of a false positive and thus
  reduce the number of useless heap visits, but of course also make the index
  larger and hence slower to scan.
 </p><p>
  This type of index is most useful when a table has many attributes and
  queries test arbitrary combinations of them.  A traditional btree index is
  faster than a bloom index, but it can require many btree indexes to support
  all possible queries where one needs only a single bloom index.  Note
  however that bloom indexes only support equality queries, whereas btree
  indexes can also perform inequality and range searches.
 </p><div class="sect2" id="id-1.11.7.16.7"><div class="titlepage"><div><div><h3 class="title">F.7.1. Parameters</h3></div></div></div><p>
   A <code class="literal">bloom</code> index accepts the following parameters in its
   <code class="literal">WITH</code> clause:
  </p><div class="variablelist"><dl class="variablelist"><dt><span class="term"><code class="literal">length</code></span></dt><dd><p>
      Length of each signature (index entry) in bits. It is rounded up to the
      nearest multiple of <code class="literal">16</code>. The default is
      <code class="literal">80</code> bits and the maximum is <code class="literal">4096</code>.
     </p></dd></dl></div><div class="variablelist"><dl class="variablelist"><dt><span class="term"><code class="literal">col1 — col32</code></span></dt><dd><p>
      Number of bits generated for each index column. Each parameter's name
      refers to the number of the index column that it controls.  The default
      is <code class="literal">2</code> bits and the maximum is <code class="literal">4095</code>.
      Parameters for index columns not actually used are ignored.
     </p></dd></dl></div></div><div class="sect2" id="id-1.11.7.16.8"><div class="titlepage"><div><div><h3 class="title">F.7.2. Examples</h3></div></div></div><p>
   This is an example of creating a bloom index:
  </p><pre class="programlisting">
CREATE INDEX bloomidx ON tbloom USING bloom (i1,i2,i3)
       WITH (length=80, col1=2, col2=2, col3=4);
</pre><p>
   The index is created with a signature length of 80 bits, with attributes
   i1 and i2 mapped to 2 bits, and attribute i3 mapped to 4 bits.  We could
   have omitted the <code class="literal">length</code>, <code class="literal">col1</code>,
   and <code class="literal">col2</code> specifications since those have the default values.
  </p><p>
   Here is a more complete example of bloom index definition and usage, as
   well as a comparison with equivalent btree indexes.  The bloom index is
   considerably smaller than the btree index, and can perform better.
  </p><pre class="programlisting">
=# CREATE TABLE tbloom AS
   SELECT
     (random() * 1000000)::int as i1,
     (random() * 1000000)::int as i2,
     (random() * 1000000)::int as i3,
     (random() * 1000000)::int as i4,
     (random() * 1000000)::int as i5,
     (random() * 1000000)::int as i6
   FROM
  generate_series(1,10000000);
SELECT 10000000
</pre><p>
   A sequential scan over this large table takes a long time:
</p><pre class="programlisting">
=# EXPLAIN ANALYZE SELECT * FROM tbloom WHERE i2 = 898732 AND i5 = 123451;
                                              QUERY PLAN
------------------------------------------------------------------------------------------------------
 Seq Scan on tbloom  (cost=0.00..2137.14 rows=3 width=24) (actual time=16.971..16.971 rows=0 loops=1)
   Filter: ((i2 = 898732) AND (i5 = 123451))
   Rows Removed by Filter: 100000
 Planning Time: 0.346 ms
 Execution Time: 16.988 ms
(5 rows)
</pre><p>
  </p><p>
   Even with the btree index defined the result will still be a
   sequential scan:
</p><pre class="programlisting">
=# CREATE INDEX btreeidx ON tbloom (i1, i2, i3, i4, i5, i6);
CREATE INDEX
=# SELECT pg_size_pretty(pg_relation_size('btreeidx'));
 pg_size_pretty
----------------
 3976 kB
(1 row)
=# EXPLAIN ANALYZE SELECT * FROM tbloom WHERE i2 = 898732 AND i5 = 123451;
                                              QUERY PLAN
------------------------------------------------------------------------------------------------------
 Seq Scan on tbloom  (cost=0.00..2137.00 rows=2 width=24) (actual time=12.805..12.805 rows=0 loops=1)
   Filter: ((i2 = 898732) AND (i5 = 123451))
   Rows Removed by Filter: 100000
 Planning Time: 0.138 ms
 Execution Time: 12.817 ms
(5 rows)
</pre><p>
  </p><p>
   Having the bloom index defined on the table is better than btree in
   handling this type of search:
</p><pre class="programlisting">
=# CREATE INDEX bloomidx ON tbloom USING bloom (i1, i2, i3, i4, i5, i6);
CREATE INDEX
=# SELECT pg_size_pretty(pg_relation_size('bloomidx'));
 pg_size_pretty
----------------
 1584 kB
(1 row)
=# EXPLAIN ANALYZE SELECT * FROM tbloom WHERE i2 = 898732 AND i5 = 123451;
                                                     QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on tbloom  (cost=1792.00..1799.69 rows=2 width=24) (actual time=0.388..0.388 rows=0 loops=1)
   Recheck Cond: ((i2 = 898732) AND (i5 = 123451))
   Rows Removed by Index Recheck: 29
   Heap Blocks: exact=28
   -&gt;  Bitmap Index Scan on bloomidx  (cost=0.00..1792.00 rows=2 width=0) (actual time=0.356..0.356 rows=29 loops=1)
         Index Cond: ((i2 = 898732) AND (i5 = 123451))
 Planning Time: 0.099 ms
 Execution Time: 0.408 ms
(8 rows)
</pre><p>
  </p><p>
   Now, the main problem with the btree search is that btree is inefficient
   when the search conditions do not constrain the leading index column(s).
   A better strategy for btree is to create a separate index on each column.
   Then the planner will choose something like this:
</p><pre class="programlisting">
=# CREATE INDEX btreeidx1 ON tbloom (i1);
CREATE INDEX
=# CREATE INDEX btreeidx2 ON tbloom (i2);
CREATE INDEX
=# CREATE INDEX btreeidx3 ON tbloom (i3);
CREATE INDEX
=# CREATE INDEX btreeidx4 ON tbloom (i4);
CREATE INDEX
=# CREATE INDEX btreeidx5 ON tbloom (i5);
CREATE INDEX
=# CREATE INDEX btreeidx6 ON tbloom (i6);
CREATE INDEX
=# EXPLAIN ANALYZE SELECT * FROM tbloom WHERE i2 = 898732 AND i5 = 123451;
                                                        QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on tbloom  (cost=24.34..32.03 rows=2 width=24) (actual time=0.028..0.029 rows=0 loops=1)
   Recheck Cond: ((i5 = 123451) AND (i2 = 898732))
   -&gt;  BitmapAnd  (cost=24.34..24.34 rows=2 width=0) (actual time=0.027..0.027 rows=0 loops=1)
         -&gt;  Bitmap Index Scan on btreeidx5  (cost=0.00..12.04 rows=500 width=0) (actual time=0.026..0.026 rows=0 loops=1)
               Index Cond: (i5 = 123451)
         -&gt;  Bitmap Index Scan on btreeidx2  (cost=0.00..12.04 rows=500 width=0) (never executed)
               Index Cond: (i2 = 898732)
 Planning Time: 0.491 ms
 Execution Time: 0.055 ms
(9 rows)
</pre><p>
   Although this query runs much faster than with either of the single
   indexes, we pay a penalty in index size.  Each of the single-column
   btree indexes occupies 2 MB, so the total space needed is 12 MB,
   eight times the space used by the bloom index.
  </p></div><div class="sect2" id="id-1.11.7.16.9"><div class="titlepage"><div><div><h3 class="title">F.7.3. Operator Class Interface</h3></div></div></div><p>
   An operator class for bloom indexes requires only a hash function for the
   indexed data type and an equality operator for searching. This example
   shows the operator class definition for the <code class="type">text</code> data type:
  </p><pre class="programlisting">
CREATE OPERATOR CLASS text_ops
DEFAULT FOR TYPE text USING bloom AS
    OPERATOR    1   =(text, text),
    FUNCTION    1   hashtext(text);
</pre></div><div class="sect2" id="id-1.11.7.16.10"><div class="titlepage"><div><div><h3 class="title">F.7.4. Limitations</h3></div></div></div><p>
   </p><div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem"><p>
      Only operator classes for <code class="type">int4</code> and <code class="type">text</code> are
      included with the module.
     </p></li><li class="listitem"><p>
      Only the <code class="literal">=</code> operator is supported for search.  But
      it is possible to add support for arrays with union and intersection
      operations in the future.
     </p></li><li class="listitem"><p>
       <code class="literal">bloom</code> access method doesn't support
       <code class="literal">UNIQUE</code> indexes.
     </p></li><li class="listitem"><p>
       <code class="literal">bloom</code> access method doesn't support searching for
       <code class="literal">NULL</code> values.
     </p></li></ul></div><p>
  </p></div><div class="sect2" id="id-1.11.7.16.11"><div class="titlepage"><div><div><h3 class="title">F.7.5. Authors</h3></div></div></div><p>
   Teodor Sigaev <code class="email">&lt;<a class="email" href="mailto:teodor@postgrespro.ru">teodor@postgrespro.ru</a>&gt;</code>,
   Postgres Professional, Moscow, Russia
  </p><p>
   Alexander Korotkov <code class="email">&lt;<a class="email" href="mailto:a.korotkov@postgrespro.ru">a.korotkov@postgrespro.ru</a>&gt;</code>,
   Postgres Professional, Moscow, Russia
  </p><p>
   Oleg Bartunov <code class="email">&lt;<a class="email" href="mailto:obartunov@postgrespro.ru">obartunov@postgrespro.ru</a>&gt;</code>,
   Postgres Professional, Moscow, Russia
  </p></div></div><div class="navfooter"><hr /><table width="100%" summary="Navigation footer"><tr><td width="40%" align="left"><a accesskey="p" href="basic-archive.html" title="F.6. basic_archive">Prev</a> </td><td width="20%" align="center"><a accesskey="u" href="contrib.html" title="Appendix F. Additional Supplied Modules">Up</a></td><td width="40%" align="right"> <a accesskey="n" href="btree-gin.html" title="F.8. btree_gin">Next</a></td></tr><tr><td width="40%" align="left" valign="top">F.6. basic_archive </td><td width="20%" align="center"><a accesskey="h" href="index.html" title="PostgreSQL 15.7 Documentation">Home</a></td><td width="40%" align="right" valign="top"> F.8. btree_gin</td></tr></table></div></body></html>