1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
1536
1537
1538
1539
1540
1541
1542
1543
1544
1545
1546
1547
1548
1549
1550
1551
1552
1553
1554
1555
1556
1557
1558
1559
1560
1561
1562
1563
1564
1565
1566
1567
1568
1569
1570
1571
1572
1573
1574
1575
1576
1577
1578
1579
1580
1581
1582
1583
1584
1585
1586
1587
1588
1589
1590
1591
1592
1593
1594
1595
1596
1597
1598
1599
1600
1601
1602
1603
1604
1605
1606
1607
1608
1609
1610
1611
1612
1613
1614
1615
1616
1617
1618
1619
1620
1621
1622
1623
1624
1625
1626
1627
1628
1629
1630
1631
1632
1633
1634
1635
1636
1637
1638
1639
1640
1641
1642
1643
1644
1645
1646
1647
1648
1649
1650
1651
1652
1653
1654
1655
1656
1657
1658
1659
1660
1661
1662
1663
1664
1665
1666
1667
1668
1669
1670
1671
1672
1673
1674
1675
1676
1677
1678
1679
1680
1681
1682
1683
1684
1685
1686
1687
1688
1689
1690
1691
1692
1693
1694
1695
1696
1697
1698
1699
1700
1701
1702
1703
1704
1705
1706
1707
1708
1709
1710
1711
1712
1713
1714
1715
1716
1717
1718
1719
1720
1721
1722
1723
1724
1725
1726
1727
1728
1729
1730
1731
1732
1733
1734
1735
1736
1737
1738
1739
1740
1741
1742
1743
1744
1745
1746
1747
1748
1749
1750
1751
1752
1753
1754
1755
1756
1757
1758
1759
1760
1761
1762
1763
1764
1765
1766
1767
1768
1769
1770
1771
1772
1773
1774
1775
1776
1777
1778
1779
1780
1781
1782
1783
1784
1785
1786
1787
1788
1789
1790
1791
1792
1793
1794
1795
1796
1797
1798
1799
1800
1801
1802
1803
1804
1805
1806
1807
1808
1809
1810
1811
1812
1813
1814
1815
1816
1817
1818
1819
1820
1821
1822
1823
1824
1825
1826
1827
1828
1829
1830
1831
1832
1833
1834
1835
1836
1837
1838
1839
1840
1841
1842
1843
1844
1845
1846
1847
1848
1849
1850
1851
1852
1853
1854
1855
1856
1857
1858
1859
1860
1861
1862
1863
1864
1865
1866
1867
1868
1869
1870
1871
1872
1873
1874
1875
1876
1877
1878
1879
1880
1881
1882
1883
1884
1885
1886
1887
1888
1889
1890
1891
1892
1893
1894
1895
1896
1897
1898
1899
1900
1901
1902
1903
1904
1905
1906
1907
1908
1909
1910
1911
1912
1913
1914
1915
1916
1917
1918
1919
1920
1921
1922
1923
1924
1925
1926
1927
1928
1929
1930
1931
1932
1933
1934
1935
1936
1937
1938
1939
1940
1941
1942
1943
1944
1945
1946
1947
1948
1949
1950
1951
1952
1953
1954
1955
1956
1957
1958
1959
1960
1961
1962
1963
1964
1965
1966
1967
1968
1969
1970
1971
1972
1973
1974
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
2026
2027
2028
2029
2030
2031
2032
2033
2034
2035
2036
2037
2038
2039
2040
2041
2042
2043
2044
2045
2046
2047
2048
2049
2050
2051
2052
2053
2054
2055
2056
2057
2058
2059
2060
2061
2062
2063
2064
2065
2066
2067
2068
2069
2070
2071
2072
2073
2074
2075
2076
2077
2078
2079
2080
2081
2082
2083
2084
2085
2086
2087
2088
2089
2090
2091
2092
2093
2094
2095
2096
2097
2098
2099
2100
2101
2102
2103
2104
2105
2106
2107
2108
2109
2110
2111
2112
2113
2114
2115
2116
2117
2118
2119
2120
2121
2122
2123
2124
2125
2126
2127
2128
2129
2130
2131
2132
2133
2134
2135
2136
2137
2138
2139
2140
2141
2142
2143
2144
2145
2146
2147
2148
2149
2150
2151
2152
2153
2154
2155
2156
2157
2158
2159
2160
2161
2162
2163
2164
2165
2166
2167
2168
2169
2170
2171
2172
2173
2174
2175
2176
2177
2178
2179
2180
2181
2182
2183
2184
2185
2186
2187
2188
2189
2190
2191
2192
2193
2194
2195
2196
2197
2198
2199
2200
2201
2202
2203
2204
2205
2206
2207
2208
2209
2210
2211
2212
2213
2214
2215
2216
2217
2218
2219
2220
2221
2222
2223
2224
2225
2226
2227
2228
2229
2230
2231
2232
2233
2234
2235
2236
2237
2238
2239
2240
2241
2242
2243
2244
2245
2246
2247
2248
2249
2250
2251
2252
2253
2254
2255
2256
2257
2258
2259
2260
2261
2262
2263
2264
2265
2266
2267
2268
2269
2270
2271
2272
2273
2274
2275
2276
2277
2278
2279
2280
2281
2282
2283
2284
2285
2286
2287
2288
2289
2290
2291
2292
2293
2294
2295
2296
2297
2298
2299
2300
2301
2302
2303
2304
2305
2306
2307
2308
2309
2310
2311
2312
2313
2314
2315
2316
2317
2318
2319
2320
2321
2322
2323
2324
2325
2326
2327
2328
2329
2330
2331
2332
2333
2334
2335
2336
2337
2338
2339
2340
2341
2342
2343
2344
2345
2346
2347
2348
2349
2350
2351
2352
2353
2354
2355
2356
2357
2358
2359
2360
2361
2362
2363
2364
2365
2366
2367
2368
2369
2370
2371
2372
2373
2374
2375
2376
2377
2378
2379
2380
2381
2382
2383
2384
2385
2386
2387
2388
2389
2390
2391
2392
2393
2394
2395
2396
2397
2398
2399
2400
2401
2402
2403
2404
2405
2406
2407
2408
2409
2410
2411
2412
2413
2414
2415
2416
2417
2418
2419
2420
2421
2422
2423
2424
2425
2426
2427
2428
2429
2430
2431
2432
2433
2434
2435
2436
2437
2438
2439
2440
2441
2442
2443
2444
2445
2446
2447
2448
2449
2450
2451
2452
2453
2454
2455
2456
2457
2458
2459
2460
2461
2462
2463
2464
2465
2466
2467
2468
2469
2470
2471
2472
2473
2474
2475
2476
2477
2478
2479
2480
2481
2482
2483
2484
2485
2486
2487
2488
2489
2490
2491
2492
2493
2494
2495
2496
2497
2498
2499
2500
2501
2502
2503
2504
2505
2506
2507
2508
2509
2510
2511
2512
2513
2514
2515
2516
2517
2518
2519
2520
2521
2522
2523
2524
2525
2526
2527
2528
2529
2530
2531
2532
2533
2534
2535
2536
2537
2538
2539
2540
2541
2542
2543
2544
2545
2546
2547
2548
2549
2550
2551
2552
2553
2554
2555
2556
2557
2558
2559
2560
2561
2562
2563
2564
2565
2566
2567
2568
2569
2570
2571
2572
2573
2574
2575
2576
2577
2578
2579
2580
2581
2582
2583
2584
2585
2586
2587
2588
2589
2590
2591
2592
2593
2594
2595
2596
2597
2598
2599
2600
2601
2602
2603
2604
2605
2606
2607
2608
2609
2610
2611
2612
2613
2614
2615
2616
2617
2618
2619
2620
2621
2622
2623
2624
2625
2626
2627
2628
2629
2630
2631
2632
2633
2634
2635
2636
2637
2638
2639
2640
2641
2642
2643
2644
2645
2646
2647
2648
2649
2650
2651
2652
2653
2654
2655
2656
2657
2658
2659
2660
2661
2662
2663
2664
2665
2666
2667
2668
2669
2670
2671
2672
2673
2674
2675
2676
2677
2678
2679
2680
2681
2682
2683
2684
2685
2686
2687
2688
2689
2690
2691
2692
2693
2694
2695
2696
2697
2698
2699
2700
2701
2702
2703
2704
2705
2706
2707
2708
2709
2710
2711
2712
2713
2714
2715
2716
2717
2718
2719
2720
2721
2722
2723
2724
2725
2726
2727
2728
2729
2730
2731
2732
2733
2734
2735
2736
2737
2738
2739
2740
2741
2742
2743
2744
2745
2746
2747
2748
2749
2750
2751
2752
2753
2754
2755
2756
2757
2758
2759
2760
2761
2762
2763
2764
2765
2766
2767
2768
2769
2770
2771
2772
2773
2774
2775
2776
2777
2778
2779
2780
2781
2782
2783
2784
2785
2786
2787
2788
2789
2790
2791
2792
2793
2794
2795
2796
2797
2798
2799
2800
2801
2802
2803
2804
2805
2806
2807
2808
2809
2810
2811
2812
2813
2814
2815
2816
2817
2818
2819
2820
2821
2822
2823
2824
2825
2826
2827
2828
2829
2830
2831
2832
2833
2834
2835
2836
2837
2838
2839
2840
2841
2842
2843
2844
2845
2846
2847
2848
2849
2850
2851
2852
2853
2854
2855
2856
2857
2858
2859
2860
2861
2862
2863
2864
2865
2866
2867
2868
2869
2870
2871
2872
2873
2874
2875
2876
2877
2878
2879
2880
2881
2882
2883
2884
2885
2886
2887
2888
2889
2890
2891
2892
2893
2894
2895
2896
2897
2898
2899
2900
2901
2902
2903
2904
2905
2906
2907
2908
2909
2910
2911
2912
2913
2914
2915
2916
2917
2918
2919
2920
2921
2922
2923
2924
2925
2926
2927
2928
2929
2930
2931
2932
2933
2934
2935
2936
2937
2938
2939
2940
2941
2942
2943
2944
2945
2946
2947
2948
2949
2950
2951
2952
2953
2954
2955
2956
2957
2958
2959
2960
2961
2962
2963
2964
2965
2966
2967
2968
2969
2970
2971
2972
2973
2974
2975
2976
2977
2978
2979
2980
2981
2982
2983
2984
2985
2986
2987
2988
2989
2990
2991
2992
2993
2994
2995
2996
2997
2998
2999
3000
3001
3002
3003
3004
3005
3006
3007
3008
3009
3010
3011
3012
3013
3014
3015
3016
3017
3018
3019
3020
3021
3022
3023
3024
3025
3026
3027
3028
3029
3030
3031
3032
3033
3034
3035
3036
3037
3038
3039
3040
3041
3042
3043
3044
3045
3046
3047
3048
3049
3050
3051
3052
3053
3054
3055
3056
3057
3058
3059
3060
3061
3062
3063
3064
3065
3066
3067
3068
3069
3070
3071
3072
3073
3074
3075
3076
3077
3078
3079
3080
3081
3082
3083
3084
3085
3086
3087
3088
3089
3090
3091
3092
3093
3094
3095
3096
3097
3098
3099
3100
3101
3102
3103
3104
3105
3106
3107
3108
3109
3110
3111
3112
3113
3114
3115
3116
3117
3118
3119
3120
3121
3122
3123
3124
3125
3126
3127
3128
3129
3130
3131
3132
3133
3134
3135
3136
3137
3138
3139
3140
3141
3142
3143
3144
3145
3146
3147
3148
3149
3150
3151
3152
3153
3154
3155
3156
3157
3158
3159
3160
3161
3162
3163
3164
3165
3166
3167
3168
3169
3170
3171
3172
3173
3174
3175
3176
3177
3178
3179
3180
3181
3182
3183
3184
3185
3186
3187
3188
3189
3190
3191
3192
3193
3194
3195
3196
3197
3198
3199
3200
3201
3202
3203
3204
3205
3206
3207
3208
3209
3210
3211
3212
3213
3214
3215
3216
3217
3218
3219
3220
3221
3222
3223
3224
3225
3226
3227
3228
3229
3230
3231
3232
3233
3234
3235
3236
3237
3238
3239
3240
3241
3242
3243
3244
3245
3246
3247
3248
3249
3250
3251
3252
3253
3254
3255
3256
3257
3258
3259
3260
3261
3262
3263
3264
3265
3266
3267
3268
3269
3270
3271
3272
3273
3274
3275
3276
3277
3278
3279
3280
3281
3282
3283
3284
3285
3286
3287
3288
3289
3290
3291
3292
3293
3294
3295
3296
3297
3298
3299
3300
3301
3302
3303
3304
3305
3306
3307
3308
3309
3310
3311
3312
3313
3314
3315
3316
3317
3318
3319
3320
3321
3322
3323
3324
3325
3326
3327
3328
3329
3330
3331
3332
3333
3334
3335
3336
3337
3338
3339
3340
3341
3342
3343
3344
3345
3346
3347
3348
3349
3350
3351
3352
3353
3354
3355
3356
3357
3358
3359
3360
3361
3362
3363
3364
3365
3366
3367
3368
3369
3370
3371
3372
3373
3374
3375
3376
3377
3378
3379
3380
3381
3382
3383
3384
3385
3386
3387
3388
3389
3390
3391
3392
3393
3394
3395
3396
3397
3398
3399
3400
3401
3402
3403
3404
3405
3406
3407
3408
3409
3410
3411
3412
3413
3414
3415
3416
3417
3418
3419
3420
3421
3422
3423
3424
3425
3426
3427
3428
3429
3430
3431
3432
3433
3434
3435
3436
3437
3438
3439
3440
3441
3442
3443
3444
3445
3446
3447
3448
3449
3450
3451
3452
3453
3454
3455
3456
3457
3458
3459
3460
3461
3462
3463
3464
3465
3466
3467
3468
3469
3470
3471
3472
3473
3474
3475
3476
3477
3478
3479
3480
3481
3482
3483
3484
3485
3486
3487
3488
3489
3490
3491
3492
3493
3494
3495
3496
3497
3498
3499
3500
3501
3502
3503
3504
3505
3506
3507
3508
3509
3510
3511
3512
3513
3514
3515
3516
3517
3518
3519
3520
3521
3522
3523
3524
3525
3526
3527
3528
3529
3530
3531
3532
3533
3534
3535
3536
3537
3538
3539
3540
3541
3542
3543
3544
3545
3546
3547
3548
3549
3550
3551
3552
3553
3554
3555
3556
3557
3558
3559
3560
3561
3562
3563
3564
3565
3566
3567
3568
3569
3570
3571
3572
3573
3574
3575
3576
3577
3578
3579
3580
3581
3582
3583
3584
3585
3586
3587
3588
3589
3590
3591
3592
3593
3594
3595
3596
3597
3598
3599
3600
3601
3602
3603
3604
3605
3606
3607
3608
3609
3610
3611
3612
3613
3614
3615
3616
3617
3618
3619
3620
3621
3622
3623
3624
3625
3626
3627
3628
3629
3630
3631
3632
3633
3634
3635
3636
3637
3638
3639
3640
3641
3642
3643
3644
3645
3646
3647
3648
3649
3650
3651
3652
3653
3654
3655
3656
3657
3658
3659
3660
3661
3662
3663
3664
3665
3666
3667
3668
3669
3670
3671
3672
3673
3674
3675
3676
3677
3678
3679
3680
3681
3682
3683
3684
3685
3686
3687
3688
3689
3690
3691
3692
3693
3694
3695
3696
3697
3698
3699
3700
3701
3702
3703
3704
3705
3706
3707
3708
3709
3710
3711
3712
3713
3714
3715
3716
3717
3718
3719
3720
3721
3722
3723
3724
3725
3726
3727
3728
3729
3730
3731
3732
3733
3734
3735
3736
3737
3738
3739
3740
3741
3742
3743
3744
3745
3746
3747
3748
3749
3750
3751
3752
3753
3754
3755
3756
3757
3758
3759
3760
3761
3762
3763
3764
3765
3766
3767
3768
3769
3770
3771
3772
3773
3774
3775
3776
3777
3778
3779
3780
3781
3782
3783
3784
3785
3786
3787
3788
3789
3790
3791
3792
3793
3794
3795
3796
3797
3798
3799
3800
3801
3802
3803
3804
3805
3806
3807
3808
3809
3810
3811
3812
3813
3814
3815
3816
3817
3818
3819
3820
3821
3822
3823
3824
3825
3826
3827
3828
3829
3830
3831
3832
3833
3834
3835
3836
3837
3838
3839
3840
3841
3842
3843
3844
3845
3846
3847
3848
3849
3850
3851
3852
3853
3854
3855
3856
3857
3858
3859
3860
3861
3862
3863
3864
3865
3866
3867
3868
3869
3870
3871
3872
3873
3874
3875
3876
3877
3878
3879
3880
3881
3882
3883
3884
3885
3886
3887
3888
3889
3890
3891
3892
3893
3894
3895
3896
3897
3898
3899
3900
3901
3902
|
<!DOCTYPE html>
<html><head>
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
<link href="sqlite.css" rel="stylesheet">
<title>SQLite FTS5 Extension</title>
<!-- path= -->
</head>
<body>
<div class=nosearch>
<a href="index.html">
<img class="logo" src="images/sqlite370_banner.gif" alt="SQLite" border="0">
</a>
<div><!-- IE hack to prevent disappearing logo --></div>
<div class="tagline desktoponly">
Small. Fast. Reliable.<br>Choose any three.
</div>
<div class="menu mainmenu">
<ul>
<li><a href="index.html">Home</a>
<li class='mobileonly'><a href="javascript:void(0)" onclick='toggle_div("submenu")'>Menu</a>
<li class='wideonly'><a href='about.html'>About</a>
<li class='desktoponly'><a href="docs.html">Documentation</a>
<li class='desktoponly'><a href="download.html">Download</a>
<li class='wideonly'><a href='copyright.html'>License</a>
<li class='desktoponly'><a href="support.html">Support</a>
<li class='desktoponly'><a href="prosupport.html">Purchase</a>
<li class='search' id='search_menubutton'>
<a href="javascript:void(0)" onclick='toggle_search()'>Search</a>
</ul>
</div>
<div class="menu submenu" id="submenu">
<ul>
<li><a href='about.html'>About</a>
<li><a href='docs.html'>Documentation</a>
<li><a href='download.html'>Download</a>
<li><a href='support.html'>Support</a>
<li><a href='prosupport.html'>Purchase</a>
</ul>
</div>
<div class="searchmenu" id="searchmenu">
<form method="GET" action="search">
<select name="s" id="searchtype">
<option value="d">Search Documentation</option>
<option value="c">Search Changelog</option>
</select>
<input type="text" name="q" id="searchbox" value="">
<input type="submit" value="Go">
</form>
</div>
</div>
<script>
function toggle_div(nm) {
var w = document.getElementById(nm);
if( w.style.display=="block" ){
w.style.display = "none";
}else{
w.style.display = "block";
}
}
function toggle_search() {
var w = document.getElementById("searchmenu");
if( w.style.display=="block" ){
w.style.display = "none";
} else {
w.style.display = "block";
setTimeout(function(){
document.getElementById("searchbox").focus()
}, 30);
}
}
function div_off(nm){document.getElementById(nm).style.display="none";}
window.onbeforeunload = function(e){div_off("submenu");}
/* Disable the Search feature if we are not operating from CGI, since */
/* Search is accomplished using CGI and will not work without it. */
if( !location.origin || !location.origin.match || !location.origin.match(/http/) ){
document.getElementById("search_menubutton").style.display = "none";
}
/* Used by the Hide/Show button beside syntax diagrams, to toggle the */
function hideorshow(btn,obj){
var x = document.getElementById(obj);
var b = document.getElementById(btn);
if( x.style.display!='none' ){
x.style.display = 'none';
b.innerHTML='show';
}else{
x.style.display = '';
b.innerHTML='hide';
}
return false;
}
var antiRobot = 0;
function antiRobotGo(){
if( antiRobot!=3 ) return;
antiRobot = 7;
var j = document.getElementById("mtimelink");
if(j && j.hasAttribute("data-href")) j.href=j.getAttribute("data-href");
}
function antiRobotDefense(){
document.body.onmousedown=function(){
antiRobot |= 2;
antiRobotGo();
document.body.onmousedown=null;
}
document.body.onmousemove=function(){
antiRobot |= 2;
antiRobotGo();
document.body.onmousemove=null;
}
setTimeout(function(){
antiRobot |= 1;
antiRobotGo();
}, 100)
antiRobotGo();
}
antiRobotDefense();
</script>
<div class=fancy>
<div class=nosearch>
<div class="fancy_title">
SQLite FTS5 Extension
</div>
<div class="fancy_toc">
<a onclick="toggle_toc()">
<span class="fancy_toc_mark" id="toc_mk">►</span>
Table Of Contents
</a>
<div id="toc_sub"><div class="fancy-toc1"><a href="#overview_of_fts5">1. Overview of FTS5</a></div>
<div class="fancy-toc1"><a href="#compiling_and_using_fts5">2. Compiling and Using FTS5</a></div>
<div class="fancy-toc2"><a href="#building_fts5_as_part_of_sqlite">2.1. Building FTS5 as part of SQLite</a></div>
<div class="fancy-toc2"><a href="#building_a_loadable_extension">2.2. Building a Loadable Extension</a></div>
<div class="fancy-toc1"><a href="#full_text_query_syntax">3. Full-text Query Syntax</a></div>
<div class="fancy-toc2"><a href="#fts5_strings">3.1. FTS5 Strings</a></div>
<div class="fancy-toc2"><a href="#fts5_phrases">3.2. FTS5 Phrases</a></div>
<div class="fancy-toc2"><a href="#fts5_prefix_queries">3.3. FTS5 Prefix Queries</a></div>
<div class="fancy-toc2"><a href="#fts5_initial_token_queries">3.4. FTS5 Initial Token Queries</a></div>
<div class="fancy-toc2"><a href="#fts5_near_queries">3.5. FTS5 NEAR Queries</a></div>
<div class="fancy-toc2"><a href="#fts5_column_filters">3.6. FTS5 Column Filters</a></div>
<div class="fancy-toc2"><a href="#fts5_boolean_operators">3.7. FTS5 Boolean Operators</a></div>
<div class="fancy-toc1"><a href="#fts5_table_creation_and_initialization">4. FTS5 Table Creation and Initialization</a></div>
<div class="fancy-toc2"><a href="#the_unindexed_column_option">4.1. The UNINDEXED column option</a></div>
<div class="fancy-toc2"><a href="#prefix_indexes">4.2. Prefix Indexes</a></div>
<div class="fancy-toc2"><a href="#tokenizers">4.3. Tokenizers</a></div>
<div class="fancy-toc3"><a href="#unicode61_tokenizer">4.3.1. Unicode61 Tokenizer</a></div>
<div class="fancy-toc3"><a href="#ascii_tokenizer">4.3.2. Ascii Tokenizer</a></div>
<div class="fancy-toc3"><a href="#porter_tokenizer">4.3.3. Porter Tokenizer</a></div>
<div class="fancy-toc3"><a href="#the_trigram_tokenizer">4.3.4. The Trigram Tokenizer</a></div>
<div class="fancy-toc2"><a href="#external_content_and_contentless_tables">4.4. External Content and Contentless Tables</a></div>
<div class="fancy-toc3"><a href="#contentless_tables">4.4.1. Contentless Tables</a></div>
<div class="fancy-toc3"><a href="#contentless_delete_tables">4.4.2. Contentless-Delete Tables</a></div>
<div class="fancy-toc3"><a href="#external_content_tables">4.4.3. External Content Tables</a></div>
<div class="fancy-toc3"><a href="#external_content_table_pitfalls">4.4.4. External Content Table Pitfalls</a></div>
<div class="fancy-toc2"><a href="#the_columnsize_option">4.5. The Columnsize Option</a></div>
<div class="fancy-toc2"><a href="#the_detail_option">4.6. The Detail Option</a></div>
<div class="fancy-toc2"><a href="#the_tokendata_option">4.7. The Tokendata Option</a></div>
<div class="fancy-toc1"><a href="#_auxiliary_functions_">5. Auxiliary Functions </a></div>
<div class="fancy-toc2"><a href="#built_in_auxiliary_functions">5.1. Built-in Auxiliary Functions</a></div>
<div class="fancy-toc3"><a href="#the_bm25_function">5.1.1. The bm25() function</a></div>
<div class="fancy-toc3"><a href="#the_highlight_function">5.1.2. The highlight() function</a></div>
<div class="fancy-toc3"><a href="#the_snippet_function">5.1.3. The snippet() function</a></div>
<div class="fancy-toc2"><a href="#sorting_by_auxiliary_function_results">5.2. Sorting by Auxiliary Function Results</a></div>
<div class="fancy-toc1"><a href="#special_insert_commands">6. Special INSERT Commands</a></div>
<div class="fancy-toc2"><a href="#the_automerge_configuration_option">6.1. The 'automerge' Configuration Option</a></div>
<div class="fancy-toc2"><a href="#the_crisismerge_configuration_option">6.2. The 'crisismerge' Configuration Option</a></div>
<div class="fancy-toc2"><a href="#the_delete_command">6.3. The 'delete' Command</a></div>
<div class="fancy-toc2"><a href="#the_delete_all_command">6.4. The 'delete-all' Command</a></div>
<div class="fancy-toc2"><a href="#the_deletemerge_configuration_option">6.5. The 'deletemerge' Configuration Option</a></div>
<div class="fancy-toc2"><a href="#the_integrity_check_command">6.6. The 'integrity-check' Command</a></div>
<div class="fancy-toc2"><a href="#the_merge_command">6.7. The 'merge' Command</a></div>
<div class="fancy-toc2"><a href="#the_optimize_command">6.8. The 'optimize' Command</a></div>
<div class="fancy-toc2"><a href="#the_pgsz_configuration_option">6.9. The 'pgsz' Configuration Option</a></div>
<div class="fancy-toc2"><a href="#the_rank_configuration_option">6.10. The 'rank' Configuration Option</a></div>
<div class="fancy-toc2"><a href="#the_rebuild_command">6.11. The 'rebuild' Command</a></div>
<div class="fancy-toc2"><a href="#the_secure_delete_configuration_option">6.12. The 'secure-delete' Configuration Option</a></div>
<div class="fancy-toc2"><a href="#the_usermerge_configuration_option">6.13. The 'usermerge' Configuration Option</a></div>
<div class="fancy-toc1"><a href="#extending_fts5">7. Extending FTS5</a></div>
<div class="fancy-toc2"><a href="#custom_tokenizers">7.1. Custom Tokenizers</a></div>
<div class="fancy-toc3"><a href="#synonym_support">7.1.1. Synonym Support</a></div>
<div class="fancy-toc2"><a href="#custom_auxiliary_functions">7.2. Custom Auxiliary Functions</a></div>
<div class="fancy-toc3"><a href="#custom_auxiliary_functions_api_overview">7.2.1. Custom Auxiliary Functions API Overview</a></div>
<div class="fancy-toc3"><a href="#custom_auxiliary_functions_api_reference">7.2.2. Custom Auxiliary Functions API Reference</a></div>
<div class="fancy-toc1"><a href="#the_fts5vocab_virtual_table_module">8. The fts5vocab Virtual Table Module</a></div>
<div class="fancy-toc1"><a href="#fts5_data_structures">9. FTS5 Data Structures</a></div>
<div class="fancy-toc2"><a href="#varint_format">9.1. Varint Format</a></div>
<div class="fancy-toc2"><a href="#the_fts_index_idx_and_data_tables_">9.2. The FTS Index (%_idx and %_data tables)</a></div>
<div class="fancy-toc3"><a href="#data_structure">9.2.1. The %_data Table Rowid Space</a></div>
<div class="fancy-toc3"><a href="#structure_record_format">9.2.2. Structure Record Format</a></div>
<div class="fancy-toc3"><a href="#averages_record_format">9.2.3. Averages Record Format</a></div>
<div class="fancy-toc3"><a href="#segment_b_tree_format">9.2.4. Segment B-Tree Format</a></div>
<div class="fancy-toc4"><a href="#the_key_doclist_format">9.2.4.1. The Key/Doclist Format</a></div>
<div class="fancy-toc4"><a href="#data_pagination">9.2.4.2. Pagination</a></div>
<div class="fancy-toc4"><a href="#data_term_index">9.2.4.3. Segment Index Format</a></div>
<div class="fancy-toc4"><a href="#doclist_index_format">9.2.4.4. Doclist Index Format</a></div>
<div class="fancy-toc2"><a href="#document_sizes_table_docsize_table_">9.3. Document Sizes Table (%_docsize table)</a></div>
<div class="fancy-toc2"><a href="#the_table_contents_content_table_">9.4. The Table Contents (%_content table)</a></div>
<div class="fancy-toc2"><a href="#configuration_options_config_table_">9.5. Configuration Options (%_config table)</a></div>
<div class="fancy-toc1"><a href="#appendix_a">
Appendix A: Comparison with FTS3/4
</a></div>
<div class="fancy-toc2"><a href="#_application_porting_guide_"> Application Porting Guide </a></div>
<div class="fancy-toc3"><a href="#_changes_to_create_virtual_table_statements_"> Changes to CREATE VIRTUAL TABLE statements </a></div>
<div class="fancy-toc3"><a href="#_changes_to_select_statements_"> Changes to SELECT statements </a></div>
<div class="fancy-toc3"><a href="#_auxiliary_function_changes_"> Auxiliary Function Changes </a></div>
<div class="fancy-toc3"><a href="#_other_issues"> Other Issues</a></div>
<div class="fancy-toc2"><a href="#_summary_of_technical_differences_">
Summary of Technical Differences
</a></div>
</div>
</div>
<script>
function toggle_toc(){
var sub = document.getElementById("toc_sub")
var mk = document.getElementById("toc_mk")
if( sub.style.display!="block" ){
sub.style.display = "block";
mk.innerHTML = "▼";
} else {
sub.style.display = "none";
mk.innerHTML = "►";
}
}
</script>
</div>
<h1 id="overview_of_fts5"><span>1. </span>Overview of FTS5</h1>
<p>FTS5 is an SQLite <a href="c3ref/module.html">virtual table module</a> that provides
<a href="https://en.wikipedia.org/wiki/Full_text_search">full-text search</a>
functionality to database applications. In their most elementary form,
full-text search engines allow the user to efficiently search a large
collection of documents for the subset that contain one or more instances of a
search term. The search functionality provided to world wide web users by
<a href="https://www.google.com/">Google</a> is, among other things, a full-text search
engine, as it allows users to search for all documents on the web that contain,
for example, the term "fts5".
</p><p>To use FTS5, the user creates an FTS5 virtual table with one or more
columns. For example:
</p><div class="codeblock"><pre>CREATE VIRTUAL TABLE email USING fts5(sender, title, body);
</pre></div>
<p>It is an error to add types, constraints or <a href="lang_createtable.html#primkeyconst">PRIMARY KEY</a> declarations to
a CREATE VIRTUAL TABLE statement used to create an FTS5 table. Once created,
an FTS5 table may be populated using <a href="lang_insert.html">INSERT</a>, <a href="lang_update.html">UPDATE</a> or <a href="lang_delete.html">DELETE</a> statements
like any other table. Like any other table with no PRIMARY KEY declaration, an
FTS5 table has an implicit INTEGER PRIMARY KEY field named rowid.
</p><p>Not shown in the example above is that there are also
<a href="fts5.html#fts5_table_creation_and_initialization">various options</a> that may be provided to FTS5 as
part of the CREATE VIRTUAL TABLE statement to configure various aspects of the
new table. These may be used to modify the way in which the FTS5 table extracts
terms from documents and queries, to create extra indexes on disk to speed up
prefix queries, or to create an FTS5 table that acts as an index on content
stored elsewhere.
</p><p>Once populated, there are three ways to execute a full-text query against
the contents of an FTS5 table:
</p><ul><li> Using a MATCH operator in the WHERE clause of a SELECT statement, or
</li><li> Using an equals ("=") operator in the WHERE clause of a SELECT statement, or
</li><li> using the <a href="vtab.html#tabfunc2">table-valued function</a> syntax.
</li></ul>
<p>If using the MATCH or = operators, the expression to the left of the MATCH
operator is usually the name of the FTS5 table (the exception is when
<a href="fts5.html#fts5_column_filters">specifying a column-filter</a>). The expression on the right
must be a text value specifying the term to search for. For the table-valued
function syntax, the term to search for is specified as the first table argument.
For example:
</p><div class="codeblock"><pre><i>-- Query for all rows that contain at least once instance of the term</i>
<i>-- "fts5" (in any column). The following three queries are equivalent.</i>
SELECT * FROM email WHERE email MATCH 'fts5';
SELECT * FROM email WHERE email = 'fts5';
SELECT * FROM email('fts5');
</pre></div>
<p> By default, FTS5 full-text searches are case-independent. Like any other
SQL query that does not contain an ORDER BY clause, the example above returns
results in an arbitrary order. To sort results by relevance (most to least
relevant), an ORDER BY may be added to a full-text query as follows:
</p><div class="codeblock"><pre><i>-- Query for all rows that contain at least once instance of the term</i>
<i>-- "fts5" (in any column). Return results in order from best to worst</i>
<i>-- match. </i>
SELECT * FROM email WHERE email MATCH 'fts5' ORDER BY rank;
</pre></div>
<p> As well as the column values and rowid of a matching row, an application
may use <a href="fts5.html#_auxiliary_functions_">FTS5 auxiliary functions</a> to retrieve extra information regarding
the matched row. For example, an auxiliary function may be used to retrieve
a copy of a column value for a matched row with all instances of the matched
term surrounded by html <b></b> tags. Auxiliary functions are
invoked in the same way as SQLite <a href="lang_corefunc.html">scalar functions</a>, except that the name
of the FTS5 table is specified as the first argument. For example:
</p><div class="codeblock"><pre><i>-- Query for rows that match "fts5". Return a copy of the "body" column</i>
<i>-- of each row with the matches surrounded by <b></b> tags.</i>
SELECT highlight(email, 2, '<b>', '</b>') FROM email('fts5');
</pre></div>
<p>A description of the available auxiliary functions, and more details
regarding configuration of the special "rank" column, are
<a href="fts5.html#_auxiliary_functions_">available below</a>. <a href="fts5.html#custom_auxiliary_functions">Custom auxiliary functions</a> may also be implemented in C and registered with
FTS5, just as custom SQL functions may be registered with the SQLite core.
</p><p> As well as searching for all rows that contain a term, FTS5 allows
the user to search for rows that contain:
</p><ul>
<li> any terms that begin with a specified prefix,
</li><li> "phrases" - sequences of terms or prefix terms that must feature in a
document for it to match the query,
</li><li> sets of terms, prefix terms or phrases that appear within a specified
proximity of each other (these are called "NEAR queries"), or
</li><li> boolean combinations of any of the above.
</li></ul>
<p> Such advanced searches are requested by providing a more complicated
FTS5 query string as the text to the right of the MATCH operator (or =
operator, or as the first argument to a table-valued function syntax). The
full query syntax is <a href="fts5.html#full_text_query_syntax">described here</a>.
</p><a name="compiling_and_using_fts5"></a>
<h1 tags="FTS5 building" id="compiling_and_using_fts5"><span>2. </span>Compiling and Using FTS5</h1>
<h2 id="building_fts5_as_part_of_sqlite"><span>2.1. </span>Building FTS5 as part of SQLite</h2>
<p>As of <a href="releaselog/3_9_0.html">version 3.9.0</a> (2015-10-14),
FTS5 is included as part of the SQLite <a href="amalgamation.html">amalgamation</a>.
If using one of the two autoconf build system, FTS5 is
enabled by specifying the "--enable-fts5" option when running the configure
script. (FTS5 is currently disabled by default for the
source-tree configure script and enabled by default for
the amalgamation configure script, but these defaults might
change in the future.)
</p><p>Or, if sqlite3.c is compiled using some other build system, by arranging for
the SQLITE_ENABLE_FTS5 pre-processor symbol to be defined.
</p><h2 id="building_a_loadable_extension"><span>2.2. </span>Building a Loadable Extension</h2>
<p>Alternatively, FTS5 may be built as a loadable extension.
</p><p>The canonical FTS5 source code consists of a series of *.c and other files
in the "ext/fts5" directory of the SQLite source tree. A build process reduces
this to just two files - "fts5.c" and "fts5.h" - which may be used to build an
SQLite loadable extension.
</p><ol>
<li> Obtain the latest SQLite code from fossil.
</li><li> Create a Makefile as described in <a href="howtocompile.html">How To Compile SQLite</a>.
</li><li> Build the "fts5.c" target. Which also creates fts5.h.
</li></ol>
<div class="codeblock"><pre>$ wget -c https://www.sqlite.org/src/tarball/SQLite-trunk.tgz?uuid=trunk -O SQLite-trunk.tgz
.... output ...
$ tar -xzf SQLite-trunk.tgz
$ cd SQLite-trunk
$ ./configure && make fts5.c
... lots of output ...
$ ls fts5.[ch]
fts5.c fts5.h
</pre></div>
<p>
The code in "fts5.c" may then be compiled into a loadable extension or
statically linked into an application as described in
<a href="loadext.html#build">Compiling Loadable Extensions</a>. There are two entry points defined, both
of which do the same thing:
</p><ul>
<li> sqlite3_fts_init
</li><li> sqlite3_fts5_init
</li></ul>
<p>
The other file, "fts5.h", is not required to compile the FTS5 extension.
It is used by applications that implement <a href="fts5.html#extending_fts5">custom FTS5 tokenizers or auxiliary functions</a>.
</p><a name="full_text_query_syntax"></a>
<h1 tags="FTS5 query syntax" id="full_text_query_syntax"><span>3. </span>Full-text Query Syntax</h1>
<p>
The following block contains a summary of the FTS query syntax in BNF form.
A detailed explanation follows.
</p><div class="codeblock"><pre><phrase> := string [*]
<phrase> := <phrase> + <phrase>
<neargroup> := NEAR ( <phrase> <phrase> ... [, N] )
<query> := [ [-] <colspec> :] [^] <phrase>
<query> := [ [-] <colspec> :] <neargroup>
<query> := [ [-] <colspec> :] ( <query> )
<query> := <query> AND <query>
<query> := <query> OR <query>
<query> := <query> NOT <query>
<colspec> := colname
<colspec> := { colname1 colname2 ... }
</pre></div>
<a name="fts5_strings"></a>
<h2 tags="FTS5 Strings" id="fts5_strings"><span>3.1. </span>FTS5 Strings</h2>
<p>
Within an FTS expression a <b>string</b> may be specified in one of two ways:
</p><ul>
<li> <p>By enclosing it in double quotes ("). Within a string, any embedded
double quote characters may be escaped SQL-style - by adding a second
double-quote character.
</p></li><li> <p>As an FTS5 bareword that is not "AND", "OR" or "NOT" (case sensitive).
An FTS5 bareword is a string of one or more consecutive characters that
are all either:
</p><ul>
<li> Non-ASCII range characters (i.e. unicode codepoints greater
than 127), or
</li><li> One of the 52 upper and lower case ASCII characters, or
</li><li> One of the 10 decimal digit ASCII characters, or
</li><li> The underscore character (unicode codepoint 96).
</li><li> The substitute character (unicode codepoint 26).
</li></ul>
Strings that include any other characters must be quoted. Characters
that are not currently allowed in barewords, are not quote characters and
do not currently serve any special purpose in FTS5 query expressions may
at some point in the future be allowed in barewords or used to implement
new query functionality. This means that queries that are currently
syntax errors because they include such a character outside of a quoted
string may be interpreted differently by some future version of FTS5.
</li></ul>
<a name="fts5_phrases"></a>
<h2 tags="FTS5 Phrases" id="fts5_phrases"><span>3.2. </span>FTS5 Phrases</h2>
<p>
Each string in an fts5 query is parsed ("tokenized") by the
<a href="fts5.html#tokenizers">tokenizer</a> and a list of zero or more <b>tokens</b>, or
terms, extracted. For example, the default tokenizer tokenizes the string "alpha
beta gamma" to three separate tokens - "alpha", "beta" and "gamma" - in that
order.
</p><p>
FTS queries are made up of <b>phrases</b>. A phrase is an ordered list of
one or more tokens. The tokens from each string in the query each make up a
single phrase. Two phrases can be concatenated into a single large phrase
using the "+" operator. For example, assuming the tokenizer module being used
tokenizes the input "one.two.three" to three separate tokens, the following
four queries all specify the same phrase:
</p><div class="codeblock"><pre>... MATCH '"one two three"'
... MATCH 'one + two + three'
... MATCH '"one two" + three'
... MATCH 'one.two.three'
</pre></div>
<p>
A phrase matches a document if the document contains at least one sub-sequence
of tokens that matches the sequence of tokens that make up the phrase.
</p><a name="fts5_prefix_queries"></a>
<h2 tags="FTS5 prefix queries" id="fts5_prefix_queries"><span>3.3. </span>FTS5 Prefix Queries</h2>
<p>
If a "*" character follows a string within an FTS expression, then the final
token extracted from the string is marked as a <b>prefix token</b>. As you
might expect, a prefix token matches any document token of which it is a
prefix. For example, the first two queries in the following block will match
any document that contains the token "one" immediately followed by the token
"two" and then any token that begins with "thr".
</p><div class="codeblock"><pre>... MATCH '"one two thr" * '
... MATCH 'one + two + thr*'
... MATCH '"one two thr*"' <b>-- May not work as expected!</b>
</pre></div>
<p>The final query in the block above may not work as expected. Because the
"*" character is inside the double-quotes, it will be passed to the tokenizer,
which will likely discard it (or perhaps, depending on the specific tokenizer
in use, include it as part of the final token) instead of recognizing it as
a special FTS character.
<a name="carrotq"></a>
</p><a name="fts5_initial_token_queries"></a>
<h2 tags="FTS5 initial token queries" id="fts5_initial_token_queries"><span>3.4. </span>FTS5 Initial Token Queries</h2>
<p>
If a "^" character appears immediately before a phrase that is not part of a
NEAR query, then that phrase only matches a document only if it starts at the
first token in a column. The "^" syntax may be combined with a
<a href="fts5.html#fts5_column_filters">column filter</a>, but may not be inserted into the middle of
a phrase.
</p><div class="codeblock"><pre>... MATCH '^one' <i>-- first token in any column must be "one"</i>
... MATCH '^ one + two' <i>-- phrase "one two" must appear at start of a column</i>
... MATCH '^ "one two"' <i>-- same as previous </i>
... MATCH 'a : ^two' <i>-- first token of column "a" must be "two"</i>
... MATCH 'NEAR(^one, two)' <b>-- syntax error! </b>
... MATCH 'one + ^two' <b>-- syntax error! </b>
... MATCH '"^one two"' <b>-- May not work as expected!</b>
</pre></div>
<a name="fts5_near_queries"></a>
<h2 tags="FTS5 NEAR queries" id="fts5_near_queries"><span>3.5. </span>FTS5 NEAR Queries</h2>
<p>Two or more phrases may be grouped into a <b>NEAR group</b>. A NEAR group
is specified by the token "NEAR" (case sensitive) followed by an open
parenthesis character, followed by two or more whitespace separated phrases, optionally followed by a comma and the numeric parameter <i>N</i>, followed by
a close parenthesis. For example:
</p><div class="codeblock"><pre>... MATCH 'NEAR("one two" "three four", 10)'
... MATCH 'NEAR("one two" thr* + four)'
</pre></div>
<p>If no <i>N</i> parameter is supplied, it defaults to 10. A NEAR group
matches a document if the document contains at least one clump of tokens that:
</p><ol>
<li> contains at least one instance of each phrase, and
</li><li> for which the number of tokens between the end of the first phrase
and the beginning of the last phrase in the clump is less than or equal to <i>N</i>.
</li></ol>
<p>For example:
</p><div class="codeblock"><pre>CREATE VIRTUAL TABLE f USING fts5(x);
INSERT INTO f(rowid, x) VALUES(1, 'A B C D x x x E F x');
... MATCH 'NEAR(e d, 4)'; <i>-- Matches!</i>
... MATCH 'NEAR(e d, 3)'; <i>-- Matches!</i>
... MATCH 'NEAR(e d, 2)'; <i>-- Does not match!</i>
... MATCH 'NEAR("c d" "e f", 3)'; <i>-- Matches!</i>
... MATCH 'NEAR("c" "e f", 3)'; <i>-- Does not match!</i>
... MATCH 'NEAR(a d e, 6)'; <i>-- Matches!</i>
... MATCH 'NEAR(a d e, 5)'; <i>-- Does not match!</i>
... MATCH 'NEAR("a b c d" "b c" "e f", 4)'; <i>-- Matches!</i>
... MATCH 'NEAR("a b c d" "b c" "e f", 3)'; <i>-- Does not match!</i>
</pre></div>
<a name="fts5_column_filters"></a>
<h2 tags="FTS5 column filters" id="fts5_column_filters"><span>3.6. </span>FTS5 Column Filters</h2>
<p>
A single phrase or NEAR group may be restricted to matching text within a
specified column of the FTS table by prefixing it with the column name
followed by a colon character. Or to a set of columns by prefixing it
with a whitespace separated list of column names enclosed in parenthesis
("curly brackets") followed by a colon character. Column names may be specified
using either of the two forms described for strings above. Unlike strings that
are part of phrases, column names are not passed to the tokenizer module.
Column names are case-insensitive in the usual way for SQLite column names -
upper/lower case equivalence is understood for ASCII-range characters only.
</p><div class="codeblock"><pre>... MATCH 'colname : NEAR("one two" "three four", 10)'
... MATCH '"colname" : one + two + three'
... MATCH '{col1 col2} : NEAR("one two" "three four", 10)'
... MATCH '{col2 col1 col3} : one + two + three'
</pre></div>
<p>
If a column filter specification is preceded by a "-" character, then
it is interpreted as a list of column not to match against. For example:
</p><div class="codeblock"><pre><i>-- Search for matches in all columns except "colname"</i>
... MATCH '- colname : NEAR("one two" "three four", 10)'
<i>-- Search for matches in all columns except "col1", "col2" and "col3"</i>
... MATCH '- {col2 col1 col3} : one + two + three'
</pre></div>
<p>
Column filter specifications may also be applied to arbitrary expressions
enclosed in parenthesis. In this case the column filter applies to all
phrases within the expression. Nested column filter operations may only
further restrict the subset of columns matched, they can not be used to
re-enable filtered columns. For example:
</p><div class="codeblock"><pre><i>-- The following are equivalent:</i>
... MATCH '{a b} : ( {b c} : "hello" AND "world" )'
... MATCH '(b : "hello") AND ({a b} : "world")'
</pre></div>
<p>
Finally, a column filter for a single column may be specified by using
the column name as the LHS of a MATCH operator (instead of the usual
table name). For example:
</p><div class="codeblock"><pre><i>-- Given the following table</i>
CREATE VIRTUAL TABLE ft USING fts5(a, b, c);
<i>-- The following are equivalent</i>
SELECT * FROM ft WHERE b MATCH 'uvw AND xyz';
SELECT * FROM ft WHERE ft MATCH 'b : (uvw AND xyz)';
<i>-- This query cannot match any rows (since all columns are filtered out): </i>
SELECT * FROM ft WHERE b MATCH 'a : xyz';
</pre></div>
<a name="fts5_boolean_operators"></a>
<h2 tags="FTS5 boolean operators" id="fts5_boolean_operators"><span>3.7. </span>FTS5 Boolean Operators</h2>
<p>
Phrases and NEAR groups may be arranged into expressions using <b>boolean
operators</b>. In order of precedence, from highest (tightest grouping) to
lowest (loosest grouping), the operators are:
</p><table striped="1" style="margin:1em auto; width:80%; border-spacing:0">
<tr style="text-align:left"><th>Operator </th><th>Function
</th></tr><tr style="text-align:left;background-color:#DDDDDD"><td><code><query1> NOT <query2></code>
</td><td>Matches if query1 matches and query2 does not match.
</td></tr><tr style="text-align:left"><td><code><query1> AND <query2></code>
</td><td>Matches if both query1 and query2 match.
</td></tr><tr style="text-align:left;background-color:#DDDDDD"><td><code><query1> OR <query2></code>
</td><td>Matches if either query1 or query2 match.
</td></tr></table>
<p>
Parenthesis may be used to group expressions in order to modify operator
precedence in the usual ways. For example:
</p><div class="codeblock"><pre><i>-- Because NOT groups more tightly than OR, either of the following may</i>
<i>-- be used to match all documents that contain the token "two" but not</i>
<i>-- "three", or contain the token "one". </i>
... MATCH 'one OR two NOT three'
... MATCH 'one OR (two NOT three)'
<i>-- Matches documents that contain at least one instance of either "one"</i>
<i>-- or "two", but do not contain any instances of token "three".</i>
... MATCH '(one OR two) NOT three'
</pre></div>
<p>
Phrases and NEAR groups may also be connected by <b>implicit AND operators</b>.
For simplicity, these are not shown in the BNF grammar above. Essentially, any
sequence of phrases or NEAR groups (including those restricted to matching
specified columns) separated only by whitespace are handled as if there were an
implicit AND operator between each pair of phrases or NEAR groups. Implicit
AND operators are never inserted after or before an expression enclosed in
parenthesis. Implicit AND operators group more tightly than all other
operators, including NOT. For example:
</p><div class="codeblock"><pre>... MATCH 'one two three' <i>-- 'one AND two AND three'</i>
... MATCH 'three "one two"' <i>-- 'three AND "one two"'</i>
... MATCH 'NEAR(one two) three' <i>-- 'NEAR(one two) AND three'</i>
... MATCH 'one OR two three' <i>-- 'one OR two AND three'</i>
... MATCH 'one NOT two three' <i>-- 'one NOT (two AND three)'</i>
... MATCH '(one OR two) three' <i>-- Syntax error!</i>
... MATCH 'func(one two)' <i>-- Syntax error!</i>
</pre></div>
<a name="fts5_table_creation_and_initialization"></a>
<h1 tags="FTS5 CREATE TABLE Options" id="fts5_table_creation_and_initialization"><span>4. </span>FTS5 Table Creation and Initialization</h1>
<p>Each argument specified as part of a "CREATE VIRTUAL TABLE ... USING fts5
..." statement is either a column declaration or a configuration option. A
<b>column declaration</b> consists of one or more whitespace separated FTS5
barewords or string literals quoted in any manner acceptable to SQLite.
</p><p>The first string or bareword in a column declaration is the column name. It
is an error to attempt to name an fts5 table column "rowid" or "rank", or to
assign the same name to a column as is used by the table itself. This is not
supported.
</p><p>Each subsequent string or bareword in a column declaration is a column
option that modifies the behaviour of that column. Column options are
case-independent. Unlike the SQLite core, FTS5 considers unrecognized column
options to be errors. Currently, the only option recognized is
<a href="fts5.html#the_unindexed_column_option">"UNINDEXED" (see below)</a>.
</p><p>A <b>configuration option</b> consists of an FTS5 bareword - the option name -
followed by an "=" character, followed by the option value. The option value is
specified using either a single FTS5 bareword or a string literal, again quoted
in any manner acceptable to the SQLite core. For example:
</p><div class="codeblock"><pre>CREATE VIRTUAL TABLE mail USING fts5(sender, title, body, tokenize = 'porter ascii');
</pre></div>
<p> There are currently the following configuration options:
</p><ul>
<li> The "tokenize" option, used to configure a <a href="fts5.html#tokenizers">custom tokenizer</a>.
</li><li> The "prefix" option, used to add <a href="fts5.html#prefix_indexes">prefix indexes</a>
to an FTS5 table.
</li><li> The "content" option, used to make the FTS5 table an
<a href="fts5.html#external_content_and_contentless_tables">external content or contentless table</a>.
</li><li> The "content_rowid" option, used to set the rowid field of an
<a href="fts5.html#external_content_tables">external content table</a>.
</li><li> The <a href="fts5.html#the_columnsize_option">"columnsize" option</a>, used to configure
whether or not the size in tokens of each value in the FTS5 table is
stored separately within the database.
</li><li> The <a href="fts5.html#the_detail_option">"detail" option</a>. This option may be used
to reduce the size of the FTS index on disk by omitting some information
from it.
</li></ul>
<a name="the_unindexed_column_option"></a>
<h2 tags="unindexed" id="the_unindexed_column_option"><span>4.1. </span>The UNINDEXED column option</h2>
<p>The contents of columns qualified with the UNINDEXED column option are not
added to the FTS index. This means that for the purposes of MATCH queries and
<a href="fts5.html#_auxiliary_functions_">FTS5 auxiliary functions</a>, the column contains no matchable tokens.
</p><p>For example, to avoid adding the contents of the "uuid" field to the FTS
index:
</p><div class="codeblock"><pre>CREATE VIRTUAL TABLE customers USING fts5(name, addr, uuid UNINDEXED);
</pre></div>
<a name="prefix_indexes"></a>
<h2 tags="FTS5 prefix indexes" id="prefix_indexes"><span>4.2. </span>Prefix Indexes</h2>
<p> By default, FTS5 maintains a single index recording the location of each
token instance within the document set. This means that querying for complete
tokens is fast, as it requires a single lookup, but querying for a prefix
token can be slow, as it requires a range scan. For example, to query for
the prefix token "abc*" requires a range scan of all tokens greater than
or equal to "abc" and less than "abd".
</p><p> A prefix index is a separate index that records the location of all
instances of prefix tokens of a certain length in characters used to speed
up queries for prefix tokens. For example, optimizing a query for prefix
token "abc*" requires a prefix index of three-character prefixes.
</p><p> To add prefix indexes to an FTS5 table, the "prefix" option is set to
either a single positive integer or a text value containing a white-space
separated list of one or more positive integer values. A prefix index is
created for each integer specified. If more than one "prefix" option is
specified as part of a single CREATE VIRTUAL TABLE statement, all apply.
</p><div class="codeblock"><pre><i>-- Two ways to create an FTS5 table that maintains prefix indexes for
-- two and three character prefix tokens.</i>
CREATE VIRTUAL TABLE ft USING fts5(a, b, prefix='2 3');
CREATE VIRTUAL TABLE ft USING fts5(a, b, prefix=2, prefix=3);
</pre></div>
<a name="tokenizers"></a>
<h2 tags="FTS5 tokenizers" id="tokenizers"><span>4.3. </span>Tokenizers</h2>
<p> The CREATE VIRTUAL TABLE "tokenize" option is used to configure the
specific tokenizer used by the FTS5 table. The option argument must be either
an FTS5 bareword, or an SQL text literal. The text of the argument is itself
treated as a white-space series of one or more FTS5 barewords or SQL text
literals. The first of these is the name of the tokenizer to use. The second
and subsequent list elements, if they exist, are arguments passed to the
tokenizer implementation.
</p><p> Unlike option values and column names, SQL text literals intended as
tokenizers must be quoted using single quote characters. For example:
</p><div class="codeblock"><pre><i>-- The following are all equivalent</i>
CREATE VIRTUAL TABLE t1 USING fts5(x, tokenize = 'porter ascii');
CREATE VIRTUAL TABLE t1 USING fts5(x, tokenize = "porter ascii");
CREATE VIRTUAL TABLE t1 USING fts5(x, tokenize = "'porter' 'ascii'");
CREATE VIRTUAL TABLE t1 USING fts5(x, tokenize = '''porter'' ''ascii''');
<i>-- But this will fail:</i>
CREATE VIRTUAL TABLE t1 USING fts5(x, tokenize = '"porter" "ascii"');
<i>-- This will fail too:</i>
CREATE VIRTUAL TABLE t1 USING fts5(x, tokenize = 'porter' 'ascii');
</pre></div>
<p>
FTS5 features four built-in tokenizer modules, described in subsequent
sections:
</p><ul>
<li> The <b>unicode61</b> tokenizer, based on the Unicode 6.1 standard. This
is the default.
</li><li> The <b>ascii</b> tokenizer, which assumes all characters outside of
the ASCII codepoint range (0-127) are to be treated as token characters.
</li><li> The <b>porter</b> tokenizer, which implements the
<a href="https://tartarus.org/martin/PorterStemmer" /="1">porter stemming algorithm</a>.
</li><li> The <b>trigram</b> tokenizer, which treats each contiguous sequence of
three characters as a token, allowing FTS5 to support more general substring matching.
</li></ul>
<p> It is also possible to create custom tokenizers for FTS5. The API for doing so is <a href="fts5.html#custom_tokenizers">described here</a>.
</p><h3 id="unicode61_tokenizer"><span>4.3.1. </span>Unicode61 Tokenizer</h3>
<p> The unicode tokenizer classifies all unicode characters as either
"separator" or "token" characters. By default all space and punctuation
characters, as defined by Unicode 6.1, are considered separators, and all
other characters as token characters. More specifically, all unicode
characters assigned to a
<a href="https://en.wikipedia.org/wiki/Unicode_character_property#General_Category">
general category</a> beginning with "L" or "N" (letters and numbers,
specifically) or to category "Co" ("other, private use") are considered tokens.
All other characters are separators.
</p><p>Each contiguous run of one or more token characters is considered to be a
token. The tokenizer is case-insensitive according to the rules defined by
Unicode 6.1.
</p><p> By default, diacritics are removed from all Latin script characters. This
means, for example, that "A", "a", "À", "à", "Â" and "â"
are all considered to be equivalent.
</p><p> Any arguments following "unicode61" in the token specification are treated
as a list of alternating option names and values. Unicode61 supports the
following options:
</p><table striped="1" style="margin:1em auto; width:80%; border-spacing:0">
<tr style="text-align:left"><th> Option </th><th> Usage
</th></tr><tr style="text-align:left;background-color:#DDDDDD"><td> remove_diacritics
</td><td>This option should be set to "0", "1" or "2". The default value is "1".
If it is set to "1" or "2", then diacritics are removed from Latin script
characters as described above. However, if it is set to "1", then diacritics
are not removed in the fairly uncommon case where a single unicode codepoint
is used to represent a character with more that one diacritic. For example,
diacritics are not removed from codepoint 0x1ED9 ("LATIN SMALL LETTER O WITH
CIRCUMFLEX AND DOT BELOW"). This is technically a bug, but cannot be fixed
without creating backwards compatibility problems. If this option is set to
"2", then diacritics are correctly removed from all Latin characters.
</td></tr><tr style="text-align:left"><td> categories
</td><td>This option may be used to modify the set of Unicode general categories
that are considered to correspond to token characters. The argument must
consist of a space separated list of two-character general category
abbreviations (e.g. "Lu" or "Nd"), or of the same with the second character
replaced with an asterisk ("*"), interpreted as a glob pattern. The default
value is "L* N* Co".
</td></tr><tr style="text-align:left;background-color:#DDDDDD"><td> tokenchars
</td><td> This option is used to specify additional unicode characters that
should be considered token characters, even if they are white-space or
punctuation characters according to Unicode 6.1. All characters in the
string that this option is set to are considered token characters.
</td></tr><tr style="text-align:left"><td> separators
</td><td> This option is used to specify additional unicode characters that
should be considered as separator characters, even if they are token
characters according to Unicode 6.1. All characters in the string that
this option is set to are considered separators.
</td></tr></table>
<p> For example:
</p><div class="codeblock"><pre><i>-- Create an FTS5 table that does not remove diacritics from Latin
-- script characters, and that considers hyphens and underscore characters
-- to be part of tokens. </i>
CREATE VIRTUAL TABLE ft USING fts5(a, b,
tokenize = "unicode61 remove_diacritics 0 tokenchars '-_'"
);
</pre></div>
<p> or:
</p><div class="codeblock"><pre><i>-- Create an FTS5 table that, as well as the default token character classes,</i>
<i>-- considers characters in class "Mn" to be token characters.</i>
CREATE VIRTUAL TABLE ft USING fts5(a, b,
tokenize = "unicode61 categories 'L* N* Co Mn'"
);
</pre></div>
<p> The fts5 unicode61 tokenizer is byte-for-byte compatible with the fts3/4
unicode61 tokenizer.
</p><h3 id="ascii_tokenizer"><span>4.3.2. </span>Ascii Tokenizer</h3>
<p> The Ascii tokenizer is similar to the Unicode61 tokenizer, except that:
</p><ul>
<li> All non-ASCII characters (those with codepoints greater than 127) are
always considered token characters. If any non-ASCII characters are specified
as part of the separators option, they are ignored.
</li><li> Case-folding is only performed for ASCII characters. So while "A" and
"a" are considered to be equivalent, "Ã" and "ã" are distinct.
</li><li> The remove_diacritics option is not supported.
</li></ul>
<p> For example:
</p><div class="codeblock"><pre><i>-- Create an FTS5 table that uses the ascii tokenizer, but does not
-- consider numeric characters to be part of tokens.</i>
CREATE VIRTUAL TABLE ft USING fts5(a, b,
tokenize = "ascii separators '0123456789'"
);
</pre></div>
<h3 id="porter_tokenizer"><span>4.3.3. </span>Porter Tokenizer</h3>
<p> The porter tokenizer is a wrapper tokenizer. It takes the output of some
other tokenizer and applies the
<a href="https://tartarus.org/martin/PorterStemmer/">porter stemming algorithm</a>
to each token before it returns it to FTS5. This allows search terms like
"correction" to match similar words such as "corrected" or "correcting". The
porter stemmer algorithm is designed for use with English language terms
only - using it with other languages may or may not improve search utility.
</p><p> By default, the porter tokenizer operates as a wrapper around the default
tokenizer (unicode61). Or, if one or more extra arguments are added to the
"tokenize" option following "porter", they are treated as a specification for
the underlying tokenizer that the porter stemmer uses. For example:
</p><div class="codeblock"><pre><i>-- Two ways to create an FTS5 table that uses the porter tokenizer to
-- stem the output of the default tokenizer (unicode61). </i>
CREATE VIRTUAL TABLE t1 USING fts5(x, tokenize = porter);
CREATE VIRTUAL TABLE t1 USING fts5(x, tokenize = 'porter unicode61');
<i>-- A porter tokenizer used to stem the output of the unicode61 tokenizer,
-- with diacritics removed before stemming.</i>
CREATE VIRTUAL TABLE t1 USING fts5(x, tokenize = 'porter unicode61 remove_diacritics 1');
</pre></div>
<a name="trigramidx"></a>
<h3 id="the_trigram_tokenizer"><span>4.3.4. </span>The Trigram Tokenizer</h3>
<p>
The trigram tokenizer extends FTS5 to support substring
matching in general, instead of the usual token matching. When using the
trigram tokenizer, a query or phrase token may match any sequence of characters
within a row, not just a complete token. For example:
</p><div class="codeblock"><pre>CREATE VIRTUAL TABLE tri USING fts5(a, tokenize="trigram");
INSERT INTO tri VALUES('abcdefghij KLMNOPQRST uvwxyz');
<i>-- The following queries all match the single row in the table</i>
SELECT * FROM tri('cdefg');
SELECT * FROM tri('cdefg AND pqr');
SELECT * FROM tri('"hij klm" NOT stuv');
</pre></div>
<p>
The trigram tokenizer supports the following options:
</p><table striped="1" style="margin:1em auto; width:80%; border-spacing:0">
<tr style="text-align:left"><th> Option </th><th> Usage
</th></tr><tr style="text-align:left;background-color:#DDDDDD"><td> case_sensitive
</td><td> This value may be set to 1 or 0 (the default). If it is set to 1,
then matching is case sensitive. Otherwise, if this option is set to
0, matching is case insensitive.
</td></tr><tr style="text-align:left"><td> remove_diacritics
</td><td> This value may also be set to 1 or 0 (the default). It may only
be set to 1 if the case_sensitive options is set to 0 - setting both
options to 1 is an error. If this option is set, then diacritics are
removed from the text before matching (e.g. so that "á" matches "a").
</td></tr></table>
<div class="codeblock"><pre><i>-- A case-sensitive trigram index</i>
CREATE VIRTUAL TABLE tri USING fts5(a, tokenize="trigram case_sensitive 1");
</pre></div>
<p>
Unless the remove_diacritics option is set, FTS5 tables that use the trigram
tokenizer also support indexed GLOB and LIKE pattern matching. For example:
</p><div class="codeblock"><pre>SELECT * FROM tri WHERE a LIKE '%cdefg%';
SELECT * FROM tri WHERE a GLOB '*ij klm*xyz';
</pre></div>
<p>
If an FTS5 trigram tokenizer is created with the case_sensitive option set to 1,
it may only index GLOB queries, not LIKE.
</p><p>
Notes:
</p><ul>
<li> Substrings consisting of fewer than 3 unicode characters do not match any
rows when used with a full-text query. If a LIKE or GLOB pattern does not
contain at least one sequence of non-wildcard unicode characters, FTS5
falls back to a linear scan of the entire table.
</li><li> If the FTS5 table is created with the detail=none or detail=column option
specified, full-text queries may not contain any tokens longer than 3
unicode characters. LIKE and GLOB pattern matching may be slightly slower,
but still works. If the index is to be used only for LIKE and/or GLOB
pattern matching, these options are worth experimenting with to reduce
the index size.
</li><li> The index cannot be used to optimize LIKE patterns if the LIKE operator
has an ESCAPE clause.
</li></ul>
<a name="external_content_and_contentless_tables"></a>
<h2 tags="FTS5 content option" id="external_content_and_contentless_tables"><span>4.4. </span>External Content and Contentless Tables</h2>
<p>
Normally, when a row is inserted into an FTS5 table, in addition to building
the index, FTS5 makes a copy of the original row content.
When column values are requested from the FTS5 table by the user or by an
auxiliary function implementation, those values are
read from that private copy of the content. The "content" option may be used
to create an FTS5 table that stores only FTS full-text index entries.
Because the column values themselves are usually much larger than the
associated full-text index entries, this can save significant database space.
</p><p>
There are two ways to use the "content" option:
</p><ul>
<li> By setting it to an empty string to create a contentless FTS5 table. In
this case FTS5 assumes that the original column values are unavailable
to it when processing queries. Full-text queries and some auxiliary
functions can still be used, but no column values apart from the rowid
may be read from the table.
</li><li> By setting it to the name of a database object (table, virtual table or
view) that may be queried by FTS5 at any time to retrieve the column
values. This is known as an "external content" table. In this case all
FTS5 functionality may be used, but it is the responsibility of the user
to ensure that the contents of the full-text index are consistent with
the named database object. If they are not, query results may be
unpredictable.
</li></ul>
<a name="contentless_tables"></a>
<h3 tags="FTS5 contentless tables" id="contentless_tables"><span>4.4.1. </span>Contentless Tables</h3>
<p> A contentless FTS5 table is created by setting the "content" option to
an empty string. For example:
</p><div class="codeblock"><pre>CREATE VIRTUAL TABLE f1 USING fts5(a, b, c, content='');
</pre></div>
<p> Contentless FTS5 tables do not support UPDATE or DELETE statements, or
INSERT statements that do not supply a non-NULL value for the rowid field.
Contentless tables do not support REPLACE conflict handling. REPLACE
and INSERT OR REPLACE statements are treated as regular INSERT statements.
Rows may be deleted from a contentless table using an <a href="fts5.html#the_delete_command">FTS5 delete command</a>.
</p><p> Attempting to read any column value except the rowid from a contentless
FTS5 table returns an SQL NULL value.
<a name="clssdeltab"></a>
</p><a name="contentless_delete_tables"></a>
<h3 tags="FTS5 contentless-delete tables" id="contentless_delete_tables"><span>4.4.2. </span>Contentless-Delete Tables</h3>
<p>As of version 3.43.0, also available are contentless-delete tables.
A contentless-delete table is created by setting the content option to an
empty string and also setting the contentless_delete option to 1. For example:
</p><div class="codeblock"><pre>CREATE VIRTUAL TABLE f1 USING fts5(a, b, c, content='', contentless_delete=1);
</pre></div>
<p>A contentless-delete table differs from a contentless table in that:
</p><ul>
<li> Contentless-delete tables support both DELETE and "INSERT OR REPLACE
INTO" statements.
</li><li> Contentless-delete tables support UPDATE statements, but only if new
values are supplied for all user-defined columns of the fts5 table.
</li><li> Contentless-delete tables do <b>not</b> support the
<a href="fts5.html#the_delete_command">FTS5 delete command</a>.
</li></ul>
<div class="codeblock"><pre><i>-- Supported UPDATE statement:</i>
UPDATE f1 SET a=?, b=?, c=? WHERE rowid=?;
<i>-- This UPDATE is not supported, as it does not supply a new value</i>
<i>-- for column "c".</i>
UPDATE f1 SET a=?, b=? WHERE rowid=?;
</pre></div>
<p> Unless backwards compatibility is required, new code should prefer
contentless-delete tables to contentless tables.
</p><a name="external_content_tables"></a>
<h3 tags="FTS5 external content tables" id="external_content_tables"><span>4.4.3. </span>External Content Tables</h3>
<p> An external content FTS5 table is created by setting the content
option to the name of a table, virtual table or view (hereafter the "content
table") within the same database. Whenever column values are required by
FTS5, it queries the content table as follows, with the rowid of the row
for which values are required bound to the SQL variable:
</p><div class="codeblock"><pre>SELECT <content_rowid>, <cols> FROM <content> WHERE <content_rowid> = ?;
</pre></div>
<p> In the above, <content> is replaced by the name of the content table.
By default, <content_rowid> is replaced by the literal text "rowid". Or,
if the "content_rowid" option is set within the CREATE VIRTUAL TABLE statement,
by the value of that option. <cols> is replaced by a comma-separated list
of the FTS5 table column names. For example:
</p><div class="codeblock"><pre><i>-- If the database schema is: </i>
CREATE TABLE tbl (a, b, c, d INTEGER PRIMARY KEY);
CREATE VIRTUAL TABLE fts USING fts5(a, c, content=tbl, content_rowid=d);
<i>-- Fts5 may issue queries such as:</i>
SELECT d, a, c FROM tbl WHERE d = ?;
</pre></div>
<p> The content table may also be queried as follows:
</p><div class="codeblock"><pre>SELECT <content_rowid>, <cols> FROM <content> ORDER BY <content_rowid> ASC;
SELECT <content_rowid>, <cols> FROM <content> ORDER BY <content_rowid> DESC;
</pre></div>
<p> It is still the responsibility of the user to ensure that the contents of
an external content FTS5 table are kept up to date with the content table.
One way to do this is with triggers. For example:
</p><div class="codeblock"><pre><i>-- Create a table. And an external content fts5 table to index it.</i>
CREATE TABLE tbl(a INTEGER PRIMARY KEY, b, c);
CREATE VIRTUAL TABLE fts_idx USING fts5(b, c, content='tbl', content_rowid='a');
<i>-- Triggers to keep the FTS index up to date.</i>
CREATE TRIGGER tbl_ai AFTER INSERT ON tbl BEGIN
INSERT INTO fts_idx(rowid, b, c) VALUES (new.a, new.b, new.c);
END;
CREATE TRIGGER tbl_ad AFTER DELETE ON tbl BEGIN
INSERT INTO fts_idx(fts_idx, rowid, b, c) VALUES('delete', old.a, old.b, old.c);
END;
CREATE TRIGGER tbl_au AFTER UPDATE ON tbl BEGIN
INSERT INTO fts_idx(fts_idx, rowid, b, c) VALUES('delete', old.a, old.b, old.c);
INSERT INTO fts_idx(rowid, b, c) VALUES (new.a, new.b, new.c);
END;
</pre></div>
<p> Like contentless tables, external content tables do not support REPLACE
conflict handling. Any operations that specify REPLACE conflict handling are
handled using ABORT.
</p><a name="external_content_table_pitfalls"></a>
<h3 tags="FTS5 external content pitfalls" id="external_content_table_pitfalls"><span>4.4.4. </span>External Content Table Pitfalls</h3>
<p>
It is the responsibility of the user to ensure that an FTS5 external content
table (one with a non-empty content= option) is kept consistent with the
content table itself (the table named by the content= option). If these are
allowed to become inconsistent, then the results of queries against the FTS5
table may become unintuitive and appear inconsistent.
</p><p>
In these situations, the apparently inconsistent results produced by queries
against the FTS5 external content table may be understood as follows:
</p><ul>
<li><p> If the query does not use the full-text index - does not contain a
MATCH operator or equivalent table-valued function syntax - then the
query is effectively passed through to the external content table. In
this case the contents of the FTS index have no effect on the results
of the query.
</p></li><li><p> If the query does use the full text index, then the FTS5 module
queries it for the set of rowid values corresponding to documents that match
the query. For each such rowid, it then runs a query similar to the following
to retrieve any required column values, where '?' is replaced by the rowid
value, and <content> and <content_rowid> by the values specified
for the content= and content_rowid= options:
</p></li></ul>
<div class="codeblock"><pre>SELECT <content_rowid>, <cols> FROM <content> WHERE <content_rowid> = ?;
</pre></div>
<p>
For example, if a database is created using the following script:
</p><div class="codeblock"><pre><i>-- Create and populate a table. </i>
CREATE TABLE tbl(a INTEGER PRIMARY KEY, t TEXT);
INSERT INTO tbl VALUES(1, 'all that glitters');
INSERT INTO tbl VALUES(2, 'is not gold');
<i>-- Create an external content FTS5 table </i>
CREATE VIRTUAL TABLE ft USING fts5(t, content='tbl', content_rowid='a');
</pre></div>
<p>then the content table contains two rows, but the FTS index contains no
entries corresponding to them. In this case the following queries will return
inconsistent results as follows:
</p><div class="codeblock"><pre><i>-- Returns 2 rows. Because the query does not use the FTS index, it is</i>
<i>-- effectively executed against table 'tbl' directly, and so returns</i>
<i>-- both rows.</i>
SELECT * FROM t1;
<i>-- Returns 0 rows. This query does use the FTS index, which currently</i>
<i>-- contains no entries. So it returns 0 rows.</i>
SELECT rowid, t FROM t1('gold')
</pre></div>
<p>
Alternatively, if the database were created and populated as follows:
</p><div class="codeblock"><pre><i>-- Create and populate a table. </i>
CREATE TABLE tbl(a INTEGER PRIMARY KEY, t TEXT);
<i>-- Create an external content FTS5 table </i>
CREATE VIRTUAL TABLE ft USING fts5(t, content='tbl', content_rowid='a');
INSERT INTO ft(rowid, t) VALUES(1, 'all that glitters');
INSERT INTO ft(rowid, t) VALUES(2, 'is not gold');
</pre></div>
<p>then the content table is empty, but the FTS index contains entries for
6 different tokens. In this case the following queries will return
inconsistent results as follows:
</p><div class="codeblock"><pre><i>-- Returns 0 rows. Since it does not use the FTS index, the query is</i>
<i>-- passed directly through to table 'tbl', which contains no data.</i>
SELECT * FROM t1;
<i>-- Returns 1 row. The "rowid" field of the returned row is 2, and</i>
<i>-- the "t" field set to NULL. "t" is set to NULL because when the external</i>
<i>-- content table "tbl" was queried for the data associated with the row</i>
<i>-- with a=2 ("a" is the content_rowid column), none could be found.</i>
SELECT rowid, t FROM t1('gold')
</pre></div>
<p>As described in the previous section, triggers on the content table are
a good way to ensure that an FTS5 external content table is kept consistent.
However, triggers are only fired when rows are inserted, updated or deleted
in the content table. This means that if, for example, a database is created
as follows:
</p><div class="codeblock"><pre><i>-- Create and populate a table. </i>
CREATE TABLE tbl(a INTEGER PRIMARY KEY, t TEXT);
INSERT INTO tbl VALUES(1, 'all that glitters');
INSERT INTO tbl VALUES(2, 'is not gold');
<i>-- Create an external content FTS5 table </i>
CREATE VIRTUAL TABLE ft USING fts5(t, content='tbl', content_rowid='a');
<i>-- Create triggers to keep the FTS5 table up to date</i>
CREATE TRIGGER tbl_ai AFTER INSERT ON tbl BEGIN
INSERT INTO ft(rowid, t) VALUES (new.a, new.t);
END;
<similar triggers for update + delete>
</pre></div>
<p>then the content table and external content FTS5 table are inconsistent, as
creating the triggers does not copy existing rows from the content table
into the FTS index. The triggers are only able to ensure that updates made to
the content table after they are created are reflected in the FTS index.
</p><p>In this, and any other situation where the FTS index and its content table
have become inconsistent, the <a href="#the_rebuild_command">'rebuild'</a>
command may be used to completely discard the contents of the FTS index and
rebuild it based on the current contents of the content table.
</p><a name="the_columnsize_option"></a>
<h2 tags="FTS5 columnsize option" id="the_columnsize_option"><span>4.5. </span>The Columnsize Option</h2>
<p>Normally, FTS5 maintains a special backing table within the database that
stores the size of each column value in tokens inserted into the main FTS5
table in a separate table. This backing table is used by the
<a href="#xColumnSize">xColumnSize</a><a> API function, which is in turn used by
the built-in <a href="fts5.html#the_bm25_function">bm25 ranking function</a> (and is likely to be useful
to other ranking functions as well).
</a></p><p>In order to save space, this backing table may be omitted by setting the
columnsize option to zero. For example:
</p><div class="codeblock"><pre><i>-- A table without the xColumnSize() values stored on disk:</i>
CREATE VIRTUAL TABLE ft USING fts5(a, b, c, columnsize=0);
<i>-- Three equivalent ways of creating a table that does store the</i>
<i>-- xColumnSize() values on disk:</i>
CREATE VIRTUAL TABLE ft USING fts5(a, b, c);
CREATE VIRTUAL TABLE ft USING fts5(a, b, c, columnsize=1);
CREATE VIRTUAL TABLE ft USING fts5(a, b, columnsize='1', c);
</pre></div>
<p> It is an error to set the columnsize option to any value other than
0 or 1.
</p><p> If an FTS5 table is configured with columnsize=0 but is not a
<a href="fts5.html#contentless_tables">contentless table</a>, the xColumnSize API function
still works, but runs much more slowly. In this case, instead of reading
the value to return directly from the database, it reads the text value
itself and count the tokens within it on demand.
</p><p>Or, if the table is also a <a href="fts5.html#contentless_tables">contentless table</a>,
then the following apply:
</p><ul>
<li> <p>The xColumnSize API always returns -1. There is no way to determine
the number of tokens in a value stored within a contentless FTS5 table
configured with columnsize=0.
</p></li><li> <p>Each inserted row must be accompanied by an explicitly specified rowid
value. If a contentless table is configured with columnsize=0,
attempting to insert a NULL value into the rowid is an SQLITE_MISMATCH
error.
</p></li><li> <p>All queries on the table must be full-text queries. In other words,
they must use the MATCH or = operator with the table-name column as the
left-hand operand, or else use the table-valued function syntax. Any
query that is not a full-text query results in an error.
</p></li></ul>
<p> The name of the table in which the xColumnSize values are stored
(unless columnsize=0 is specified) is "<name>_docsize", where
<name> is the name of the FTS5 table itself. The
<a href="https://www.sqlite.org/download.html">sqlite3_analyzer</a>
tool may be used on an existing database in order to determine how much
space might be saved by recreating an FTS5 table using columnsize=0.
</p><a name="the_detail_option"></a>
<h2 tags="FTS5 detail option" id="the_detail_option"><span>4.6. </span>The Detail Option</h2>
<p> For each term in a document, the FTS index maintained by FTS5
stores the rowid of the document, the column number of the column that contains
the term and the offset of the term within the column value. The "detail"
option may be used to omit some of this information. This reduces the space
that the index consumes within the database file, but also reduces the
capability and efficiency of the system.
</p><p> The detail option may be set to "full" (the default value), "column" or
"none". For example:
</p><div class="codeblock"><pre><i>-- The following two lines are equivalent (because the default value</i>
<i>-- of "detail" is "full". </i>
CREATE VIRTUAL TABLE ft1 USING fts5(a, b, c);
CREATE VIRTUAL TABLE ft1 USING fts5(a, b, c, detail=full);
CREATE VIRTUAL TABLE ft2 USING fts5(a, b, c, detail=column);
CREATE VIRTUAL TABLE ft3 USING fts5(a, b, c, detail=none);
</pre></div>
<p>If the detail option is set to <b>column</b>, then for each term the FTS
index records the rowid and column number only, omitting the term offset
information. This results in the following restrictions:
</p><ul>
<li> NEAR queries are not available.
</li><li> Phrase queries are not available.
</li><li> Assuming the table is not also a
<a href="fts5.html#contentless_tables">contentless table</a>, the
<a href="#xInstCount">xInstCount</a>, <a href="#xInst">xInst</a>,
<a href="#xPhraseFirst">xPhraseFirst</a> and <a href="#xPhraseNext">xPhraseNext</a>
are slower than usual. This is because instead of reading the required data
directly from the FTS index they have to load and tokenize the document text
on demand.
</li><li> If the table is also a contentless table, the xInstCount, xInst,
xPhraseFirst and xPhraseNext APIs behave as if the current row contains no
phrase matches at all (i.e. xInstCount() returns 0).
</li></ul>
<p>If the detail option is set to <b>none</b>, then for each term the FTS
index records just the rowid is stored. Both column and offset information
are omitted. As well as the restrictions itemized above for detail=column
mode, this imposes the following extra limitations:
</p><ul>
<li> Column filter queries are not available.
</li><li> Assuming the table is not also a contentless table, the
<a href="#xPhraseFirstColumn">xPhraseFirstColumn</a> and
<a href="#xPhraseNextColumn">xPhraseNextColumn</a> are slower than usual.
</li><li> If the table is also a contentless table, the xPhraseFirstColumn and
xPhraseNextColumn APIs behave as if the current row contains no phrase
matches at all (i.e. xPhraseFirstColumn() sets the iterator to EOF).
</li></ul>
<p> In one test that indexed a large set of emails (1636 MiB on disk), the FTS
index was 743 MiB on disk with detail=full, 340 MiB with detail=column and 134
MiB with detail=none.
</p><a name="the_tokendata_option"></a>
<h2 tags="FTS5 tokendata option" id="the_tokendata_option"><span>4.7. </span>The Tokendata Option</h2>
<p>This option is only useful to applications that implement
<a href="#custom_tokenizers">custom tokenizers</a>. Usually, tokenizers may
return tokens that consist of any sequence of bytes, including 0x00 bytes.
However, if the table specifies the tokendata=1 option, then fts5 ignores
the first 0x00 byte and any trailing data in the token for the purposes
of matching. It still stores the entire token as returned by the tokenizer,
but it is ignored by the fts5 core.
</p><p>The full version of the token, including any 0x00 byte and trailing data,
is available to <a href="#custom_auxiliary_functions">custom auxiliary
functions</a> via the <a href="#xQueryToken">xQueryToken</a> and
<a href="#xInstToken">xInstToken</a> APIs.
</p><p>This may be useful for ranking functions. A custom tokenizer may
add extra data to some document tokens allowing a ranking function to give
more weight to hits of some tokens (e.g. those in document headings).
</p><p>Alternatively, the combination of a custom tokenizer and a custom auxiliary
function may be used to implement
<a href="https://www.unicode.org/reports/tr10/tr10-41.html#Asymmetric_Search">
asymmetric search</a>. The tokenizer could (say) for each document token return
the case-normalized and unmarked version of the token, followed by an 0x00
byte, followed by the full text of the token from the document. When queried,
fts5 would provide results as if all characters in the query were
case-normalized and unmarked. The custom auxiliary function could then be used
in the WHERE clause of the query to filter out any rows that do not match based
on secondary or tertiary markings in the document or query terms.
</p><a name="_auxiliary_functions_"></a>
<h1 tags="FTS5 auxiliary functions" id="_auxiliary_functions_"><span>5. </span> Auxiliary Functions </h1>
<p> Auxiliary functions are similar to <a href="lang_corefunc.html">SQL scalar functions</a>,
except that they may only be used within full-text queries (those that use
the MATCH operator, or LIKE/GLOB with the trigram tokenizer) on an FTS5 table.
Their results are calculated based not only on the arguments passed to them,
but also on the current match and matched row. For example, an auxiliary
function may return a numeric value indicating the accuracy of the match (see
the <a href="fts5.html#the_bm25_function">bm25()</a> function), or a fragment of text from the matched row
that contains one or more instances of the search terms (see the <a href="fts5.html#the_snippet_function">snippet()</a> function).
</p><p>To invoke an auxiliary function, the name of the FTS5 table should be
specified as the first argument. Other arguments may follow the first,
depending on the specific auxiliary function being invoked. For example, to
invoke the "highlight" function:
</p><div class="codeblock"><pre>SELECT highlight(email, 2, '<b>', '</b>') FROM email WHERE email MATCH 'fts5'
</pre></div>
<p>The built-in auxiliary functions provided as part of FTS5 are described in
the following section. Applications may also implement
<a href="fts5.html#custom_auxiliary_functions">custom auxiliary functions in C</a>.
</p><h2 id="built_in_auxiliary_functions"><span>5.1. </span>Built-in Auxiliary Functions</h2>
<p> FTS5 provides three built-in auxiliary functions:
</p><ul>
<li> The <a href="fts5.html#the_bm25_function">bm25() auxiliary function</a> returns a real value
reflecting the accuracy of the current match. Better matches are
assigned numerically lower values.
</li><li> The <a href="fts5.html#the_highlight_function">highlight() auxiliary function</a> returns a copy
of the text from one of the columns of the current match with each
instance of a queried term within the result surrounded by specified
markup (for example "<b>" and "</b>").
</li><li> The <a href="fts5.html#the_snippet_function">snippet() auxiliary function</a> selects a short
fragment of text from one of the columns of the matched row and returns
it with each instance of a queried term surrounded by markup in
the same manner as the highlight() function. The fragment of text is
selected so as to maximize the number of distinct queried terms it
contains. Higher weight is given to snippets that occur at the start
of a column value, or that immediately follow "." or ":" characters
in the text.
</li></ul>
<a name="the_bm25_function"></a>
<h3 tags="FTS5 bm25" id="the_bm25_function"><span>5.1.1. </span>The bm25() function</h3>
<p> The built-in auxiliary function bm25() returns a real value indicating
how well the current row matches the full-text query. The better the match,
the numerically smaller the value returned. A query such as the following may
be used to return matches in order from best to worst match:
</p><div class="codeblock"><pre>SELECT * FROM fts WHERE fts MATCH ? ORDER BY bm25(fts)
</pre></div>
<p> In order to calculate a documents score, the full-text query is separated
into its component phrases. The bm25 score for document <i>D</i> and
query <i>Q</i> is then calculated as follows:
</p><p> <img src="images/fts5_formula1.png" style="width:75ex;margin-left:5ex">
</p><p> In the above, <i>nPhrase</i> is the number of phrases in the query.
<i>|D|</i> is the number of tokens in the current document, and
<i>avgdl</i> is the average number of tokens in all documents within the
FTS5 table. <i>k<sub>1</sub></i> and <i>b</i> are both constants,
hard-coded at 1.2 and 0.75 respectively.
</p><p> The "-1" term at the start of the formula is not found in most
implementations of the BM25 algorithm. Without it, a better match is assigned
a numerically higher BM25 score. Since the default sorting order is
"ascending", this means that appending "ORDER BY bm25(fts)" to a query would
cause results to be returned in order from worst to best. The "DESC" keyword
would be required in order to return the best matches first. In order to
avoid this pitfall, the FTS5 implementation of BM25 multiplies the result
by -1 before returning it, ensuring that better matches are assigned
numerically lower scores.
</p><p> <i>IDF(q<sub>i</sub>)</i> is the inverse-document-frequency of query
phrase <i>i</i>. It is calculated as follows, where <i>N</i> is the total
number of rows in the FTS5 table and <i>n(q<sub>i</sub>)</i> is the total
number of rows that contain at least one instance of phrase <i>i</i>:
</p><p> <img src="images/fts5_formula2.png" style="width:75ex;margin-left:5ex">
</p><p> Finally, <i>f(q<sub>i</sub>,D)</i> is the phrase frequency of phrase
<i>i</i>. By default, this is simply the number of occurrences of the phrase
within the current row. However, by passing extra real value arguments to
the bm25() SQL function, each column of the table may be assigned a different
weight and the phrase frequency calculated as follows:
</p><p> <img src="images/fts5_formula3.png" style="width:75ex;margin-left:5ex">
</p><p> where <i>w<sub>c</sub></i> is the weight assigned to column <i>c</i> and
<i>n(q<sub>i</sub>,c)</i> is the number of occurrences of phrase <i>i</i> in
column <i>c</i> of the current row. The first argument passed to bm25()
following the table name is the weight assigned to the leftmost column of
the FTS5 table. The second is the weight assigned to the second leftmost
column, and so on. If there are not enough arguments for all table columns,
remaining columns are assigned a weight of 1.0. If there are too many
trailing arguments, the extras are ignored. For example:
</p><div class="codeblock"><pre><i>-- Assuming the following schema:</i>
CREATE VIRTUAL TABLE email USING fts5(sender, title, body);
<i>-- Return results in bm25 order, with each phrase hit in the "sender"</i>
<i>-- column considered the equal of 10 hits in the "body" column, and</i>
<i>-- each hit in the "title" column considered as valuable as 5 hits in</i>
<i>-- the "body" column.</i>
SELECT * FROM email WHERE email MATCH ? ORDER BY bm25(email, 10.0, 5.0);
</pre></div>
<p>Refer to wikipedia for
<a href="https://en.wikipedia.org/wiki/Okapi_BM25">more information regarding
BM25</a> and its variants.
</p><a name="the_highlight_function"></a>
<h3 tags="FTS5 highlight" id="the_highlight_function"><span>5.1.2. </span>The highlight() function</h3>
<p> The highlight() function returns a copy of the text from a specified
column of the current row with extra markup text inserted to mark the start
and end of phrase matches.
</p><p>The highlight() must be invoked with exactly three arguments following
the table name. To be interpreted as follows:
</p><ol>
<li> An integer indicating the index of the FTS table column to read the
text from. Columns are numbered from left to right starting at zero.
</li><li> The text to insert before each phrase match.
</li><li> The text to insert after each phrase match.
</li></ol>
<p>For example:
</p><div class="codeblock"><pre><i>-- Return a copy of the text from the leftmost column of the current</i>
<i>-- row, with phrase matches marked using html "b" tags.</i>
SELECT highlight(fts, 0, '<b>', '</b>') FROM fts WHERE fts MATCH ?
</pre></div>
<p>In cases where two or more phrase instances overlap (share one or more
tokens in common), a single open and close marker is inserted for each set
of overlapping phrases. For example:
</p><div class="codeblock"><pre><i>-- Assuming this:</i>
CREATE VIRTUAL TABLE ft USING fts5(a);
INSERT INTO ft VALUES('a b c x c d e');
INSERT INTO ft VALUES('a b c c d e');
INSERT INTO ft VALUES('a b c d e');
<i>-- The following SELECT statement returns these three rows:</i>
<i>-- '[a b c] x [c d e]'</i>
<i>-- '[a b c] [c d e]'</i>
<i>-- '[a b c d e]'</i>
SELECT highlight(ft, 0, '[', ']') FROM ft WHERE ft MATCH 'a+b+c AND c+d+e';
</pre></div>
<a name="the_snippet_function"></a>
<h3 tags="FTS5 snippet" id="the_snippet_function"><span>5.1.3. </span>The snippet() function</h3>
<p>The snippet() function is similar to highlight(), except that instead of
returning entire column values, it automatically selects and extracts a
short fragment of document text to process and return. The snippet() function
must be passed five parameters following the table name argument:
</p><ol>
<li> An integer indicating the index of the FTS table column to select
the returned text from. Columns are numbered from left to right
starting at zero. A negative value indicates that the column should
be automatically selected.
</li><li> The text to insert before each phrase match within the returned text.
</li><li> The text to insert after each phrase match within the returned text.
</li><li> The text to add to the start or end of the selected text to indicate
that the returned text does not occur at the start or end of its column,
respectively.
</li><li> The maximum number of tokens in the returned text. This must be greater
than zero and equal to or less than 64.
</li></ol>
<a name="sorting_by_auxiliary_function_results"></a>
<h2 tags="auxiliary function mapping" id="sorting_by_auxiliary_function_results"><span>5.2. </span>Sorting by Auxiliary Function Results</h2>
<p> All FTS5 tables feature a special hidden column named "rank". If the
current query is not a full-text query (i.e. if it does not include a MATCH
operator), the value of the "rank" column is always NULL. Otherwise, in a
full-text query, column rank contains by default the same value as would be
returned by executing the bm25() auxiliary function with no trailing
arguments.
</p><p> The difference between reading from the rank column and using the bm25()
function directly within the query is only significant when sorting by the
returned value. In this case, using "rank" is faster than using bm25().
</p><div class="codeblock"><pre><i>-- The following queries are logically equivalent. But the second may</i>
<i>-- be faster, particularly if the caller abandons the query before</i>
<i>-- all rows have been returned (or if the queries were modified to </i>
<i>-- include LIMIT clauses).</i>
SELECT * FROM fts WHERE fts MATCH ? ORDER BY bm25(fts);
SELECT * FROM fts WHERE fts MATCH ? ORDER BY rank;
</pre></div>
<p> Instead of using bm25() with no trailing arguments, the specific auxiliary
function mapped to the rank column may be configured either on a per-query
basis, or by setting a different persistent default for the FTS table.
</p><p> In order to change the mapping of the rank column for a single query,
a term similar to either of the following is added to the WHERE clause of a
query:
</p><div class="codeblock"><pre>rank MATCH 'auxiliary-function-name(arg1, arg2, ...)'
rank = 'auxiliary-function-name(arg1, arg2, ...)'
</pre></div>
<p> The right-hand-side of the MATCH or = operator must be a constant
expression that evaluates to a string consisting of the auxiliary function to
invoke, followed by zero or more comma separated arguments within parenthesis.
Arguments must be SQL literals. For example:
</p><div class="codeblock"><pre><i>-- The following queries are logically equivalent. But the second may</i>
<i>-- be faster. See above. </i>
SELECT * FROM fts WHERE fts MATCH ? ORDER BY bm25(fts, 10.0, 5.0);
SELECT * FROM fts WHERE fts MATCH ? AND rank MATCH 'bm25(10.0, 5.0)' ORDER BY rank;
</pre></div>
<p> The table-valued function syntax may also be used to specify an alternative
ranking function. In this case the text describing the ranking function should
be specified as the second table-valued function argument. The following three
queries are equivalent:
</p><div class="codeblock"><pre>SELECT * FROM fts WHERE fts MATCH ? AND rank MATCH 'bm25(10.0, 5.0)' ORDER BY rank;
SELECT * FROM fts WHERE fts = ? AND rank = 'bm25(10.0, 5.0)' ORDER BY rank;
SELECT * FROM fts WHERE fts(?, 'bm25(10.0, 5.0)') ORDER BY rank;
</pre></div>
<p> The default mapping of the rank column for a table may be modified
using the <a href="fts5.html#the_rank_configuration_option">FTS5 rank configuration option</a>.
</p><h1 id="special_insert_commands"><span>6. </span>Special INSERT Commands</h1>
<a name="the_automerge_configuration_option"></a>
<h2 tags="FTS5 automerge option" id="the_automerge_configuration_option"><span>6.1. </span>The 'automerge' Configuration Option</h2>
<p>
Instead of using a single data structure on disk to store the full-text
index, FTS5 uses a series of b-trees. Each time a new transaction is
committed, a new b-tree containing the contents of the committed transaction
is written into the database file. When the full-text index is queried, each
b-tree must be queried individually and the results merged before being
returned to the user.
</p><p>
In order to prevent the number of b-trees in the database from becoming too
large (slowing down queries), smaller b-trees are periodically merged into
single larger b-trees containing the same data. By default, this happens
automatically within INSERT, UPDATE or DELETE statements that modify the
full-text index. The 'automerge' parameter determines how many smaller
b-trees are merged together at a time. Setting it to a small value can
speed up queries (as they have to query and merge the results from fewer
b-trees), but can also slow down writing to the database (as each INSERT,
UPDATE or DELETE statement has to do more work as part of the automatic
merging process).
</p><p>
Each of the b-trees that make up the full-text index is assigned to a "level"
based on its size. Level-0 b-trees are the smallest, as they contain the
contents of a single transaction. Higher level b-trees are the result of
merging two or more level-0 b-trees together and so they are larger. FTS5
begins to merge b-trees together once there exist <i>M</i> or more b-trees
with the same level, where <i>M</i> is the value of the 'automerge'
parameter.
</p><p>
The maximum allowed value for the 'automerge' parameter is 16. The default
value is 4. Setting the 'automerge' parameter to 0 disables the automatic
incremental merging of b-trees altogether.
</p><div class="codeblock"><pre>INSERT INTO ft(ft, rank) VALUES('automerge', 8);
</pre></div>
<h2 id="the_crisismerge_configuration_option"><span>6.2. </span>The 'crisismerge' Configuration Option</h2>
<p>The 'crisismerge' option is similar to 'automerge', in that it determines
how and how often the component b-trees that make up the full-text index are
merged together. Once there exist <i>C</i> or more b-trees on a single level
within the full-text index, where <i>C</i> is the value of the 'crisismerge'
option, all b-trees on the level are immediately merged into a single b-tree.
</p><p>The difference between this option and the 'automerge' option is that when
the 'automerge' limit is reached FTS5 only begins to merge the b-trees
together. Most of the work is performed as part of subsequent INSERT,
UPDATE or DELETE operations. Whereas when the 'crisismerge' limit is reached,
the offending b-trees are all merged immediately. This means that an INSERT,
UPDATE or DELETE that triggers a crisis-merge may take a long time to
complete.
</p><p>The default 'crisismerge' value is 16. There is no maximum limit. Attempting
to set the 'crisismerge' parameter to a value of 0 or 1 is equivalent to
setting it to the default value (16). It is an error to attempt to set the
'crisismerge' option to a negative value.
</p><div class="codeblock"><pre>INSERT INTO ft(ft, rank) VALUES('crisismerge', 16);
</pre></div>
<a name="the_delete_command"></a>
<h2 tags="FTS5 delete command" id="the_delete_command"><span>6.3. </span>The 'delete' Command</h2>
<p> This command is only available with <a href="fts5.html#external_content_tables">external content</a> and <a href="fts5.html#contentless_tables">contentless</a> tables. It
is used to delete the index entries associated with a single row from the
full-text index. This command and the <a href="fts5.html#the_delete_all_command">delete-all</a>
command are the only ways to remove entries from the full-text index of a
contentless table.
</p><p> In order to use this command to delete a row, the text value 'delete'
must be inserted into the special column with the same name as the table.
The rowid of the row to delete is inserted into the rowid column. The
values inserted into the other columns must match the values currently
stored in the table. For example:
</p><div class="codeblock"><pre><i>-- Insert a row with rowid=14 into the fts5 table.</i>
INSERT INTO ft(rowid, a, b, c) VALUES(14, $a, $b, $c);
<i>-- Remove the same row from the fts5 table.</i>
INSERT INTO ft(ft, rowid, a, b, c) VALUES('delete', 14, $a, $b, $c);
</pre></div>
<p> If the values "inserted" into the text columns as part of a 'delete'
command are not the same as those currently stored within the table, the
results may be unpredictable.
</p><p> The reason for this is easy to understand: When a document is inserted
into the FTS5 table, an entry is added to the full-text index to record the
position of each token within the new document. When a document is removed,
the original data is required in order to determine the set of entries that
need to be removed from the full-text index. So if the data supplied to FTS5
when a row is deleted using this command is different from that used to
determine the set of token instances when it was inserted, some full-text
index entries may not be correctly deleted, or FTS5 may try to remove index
entries that do not exist. This can leave the full-text index in an
unpredictable state, making future query results unreliable.
</p><a name="the_delete_all_command"></a>
<h2 tags="FTS5 delete-all command" id="the_delete_all_command"><span>6.4. </span>The 'delete-all' Command</h2>
<p> This command is only available with <a href="fts5.html#external_content_tables">external content</a> and <a href="fts5.html#contentless_tables">contentless</a> tables (including
<a href="fts5.html#contentless_delete_tables">contentless-delete</a> tables. It deletes all
entries from the full-text index.
</p><div class="codeblock"><pre>INSERT INTO ft(ft) VALUES('delete-all');
</pre></div>
<a name="the_deletemerge_configuration_option"></a>
<h2 tags="deletemerge" id="the_deletemerge_configuration_option"><span>6.5. </span>The 'deletemerge' Configuration Option</h2>
<p> The 'deletemerge' option is only used by
<a href="fts5.html#contentless_delete_tables">contentless-delete</a> tables.
</p><p> When a row is deleted from a contentless-delete table, the entries
associated with its tokens are not immediately removed from the FTS index.
Instead, a "tombstone" marker containing the rowid of the deleted row is
attached to the b-tree that contains the row's FTS index entries. When the
b-tree is queried, any query result rows for which there exist tombstone
markers are omitted from the results. When the b-tree is merged with other
b-trees, both the deleted rows and their tombstone markers are discarded.
</p><p> This option specifies a minimum percentage of rows in a b-tree that must
have tombstone markers before the b-tree is made eligible for merging -
either by <a href="#the_automerge_configuration_option">automatic</a> merges or
explicit user <a href="the_merge_command">'merge'</a> commands - even if it
does not meet the usual criteria as determined by the 'automerge' and
<a href="#the_usermerge_configuration_option">'usermerge'</a> options.
</p><p> For example, to specify that FTS5 should consider merging a component
b-tree after 15% of its rows have associated tombstone markers:
</p><div class="codeblock"><pre>INSERT INTO ft(ft, rank) VALUES('deletemerge', 15);
</pre></div>
<p> The default value of this option is 10. Attempting to set it to less than
zero restores the default value. Setting this option to 0 or to greater than
100 ensures that b-trees are never made eligible for merging due to tombstone
markers.
</p><h2 id="the_integrity_check_command"><span>6.6. </span>The 'integrity-check' Command</h2>
<p> This command is used to verify that the full-text index is internally
consistent, and, optionally, that it is consistent with any
<a href="fts5.html#external_content_tables">external content</a> table.
</p><p>The integrity-check command is invoked by inserting the text value
'integrity-check' into the special column with the same name as the FTS5
table. If a value is supplied for the "rank" column, it must be either
0 or 1. For example:
</p><div class="codeblock"><pre>INSERT INTO ft(ft) VALUES('integrity-check');
INSERT INTO ft(ft, rank) VALUES('integrity-check', 0);
INSERT INTO ft(ft, rank) VALUES('integrity-check', 1);
</pre></div>
<p>The three forms above are equivalent for all FTS tables that are
not external content tables. They check that the index data structures are
not corrupt, and, if the FTS table is not contentless, that the contents of
the index match the contents of the table itself.
</p><p>For an external content table, the contents of the index are only
compared to the contents of the external content table if the value
specified for the rank column is 1.
</p><p>In all cases, if any discrepancies are found, the command fails
with an <a href="rescode.html#corrupt_vtab">SQLITE_CORRUPT_VTAB</a> error.
</p><a name="the_merge_command"></a>
<h2 tags="FTS5 merge command" id="the_merge_command"><span>6.7. </span>The 'merge' Command</h2>
<div class="codeblock"><pre>INSERT INTO ft(ft, rank) VALUES('merge', 500);
</pre></div>
<p> This command merges b-tree structures together until roughly N pages
of merged data have been written to the database, where N is the absolute
value of the parameter specified as part of the 'merge' command. The size of
each page is as configured by the <a href="fts5.html#the_pgsz_configuration_option">FTS5 pgsz option</a>.
</p><p> If the parameter is a positive value, B-tree structures are only eligible
for merging if one of the following is true:
</p><ul>
<li> There are U or more such b-trees on a
single level (see the documentation for the <a href="fts5.html#the_automerge_configuration_option">FTS5 automerge option</a>
for an explanation of b-tree levels), where U is the value assigned
to the <a href="fts5.html#the_usermerge_configuration_option">FTS5 usermerge option</a> option.
</li><li> A merge has already been started (perhaps by a 'merge' command that
specified a negative parameter).
</li></ul>
<p> It is possible to tell whether or not the 'merge' command found any
b-trees to merge together by checking the value returned by the
<a href="c3ref/total_changes.html">sqlite3_total_changes()</a> API before and after the command is executed. If
the difference between the two values is 2 or greater, then work was performed.
If the difference is less than 2, then the 'merge' command was a no-op. In this
case there is no reason to execute the same 'merge' command again, at least
until after the FTS table is next updated.
</p><p> If the parameter is negative, and there are B-tree structures on more than
one level within the FTS index, all B-tree structures are assigned to the same
level before the merge operation is commenced. Additionally, if the parameter
is negative, the value of the usermerge configuration option is not
respected - as few as two b-trees from the same level may be merged together.
</p><p> The above means that executing the 'merge' command with a negative
parameter until the before and after difference in the return value of
<a href="c3ref/total_changes.html">sqlite3_total_changes()</a> is less than two optimizes the FTS index in the
same way as the <a href="fts5.html#the_optimize_command">FTS5 optimize command</a>. However, if a new b-tree is added
to the FTS index while this process is ongoing, FTS5 will move the new
b-tree to the same level as the existing b-trees and restart the merge. To
avoid this, only the first call to 'merge' should specify a negative parameter.
Each subsequent call to 'merge' should specify a positive value so that the
merge started by the first call is run to completion even if new b-trees are
added to the FTS index.
</p><a name="the_optimize_command"></a>
<h2 tags="FTS5 optimize command" id="the_optimize_command"><span>6.8. </span>The 'optimize' Command</h2>
<p>This command merges all individual b-trees that currently make up the
full-text index into a single large b-tree structure. This ensures that the
full-text index consumes the minimum space within the database and is in the
fastest form to query.
</p><p>Refer to the documentation for the <a href="fts5.html#the_automerge_configuration_option">FTS5 automerge option</a> for more details
regarding the relationship between the full-text index and its component
b-trees.
</p><div class="codeblock"><pre>INSERT INTO ft(ft) VALUES('optimize');
</pre></div>
<p>Because it reorganizes the entire FTS index, the optimize command can
take a long time to run. The <a href="fts5.html#the_merge_command">FTS5 merge command</a> can be used to divide
the work of optimizing the FTS index into multiple steps. To do this:
</p><ul>
<li> Invoke the 'merge' command once with the parameter set to -N, then
</li><li> Invoke the 'merge' command zero or more times with the parameter set to N.
</li></ul>
<p>where N is the number of pages of data to merge within each invocation of
the merge command. The application should stop invoking merge when the
difference in the value returned by the sqlite3_total_changes() function before
and after the merge command drops to below two. The merge commands may be
issued as part of the same or separate transactions, and by the same or
different database clients. Refer to the documentation for the
<a href="fts5.html#the_merge_command">merge command</a> for further details.
</p><a name="the_pgsz_configuration_option"></a>
<h2 tags="FTS5 pgsz option" id="the_pgsz_configuration_option"><span>6.9. </span>The 'pgsz' Configuration Option</h2>
<p> This command is used to set the persistent "pgsz" option.
</p><p> The full-text index maintained by FTS5 is stored as a series of fixed-size
blobs in a database table. It is not strictly necessary for all blobs that make
up a full-text index to be the same size. The pgsz option determines the size
of all blobs created by subsequent index writers. The default value is 1000.
</p><div class="codeblock"><pre>INSERT INTO ft(ft, rank) VALUES('pgsz', 4072);
</pre></div>
<a name="the_rank_configuration_option"></a>
<h2 tags="FTS5 rank configuration option" id="the_rank_configuration_option"><span>6.10. </span>The 'rank' Configuration Option</h2>
<p> This command is used to set the persistent "rank" option.
</p><p> The rank option is used to change the default auxiliary function mapping
for the rank column. The option should be set to a text value in the same
format as described for <a href="fts5.html#sorting_by_auxiliary_function_results">"rank MATCH ?"</a> terms
above. For example:
</p><div class="codeblock"><pre>INSERT INTO ft(ft, rank) VALUES('rank', 'bm25(10.0, 5.0)');
</pre></div>
<a name="the_rebuild_command"></a>
<h2 tags="FTS5 rebuild command" id="the_rebuild_command"><span>6.11. </span>The 'rebuild' Command</h2>
<p> This command first deletes the entire full-text index, then rebuilds it
based on the contents of the table or <a href="fts5.html#external_content_tables">content
table</a>. It is not available with <a href="fts5.html#contentless_tables">contentless
tables</a>.
</p><div class="codeblock"><pre>INSERT INTO ft(ft) VALUES('rebuild');
</pre></div>
<a name="the_secure_delete_configuration_option"></a>
<h2 tags="FTS5 secure-delete command" id="the_secure_delete_configuration_option"><span>6.12. </span>The 'secure-delete' Configuration Option</h2>
<p> This command is used to set the persistent boolean "secure-delete" option.
For example:
</p><div class="codeblock"><pre>INSERT INTO ft(ft, rank) VALUES('secure-delete', 1);
</pre></div>
<p> Normally, when an entry in an fts5 table is updated or deleted, instead
of removing entries from the full-text index, delete-keys are added to the <a href="#the_automerge_configuration_option">new b-tree</a> created by the
transaction. This is efficient, but it means that the old full-text index
entries remain in the database file until they are eventually removed
by merge operations on the full-text index. Anyone with access to the
database can use these entries to trivially reconstruct the contents of
deleted FTS5 table rows. However, if the 'secure-delete' option is set
to 1, then full-text entries are actually removed from the database when
existing FTS5 table rows are updated or deleted. This is slower, but
it prevents old full-text entries from being used to reconstruct deleted
table rows.
</p><p> This option ensures that old full-text entries are not available to
attackers with SQL access to the database. To also ensure that they may
not be recovered by attackers with access to the SQLite database file
itself, the application must also enable the SQLite core secure-delete
option with a command like <a href="pragma.html#pragma_secure_delete">
"PRAGMA secure_delete = 1"</a>.
</p><p> <b>Warning:</b> Once one or more table rows have been updated or
deleted with this option set, the FTS5 table may no longer be read or
written by any version of FTS5 earlier than 3.42.0 (the first version
in which this option was available). Attempting to do so results in
an error, with an error message like "invalid fts5 file format (found 5,
expected 4) - run 'rebuild'". The FTS5 file format may be reverted, so
that it may be read by earlier versions of FTS5, by running the
<a href="#the_rebuild_command">'rebuild' command</a> on the table using
version 3.42.0 or later.
</p><p>The default value of the secure-delete option is 0.
</p><a name="the_usermerge_configuration_option"></a>
<h2 tags="FTS5 usermerge option" id="the_usermerge_configuration_option"><span>6.13. </span>The 'usermerge' Configuration Option</h2>
<p> This command is used to set the persistent "usermerge" option.
</p><p> The usermerge option is similar to the automerge and crisismerge options.
It is the minimum number of b-tree segments that will be merged together by
a 'merge' command with a positive parameter. For example:
</p><div class="codeblock"><pre>INSERT INTO ft(ft, rank) VALUES('usermerge', 4);
</pre></div>
<p> The default value of the usermerge option is 4. The minimum allowed value
is 2, and the maximum 16.
</p><a name="extending_fts5"></a>
<h1 tags="Extending FTS5" id="extending_fts5"><span>7. </span>Extending FTS5</h1>
<p>FTS5 features APIs allowing it to be extended by:
</p><ul>
<li> Adding new auxiliary functions implemented in C, and
</li><li> Adding new tokenizers, also implemented in C.
</li></ul>
<p> The built-in tokenizers and auxiliary functions described in this
document are all implemented using the publicly available API described
below.
</p><p> Before a new auxiliary function or tokenizer implementation may be
registered with FTS5, an application must obtain a pointer to the "fts5_api"
structure. There is one fts5_api structure for each database connection with
which the FTS5 extension is registered. To obtain the pointer, the application
invokes the SQL user-defined function fts5() with a single argument. That
argument must be set to a pointer to a pointer to an fts5_api object
using the <a href="c3ref/bind_blob.html">sqlite3_bind_pointer()</a> interface.
The following example code demonstrates the technique:
</p><div class="codeblock"><pre><i>/*
** Return a pointer to the fts5_api pointer for database connection db.
** If an error occurs, return NULL and leave an error in the database
** handle (accessible using sqlite3_errcode()/errmsg()).
*/</i>
fts5_api *fts5_api_from_db(sqlite3 *db){
fts5_api *pRet = 0;
sqlite3_stmt *pStmt = 0;
if( SQLITE_OK==sqlite3_prepare(db, "SELECT fts5(?1)", -1, &pStmt, 0) ){
sqlite3_bind_pointer(pStmt, 1, (void*)&pRet, "fts5_api_ptr", NULL);
sqlite3_step(pStmt);
}
sqlite3_finalize(pStmt);
return pRet;
}
</pre></div>
<p><b>Backwards Compatibility Warning:</b>
Prior to SQLite version 3.20.0 (2017-08-01), the fts5() worked slightly
differently. Older applications that extend FTS5 must be revised to use
the new technique shown above.
</p><p> The fts5_api structure is defined as follows. It exposes three methods,
one each for registering new auxiliary functions and tokenizers, and one for
retrieving existing tokenizer. The latter is intended to facilitate the
implementation of "tokenizer wrappers" similar to the built-in
porter tokenizer.
</p><div class="codeblock"><pre>typedef struct fts5_api fts5_api;
struct fts5_api {
int iVersion; <i>/* Currently always set to 2 */</i>
<i>/* Create a new tokenizer */</i>
int (*xCreateTokenizer)(
fts5_api *pApi,
const char *zName,
void *pUserData,
fts5_tokenizer *pTokenizer,
void (*xDestroy)(void*)
);
<i>/* Find an existing tokenizer */</i>
int (*xFindTokenizer)(
fts5_api *pApi,
const char *zName,
void **ppUserData,
fts5_tokenizer *pTokenizer
);
<i>/* Create a new auxiliary function */</i>
int (*xCreateFunction)(
fts5_api *pApi,
const char *zName,
void *pUserData,
fts5_extension_function xFunction,
void (*xDestroy)(void*)
);
};
</pre></div>
<p> To invoke a method of the fts5_api object, the fts5_api pointer itself
should be passed as the methods first argument followed by the other, method
specific, arguments. For example:
</p><div class="codeblock"><pre>rc = pFts5Api->xCreateTokenizer(pFts5Api, ... other args ...);
</pre></div>
<p> The fts5_api structure methods are described individually in the following
sections.
</p><a name="custom_tokenizers"></a>
<h2 tags="custom tokenizers" id="custom_tokenizers"><span>7.1. </span>Custom Tokenizers</h2>
<p> To create a custom tokenizer, an application must implement three
functions: a tokenizer constructor (xCreate), a destructor (xDelete) and a
function to do the actual tokenization (xTokenize). The type of each
function is as for the member variables of the fts5_tokenizer struct:
</p><div class="codeblock"><pre>typedef struct Fts5Tokenizer Fts5Tokenizer;
typedef struct fts5_tokenizer fts5_tokenizer;
struct fts5_tokenizer {
int (*xCreate)(void*, const char **azArg, int nArg, Fts5Tokenizer **ppOut);
void (*xDelete)(Fts5Tokenizer*);
int (*xTokenize)(Fts5Tokenizer*,
void *pCtx,
int flags, <i>/* Mask of FTS5_TOKENIZE_* flags */</i>
const char *pText, int nText,
int (*xToken)(
void *pCtx, <i>/* Copy of 2nd argument to xTokenize() */</i>
int tflags, <i>/* Mask of FTS5_TOKEN_* flags */</i>
const char *pToken, <i>/* Pointer to buffer containing token */</i>
int nToken, <i>/* Size of token in bytes */</i>
int iStart, <i>/* Byte offset of token within input text */</i>
int iEnd <i>/* Byte offset of end of token within input text */</i>
)
);
};
<i>/* Flags that may be passed as the third argument to xTokenize() */</i>
#define FTS5_TOKENIZE_QUERY 0x0001
#define FTS5_TOKENIZE_PREFIX 0x0002
#define FTS5_TOKENIZE_DOCUMENT 0x0004
#define FTS5_TOKENIZE_AUX 0x0008
<i>/* Flags that may be passed by the tokenizer implementation back to FTS5
** as the third argument to the supplied xToken callback. */</i>
#define FTS5_TOKEN_COLOCATED 0x0001 <i>/* Same position as prev. token */</i>
</pre></div>
<p> The implementation is registered with the FTS5 module by calling the
xCreateTokenizer() method of the fts5_api object. If there is already a
tokenizer with the same name, it is replaced. If a non-NULL xDestroy parameter
is passed to xCreateTokenizer(), it is invoked with a copy of the pUserData
pointer passed as the only argument when the database handle is closed or when
the tokenizer is replaced.
</p><p> If successful, xCreateTokenizer() returns SQLITE_OK. Otherwise, it
returns an SQLite error code. In this case the xDestroy function is <b>not</b>
invoked.
</p><p> When an FTS5 table uses the custom tokenizer, the FTS5 core calls xCreate()
once to create a tokenizer, then xTokenize() zero or more times to tokenize
strings, then xDelete() to free any resources allocated by xCreate(). More
specifically:
</p><dl>
<dt><b>xCreate:</b></dt><dd><p style="margin-top:0">
This function is used to allocate and initialize a tokenizer instance.
A tokenizer instance is required to actually tokenize text.
</p><p>
The first argument passed to this function is a copy of the (void*)
pointer provided by the application when the fts5_tokenizer object
was registered with FTS5 (the third argument to xCreateTokenizer()).
The second and third arguments are an array of nul-terminated strings
containing the tokenizer arguments, if any, specified following the
tokenizer name as part of the CREATE VIRTUAL TABLE statement used
to create the FTS5 table.
</p><p>
The final argument is an output variable. If successful, (*ppOut)
should be set to point to the new tokenizer handle and SQLITE_OK
returned. If an error occurs, some value other than SQLITE_OK should
be returned. In this case, fts5 assumes that the final value of *ppOut
is undefined.
</p><p>
</p></dd><dt><b> xDelete:</b></dt><dd><p style="margin-top:0">
This function is invoked to delete a tokenizer handle previously
allocated using xCreate(). Fts5 guarantees that this function will
be invoked exactly once for each successful call to xCreate().
</p><p>
</p></dd><dt><b> xTokenize:</b></dt><dd><p style="margin-top:0">
This function is expected to tokenize the nText byte string indicated
by argument pText. pText may or may not be nul-terminated. The first
argument passed to this function is a pointer to an Fts5Tokenizer object
returned by an earlier call to xCreate().
</p><p>
The second argument indicates the reason that FTS5 is requesting
tokenization of the supplied text. This is always one of the following
four values:
</p><p>
</p><ul><li> <b>FTS5_TOKENIZE_DOCUMENT</b> - A document is being inserted into
or removed from the FTS table. The tokenizer is being invoked to
determine the set of tokens to add to (or delete from) the
FTS index.
<p>
</p></li><li> <b>FTS5_TOKENIZE_QUERY</b> - A MATCH query is being executed
against the FTS index. The tokenizer is being called to tokenize
a bareword or quoted string specified as part of the query.
<p>
</p></li><li> <b>(FTS5_TOKENIZE_QUERY | FTS5_TOKENIZE_PREFIX)</b> - Same as
FTS5_TOKENIZE_QUERY, except that the bareword or quoted string is
followed by a "*" character, indicating that the last token
returned by the tokenizer will be treated as a token prefix.
<p>
</p></li><li> <b>FTS5_TOKENIZE_AUX</b> - The tokenizer is being invoked to
satisfy an fts5_api.xTokenize() request made by an auxiliary
function. Or an fts5_api.xColumnSize() request made by the same
on a columnsize=0 database.
</li></ul>
<p>
For each token in the input string, the supplied callback xToken() must
be invoked. The first argument to it should be a copy of the pointer
passed as the second argument to xTokenize(). The third and fourth
arguments are a pointer to a buffer containing the token text, and the
size of the token in bytes. The 4th and 5th arguments are the byte offsets
of the first byte of and first byte immediately following the text from
which the token is derived within the input.
</p><p>
The second argument passed to the xToken() callback ("tflags") should
normally be set to 0. The exception is if the tokenizer supports
synonyms. In this case see the discussion below for details.
</p><p>
FTS5 assumes the xToken() callback is invoked for each token in the
order that they occur within the input text.
</p><p>
If an xToken() callback returns any value other than SQLITE_OK, then
the tokenization should be abandoned and the xTokenize() method should
immediately return a copy of the xToken() return value. Or, if the
input buffer is exhausted, xTokenize() should return SQLITE_OK. Finally,
if an error occurs with the xTokenize() implementation itself, it
may abandon the tokenization and return any error code other than
SQLITE_OK or SQLITE_DONE.
</p><p>
</p></dd></dl><h3 id="synonym_support"><span>7.1.1. </span>Synonym Support</h3>
<p>
Custom tokenizers may also support synonyms. Consider a case in which a
user wishes to query for a phrase such as "first place". Using the
built-in tokenizers, the FTS5 query 'first + place' will match instances
of "first place" within the document set, but not alternative forms
such as "1st place". In some applications, it would be better to match
all instances of "first place" or "1st place" regardless of which form
the user specified in the MATCH query text.
</p><p>
There are several ways to approach this in FTS5:
</p><p>
</p><ol><li> By mapping all synonyms to a single token. In this case, using
the above example, this means that the tokenizer returns the
same token for inputs "first" and "1st". Say that token is in
fact "first", so that when the user inserts the document "I won
1st place" entries are added to the index for tokens "i", "won",
"first" and "place". If the user then queries for '1st + place',
the tokenizer substitutes "first" for "1st" and the query works
as expected.
<p>
</p></li><li> By querying the index for all synonyms of each query term
separately. In this case, when tokenizing query text, the
tokenizer may provide multiple synonyms for a single term
within the document. FTS5 then queries the index for each
synonym individually. For example, faced with the query:
<p>
</p><div class="codeblock"><pre>... MATCH 'first place'
</pre></div>
<p>
the tokenizer offers both "1st" and "first" as synonyms for the
first token in the MATCH query and FTS5 effectively runs a query
similar to:
</p><p>
</p><div class="codeblock"><pre>... MATCH '(first OR 1st) place'
</pre></div>
<p>
except that, for the purposes of auxiliary functions, the query
still appears to contain just two phrases - "(first OR 1st)"
being treated as a single phrase.
</p><p>
</p></li><li> By adding multiple synonyms for a single term to the FTS index.
Using this method, when tokenizing document text, the tokenizer
provides multiple synonyms for each token. So that when a
document such as "I won first place" is tokenized, entries are
added to the FTS index for "i", "won", "first", "1st" and
"place".
<p>
This way, even if the tokenizer does not provide synonyms
when tokenizing query text (it should not - to do so would be
inefficient), it doesn't matter if the user queries for
'first + place' or '1st + place', as there are entries in the
FTS index corresponding to both forms of the first token.
</p></li></ol>
<p>
Whether it is parsing document or query text, any call to xToken that
specifies a <i>tflags</i> argument with the FTS5_TOKEN_COLOCATED bit
is considered to supply a synonym for the previous token. For example,
when parsing the document "I won first place", a tokenizer that supports
synonyms would call xToken() 5 times, as follows:
</p><p>
</p><div class="codeblock"><pre>xToken(pCtx, 0, "i", 1, 0, 1);
xToken(pCtx, 0, "won", 3, 2, 5);
xToken(pCtx, 0, "first", 5, 6, 11);
xToken(pCtx, FTS5_TOKEN_COLOCATED, "1st", 3, 6, 11);
xToken(pCtx, 0, "place", 5, 12, 17);
</pre></div>
<p>
It is an error to specify the FTS5_TOKEN_COLOCATED flag the first time
xToken() is called. Multiple synonyms may be specified for a single token
by making multiple calls to xToken(FTS5_TOKEN_COLOCATED) in sequence.
There is no limit to the number of synonyms that may be provided for a
single token.
</p><p>
In many cases, method (1) above is the best approach. It does not add
extra data to the FTS index or require FTS5 to query for multiple terms,
so it is efficient in terms of disk space and query speed. However, it
does not support prefix queries very well. If, as suggested above, the
token "first" is substituted for "1st" by the tokenizer, then the query:
</p><p>
</p><div class="codeblock"><pre>... MATCH '1s*'
</pre></div>
<p>
will not match documents that contain the token "1st" (as the tokenizer
will probably not map "1s" to any prefix of "first").
</p><p>
For full prefix support, method (3) may be preferred. In this case,
because the index contains entries for both "first" and "1st", prefix
queries such as 'fi*' or '1s*' will match correctly. However, because
extra entries are added to the FTS index, this method uses more space
within the database.
</p><p>
Method (2) offers a midpoint between (1) and (3). Using this method,
a query such as '1s*' will match documents that contain the literal
token "1st", but not "first" (assuming the tokenizer is not able to
provide synonyms for prefixes). However, a non-prefix query like '1st'
will match against "1st" and "first". This method does not require
extra disk space, as no extra entries are added to the FTS index.
On the other hand, it may require more CPU cycles to run MATCH queries,
as separate queries of the FTS index are required for each synonym.
</p><p>
When using methods (2) or (3), it is important that the tokenizer only
provide synonyms when tokenizing document text (method (3)) or query
text (method (2)), not both. Doing so will not cause any errors, but is
inefficient.
</p><a name="custom_auxiliary_functions"></a>
<h2 tags="FTS5 custom auxiliary functions" id="custom_auxiliary_functions"><span>7.2. </span>Custom Auxiliary Functions</h2>
<p> Implementing a custom auxiliary function is similar to implementing a
<a href="appfunc.html">scalar SQL function</a>. The implementation
should be a C function of type fts5_extension_function, defined as follows:
</p><div class="codeblock"><pre>typedef struct Fts5ExtensionApi Fts5ExtensionApi;
typedef struct Fts5Context Fts5Context;
typedef struct Fts5PhraseIter Fts5PhraseIter;
typedef void (*fts5_extension_function)(
const Fts5ExtensionApi *pApi, <i>/* API offered by current FTS version */</i>
Fts5Context *pFts, <i>/* First arg to pass to pApi functions */</i>
sqlite3_context *pCtx, <i>/* Context for returning result/error */</i>
int nVal, <i>/* Number of values in apVal[] array */</i>
sqlite3_value **apVal <i>/* Array of trailing arguments */</i>
);
</pre></div>
<p> The implementation is registered with the FTS5 module by calling the
xCreateFunction() method of the fts5_api object. If there is already an
auxiliary function with the same name, it is replaced by the new function.
If a non-NULL xDestroy parameter is passed to xCreateFunction(), it is invoked
with a copy of the pUserData pointer passed as the only argument when the
database handle is closed or when the registered auxiliary function is
replaced.
</p><p> If successful, xCreateFunction() returns SQLITE_OK. Otherwise, it
returns an SQLite error code. In this case the xDestroy function is <b>not</b>
invoked.
</p><p> The final three arguments passed to the auxiliary function callback
(pCtx, nVal and apVal above) are similar to the three arguments passed to the
implementation of a scalar SQL function. The apVal[] array contains all
SQL arguments except the first passed to the auxiliary function. The
implementation should return a result or error via the content handle pCtx.
</p><p> The first argument passed to an auxiliary function callback is a pointer
to a structure (pApi above) containing methods that may be invoked
in order to obtain information regarding the current query or row. The second
argument is an opaque handle (pFts above) that should be passed as the
first argument to any such method invocation. For example, the following
auxiliary function returns the total number of tokens in all columns of the
current row:
</p><div class="codeblock"><pre><i>/*
** Implementation of an auxiliary function that returns the number
** of tokens in the current row (including all columns).
*/</i>
static void column_size_imp(
const Fts5ExtensionApi *pApi,
Fts5Context *pFts,
sqlite3_context *pCtx,
int nVal,
sqlite3_value **apVal
){
int rc;
int nToken;
rc = pApi->xColumnSize(pFts, -1, &nToken);
if( rc==SQLITE_OK ){
sqlite3_result_int(pCtx, nToken);
}else{
sqlite3_result_error_code(pCtx, rc);
}
}
</pre></div>
<p>The following section describes the API offered to auxiliary function
implementations in detail. Further examples may be found in the "fts5_aux.c"
file of the source code.
</p><a name="custom_auxiliary_functions_api_overview"></a>
<h3 tags="custom auxiliary overview" id="custom_auxiliary_functions_api_overview"><span>7.2.1. </span>Custom Auxiliary Functions API Overview</h3>
<p>This section provides an overview of the capabilities of the auxiliary
function API. It does not describe every function. Refer to the <a href="#custom_auxiliary_functions_api_reference">reference text</a> below for a
complete description.
</p><p>When invoked, an auxiliary function implementation has access to APIs that
allow it to query FTS5 for various information. Some of these APIs return
information relating to the current row of the FTS5 table being visited,
some relating to the entire set of rows that will be visited by the FTS5
query, and some relating to the FTS5 table. Given an FTS5 table populated as
follows:
</p><div class="codeblock"><pre>CREATE VIRTUAL TABLE ft USING fts5(a, b);
INSERT INTO ft(rowid, a, b) VALUES
(1, 'ab cd', 'cd de one'),
(2, 'de fg', 'fg gh'),
(3, 'gh ij', 'ij ab three four');
</pre></div>
<p>and the query:
</p><div class="codeblock"><pre>SELECT my_aux_function(ft) FROM ft('ab')
</pre></div>
<p>then the custom auxiliary function will be invoked for rows 1 and 3 (all
rows that contain the token "ab" and therefore match the query).
</p><p><b>Number of rows/columns in table: xRowCount, xColumnCount
</b></p><p>The system may be queried for the total number of rows in the FTS5 table
using the <a href="#xRowCount">xRowCount</a> API. This provides the total number
of rows in the table, not the number that match the current query.
</p><p>Table columns are numbered from left to right starting from 0. The
"rowid" column does not count - only user declared columns - so in the example
above column "a" is column 0 and column "b" is column 1. From within an
auxiliary function implementation, the <a href="#xColumnCount">xColumnCount</a>
API may be used to determine how many columns the table being queried has. If
the xColumnCount() API is invoked from within the implementation of the
auxiliary function my_aux_function in the example above, it returns 2.
</p><p><b>Data From Current Row: xColumnText, xRowid
</b></p><p>The <a href="#xRowid">xRowid</a> API may be used to find the rowid value
for the current row. The <a href="#xColumnText">xColumnText</a> may be used
to obtain the text stored in a specified column of the current row.
</p><p><b>Token Counts: xColumnSize, xColumnTotalSize
</b></p><p>FTS5 divides documents inserted into an fts table into tokens. These are
usually just words, perhaps folded to either upper or lower case and with any
punctuation removed. For example, the default
<a href="#unicode61_tokenizer">unicode61 tokenizer</a> tokenizes the text "The
tokenizer is case-insensitive" to a list of 5 tokens - "the", "tokenizer", is",
"case" and "insensitive". Exactly how tokens are extracted from text is
determined by the <a href="#tokenizers">tokenizer</a>.
</p><p>The auxiliary functions API provides functions to query for both the number
of tokens in a specified column of the current row (the
<a href="#xColumnSize">xColumnSize</a> API), or for the number of tokens in a
specified column of all rows of the table (the <a href="#xColumnTotalSize">xColumnTotalSize</a> API). For the example at the
top of this section, when visiting row 1, xColumnSize returns 2 for column 0
and 3 for column 1. xColumnTotalSize returns 6 for column 0 and 9 for column 1
regardless of the current row.
</p><p><b>The Current Full-Text Query: xPhraseCount, xPhraseSize, xQueryToken
</b></p><p>An FTS5 query contains one or more <a href="#fts5_phrases">phrases</a>. The
<a href="#xPhraseCount">xPhraseCount</a>, <a href="#xPhraseSize">xPhraseSize</a>
and <a href="#xQueryToken">xQueryToken</a> APIs allow an auxiliary function
implementation to query the system for details of the current query. The
xPhraseCount API returns the number of phrases in the current query. For
example, if an FTS5 table is queried as follows:
</p><div class="codeblock"><pre>SELECT my_aux_function(ft) FROM ft('ab AND "cd ef gh" OR ij + kl')
</pre></div>
<p>and the xPhraseCount() API invoked from within the implementation of the
auxiliary function, it returns 3 (the three phrases being "ab", "ce ef gh" and
"ij kl").
</p><p>Phrases are numbered in order of appearance within a query starting from 0.
The xPhraseSize() API may be used to query for the number of tokens in a
specified phrase of the query. In the example above, phrase 0 contains 1 token,
phrase 1 contains 3 tokens, and phrase 2 contains 2.
</p><p>The xQueryToken API may be used to access the text of a specified token
within a specified phrase of the query. Tokens are numbered within their
phrases from left to right starting from 0. For example, if the xQueryToken
API is used to request token 1 of phrase 2 in the example above, it returns
the text "kl". Token 0 of phrase 0 is "ab".
</p><p><b>Phrase Hits in the Current Row: xPhraseFirst, xPhraseNext
</b></p><p>These two API functions may be used to iterate through the matches for
a specified phrase of the query within the current row. Phrase matches are
identified by the column and token offset within the current row. For
example, say the following example table:
</p><div class="codeblock"><pre>CREATE VIRTUAL TABLE ft2 USING fts5(x, y);
INSERT INTO ft2(rowid, x, y) VALUES
(1, 'xxx one two xxx five xxx six', 'seven four'),
(2, 'five four four xxx six', 'three four five six four five six');
</pre></div>
<p>is queried with:
</p><div class="codeblock"><pre>SELECT my_aux_function(ft2) FROM ft2(
'("one two" OR "three") AND y:four NEAR(five six, 2)'
);
</pre></div>
<p>The query above contains 5 phrases - "one two", "three", "four",
"five" and "six". It matches all rows of the table, so the auxiliary
function is invoked for each row.
</p><p>In row 1, for phrase 0, "one two", there is exactly one match to iterate
through - at column 0 token offset 1. The column number is 0 because the
match appears in the left most column. The token offset is 1 because there
is exactly one token ("xxx") before the phrase match in the column value.
For phrase 1, "three", there are no matches. Phrase 2, "four", has one
match, at column 1, token offset 0. Phrase 3, "five", has one match at
column 0, token offset 4, and phrase 4, "six", has one match at column 0
token offset 6.
</p><p>The set of matches for each phrase in each row of the example is presented
in the table below. Each match is notated as (column-number, token-offset):
</p><table striped="1" style="margin:1em auto; width:80%; border-spacing:0">
<tr style="text-align:left"><th>Row</th><th>Phrase 0</th><th>Phrase 1</th><th>Phrase 2</th><th>Phrase 3</th><th>Phrase 4
</th></tr><tr style="text-align:left;background-color:#DDDDDD"><td>1</td><td>(0, 1) </td><td></td><td>(1, 1)</td><td>(0, 4)</td><td>(0, 6)
</td></tr><tr style="text-align:left"><td>2</td><td></td><td>(1,0)</td><td>(1, 1), (1,4)</td><td>(1, 2), (1, 5)</td><td>(1, 3), (1, 6)
</td></tr></table>
<p>The second row is slightly more complicated. There were no occurrences of
phrase 0. Phrase 1 ("three") appears once, at column 1 token offset 0. Although
there are instances of phrase 2 ("four") in column 0, none of them are reported
by the API, as phrase 4 has a <a href="#fts5_column_filters">column filter</a> -
"y:". Matches that are filtered out by column filters do not count. Similarly,
although phrases 3 and 4 do occur in column "x" of row 2, they are filtered
out by the <a href="#fts5_near_queries">NEAR filter</a>. Matches that are
filtered out by NEAR filters do not count either.
</p><p><b>Phrase Hits in the Current Row (2): xInstCount, xInst
</b></p><p>The <a href="#xInstCount">xInstCount</a> and <a href="#xInst">xInst</a> APIs
provide access to the same information as the xPhraseFirst and xPhraseNext
described above. The difference is that instead of iterating through the
matches for a single, specified phrase, the xInstCount/xInst APIs collate
all matches into a single flat array, sorted in order of occurrence within
the current row. Elements of this array may then be accessed randomly.
</p><p>Each array element consists of three values:
</p><ul>
<li> A phrase number,
</li><li> A column number, and
</li><li> A token offset
</li></ul>
<p>Using the same example data and query as for xPhraseFirst/xPhraseNext
above, the array accessible via xInstCount/xInst consists of the following
entries for each row:
</p><table striped="1" style="margin:1em auto; width:80%; border-spacing:0">
<tr style="text-align:left"><th>Row</th><th>xInstCount/xInst array
</th></tr><tr style="text-align:left;background-color:#DDDDDD"><td>1</td><td>(0, 0, 1), (3, 0, 4), (4, 0, 6), (2, 1, 1)
</td></tr><tr style="text-align:left"><td>2</td><td>(1, 1, 0), (2, 1, 1), (3, 1, 2), (4, 1, 3), (2, 1, 4), (3, 1, 5), (4, 1, 6)
</td></tr></table>
<p>Each entry of the array is called a phrase match. Phrase matches are
numbered in order, starting from 0. So, in the example above, in row 2, phrase
match 3 is (4, 1, 3) - phrase 4 of the query matches at column 1, token offset
3.
</p><a name="custom_auxiliary_functions_api_reference"></a>
<h3 tags="custom auxiliary functions" id="custom_auxiliary_functions_api_reference"><span>7.2.2. </span>Custom Auxiliary Functions API Reference</h3>
<div class="codeblock"><pre>struct Fts5ExtensionApi {
int iVersion; <i>/* Currently always set to 3 */</i>
void *(*<a href="#xUserData">xUserData</a>)(Fts5Context*);
int (*<a href="#xColumnCount">xColumnCount</a>)(Fts5Context*);
int (*<a href="#xRowCount">xRowCount</a>)(Fts5Context*, sqlite3_int64 *pnRow);
int (*<a href="#xColumnTotalSize">xColumnTotalSize</a>)(Fts5Context*, int iCol, sqlite3_int64 *pnToken);
int (*<a href="#xTokenize">xTokenize</a>)(Fts5Context*,
const char *pText, int nText, <i>/* Text to tokenize */</i>
void *pCtx, <i>/* Context passed to xToken() */</i>
int (*xToken)(void*, int, const char*, int, int, int) <i>/* Callback */</i>
);
int (*<a href="#xPhraseCount">xPhraseCount</a>)(Fts5Context*);
int (*<a href="#xPhraseSize">xPhraseSize</a>)(Fts5Context*, int iPhrase);
int (*<a href="#xInstCount">xInstCount</a>)(Fts5Context*, int *pnInst);
int (*<a href="#xInst">xInst</a>)(Fts5Context*, int iIdx, int *piPhrase, int *piCol, int *piOff);
sqlite3_int64 (*<a href="#xRowid">xRowid</a>)(Fts5Context*);
int (*<a href="#xColumnText">xColumnText</a>)(Fts5Context*, int iCol, const char **pz, int *pn);
int (*<a href="#xColumnSize">xColumnSize</a>)(Fts5Context*, int iCol, int *pnToken);
int (*<a href="#xQueryPhrase">xQueryPhrase</a>)(Fts5Context*, int iPhrase, void *pUserData,
int(*)(const Fts5ExtensionApi*,Fts5Context*,void*)
);
int (*<a href="#xSetAuxdata">xSetAuxdata</a>)(Fts5Context*, void *pAux, void(*xDelete)(void*));
void *(*<a href="#xGetAuxdata">xGetAuxdata</a>)(Fts5Context*, int bClear);
int (*<a href="#xPhraseFirst">xPhraseFirst</a>)(Fts5Context*, int iPhrase, Fts5PhraseIter*, int*, int*);
void (*<a href="#xPhraseNext">xPhraseNext</a>)(Fts5Context*, Fts5PhraseIter*, int *piCol, int *piOff);
int (*<a href="#xPhraseFirstColumn">xPhraseFirstColumn</a>)(Fts5Context*, int iPhrase, Fts5PhraseIter*, int*);
void (*<a href="#xPhraseNextColumn">xPhraseNextColumn</a>)(Fts5Context*, Fts5PhraseIter*, int *piCol);
<i>/* Below this point are iVersion>=3 only */</i>
int (*<a href="#xQueryToken">xQueryToken</a>)(Fts5Context*,
int iPhrase, int iToken,
const char **ppToken, int *pnToken
);
int (*<a href="#xInstToken">xInstToken</a>)(Fts5Context*, int iIdx, int iToken, const char**, int*);
};
</pre></div>
<dl>
<dt style="white-space:pre;font-family:monospace;font-size:120%" id="xUserData">
<b>void *(*xUserData)(Fts5Context*)</b></dt><dd>
<p style="margin-top:0.1em">
Return a copy of the pUserData pointer passed to the xCreateFunction()
API when the extension function was registered.
</p>
</dd>
<dt style="white-space:pre;font-family:monospace;font-size:120%" id="xColumnTotalSize">
<b>int (*xColumnTotalSize)(Fts5Context*, int iCol, sqlite3_int64 *pnToken)</b></dt><dd>
<p style="margin-top:0.1em">
If parameter iCol is less than zero, set output variable *pnToken
to the total number of tokens in the FTS5 table. Or, if iCol is
non-negative but less than the number of columns in the table, return
the total number of tokens in column iCol, considering all rows in
the FTS5 table.
</p>
<p>
If parameter iCol is greater than or equal to the number of columns
in the table, SQLITE_RANGE is returned. Or, if an error occurs (e.g.
an OOM condition or IO error), an appropriate SQLite error code is
returned.
</p>
</dd>
<dt style="white-space:pre;font-family:monospace;font-size:120%" id="xColumnCount">
<b>int (*xColumnCount)(Fts5Context*)</b></dt><dd>
<p style="margin-top:0.1em">
Return the number of columns in the table.
</p>
</dd>
<dt style="white-space:pre;font-family:monospace;font-size:120%" id="xColumnSize">
<b>int (*xColumnSize)(Fts5Context*, int iCol, int *pnToken)</b></dt><dd>
<p style="margin-top:0.1em">
If parameter iCol is less than zero, set output variable *pnToken
to the total number of tokens in the current row. Or, if iCol is
non-negative but less than the number of columns in the table, set
*pnToken to the number of tokens in column iCol of the current row.
</p>
<p>
If parameter iCol is greater than or equal to the number of columns
in the table, SQLITE_RANGE is returned. Or, if an error occurs (e.g.
an OOM condition or IO error), an appropriate SQLite error code is
returned.
</p>
<p>
This function may be quite inefficient if used with an FTS5 table
created with the "columnsize=0" option.
</p>
</dd>
<dt style="white-space:pre;font-family:monospace;font-size:120%" id="xColumnText">
<b>int (*xColumnText)(Fts5Context*, int iCol, const char **pz, int *pn)</b></dt><dd>
<p style="margin-top:0.1em">
If parameter iCol is less than zero, or greater than or equal to the
number of columns in the table, SQLITE_RANGE is returned.
</p>
<p>
Otherwise, this function attempts to retrieve the text of column iCol of
the current document. If successful, (*pz) is set to point to a buffer
containing the text in utf-8 encoding, (*pn) is set to the size in bytes
(not characters) of the buffer and SQLITE_OK is returned. Otherwise,
if an error occurs, an SQLite error code is returned and the final values
of (*pz) and (*pn) are undefined.
</p>
</dd>
<dt style="white-space:pre;font-family:monospace;font-size:120%" id="xPhraseCount">
<b>int (*xPhraseCount)(Fts5Context*)</b></dt><dd>
<p style="margin-top:0.1em">
Returns the number of phrases in the current query expression.
</p>
</dd>
<dt style="white-space:pre;font-family:monospace;font-size:120%" id="xPhraseSize">
<b>int (*xPhraseSize)(Fts5Context*, int iPhrase)</b></dt><dd>
<p style="margin-top:0.1em">
If parameter iCol is less than zero, or greater than or equal to the
number of phrases in the current query, as returned by xPhraseCount,
0 is returned. Otherwise, this function returns the number of tokens in
phrase iPhrase of the query. Phrases are numbered starting from zero.
</p>
</dd>
<dt style="white-space:pre;font-family:monospace;font-size:120%" id="xInstCount">
<b>int (*xInstCount)(Fts5Context*, int *pnInst)</b></dt><dd>
<p style="margin-top:0.1em">
Set *pnInst to the total number of occurrences of all phrases within
the query within the current row. Return SQLITE_OK if successful, or
an error code (i.e. SQLITE_NOMEM) if an error occurs.
</p>
<p>
This API can be quite slow if used with an FTS5 table created with the
"detail=none" or "detail=column" option. If the FTS5 table is created
with either "detail=none" or "detail=column" and "content=" option
(i.e. if it is a contentless table), then this API always returns 0.
</p>
</dd>
<dt style="white-space:pre;font-family:monospace;font-size:120%" id="xInst">
<b>int (*xInst)(Fts5Context*, int iIdx, int *piPhrase, int *piCol, int *piOff)</b></dt><dd>
<p style="margin-top:0.1em">
Query for the details of phrase match iIdx within the current row.
Phrase matches are numbered starting from zero, so the iIdx argument
should be greater than or equal to zero and smaller than the value
output by xInstCount(). If iIdx is less than zero or greater than
or equal to the value returned by xInstCount(), SQLITE_RANGE is returned.
</p>
<p>
Otherwise, output parameter *piPhrase is set to the phrase number, *piCol
to the column in which it occurs and *piOff the token offset of the
first token of the phrase. SQLITE_OK is returned if successful, or an
error code (i.e. SQLITE_NOMEM) if an error occurs.
</p>
<p>
This API can be quite slow if used with an FTS5 table created with the
"detail=none" or "detail=column" option.
</p>
</dd>
<dt style="white-space:pre;font-family:monospace;font-size:120%" id="xRowid">
<b>sqlite3_int64 (*xRowid)(Fts5Context*)</b></dt><dd>
<p style="margin-top:0.1em">
Returns the rowid of the current row.
</p>
</dd>
<dt style="white-space:pre;font-family:monospace;font-size:120%" id="xTokenize">
<b>int (*xTokenize)(Fts5Context*,
const char *pText, int nText,
void *pCtx,
int (*xToken)(void*, int, const char*, int, int, int)
)</b></dt><dd>
<p style="margin-top:0.1em">
Tokenize text using the tokenizer belonging to the FTS5 table.
</p>
</dd>
<dt style="white-space:pre;font-family:monospace;font-size:120%" id="xQueryPhrase">
<b>int (*xQueryPhrase)(Fts5Context*, int iPhrase, void *pUserData,
int(*)(const Fts5ExtensionApi*,Fts5Context*,void*)
)</b></dt><dd>
<p style="margin-top:0.1em">
This API function is used to query the FTS table for phrase iPhrase
of the current query. Specifically, a query equivalent to:
</p>
<div class="codeblock"><pre>... FROM ftstable WHERE ftstable MATCH $p ORDER BY rowid
</pre></div>
<p>
with $p set to a phrase equivalent to the phrase iPhrase of the
current query is executed. Any column filter that applies to
phrase iPhrase of the current query is included in $p. For each
row visited, the callback function passed as the fourth argument
is invoked. The context and API objects passed to the callback
function may be used to access the properties of each matched row.
Invoking Api.xUserData() returns a copy of the pointer passed as
the third argument to pUserData.
</p>
<p>
If parameter iPhrase is less than zero, or greater than or equal to
the number of phrases in the query, as returned by xPhraseCount(),
this function returns SQLITE_RANGE.
</p>
<p>
If the callback function returns any value other than SQLITE_OK, the
query is abandoned and the xQueryPhrase function returns immediately.
If the returned value is SQLITE_DONE, xQueryPhrase returns SQLITE_OK.
Otherwise, the error code is propagated upwards.
</p>
<p>
If the query runs to completion without incident, SQLITE_OK is returned.
Or, if some error occurs before the query completes or is aborted by
the callback, an SQLite error code is returned.
</p>
</dd>
<dt style="white-space:pre;font-family:monospace;font-size:120%" id="xSetAuxdata">
<b>int (*xSetAuxdata)(Fts5Context*, void *pAux, void(*xDelete)(void*))</b></dt><dd>
<p style="margin-top:0.1em">
Save the pointer passed as the second argument as the extension function's
"auxiliary data". The pointer may then be retrieved by the current or any
future invocation of the same fts5 extension function made as part of
the same MATCH query using the xGetAuxdata() API.
</p>
<p>
Each extension function is allocated a single auxiliary data slot for
each FTS query (MATCH expression). If the extension function is invoked
more than once for a single FTS query, then all invocations share a
single auxiliary data context.
</p>
<p>
If there is already an auxiliary data pointer when this function is
invoked, then it is replaced by the new pointer. If an xDelete callback
was specified along with the original pointer, it is invoked at this
point.
</p>
<p>
The xDelete callback, if one is specified, is also invoked on the
auxiliary data pointer after the FTS5 query has finished.
</p>
<p>
If an error (e.g. an OOM condition) occurs within this function,
the auxiliary data is set to NULL and an error code returned. If the
xDelete parameter was not NULL, it is invoked on the auxiliary data
pointer before returning.
</p>
</dd>
<dt style="white-space:pre;font-family:monospace;font-size:120%" id="xGetAuxdata">
<b>void *(*xGetAuxdata)(Fts5Context*, int bClear)</b></dt><dd>
<p style="margin-top:0.1em">
Returns the current auxiliary data pointer for the fts5 extension
function. See the xSetAuxdata() method for details.
</p>
<p>
If the bClear argument is non-zero, then the auxiliary data is cleared
(set to NULL) before this function returns. In this case the xDelete,
if any, is not invoked.
</p>
</dd>
<dt style="white-space:pre;font-family:monospace;font-size:120%" id="xRowCount">
<b>int (*xRowCount)(Fts5Context*, sqlite3_int64 *pnRow)</b></dt><dd>
<p style="margin-top:0.1em">
This function is used to retrieve the total number of rows in the table.
In other words, the same value that would be returned by:
</p>
<div class="codeblock"><pre>SELECT count(*) FROM ftstable;
</pre></div>
</dd>
<dt style="white-space:pre;font-family:monospace;font-size:120%" id="xPhraseFirst">
<b>int (*xPhraseFirst)(Fts5Context*, int iPhrase, Fts5PhraseIter*, int*, int*)</b></dt><dd>
<p style="margin-top:0.1em">
This function is used, along with type Fts5PhraseIter and the xPhraseNext
method, to iterate through all instances of a single query phrase within
the current row. This is the same information as is accessible via the
xInstCount/xInst APIs. While the xInstCount/xInst APIs are more convenient
to use, this API may be faster under some circumstances. To iterate
through instances of phrase iPhrase, use the following code:
</p>
<div class="codeblock"><pre>Fts5PhraseIter iter;
int iCol, iOff;
for(pApi->xPhraseFirst(pFts, iPhrase, &iter, &iCol, &iOff);
iCol>=0;
pApi->xPhraseNext(pFts, &iter, &iCol, &iOff)
){
// An instance of phrase iPhrase at offset iOff of column iCol
}
</pre></div>
<p>
The Fts5PhraseIter structure is defined above. Applications should not
modify this structure directly - it should only be used as shown above
with the xPhraseFirst() and xPhraseNext() API methods (and by
xPhraseFirstColumn() and xPhraseNextColumn() as illustrated below).
</p>
<p>
This API can be quite slow if used with an FTS5 table created with the
"detail=none" or "detail=column" option. If the FTS5 table is created
with either "detail=none" or "detail=column" and "content=" option
(i.e. if it is a contentless table), then this API always iterates
through an empty set (all calls to xPhraseFirst() set iCol to -1).
</p>
</dd>
<dt style="white-space:pre;font-family:monospace;font-size:120%" id="xPhraseNext">
<b>void (*xPhraseNext)(Fts5Context*, Fts5PhraseIter*, int *piCol, int *piOff)</b></dt><dd>
<p style="margin-top:0.1em">
See xPhraseFirst above.
</p>
</dd>
<dt style="white-space:pre;font-family:monospace;font-size:120%" id="xPhraseFirstColumn">
<b>int (*xPhraseFirstColumn)(Fts5Context*, int iPhrase, Fts5PhraseIter*, int*)</b></dt><dd>
<p style="margin-top:0.1em">
This function and xPhraseNextColumn() are similar to the xPhraseFirst()
and xPhraseNext() APIs described above. The difference is that instead
of iterating through all instances of a phrase in the current row, these
APIs are used to iterate through the set of columns in the current row
that contain one or more instances of a specified phrase. For example:
</p>
<div class="codeblock"><pre>Fts5PhraseIter iter;
int iCol;
for(pApi->xPhraseFirstColumn(pFts, iPhrase, &iter, &iCol);
iCol>=0;
pApi->xPhraseNextColumn(pFts, &iter, &iCol)
){
// Column iCol contains at least one instance of phrase iPhrase
}
</pre></div>
<p>
This API can be quite slow if used with an FTS5 table created with the
"detail=none" option. If the FTS5 table is created with either
"detail=none" "content=" option (i.e. if it is a contentless table),
then this API always iterates through an empty set (all calls to
xPhraseFirstColumn() set iCol to -1).
</p>
<p>
The information accessed using this API and its companion
xPhraseFirstColumn() may also be obtained using xPhraseFirst/xPhraseNext
(or xInst/xInstCount). The chief advantage of this API is that it is
significantly more efficient than those alternatives when used with
"detail=column" tables.
</p>
</dd>
<dt style="white-space:pre;font-family:monospace;font-size:120%" id="xPhraseNextColumn">
<b>void (*xPhraseNextColumn)(Fts5Context*, Fts5PhraseIter*, int *piCol)</b></dt><dd>
<p style="margin-top:0.1em">
See xPhraseFirstColumn above.
</p>
</dd>
<dt style="white-space:pre;font-family:monospace;font-size:120%" id="xQueryToken">
<b>int (*xQueryToken)(Fts5Context*,
int iPhrase, int iToken,
const char **ppToken, int *pnToken
)</b></dt><dd>
<p style="margin-top:0.1em">
This is used to access token iToken of phrase iPhrase of the current
query. Before returning, output parameter *ppToken is set to point
to a buffer containing the requested token, and *pnToken to the
size of this buffer in bytes.
</p>
<p>
If iPhrase or iToken are less than zero, or if iPhrase is greater than
or equal to the number of phrases in the query as reported by
xPhraseCount(), or if iToken is equal to or greater than the number of
tokens in the phrase, SQLITE_RANGE is returned and *ppToken and *pnToken
are both zeroed.
</p>
<p>
The output text is not a copy of the query text that specified the
token. It is the output of the tokenizer module. For tokendata=1
tables, this includes any embedded 0x00 and trailing data.
</p>
</dd>
<dt style="white-space:pre;font-family:monospace;font-size:120%" id="xInstToken">
<b>int (*xInstToken)(Fts5Context*, int iIdx, int iToken, const char**, int*)</b></dt><dd>
<p style="margin-top:0.1em">
This is used to access token iToken of phrase hit iIdx within the
current row. If iIdx is less than zero or greater than or equal to the
value returned by xInstCount(), SQLITE_RANGE is returned. Otherwise,
output variable (*ppToken) is set to point to a buffer containing the
matching document token, and (*pnToken) to the size of that buffer in
bytes. This API is not available if the specified token matches a
prefix query term. In that case both output variables are always set
to 0.
</p>
<p>
The output text is not a copy of the document text that was tokenized.
It is the output of the tokenizer module. For tokendata=1 tables, this
includes any embedded 0x00 and trailing data.
</p>
<p>
This API can be quite slow if used with an FTS5 table created with the
"detail=none" or "detail=column" option.
</p>
</dd>
</dl>
<a name="the_fts5vocab_virtual_table_module"></a>
<h1 tags="fts5vocab" id="the_fts5vocab_virtual_table_module"><span>8. </span>The fts5vocab Virtual Table Module</h1>
<p> The fts5vocab virtual table module allows users to extract information from
an FTS5 full-text index directly. The fts5vocab module is a part of FTS5 - it
is available whenever FTS5 is.
</p><p> Each fts5vocab table is associated with a single FTS5 table. An fts5vocab
table is usually created by specifying two arguments in place of column names
in the CREATE VIRTUAL TABLE statement - the name of the associated FTS5 table
and the type of fts5vocab table. Currently there are three types of fts5vocab
table; "row", "col" and "instance". Unless the fts5vocab table is created
within the "temp" database, it must be part of the same database as the
associated FTS5 table.
</p><div class="codeblock"><pre><i>-- Create an fts5vocab "row" table to query the full-text index belonging
-- to FTS5 table "ft1".</i>
CREATE VIRTUAL TABLE ft1_v USING fts5vocab('ft1', 'row');
<i>-- Create an fts5vocab "col" table to query the full-text index belonging
-- to FTS5 table "ft2".</i>
CREATE VIRTUAL TABLE ft2_v USING fts5vocab(ft2, col);
<i>-- Create an fts5vocab "instance" table to query the full-text index
-- belonging to FTS5 table "ft3".</i>
CREATE VIRTUAL TABLE ft3_v USING fts5vocab(ft3, instance);
</pre></div>
<p> If an fts5vocab table is created in the temp database, it may be associated
with an FTS5 table in any attached database. In order to attach the fts5vocab
table to an FTS5 table located in a database other than "temp", the name of the
database is inserted before the FTS5 table name in the CREATE VIRTUAL TABLE
arguments. For example:
</p><div class="codeblock"><pre><i>-- Create an fts5vocab "row" table to query the full-text index belonging
-- to FTS5 table "ft1" in database "main".</i>
CREATE VIRTUAL TABLE temp.ft1_v USING fts5vocab(main, 'ft1', 'row');
<i>-- Create an fts5vocab "col" table to query the full-text index belonging
-- to FTS5 table "ft2" in attached database "aux".</i>
CREATE VIRTUAL TABLE temp.ft2_v USING fts5vocab('aux', ft2, col);
<i>-- Create an fts5vocab "instance" table to query the full-text index
-- belonging to FTS5 table "ft3" in attached database "other".</i>
CREATE VIRTUAL TABLE temp.ft2_v USING fts5vocab('aux', ft3, 'instance');
</pre></div>
<p> Specifying three arguments when creating an fts5vocab table in any database
other than "temp" results in an error.
</p><p> An fts5vocab table of type "row" contains one row for each distinct term
in the associated FTS5 table. The table columns are as follows:
</p><table striped="1" style="margin:1em auto; width:80%; border-spacing:0">
<tr style="text-align:left"><th>Column</th><th>Contents
</th></tr><tr style="text-align:left;background-color:#DDDDDD"><td>term</td><td> The term, as stored in the FTS5 index.
</td></tr><tr style="text-align:left"><td>doc</td><td> The number of rows that contain at least one instance of the term.
</td></tr><tr style="text-align:left;background-color:#DDDDDD"><td>cnt</td><td> The total number of instances of the term in the entire FTS5 table.
</td></tr></table>
<p> An fts5vocab table of type "col" contains one row for each distinct term/column
combination in the associated FTS5 table. Table columns are as follows:
</p><table striped="1" style="margin:1em auto; width:80%; border-spacing:0">
<tr style="text-align:left"><th>Column</th><th>Contents
</th></tr><tr style="text-align:left;background-color:#DDDDDD"><td>term</td><td> The term, as stored in the FTS5 index.
</td></tr><tr style="text-align:left"><td>col</td><td> The name of the FTS5 table column that contains the term.
</td></tr><tr style="text-align:left;background-color:#DDDDDD"><td>doc</td><td> The number of rows in the FTS5 table for which column $col
contains at least one instance of the term.
</td></tr><tr style="text-align:left"><td>cnt</td><td> The total number of instances of the term that appear in
column $col of the FTS5 table (considering all rows).
</td></tr></table>
<p> An fts5vocab table of type "instance" contains one row for each term
instance stored in the associated FTS index. Assuming the FTS5 table is
created with the 'detail' option set to 'full', table columns are as follows:
</p><table striped="1" style="margin:1em auto; width:80%; border-spacing:0">
<tr style="text-align:left"><th>Column</th><th>Contents
</th></tr><tr style="text-align:left;background-color:#DDDDDD"><td>term</td><td> The term, as stored in the FTS5 index.
</td></tr><tr style="text-align:left"><td>doc</td><td> The rowid of the document that contains the term instance.
</td></tr><tr style="text-align:left;background-color:#DDDDDD"><td>col</td><td> The name of the column that contains the term instance.
</td></tr><tr style="text-align:left"><td>offset</td><td> The index of the term instance within its column. Terms
are numbered in order of occurrence starting from 0.
</td></tr></table>
<p> If the FTS5 table is created with the 'detail' option set to 'col', then
the <i>offset</i> column of an instance virtual table always contains NULL.
In this case there is one row in the table for each unique term/doc/col
combination. Or, if the FTS5 table is created with 'detail' set to 'none',
then both <i>offset</i> and <i>col</i> always contain NULL values. For
detail=none FTS5 tables, there is one row in the fts5vocab table for each
unique term/doc combination.
</p><p>Example:
</p><div class="codeblock"><pre><i>-- Assuming a database created using:</i>
CREATE VIRTUAL TABLE ft1 USING fts5(c1, c2);
INSERT INTO ft1 VALUES('apple banana cherry', 'banana banana cherry');
INSERT INTO ft1 VALUES('cherry cherry cherry', 'date date date');
<i>-- Then querying the following fts5vocab table (type "col") returns:
--
-- apple | c1 | 1 | 1
-- banana | c1 | 1 | 1
-- banana | c2 | 1 | 2
-- cherry | c1 | 2 | 4
-- cherry | c2 | 1 | 1
-- date | c3 | 1 | 3
--</i>
CREATE VIRTUAL TABLE ft1_v_col USING fts5vocab(ft1, col);
<i>-- Querying an fts5vocab table of type "row" returns:
--
-- apple | 1 | 1
-- banana | 1 | 3
-- cherry | 2 | 5
-- date | 1 | 3
--</i>
CREATE VIRTUAL TABLE ft1_v_row USING fts5vocab(ft1, row);
<i>-- And, for type "instance"
INSERT INTO ft1 VALUES('apple banana cherry', 'banana banana cherry');
INSERT INTO ft1 VALUES('cherry cherry cherry', 'date date date');
--
-- apple | 1 | c1 | 0
-- banana | 1 | c1 | 1
-- banana | 1 | c2 | 0
-- banana | 1 | c2 | 1
-- cherry | 1 | c1 | 2
-- cherry | 1 | c2 | 2
-- cherry | 2 | c1 | 0
-- cherry | 2 | c1 | 1
-- cherry | 2 | c1 | 2
-- date | 2 | c2 | 0
-- date | 2 | c2 | 1
-- date | 2 | c2 | 2
--</i>
CREATE VIRTUAL TABLE ft1_v_instance USING fts5vocab(ft1, instance);
</pre></div>
<a name="fts5shadowtables"></a>
<h1 id="fts5_data_structures"><span>9. </span>FTS5 Data Structures</h1>
<p>This section describes at a high-level the way the FTS module stores its
index and content in the database. It is not necessary to read or understand
the material in this section in order to use FTS in an application. However, it
may be useful to application developers attempting to analyze and understand
FTS performance characteristics, or to developers contemplating enhancements to
the existing FTS feature set.
</p><p>
When an FTS5 virtual table is created in a database, between 3 and 5 real
tables are created in the database. These are known as "<a href="vtab.html#xshadowname">shadow tables</a>", and
are used by the virtual table module to store persistent data. They should not
be accessed directly by the user. Many other virtual table modules, including
<a href="fts3.html">FTS3</a> and <a href="rtree.html">rtree</a>, also create and use shadow tables.
</p><p>FTS5 creates the following shadow tables. In each case the actual table name
is based on the name of the FTS5 virtual table (in the following, replace
% with the name of the virtual table to find the actual shadow table name).
</p><div class="codeblock"><pre><i>-- This table contains most of the full-text index data. </i>
CREATE TABLE %_data(id INTEGER PRIMARY KEY, block BLOB);
<i>-- This table contains the remainder of the full-text index data. </i>
<i>-- It is almost always much smaller than the %_data table. </i>
CREATE TABLE %_idx(segid, term, pgno, PRIMARY KEY(segid, term)) WITHOUT ROWID;
<i>-- Contains the values of persistent configuration parameters.</i>
CREATE TABLE %_config(k PRIMARY KEY, v) WITHOUT ROWID;
<i>-- Contains the size of each column of each row in the virtual table</i>
<i>-- in tokens. This shadow table is not present if the "columnsize"</i>
<i>-- option is set to 0.</i>
CREATE TABLE %_docsize(id INTEGER PRIMARY KEY, sz BLOB);
<i>-- Contains the actual data inserted into the FTS5 table. There</i>
<i>-- is one "cN" column for each indexed column in the FTS5 table.</i>
<i>-- This shadow table is not present for contentless or external </i>
<i>-- content FTS5 tables. </i>
CREATE TABLE %_content(id INTEGER PRIMARY KEY, c0, c1...);
</pre></div>
<p>The following sections describe in more detail how these five tables are
used to store FTS5 data.
</p><h2 id="varint_format"><span>9.1. </span>Varint Format</h2>
<p>The sections below refer to 64-bit signed integers stored in "varint" form.
FTS5 uses the same varint format as used in various places by the SQLite core.
</p><p>A varint is between 1 and 9 bytes in length. The varint consists of either
zero or more bytes which have the high-order bit set followed by a single byte
with the high-order bit clear, or nine bytes, whichever is shorter. The lower
seven bits of each of the first eight bytes and all 8 bits of the ninth byte
are used to reconstruct the 64-bit twos-complement integer. Varints are
big-endian: bits taken from the earlier byte of the varint are more significant
than bits taken from the later bytes.
</p><h2 id="the_fts_index_idx_and_data_tables_"><span>9.2. </span>The FTS Index (%_idx and %_data tables)</h2>
<p>The FTS index is an ordered key-value store where the keys are document
terms or term prefixes and the associated values are "doclists". A doclist is a
packed array of varints that encodes the position of each instance of the term
within the FTS5 table. The position of a single term instance is defined as the
combination of:
</p><ul>
<li> The rowid of the FTS5 table row it appears in,
</li><li> The index of the column the term instance appears in (columns are
numbered from left to right starting from zero), and
</li><li> The offset of the term within the column value (i.e. the number of
tokens that appear within the column value before this one).
</li></ul>
<p>The FTS index contains up to (nPrefix+1) entries for each token in the
data set, where nPrefix is the number of defined <a href="#prefix_indexes">
prefix indexes</a>.
</p><p>Keys associated with the main FTS index (the one that is not a prefix
index) are prefixed with the character "0". Keys for the first prefix
index are prefixed with "1". Keys for the second prefix index are
prefixed with "2", and so on. For example, if the token "document" is
inserted into an FTS5 table with <a href="#prefix_indexes">prefix indexes</a>
specified by prefix="2 4", then the keys added to the FTS index would be
"0document", "1do" and "2docu".
</p><p>The FTS index entries are not stored in a single tree or hash table
structure. Instead, they are stored in a series of immutable b-tree like
structures referred to as "segment b-trees". Each time a write to the FTS5
table is committed, one or more (but usually just one) new segment b-trees
are added containing both the new entries and tombstones for any deleted
entries. When the FTS index is queried, the reader queries each segment
b-tree in turn and merges the results, giving priority to newer data.
</p><p>Each segment b-tree is assigned a numerical level. When a new segment
b-tree is written to the database as part of committing a transaction,
it is assigned to level 0. Segment b-trees belonging to a single level are
periodically merged together to create a single, larger segment b-tree
that is assigned to the next level (i.e. level 0 segment b-trees are
merged to become a single level 1 segment b-tree). Thus the numerically
larger levels contain older data in (usually) larger segment b-trees.
Refer to the
<a href="#the_automerge_configuration_option">'automerge'</a>,
<a href="#the_crisismerge_configuration_option">'crisismerge'</a> and
<a href="#the_usermerge_configuration_option">'usermerge'</a> options, along
with the
<a href="#the_merge_command">'merge'</a> and
<a href="#the_optimize_command">'optimize'</a> commands for details on how to
control the merging.
</p><p>In cases where the doclist associated with a term or term prefix is very
large, there may be an associated <a href="#doclist_index_format">doclist
index</a>. A doclist index is similar to the set of internal nodes of a b-tree.
It allows a large doclist to be efficiently queried for rowids or ranges of
rowids. For example, when processing a query like:
</p><div class="codeblock"><pre>SELECT ... FROM fts_table('term') WHERE rowid BETWEEN ? AND ?
</pre></div>
<p>FTS5 uses the segment b-tree index to locate the doclist for term "term",
then uses its doclist index (assuming it is present) to efficiently identify
the subset of matches with rowids in the required range.
</p><h3 id="data_structure"><span>9.2.1. </span>The %_data Table Rowid Space</h3>
<div class="codeblock"><pre>CREATE TABLE %_data(
id INTEGER PRIMARY KEY,
block BLOB
);
</pre></div>
<p>The %_data table is used to store three types of records:
</p><ul>
<li> The special <a href="#structure_record_format">structure record</a>,
stored with id=10.
</li><li> The special <a href="#averages_record_format">averages record</a>,
stored with id=1.
</li><li> A record to store each <a href="#segment_b_tree_format">segment b-tree</a>
leaf and <a href="#doclist_index_format">doclist index</a> leaf and
internal node. See below for how id values are calculated for these
records.
</li></ul>
<p>Each segment b-tree in the system is assigned a unique 16-bit segment id.
Segment ids may only be reused after the original owner segment b-tree is
completely merged into a higher level segment b-tree. Within a segment b-tree,
each leaf page is assigned a unique page number - 1 for the first leaf page, 2
for the second, and so on.
</p><p>Each doclist index leaf page is also assigned a page number. The first
(leftmost) leaf page in a doclist index is assigned the same page number as
the segment b-tree leaf page on which its term appears (because doclist indexes
are only created for terms with very long doclists, at most one term per
segment b-tree leaf has an associated doclist index). Call this page number P.
If the doclist is so large that it requires a second leaf, the second leaf is
assigned page number P+1. The third leaf P+2. Each tier of a doclist index
b-tree (leaves, parents of leaves, grandparents etc.) is assigned page numbers
in this fashion, starting with page number P.
</p><p>The "id" value used in the %_data table to store any given segment b-tree
leaf or doclist index leaf or node is composed as follows:
</p><table striped="1" style="margin:1em auto; width:80%; border-spacing:0">
<tr style="text-align:left"><th>Rowid Bits </th><th>Contents
</th></tr><tr style="text-align:left;background-color:#DDDDDD"><td>38..43 </td><td> (16 bit) Segment b-tree id value.
</td></tr><tr style="text-align:left"><td>37 </td><td> (1 bit) Doclist index flag. Set for doclist index pages, clear
for segment b-tree leaves.
</td></tr><tr style="text-align:left;background-color:#DDDDDD"><td>32..36 </td><td> (5 bits) Height in tree. This is set to 0 for segment b-tree
and doclist index leaves, to 1 for the parents of doclist
index leaves, 2 for the grandparents, etc.
</td></tr><tr style="text-align:left"><td>0..31 </td><td> (32 bits) Page number
</td></tr></table>
<h3 id="structure_record_format"><span>9.2.2. </span>Structure Record Format</h3>
<p>The structure record identifies the set of segment b-trees that make up the
current FTS index, along with details of any ongoing incremental merge
operations. It is stored in the %_data table with id=10.
A structure record begins with a single 32-bit unsigned value - the cookie
value. This value is incremented each time the structure is modified.
Following the cookie value are three varint values, as follows:
</p><ul>
<li> The number of levels in the index (i.e. the maximum level associated
with any segment b-tree plus one).
</li><li> The total number of segment b-trees in the index.
</li><li> The total number of segment b-tree leaves written to level 0 trees
since the FTS5 table was created.
</li></ul>
<p>Then, for each level from 0 to nLevel:
</p><ul>
<li> The number of input segments from the previous level being used as
inputs for the current incremental merge, or zero if there is no
ongoing incremental merge to create a new segment b-tree for this level.
</li><li> The total number of segment b-trees on the level.
</li><li> Then, for each segment b-tree, from oldest to newest:
<ul>
<li> The segment id.
</li><li> Page number of first leaf (often 1, always >0).
</li><li> Page number of last leaf (always >0).
</li></ul>
</li></ul>
<h3 id="averages_record_format"><span>9.2.3. </span>Averages Record Format</h3>
<p>The averages record, which is always stored with id=1 in the %_data table,
does not store the average of anything. Instead, it contains a vector of
(nCol+1) packed varint values, where nCol is the number of columns in the FTS5
table, including unindexed columns. The first varint contains the total
number of rows in the FTS5 table. The second contains the total number of
tokens in all values stored in the leftmost FTS5 table column. The third the
number of tokens in all values for the next leftmost, and so on. The value for
unindexed columns is always zero.
</p><h3 id="segment_b_tree_format"><span>9.2.4. </span>Segment B-Tree Format</h3>
<h4 id="the_key_doclist_format"><span>9.2.4.1. </span>The Key/Doclist Format</h4>
<p>The key/doclist format is a format used to store a series of keys (document
terms or term prefixes prefixed by a single character to indentify the specific
index to which they belong) in sorted order, each with their associated
doclist. The format consists of alternating keys and doclists packed together.
</p><p>The first key is stored as:
</p><ul>
<li> A varint indicating the number of bytes in the key (N), followed by
</li><li> The key data itself (N bytes).
</li></ul>
<p>Each subsequent key is stored as:
</p><ul>
<li> A varint indicating the size of the prefix that the key has in common
with the previous key in bytes,
</li><li> A varint indicating the number of bytes in the key following the
common prefix (N), followed by
</li><li> The key suffix data itself (N bytes).
</li></ul>
<p>For example, if the first two keys in an FTS5 key/doclist record are
"0challenger" and "0chandelier", then the first key is stored as varint 11
followed by the 11 bytes "0challenger", and the second key is stored as varints
4 and 7, followed by the 7 bytes "ndelier".
</p><center><div style="max-width:956px;"><svg xmlns='http://www.w3.org/2000/svg' class="pikchr" viewBox="0 0 956.102 126.72">
<path d="M62,77L91,77L91,48L62,48Z" style="fill:none;stroke-width:2.16;stroke:rgb(0,0,0);" />
<path d="M91,77L149,77L149,48L91,48Z" style="fill:none;stroke-width:2.16;stroke:rgb(0,0,0);" />
<path d="M149,77L437,77L437,48L149,48Z" style="fill:none;stroke-width:2.16;stroke:rgb(0,0,0);" />
<text x="293" y="63" text-anchor="middle" fill="rgb(0,0,0)" dominant-baseline="central">doclist 0</text>
<path d="M437,77L466,77L466,48L437,48Z" style="fill:none;stroke-width:2.16;stroke:rgb(0,0,0);" />
<path d="M466,77L494,77L494,48L466,48Z" style="fill:none;stroke-width:2.16;stroke:rgb(0,0,0);" />
<path d="M494,77L538,77L538,48L494,48Z" style="fill:none;stroke-width:2.16;stroke:rgb(0,0,0);" />
<path d="M538,77L826,77L826,48L538,48Z" style="fill:none;stroke-width:2.16;stroke:rgb(0,0,0);" />
<text x="682" y="63" text-anchor="middle" fill="rgb(0,0,0)" dominant-baseline="central">doclist 1</text>
<text x="892" y="63" text-anchor="middle" fill="rgb(0,0,0)" dominant-baseline="central">key/doclist 2...</text>
<path d="M826,48L883,48" style="fill:none;stroke-width:2.16;stroke:rgb(0,0,0);" />
<path d="M826,77L883,77" style="fill:none;stroke-width:2.16;stroke:rgb(0,0,0);" />
<text x="120" y="114" text-anchor="middle" fill="rgb(0,0,0)" dominant-baseline="central">key 0 data</text>
<polygon points="120,77 124,89 116,89" style="fill:rgb(0,0,0)"/>
<path d="M120,99L120,83" style="fill:none;stroke-width:2.16;stroke:rgb(0,0,0);" />
<text x="77" y="12" text-anchor="middle" fill="rgb(0,0,0)" dominant-baseline="central">key 0 size (varint)</text>
<polygon points="77,48 72,37 81,37" style="fill:rgb(0,0,0)"/>
<path d="M77,27L77,43" style="fill:none;stroke-width:2.16;stroke:rgb(0,0,0);" />
<text x="358" y="114" text-anchor="middle" fill="rgb(0,0,0)" dominant-baseline="central">key 1 prefix size (varint)</text>
<polygon points="451,77 441,84 439,76" style="fill:rgb(0,0,0)"/>
<path d="M358,99L446,79" style="fill:none;stroke-width:2.16;stroke:rgb(0,0,0);" />
<text x="480" y="12" text-anchor="middle" fill="rgb(0,0,0)" dominant-baseline="central">key 1 suffix size (varint)</text>
<polygon points="480,48 476,37 484,37" style="fill:rgb(0,0,0)"/>
<path d="M480,27L480,43" style="fill:none;stroke-width:2.16;stroke:rgb(0,0,0);" />
<text x="585" y="114" text-anchor="middle" fill="rgb(0,0,0)" dominant-baseline="central">key 1 prefix data</text>
<polygon points="516,77 528,77 526,85" style="fill:rgb(0,0,0)"/>
<path d="M585,99L521,79" style="fill:none;stroke-width:2.16;stroke:rgb(0,0,0);" />
</svg>
</div>
<p><i>Figure 1 - Term/Doclist Format
</i></p></center>
<p>Each doclist identifies the rows (by their rowid values) that contain at
least one instance of the term or term prefix and an associated position list,
or "poslist" enumerating the position of each term instance within the row. In
this sense a "position" is defined as a column number and term offset within
the column value.
</p><p>Within a doclist, documents are always stored in order sorted by rowid. The
first rowid in a doclist is stored as is, as a varint. It is immediately
followed by its associated position list. Following this, the difference
between the first rowid and the second, as a varint, followed by the doclist
associated with the second rowid in the doclist. And so on.
</p><p>There is no way to determine the size of a doclist by parsing it. This must
be stored externally. See the <a href="#data_pagination">section below</a> for
details of how this is accomplished in FTS5.
</p><center><div style="max-width:889px;"><svg xmlns='http://www.w3.org/2000/svg' class="pikchr" viewBox="0 0 889.459 79.92">
<path d="M46,77L89,77L89,48L46,48Z" style="fill:none;stroke-width:2.16;stroke:rgb(0,0,0);" />
<path d="M89,77L377,77L377,48L89,48Z" style="fill:none;stroke-width:2.16;stroke:rgb(0,0,0);" />
<text x="233" y="63" text-anchor="middle" fill="rgb(0,0,0)" dominant-baseline="central">position list 0</text>
<path d="M377,77L420,77L420,48L377,48Z" style="fill:none;stroke-width:2.16;stroke:rgb(0,0,0);" />
<path d="M420,77L708,77L708,48L420,48Z" style="fill:none;stroke-width:2.16;stroke:rgb(0,0,0);" />
<text x="564" y="63" text-anchor="middle" fill="rgb(0,0,0)" dominant-baseline="central">position list 1</text>
<path d="M708,77L751,77L751,48L708,48Z" style="fill:none;stroke-width:2.16;stroke:rgb(0,0,0);" />
<text x="822" y="63" text-anchor="middle" fill="rgb(0,0,0)" dominant-baseline="central">position list 2...</text>
<path d="M751,48L809,48" style="fill:none;stroke-width:2.16;stroke:rgb(0,0,0);" />
<path d="M751,77L809,77" style="fill:none;stroke-width:2.16;stroke:rgb(0,0,0);" />
<text x="67" y="12" text-anchor="middle" fill="rgb(0,0,0)" dominant-baseline="central">rowid 0 (varint)</text>
<polygon points="67,48 63,37 72,37" style="fill:rgb(0,0,0)"/>
<path d="M67,27L67,43" style="fill:none;stroke-width:2.16;stroke:rgb(0,0,0);" />
<text x="398" y="12" text-anchor="middle" fill="rgb(0,0,0)" dominant-baseline="central">rowid 1 (delta-encoded varint)</text>
<polygon points="398,48 394,37 403,37" style="fill:rgb(0,0,0)"/>
<path d="M398,27L398,43" style="fill:none;stroke-width:2.16;stroke:rgb(0,0,0);" />
<text x="730" y="12" text-anchor="middle" fill="rgb(0,0,0)" dominant-baseline="central">rowid 3 (delta-encoded varint)</text>
<polygon points="730,48 725,37 734,37" style="fill:rgb(0,0,0)"/>
<path d="M730,27L730,43" style="fill:none;stroke-width:2.16;stroke:rgb(0,0,0);" />
</svg>
</div>
<p><i>Figure 2 - Doclist Format
</i></p></center>
<p>A position list - often shortened to "poslist" - identifies the column
and token offset within the row of each instance of the token in question.
The format of a poslist is:
</p><ul>
<li> Varint set to twice the size of the poslist, not including this field,
plus one if the "delete" flag is set on the entry.
</li><li> A (possibly empty) list of offsets for column 0 (the leftmost column) of
the row. Each offset is stored as a varint. The first varint contains
the value of the first offset, plus 2. The second variant contains the
difference between the second and first offsets, plus 2. etc. For
example, if the offset list is to contain offsets 0, 10, 15 and 16, it
is encoded by packing the following values, encoded as varints, end to
end:
<pre>
2, 12, 7, 3
</pre>
</li><li> For each column other than column 0 that contains one of more instances
of the token:
<ul>
<li> Byte value 0x01.
</li><li> The column number, as a varint.
</li><li> An offset list, in the same format as the offset list for column 0.
</li></ul>
</li></ul>
<center><div style="max-width:785px;"><svg xmlns='http://www.w3.org/2000/svg' class="pikchr" viewBox="0 0 785.606 133.92">
<path d="M77,77L121,77L121,48L77,48Z" style="fill:none;stroke-width:2.16;stroke:rgb(0,0,0);" />
<path d="M121,77L409,77L409,48L121,48Z" style="fill:none;stroke-width:2.16;stroke:rgb(0,0,0);" />
<text x="265" y="63" text-anchor="middle" fill="rgb(0,0,0)" dominant-baseline="central">col 0 offset-list</text>
<path d="M409,77L452,77L452,48L409,48Z" style="fill:none;stroke-width:2.16;stroke:rgb(0,0,0);" />
<text x="430" y="63" text-anchor="middle" fill="rgb(0,0,0)" dominant-baseline="central">0x01</text>
<path d="M452,77L495,77L495,48L452,48Z" style="fill:none;stroke-width:2.16;stroke:rgb(0,0,0);" />
<path d="M495,77L783,77L783,48L495,48Z" style="fill:none;stroke-width:2.16;stroke:rgb(0,0,0);" />
<text x="639" y="63" text-anchor="middle" fill="rgb(0,0,0)" dominant-baseline="central">col i offset-list</text>
<text x="99" y="12" text-anchor="middle" fill="rgb(0,0,0)" dominant-baseline="central">nSize*2 + bDel (varint)</text>
<polygon points="99,48 95,37 103,37" style="fill:rgb(0,0,0)"/>
<path d="M99,27L99,43" style="fill:none;stroke-width:2.16;stroke:rgb(0,0,0);" />
<text x="473" y="12" text-anchor="middle" fill="rgb(0,0,0)" dominant-baseline="central">column number (i)</text>
<polygon points="473,48 469,37 478,37" style="fill:rgb(0,0,0)"/>
<path d="M473,27L473,43" style="fill:none;stroke-width:2.16;stroke:rgb(0,0,0);" />
<path d="M121,92L121,106L783,106L783,92" style="fill:none;stroke-width:2.16;stroke-linejoin:round;stroke:rgb(0,0,0);" />
<text x="452" y="121" text-anchor="middle" fill="rgb(0,0,0)" dominant-baseline="central">nSize bytes</text>
</svg>
</div>
<p><i>Figure 3 - Position List (poslist) With Offsets in Columns 0 and i
</i></p></center>
<h4 id="data_pagination"><span>9.2.4.2. </span>Pagination</h4>
<p>If it is small enough (by default this means smaller than 4000 bytes), the
entire contents of a segment b-tree may be stored in the key/doclist format
described in the previous section as a single blob within the %_data table.
Otherwise, the key/doclist is split into pages (by default, of approximately
4000 bytes each) and stored in a contiguous set of entries in the %_data table
(<a href="#data_structure">see above</a> for details).
</p><p>When a key/doclist is divided into pages, the following modifications are
made to the format:
</p><ul>
<li> A single varint or key data field never spans two pages.
</li><li> The first key on each page is not prefix-compressed. It is stored in
the format described above for the first key of a doclist - its size as
a varint followed by the key data.
</li><li> If there are one or more rowids on a page before the first key, then
the first of them is not delta compressed. It is stored as is, just as
if it were the first rowid of its doclist (which it may or may not be).
</li></ul>
<p>Each page also has fixed-size 4-byte header and a variably-sized footer.
The header is divided into 2 16-bit big-endian integer fields. They
contain:
</p><ul>
<li> The byte offset of the first rowid value on the page, if it occurs
before the first key, or 0 otherwise.
</li><li> The byte offset of the page footer.
</li></ul>
<p>The page footer consists of a series of varints containing the byte offset
of each key that appears on the page. The page footer is zero bytes in size
if there are no keys on the page.
</p><center><div style="max-width:721px;"><svg xmlns='http://www.w3.org/2000/svg' class="pikchr" viewBox="0 0 721.267 79.92">
<path d="M10,30L54,30L54,2L10,2Z" style="fill:none;stroke-width:2.16;stroke:rgb(0,0,0);" />
<text x="32" y="16" text-anchor="middle" fill="rgb(0,0,0)" dominant-baseline="central">hdr</text>
<path d="M54,30L630,30L630,2L54,2Z" style="fill:none;stroke-width:2.16;stroke:rgb(0,0,0);" />
<text x="342" y="16" text-anchor="middle" fill="rgb(0,0,0)" dominant-baseline="central">modified key/doclist data</text>
<path d="M630,30L702,30L702,2L630,2Z" style="fill:none;stroke-width:2.16;stroke:rgb(0,0,0);" />
<text x="666" y="16" text-anchor="middle" fill="rgb(0,0,0)" dominant-baseline="central">footer</text>
<text x="32" y="67" text-anchor="middle" fill="rgb(0,0,0)" dominant-baseline="central">4 bytes</text>
<text x="666" y="67" text-anchor="middle" fill="rgb(0,0,0)" dominant-baseline="central">variable size</text>
<polygon points="32,30 36,42 28,42" style="fill:rgb(0,0,0)"/>
<path d="M32,52L32,36" style="fill:none;stroke-width:2.16;stroke:rgb(0,0,0);" />
<polygon points="666,30 670,42 661,42" style="fill:rgb(0,0,0)"/>
<path d="M666,52L666,36" style="fill:none;stroke-width:2.16;stroke:rgb(0,0,0);" />
</svg>
</div>
<p><i>Figure 4 - Page Format
</i></p></center>
<h4 id="data_term_index"><span>9.2.4.3. </span>Segment Index Format</h4>
<p>The result of formatting the contents of the segment b-tree in the
key/doclist format and then splitting it into pages is something very
similar to the leaves of a b+tree. Instead of creating a format for
the internal nodes of this b+tree and storing them in the %_data table
alongside the leaves, the keys that would have been stored on such nodes are
added to the %_idx table, defined as:
</p><div class="codeblock"><pre>CREATE TABLE %_idx(
segid INTEGER, <i>-- segment id</i>
term TEXT, <i>-- prefix of first key on page</i>
pgno INTEGER, <i>-- (2*pgno + bDoclistIndex)</i>
PRIMARY KEY(segid, term)
);
</pre></div>
<p>For each "leaf" page that contains at least one key, an entry is added
to the %_idx table. Fields are set as follows:
</p><table striped="1" style="margin:1em auto; width:80%; border-spacing:0">
<tr style="text-align:left"><th>Column</th><th>Contents
</th></tr><tr style="text-align:left;background-color:#DDDDDD"><td>segid</td><td> The integer segment id.
</td></tr><tr style="text-align:left"><td>term</td><td> The smallest prefix of the first key on the page that
is larger than all keys on the previous page. For the
first page in a segment, this prefix is zero bytes in
size.
</td></tr><tr style="text-align:left;background-color:#DDDDDD"><td>pgno</td><td> This field encodes both the page number (within the
segment - starting from 1) and the doclist index flag.
The doclist index flag is set if the final key on the
page has an <a href="#doclist_index_format">associated
doclist index</a>. The value of this field is:
<pre>
(pgno*2 + bDoclistIndexFlag)
</pre>
</td></tr></table>
<p>Then, to find the leaf for segment i that may contain term t, instead of
searching through internal nodes, FTS5 runs the query:
</p><div class="codeblock"><pre>SELECT pgno FROM %_idx WHERE segid=$i AND term>=$t ORDER BY term LIMIT 1
</pre></div>
<h4 id="doclist_index_format"><span>9.2.4.4. </span>Doclist Index Format</h4>
<p>The segment index described in the <a href="#data_term_index">previous
section</a> allows a segment b-tree to be efficiently queried by term or,
assuming there is a prefix index of the required size, a term prefix. The data
structure described in this section, doclist indexes, allows FTS5 to
efficiently search for a rowid or range or rowids within the doclist associated
with a single term or term prefix.
</p><p>Not all keys have associated doclists indexes. By default, a doclist index
is only added for a key if its doclist spans more than 4 segment b-tree leaf
pages. Doclist indexes are themselves b-trees, with both leaves and internal
nodes stored as entries in the %_data table, but in practice most doclists are
small enough to fit on a single leaf. FTS5 uses the same rough size for doclist
index node and leaves as it does for segment b-tree leaves (by default 4000
bytes).
</p><p>Doclist index leaves and internal nodes use the same page format. The first
byte is a "flags" byte. This is set to 0x00 for the root page of the doclist
index b-tree, and 0x01 for all other pages. The remainder of the page is a
series of tightly packed varints, as follows:
</p><ul>
<li> page number of leftmost child page, followed by
</li><li> the smallest rowid value on the left most child page, followed by
</li><li> one varint for each subsequent child page, containing the value:
<ul>
<li> 0x00 if there are no rowids on the child page (this can only happen
when the "child" page is actually a segment b-tree leaf), or
</li><li> the difference between the smallest rowid on the child page and
the previous rowid value stored on the doclist index page.
</li></ul>
</li></ul>
<p>For the leftmost doclist index leaf in a doclist index, the leftmost child
page is the first segment b-tree leaf after the one that contains the key
itself.
</p><h2 id="document_sizes_table_docsize_table_"><span>9.3. </span>Document Sizes Table (%_docsize table)</h2>
<div class="codeblock"><pre>CREATE TABLE %_docsize(
id INTEGER PRIMARY KEY, -- id of FTS5 row this record pertains to
sz BLOB -- blob containing nCol packed varints
);
</pre></div>
<p>Many common search result ranking functions require as an input the size
in tokens of the result document (as a search term hit in a short document is
considered more significant than one in a long document). To provide fast
access to this information, for each row in the FTS5 table there exists a
corresponding record (with the same rowid) in the %_docsize shadow table
that contains the size of each column value in the row, in tokens.
</p><p>The column value sizes are stored in a blob containing one packed varint for
each column of the FTS5 table, from left to right. The varint contains, of
course, the total number of tokens in the corresponding column value. Unindexed
columns are included in this vector of varints; for them the value is always
set to zero.
</p><p>This table is used by the <a href="#xColumnSize">xColumnSize</a> API. It can
be omitted altogether by specifying the
<a href="#the_columnsize_option">columnsize=0</a> option. In that case the
xColumnSize API is still available to auxiliary functions, but runs much more
slowly.
</p><h2 id="the_table_contents_content_table_"><span>9.4. </span>The Table Contents (%_content table)</h2>
<div class="codeblock"><pre>CREATE TABLE %_content(id INTEGER PRIMARY KEY, c0, c1...);
</pre></div>
<p>The actual table content - the values inserted into the FTS5 table, is
stored in the %_content table. This table is created with one "c*" column for
each column of the FTS5 table, including any unindexed columns. The values for
the leftmost FTS5 table column are stored in column "c0" of the %_content
table, the values from the next FTS5 table column in column "c1", and so on.
</p><p>This table is omitted completely for
<a href="#external_content_and_contentless_tables">
external content or contentless</a> FTS5 tables.
tables.
</p><h2 id="configuration_options_config_table_"><span>9.5. </span>Configuration Options (%_config table)</h2>
<div class="codeblock"><pre>CREATE TABLE %_config(k PRIMARY KEY, v) WITHOUT ROWID;
</pre></div>
<p>This table stores the values of any persistent configuration options.
Column "k" stores the name of the option (text) and column "v" the value.
Example contents:
</p><div class="codeblock"><pre>sqlite> SELECT * FROM fts_tbl_config;
┌─────────────┬──────┐
│ k │ v │
├─────────────┼──────┤
│ crisismerge │ 8 │
│ pgsz │ 8000 │
│ usermerge │ 4 │
│ version │ 4 │
└─────────────┴──────┘
</pre></div>
<a name="appendix_a"></a>
<h1 id="appendix_a" nonumber="1" tags="comparison with fts4">
Appendix A: Comparison with FTS3/4
</h1>
<p> Also available is the similar but more mature <a href="fts3.html">FTS3/4</a> module.
FTS5 is a new version of FTS4 that includes various fixes and solutions for
problems that could not be fixed in FTS4 without sacrificing backwards
compatibility. Some of these problems are
<a href="fts5.html#_summary_of_technical_differences_">described below</a>.
</p><h2 nonumber="1" id="_application_porting_guide_"> Application Porting Guide </h2>
<p> In order to use FTS5 instead of FTS3 or FTS4, applications usually require
minimal modifications. Most of these fall into three categories - changes
required to the CREATE VIRTUAL TABLE statement used to create the FTS table,
changes required to SELECT queries used to execute queries against the table,
and changes required to applications that use <a href="fts3.html#snippet">FTS auxiliary functions</a>.
</p><h3 nonumber="1" id="_changes_to_create_virtual_table_statements_"> Changes to CREATE VIRTUAL TABLE statements </h3>
<ol>
<li> <p>The module name must be changed from "fts3" or "fts4" to "fts5".
</p></li><li> <p>All type information or constraint specifications must be removed from
column definitions. FTS3/4 ignores everything following the column name in
a column definition, FTS5 attempts to parse it (and will report an error
if it fails to).
</p></li><li> <p>The "matchinfo=fts3" option is not available. The
<a href="fts5.html#the_columnsize_option">"columnsize=0"</a> option is equivalent.
</p></li><li> <p>The notindexed= option is not available. Adding <a href="fts5.html#the_unindexed_column_option">UNINDEXED</a>
to the column definition is equivalent.
</p></li><li> <p>The ICU tokenizer is not available.
</p></li><li> <p>The compress=, uncompress= and languageid= options are not available.
There is as of yet no equivalent for their functionality.
</p></li></ol>
<div class="codeblock"><pre><i> -- FTS3/4 statement </i>
CREATE VIRTUAL TABLE t1 USING fts4(
linkid INTEGER,
header CHAR(20),
text VARCHAR,
notindexed=linkid,
matchinfo=fts3,
tokenizer=unicode61
);
<i> -- FTS5 equivalent (note - the "tokenizer=unicode61" option is not</i>
<i> -- required as this is the default for FTS5 anyway)</i>
CREATE VIRTUAL TABLE t1 USING fts5(
linkid UNINDEXED,
header,
text,
columnsize=0
);
</pre></div>
<h3 nonumber="1" id="_changes_to_select_statements_"> Changes to SELECT statements </h3>
<ol>
<li> <p>The "docid" alias does not exist. Applications must use "rowid"
instead.
</p></li><li> <p>The behaviour of queries when a column-filter is specified both as
part of the FTS query and by using a column as the LHS of a MATCH
operator is slightly different. For a table with columns "a" and "b"
and a query similar to:
</p><div class="codeblock"><pre>... a MATCH 'b: string'
</pre></div>
<p>FTS3/4 searches for matches in column "b". However, FTS5 always
returns zero rows, as results are first filtered for column "b", then
for column "a", leaving no results. In other words, in FTS3/4 the
inner filter overrides the outer, in FTS5 both filters are applied.
</p></li><li> <p>The FTS query syntax (right hand side of the MATCH operator) has
changed in some ways. The FTS5 syntax is quite close to the FTS4
"enhanced syntax". The main difference is that FTS5 is fussier
about unrecognized punctuation characters and similar within query
strings. Most queries that work with FTS3/4 should also work with
FTS5, and those that do not should return parse errors.
</p></li></ol>
<h3 nonumber="1" id="_auxiliary_function_changes_"> Auxiliary Function Changes </h3>
<p> FTS5 has no matchinfo() or offsets() function, and the snippet() function
is not as fully-featured as in FTS3/4. However, since FTS5 does provide
an API allowing applications to create <a href="fts5.html#custom_auxiliary_functions_api_reference">custom auxiliary functions</a>, any
required functionality may be implemented within the application code.
</p><p> The set of built-in auxiliary functions provided by FTS5 may be
improved upon in the future.
</p><h3 nonumber="1" id="_other_issues"> Other Issues</h3>
<ol>
<li><p> The functionality provided by the fts4aux module is now provided
by <a href="fts5.html#the_fts5vocab_virtual_table_module">fts5vocab</a>. The schema of these two tables is slightly different.
</p></li><li><p> The FTS3/4 "merge=X,Y" command has been replaced by the
<a href="fts5.html#the_merge_command">FTS5 merge command</a>.
</p></li><li><p> The FTS3/4 "automerge=X" command has been replaced by the
<a href="fts5.html#the_automerge_configuration_option">FTS5 automerge option</a>.
</p></li></ol>
<a name="_summary_of_technical_differences_"></a>
<h2 nonumber="1" tags="fts5 technical differences" id="_summary_of_technical_differences_">
Summary of Technical Differences
</h2>
<p>FTS5 is similar to FTS3/4 in that the primary task of each is to maintain
an index mapping from each unique token to a list of instances of that token
within a set of documents, where each instance is identified by the document
in which it appears and its position within that document. For example:
</p><div class="codeblock"><pre><i>-- Given the following SQL:</i>
CREATE VIRTUAL TABLE ft USING fts5(a, b);
INSERT INTO ft(rowid, a, b) VALUES(1, 'X Y', 'Y Z');
INSERT INTO ft(rowid, a, b) VALUES(2, 'A Z', 'Y Y');
<i>-- The FTS5 module creates the following mapping on disk:</i>
A --> (2, 0, 0)
X --> (1, 0, 0)
Y --> (1, 0, 1) (1, 1, 0) (2, 1, 0) (2, 1, 1)
Z --> (1, 1, 1) (2, 0, 1)
</pre></div>
<p>In the example above, each triple identifies the location of a token
instance by rowid, column number (columns are numbered sequentially
starting at 0 from left to right) and position within the column value (the
first token in a column value is 0, the second is 1, and so on). Using this
index, FTS5 is able to provide timely answers to queries such as "the set
of all documents that contain the token 'A'", or "the set of all documents
that contain the sequence 'Y Z'". The list of instances associated with a
single token is called an "instance-list".
</p><p>The principle difference between FTS3/4 and FTS5 is that in FTS3/4,
each instance-list is stored as a single large database record, whereas
in FTS5 large instance-lists are divided between multiple database records.
This has the following implications for dealing with large databases that
contain large lists:
</p><ul>
<li> <p>FTS5 is able to load instance-lists into memory incrementally in
order to reduce memory usage and peak allocation size. FTS3/4 very
often loads entire instance-lists into memory.
</p></li><li> <p>When processing queries that feature more than one token, FTS5 is
sometimes able to determine that the query can be answered by
inspecting a subset of a large instance-list. FTS3/4 almost always
has to traverse entire instance-lists.
</p></li><li> If an instance-list grows so large that it exceeds
the <a href="limits.html#max_length">SQLITE_MAX_LENGTH</a> limit, FTS3/4 is unable to handle it. FTS5
does not have this problem.
</li></ul>
<p>For these reasons, many complex queries may use less memory and run faster
using FTS5.
</p><p>Some other ways in which FTS5 differs from FTS3/4 are:
</p><ul>
<li> <p>FTS5 supports "ORDER BY rank" for returning results in order of
decreasing relevancy.
</p></li><li> <p>FTS5 features an API allowing users to create custom auxiliary
functions for advanced ranking and text processing applications. The
special "rank" column may be mapped to a custom auxiliary function
so that adding "ORDER BY rank" to a query works as expected.
</p></li><li> <p>FTS5 recognizes unicode separator characters and case equivalence by
default. This is also possible using FTS3/4, but must be explicitly
enabled.
</p></li><li> <p>The query syntax has been revised where necessary to remove
ambiguities and to make it possible to escape special characters
in query terms.
</p></li><li> <p>By default, FTS3/4 occasionally merges together two or more of the
b-trees that make up its full-text index within an INSERT, UPDATE or
DELETE statement executed by the user. This means that any operation
on an FTS3/4 table may turn out to be surprisingly slow, as FTS3/4
may unpredictably choose to merge together two or more large b-trees
within it. FTS5 uses incremental merging by default, which limits
the amount of processing that may take place within any given
INSERT, UPDATE or DELETE operation.
</p></li></ul>
<p align="center"><small><i>This page last modified on <a href="https://sqlite.org/docsrc/honeypot" id="mtimelink" data-href="https://sqlite.org/docsrc/finfo/pages/fts5.in?m=36e82d9638">2024-07-24 12:26:57</a> UTC </small></i></p>
|