summaryrefslogtreecommitdiffstats
path: root/www/invalidutf.html
blob: 0c6c2d3efa67e0ffeff902f9025dc55596463c41 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
<!DOCTYPE html>
<html><head>
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
<link href="sqlite.css" rel="stylesheet">
<title>Invalid UTF Policy</title>
<!-- path= -->
</head>
<body>
<div class=nosearch>
<a href="index.html">
<img class="logo" src="images/sqlite370_banner.gif" alt="SQLite" border="0">
</a>
<div><!-- IE hack to prevent disappearing logo --></div>
<div class="tagline desktoponly">
Small. Fast. Reliable.<br>Choose any three.
</div>
<div class="menu mainmenu">
<ul>
<li><a href="index.html">Home</a>
<li class='mobileonly'><a href="javascript:void(0)" onclick='toggle_div("submenu")'>Menu</a>
<li class='wideonly'><a href='about.html'>About</a>
<li class='desktoponly'><a href="docs.html">Documentation</a>
<li class='desktoponly'><a href="download.html">Download</a>
<li class='wideonly'><a href='copyright.html'>License</a>
<li class='desktoponly'><a href="support.html">Support</a>
<li class='desktoponly'><a href="prosupport.html">Purchase</a>
<li class='search' id='search_menubutton'>
<a href="javascript:void(0)" onclick='toggle_search()'>Search</a>
</ul>
</div>
<div class="menu submenu" id="submenu">
<ul>
<li><a href='about.html'>About</a>
<li><a href='docs.html'>Documentation</a>
<li><a href='download.html'>Download</a>
<li><a href='support.html'>Support</a>
<li><a href='prosupport.html'>Purchase</a>
</ul>
</div>
<div class="searchmenu" id="searchmenu">
<form method="GET" action="search">
<select name="s" id="searchtype">
<option value="d">Search Documentation</option>
<option value="c">Search Changelog</option>
</select>
<input type="text" name="q" id="searchbox" value="">
<input type="submit" value="Go">
</form>
</div>
</div>
<script>
function toggle_div(nm) {
var w = document.getElementById(nm);
if( w.style.display=="block" ){
w.style.display = "none";
}else{
w.style.display = "block";
}
}
function toggle_search() {
var w = document.getElementById("searchmenu");
if( w.style.display=="block" ){
w.style.display = "none";
} else {
w.style.display = "block";
setTimeout(function(){
document.getElementById("searchbox").focus()
}, 30);
}
}
function div_off(nm){document.getElementById(nm).style.display="none";}
window.onbeforeunload = function(e){div_off("submenu");}
/* Disable the Search feature if we are not operating from CGI, since */
/* Search is accomplished using CGI and will not work without it. */
if( !location.origin || !location.origin.match || !location.origin.match(/http/) ){
document.getElementById("search_menubutton").style.display = "none";
}
/* Used by the Hide/Show button beside syntax diagrams, to toggle the */
function hideorshow(btn,obj){
var x = document.getElementById(obj);
var b = document.getElementById(btn);
if( x.style.display!='none' ){
x.style.display = 'none';
b.innerHTML='show';
}else{
x.style.display = '';
b.innerHTML='hide';
}
return false;
}
var antiRobot = 0;
function antiRobotGo(){
if( antiRobot!=3 ) return;
antiRobot = 7;
var j = document.getElementById("mtimelink");
if(j && j.hasAttribute("data-href")) j.href=j.getAttribute("data-href");
}
function antiRobotDefense(){
document.body.onmousedown=function(){
antiRobot |= 2;
antiRobotGo();
document.body.onmousedown=null;
}
document.body.onmousemove=function(){
antiRobot |= 2;
antiRobotGo();
document.body.onmousemove=null;
}
setTimeout(function(){
antiRobot |= 1;
antiRobotGo();
}, 100)
antiRobotGo();
}
antiRobotDefense();
</script>
<div class=fancy>
<div class=nosearch>
<div class="fancy_title">
Invalid UTF Policy
</div>
</div>





<h1 id="garbage_in_garbage_out"><span>1. </span>Garbage In, Garbage Out</h1>

<p>
With regard to invalid UTF, SQLite follows a policy of
Garbage-In, Garbage-Out (GIGO). If you insert invalid UTF
into an SQLite database, then try to query that data, what you get back out
might not be exactly what you put in.  If you put garbage in, then you
may not complain if you get different garbage back out.

</p><p>
For the purposes of this discussion, "invalid UTF" can mean any of
the following circumstances:

</p><ul>
<li><p>
Invalid surrogate pairs in UTF-16.

</p></li><li><p>
Invalid multi-byte sequences in UTF-8.

</p></li><li><p>
Using more bytes of UTF-8 than necessary to represent a single
code point.  (Example:  encoding 'A' as the two-byte sequence
0xc1, 0x01 instead of just a single 0x41 byte.)

</p></li><li><p>
NUL characters (U+0000) embedded in strings.

</p></li><li><p>
Invalid sequences of combining characters.

</p></li><li><p>
UTF-8 or UTF-16 bytes sequences that encode numbers that are not
defined Unicode characters.

</p></li></ul>

<h2 id="invalid_utf_will_never_cause_memory_errors"><span>1.1. </span>Invalid UTF will never cause memory errors</h2>

<p>
If you insert invalid UTF into an SQLite database, then SQLite makes
no guarantees about what text you might get back out.  But it does
promise that invalid UTF will never cause memory errors
(array overruns, reads or writes of uninitialized memory, etc), at
least for the built-in processing of SQLite.
In other words, invalid UTF will not cause SQLite to crash.

</p><p>
This promise only applies to the core SQLite components, not
application-provided extensions, of course.
If an application adds new application-defined SQL functions or
virtual tables or collating sequences or other extensions, and a
database contains invalid UTF, then invalid UTF might get passed
into those extensions.  If the invalid UTF causes one of those
extensions to crash, then that is a problem with the extension,
not with SQLite.

</p><h1 id="no_enforcement_of_text_formatting_rules"><span>2. </span>No enforcement of text formatting rules</h1>

<p>
SQLite does not try to enforce UTF formatting rules.  You can
insert invalid UTF into a TEXT field and SQLite will not complain
about this.  It stores the invalid TEXT as best it can.  SQLite
sees its role in the world as a storage engine, not a text format
validation engine.

</p><h1 id="best_effort_to_preserve_text"><span>3. </span>Best effort to preserve text</h1>

<p>
SQLite does not promise to always preserve invalid UTF, but it does
make an effort.  Generally speaking, if you insert invalid UTF into
SQLite, you will get the exact same byte sequence back out, as long
as you do not ask SQLite to transform the text in any way.

</p><p>
For example, if you insert some UTF-16LE with invalid surrogates into
a TEXT column of a table of a database that has <a href="pragma.html#pragma_encoding">PRAGMA encoding=UTF16LE</a>,
then later query that column using <a href="c3ref/column_blob.html">sqlite3_column_text16()</a>, you will 
probably get back the same exact invalid UTF-16.  But if you insert the
same invalid UTF-16LE content in a <a href="pragma.html#pragma_encoding">PRAGMA encoding=UTF8</a> database,
the content must be converted into UTF8 when it is stored, which could
cause irreversible changes to the content.  Or if you insert that
same invalid UTF-16LE content into a <a href="pragma.html#pragma_encoding">PRAGMA encoding=UTF16LE</a> database
but then read it out using <a href="c3ref/column_blob.html">sqlite3_column_text()</a>, then a UTF16 to
UTF8 conversion must occur during the read-out and that conversion might
introduce irreversible changes.

</p><p>
Or, suppose you are doing everything using UTF-8 (the most common case).
Invalid UTF-8 will normally pass through the database without any change
in its byte sequence.  However, if you try to transform the invalid
UTF-8 with SQL function like <a href="lang_corefunc.html#substr">substr()</a> or <a href="lang_corefunc.html#replace">replace()</a>
or if you try to do string matching with the <a href="lang_expr.html#like">LIKE</a> operator, then
you might get unexpected results.

</p><p>
So, in other words, SQLite does not actively try to subvert your
invalid text.  But when you ask SQLite to make transformations of invalid
UTF, there are no guarantees that those transformations will be reversible
or even sensible.

</p><h1 id="invalid_utf_in_the_database_schema"><span>4. </span>Invalid UTF in the database schema</h1>

<p>
If a database schema contains names (table names, column names, index
names, and so forth) that are invalid UTF, SQLite will continue to
operate normally.  As far as SQLite is concerned, those names are just
byte sequences.  SQLite does not care whether they are valid UTF or not.

</p><p>
When generating error messages (using, for example, <a href="c3ref/errcode.html">sqlite3_errmsg()</a>),
sometimes SQLite will embedded parts of the database schema into the
error message.  If those embedded schema elements
are invalid UTF, then the resulting error message might also be
invalid UTF.
Similarly, the output from the <a href="pragma.html#pragma_integrity_check">PRAGMA integrity_check</a> and similar
statements will sometimes embed the names of schema elements.  If those
schema element names are invalid UTF, then the output of the command
will also be invalid UTF.
</p><p align="center"><small><i>This page last modified on  <a href="https://sqlite.org/docsrc/honeypot" id="mtimelink"  data-href="https://sqlite.org/docsrc/finfo/pages/invalidutf.in?m=33df79fd55">2023-12-05 14:43:20</a> UTC </small></i></p>