1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
|
--- Relevant portions of RFC2616 ---
OCTET = <any 8-bit sequence of data>
CHAR = <any US-ASCII character (octets 0 - 127)>
UPALPHA = <any US-ASCII uppercase letter "A".."Z">
LOALPHA = <any US-ASCII lowercase letter "a".."z">
ALPHA = UPALPHA | LOALPHA
DIGIT = <any US-ASCII digit "0".."9">
CTL = <any US-ASCII control character (octets 0 - 31) and DEL (127)>
CR = <US-ASCII CR, carriage return (13)>
LF = <US-ASCII LF, linefeed (10)>
SP = <US-ASCII SP, space (32)>
HT = <US-ASCII HT, horizontal-tab (9)>
<"> = <US-ASCII double-quote mark (34)>
CRLF = CR LF
LWS = [CRLF] 1*( SP | HT )
TEXT = <any OCTET except CTLs, but including LWS>
HEX = "A" | "B" | "C" | "D" | "E" | "F"
| "a" | "b" | "c" | "d" | "e" | "f" | DIGIT
separators = "(" | ")" | "<" | ">" | "@"
| "," | ";" | ":" | "\" | <">
| "/" | "[" | "]" | "?" | "="
| "{" | "}" | SP | HT
token = 1*<any CHAR except CTLs or separators>
quoted-pair = "\" CHAR
ctext = <any TEXT excluding "(" and ")">
qdtext = <any TEXT except <">>
quoted-string = ( <"> *(qdtext | quoted-pair ) <"> )
comment = "(" *( ctext | quoted-pair | comment ) ")"
4 HTTP Message
4.1 Message Types
HTTP messages consist of requests from client to server and responses from
server to client. Request (section 5) and Response (section 6) messages use the
generic message format of RFC 822 [9] for transferring entities (the payload of
the message). Both types of message consist of :
- a start-line
- zero or more header fields (also known as "headers")
- an empty line (i.e., a line with nothing preceding the CRLF) indicating the
end of the header fields
- and possibly a message-body.
HTTP-message = Request | Response
start-line = Request-Line | Status-Line
generic-message = start-line
*(message-header CRLF)
CRLF
[ message-body ]
In the interest of robustness, servers SHOULD ignore any empty line(s) received
where a Request-Line is expected. In other words, if the server is reading the
protocol stream at the beginning of a message and receives a CRLF first, it
should ignore the CRLF.
4.2 Message headers
- Each header field consists of a name followed by a colon (":") and the field
value.
- Field names are case-insensitive.
- The field value MAY be preceded by any amount of LWS, though a single SP is
preferred.
- Header fields can be extended over multiple lines by preceding each extra
line with at least one SP or HT.
message-header = field-name ":" [ field-value ]
field-name = token
field-value = *( field-content | LWS )
field-content = <the OCTETs making up the field-value and consisting of
either *TEXT or combinations of token, separators, and
quoted-string>
The field-content does not include any leading or trailing LWS occurring before
the first non-whitespace character of the field-value or after the last
non-whitespace character of the field-value. Such leading or trailing LWS MAY
be removed without changing the semantics of the field value. Any LWS that
occurs between field-content MAY be replaced with a single SP before
interpreting the field value or forwarding the message downstream.
=> format des headers = 1*(CHAR & !ctl & !sep) ":" *(OCTET & (!ctl | LWS))
=> les regex de matching de headers s'appliquent sur field-content, et peuvent
utiliser field-value comme espace de travail (mais de pr�f�rence apr�s le
premier SP).
(19.3) The line terminator for message-header fields is the sequence CRLF.
However, we recommend that applications, when parsing such headers, recognize
a single LF as a line terminator and ignore the leading CR.
message-body = entity-body
| <entity-body encoded as per Transfer-Encoding>
5 Request
Request = Request-Line
*(( general-header
| request-header
| entity-header ) CRLF)
CRLF
[ message-body ]
5.1 Request line
The elements are separated by SP characters. No CR or LF is allowed except in
the final CRLF sequence.
Request-Line = Method SP Request-URI SP HTTP-Version CRLF
(19.3) Clients SHOULD be tolerant in parsing the Status-Line and servers
tolerant when parsing the Request-Line. In particular, they SHOULD accept any
amount of SP or HT characters between fields, even though only a single SP is
required.
4.5 General headers
Apply to MESSAGE.
general-header = Cache-Control
| Connection
| Date
| Pragma
| Trailer
| Transfer-Encoding
| Upgrade
| Via
| Warning
General-header field names can be extended reliably only in combination with a
change in the protocol version. However, new or experimental header fields may
be given the semantics of general header fields if all parties in the
communication recognize them to be general-header fields. Unrecognized header
fields are treated as entity-header fields.
5.3 Request Header Fields
The request-header fields allow the client to pass additional information about
the request, and about the client itself, to the server. These fields act as
request modifiers, with semantics equivalent to the parameters on a programming
language method invocation.
request-header = Accept
| Accept-Charset
| Accept-Encoding
| Accept-Language
| Authorization
| Expect
| From
| Host
| If-Match
| If-Modified-Since
| If-None-Match
| If-Range
| If-Unmodified-Since
| Max-Forwards
| Proxy-Authorization
| Range
| Referer
| TE
| User-Agent
Request-header field names can be extended reliably only in combination with a
change in the protocol version. However, new or experimental header fields MAY
be given the semantics of request-header fields if all parties in the
communication recognize them to be request-header fields. Unrecognized header
fields are treated as entity-header fields.
7.1 Entity header fields
Entity-header fields define metainformation about the entity-body or, if no
body is present, about the resource identified by the request. Some of this
metainformation is OPTIONAL; some might be REQUIRED by portions of this
specification.
entity-header = Allow
| Content-Encoding
| Content-Language
| Content-Length
| Content-Location
| Content-MD5
| Content-Range
| Content-Type
| Expires
| Last-Modified
| extension-header
extension-header = message-header
The extension-header mechanism allows additional entity-header fields to be
defined without changing the protocol, but these fields cannot be assumed to be
recognizable by the recipient. Unrecognized header fields SHOULD be ignored by
the recipient and MUST be forwarded by transparent proxies.
----------------------------------
The format of Request-URI is defined by RFC3986 :
URI = scheme ":" hier-part [ "?" query ] [ "#" fragment ]
hier-part = "//" authority path-abempty
/ path-absolute
/ path-rootless
/ path-empty
URI-reference = URI / relative-ref
absolute-URI = scheme ":" hier-part [ "?" query ]
relative-ref = relative-part [ "?" query ] [ "#" fragment ]
relative-part = "//" authority path-abempty
/ path-absolute
/ path-noscheme
/ path-empty
scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." )
authority = [ userinfo "@" ] host [ ":" port ]
userinfo = *( unreserved / pct-encoded / sub-delims / ":" )
host = IP-literal / IPv4address / reg-name
port = *DIGIT
IP-literal = "[" ( IPv6address / IPvFuture ) "]"
IPvFuture = "v" 1*HEXDIG "." 1*( unreserved / sub-delims / ":" )
IPv6address = 6( h16 ":" ) ls32
/ "::" 5( h16 ":" ) ls32
/ [ h16 ] "::" 4( h16 ":" ) ls32
/ [ *1( h16 ":" ) h16 ] "::" 3( h16 ":" ) ls32
/ [ *2( h16 ":" ) h16 ] "::" 2( h16 ":" ) ls32
/ [ *3( h16 ":" ) h16 ] "::" h16 ":" ls32
/ [ *4( h16 ":" ) h16 ] "::" ls32
/ [ *5( h16 ":" ) h16 ] "::" h16
/ [ *6( h16 ":" ) h16 ] "::"
h16 = 1*4HEXDIG
ls32 = ( h16 ":" h16 ) / IPv4address
IPv4address = dec-octet "." dec-octet "." dec-octet "." dec-octet
dec-octet = DIGIT ; 0-9
/ %x31-39 DIGIT ; 10-99
/ "1" 2DIGIT ; 100-199
/ "2" %x30-34 DIGIT ; 200-249
/ "25" %x30-35 ; 250-255
reg-name = *( unreserved / pct-encoded / sub-delims )
path = path-abempty ; begins with "/" or is empty
/ path-absolute ; begins with "/" but not "//"
/ path-noscheme ; begins with a non-colon segment
/ path-rootless ; begins with a segment
/ path-empty ; zero characters
path-abempty = *( "/" segment )
path-absolute = "/" [ segment-nz *( "/" segment ) ]
path-noscheme = segment-nz-nc *( "/" segment )
path-rootless = segment-nz *( "/" segment )
path-empty = 0<pchar>
segment = *pchar
segment-nz = 1*pchar
segment-nz-nc = 1*( unreserved / pct-encoded / sub-delims / "@" )
; non-zero-length segment without any colon ":"
pchar = unreserved / pct-encoded / sub-delims / ":" / "@"
query = *( pchar / "/" / "?" )
fragment = *( pchar / "/" / "?" )
pct-encoded = "%" HEXDIG HEXDIG
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
reserved = gen-delims / sub-delims
gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@"
sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
/ "*" / "+" / "," / ";" / "="
=> so the list of allowed characters in a URI is :
uri-char = unreserved / gen-delims / sub-delims / "%"
= ALPHA / DIGIT / "-" / "." / "_" / "~"
/ ":" / "/" / "?" / "#" / "[" / "]" / "@"
/ "!" / "$" / "&" / "'" / "(" / ")" /
/ "*" / "+" / "," / ";" / "=" / "%"
Note that non-ascii characters are forbidden ! Spaces and CTL are forbidden.
Unfortunately, some products such as Apache allow such characters :-/
---- The correct way to do it ----
- one http_session
It is basically any transport session on which we talk HTTP. It may be TCP,
SSL over TCP, etc... It knows a way to talk to the client, either the socket
file descriptor or a direct access to the client-side buffer. It should hold
information about the last accessed server so that we can guarantee that the
same server can be used during a whole session if needed. A first version
without optimal support for HTTP pipelining will have the client buffers tied
to the http_session. It may be possible that it is not sufficient for full
pipelining, but this will need further study. The link from the buffers to
the backend should be managed by the http transaction (http_txn), provided
that they are serialized. Each http_session, has 0 to N http_txn. Each
http_txn belongs to one and only one http_session.
- each http_txn has 1 request message (http_req), and 0 or 1 response message
(http_rtr). Each of them has 1 and only one http_txn. An http_txn holds
information such as the HTTP method, the URI, the HTTP version, the
transfer-encoding, the HTTP status, the authorization, the req and rtr
content-length, the timers, logs, etc... The backend and server which process
the request are also known from the http_txn.
- both request and response messages hold header and parsing information, such
as the parsing state, start of headers, start of message, captures, etc...
|