1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
|
Changes for 1.3.0 'Tundra Peregrine Falcon (Calidus)':
------------------------------------------------------
1.3.0 is a medium release of dav1d, focus on new APIs and memory usage reduction.
- Reduce memory usage in numerous places
- ABI break in Dav1dSequenceHeader, Dav1dFrameHeader, Dav1dContentLightLevel structures
- new API function to check the API version: dav1d_version_api()
- Rewrite of the SGR functions for ARM64 to be faster
- NEON implemetation of save_tmvs for ARM32 and ARM64
- x86 palette DSP for pal_idx_finish function
Changes for 1.2.1 'Arctic Peregrine Falcon':
-------------------------------------------
1.2.1 is a small release of dav1d, adding more SIMD and fixes
- Fix a threading race on task_thread.init_done
- NEON z2 8bpc and high bit-depth optimizations
- SSSE3 z2 high bit-depth optimziations
- Fix a desynced luma/chroma planes issue with Film Grain
- Reduce memory consumption
- Improve dav1d_parse_sequence_header() speed
- OBU: Improve header parsing and fix potential overflows
- OBU: Improve ITU-T T.35 parsing speed
- Misc buildsystems, CI and headers fixes
Changes for 1.2.0 'Arctic Peregrine Falcon':
-------------------------------------------
1.2.0 is a small release of dav1d, adding more SIMD and fixes
- Improvements on attachments of props and T.35 entries on output pictures
- NEON z1/z3 high bit-depth optimizations and improvements for 8bpc
- SSSE3 z2/z3 8bpc and SSSE3 z1/z3 high bit-depth optimziations
- refmvs.save_tmvs optimizations in SSSE3/AVX2/AVX-512
- AVX-512 optimizations for high bit-depth itx (16x64, 32x64, 64x16, 64x32, 64x64)
- AVX2 optimizations for 12bpc for 16x32, 32x16, 32x32 itx
Changes for 1.1.0 'Arctic Peregrine Falcon':
-------------------------------------------
1.1.0 is an important release of dav1d, fixing numerous bugs, and adding SIMD
- New function dav1d_get_frame_delay to query the decoder frame delay
- Numerous fixes for strict conformity to the specs and samples
- NEON and AVX-512 misc fixes and improvements
- Partial AVX2 12bpc transform implementations
- AVX-512 high bit-depth cdef_filter, loopfilter, itx
- NEON z1/z3 optimization for 8bpc
- SSSE3 z1 optimization for 8bpc
"From VideoLAN with love"
Changes for 1.0.0 'Peregrine Falcon':
-------------------------------------
1.0.0 is a major release of dav1d, adding important features and bug fixes.
It notably changes, in an important way, the way threading works, by adding
an automatic thread management.
It also adds support for AVX-512 acceleration, and adds speedups to existing x86
code (from SSE2 to AVX2).
1.0.0 adds new grain API to ease acceleration on the GPU, and adds an API call
to get information of which frame failed to decode, in error cases.
Finally, 1.0.0 fixes numerous small bugs that were reported since the beginning
of the project to have a proper release.
.''.
.''. . *''* :_\/_: .
:_\/_: _\(/_ .:.*_\/_* : /\ : .'.:.'.
.''.: /\ : ./)\ ':'* /\ * : '..'. -=:o:=-
:_\/_:'.:::. ' *''* * '.\'/.' _\(/_'.':'.'
: /\ : ::::: *_\/_* -= o =- /)\ ' *
'..' ':::' * /\ * .'/.\'. '
* *..* :
* :
* 1.0.0
Changes for 0.9.2 'Golden Eagle':
---------------------------------
0.9.2 is a small update of dav1d on the 0.9.x branch:
- x86: SSE4 optimizations of inverse transforms for 10bit for all sizes
- x86: mc.resize optimizations with AVX2/SSSE3 for 10/12b
- x86: SSSE3 optimizations for cdef_filter in 10/12b and mc_w_mask_422/444 in 8b
- ARM NEON optimizations for FilmGrain Gen_grain functions
- Optimizations for splat_mv in SSE2/AVX2 and NEON
- x86: SGR improvements for SSSE3 CPUs
- x86: AVX2 optimizations for cfl_ac
Changes for 0.9.1 'Golden Eagle':
---------------------------------
0.9.1 is a middle-size revision of dav1d, adding notably 10b acceleration for SSSE3:
- 10/12b SSSE3 optimizations for mc (avg, w_avg, mask, w_mask, emu_edge),
prep/put_bilin, prep/put_8tap, ipred (dc/h/v, paeth, smooth, pal, filter), wiener,
sgr (10b), warp8x8, deblock, film_grain, cfl_ac/pred for 32bit and 64bit x86 processors
- Film grain NEON for fguv 10/12b, fgy/fguv 8b and fgy/fguv 10/12 arm32
- Fixes for filmgrain on ARM
- itx 10bit optimizations for 4x4/x8/x16, 8x4/x8/x16 for SSE4
- Misc improvements on SSE2, SSE4
Changes for 0.9.0 'Golden Eagle':
---------------------------------
0.9.0 is a major version of dav1d, adding notably 10b acceleration on x64.
Details:
- x86 (64bit) AVX2 implementation of most 10b/12b functions, which should provide
a large boost for high-bitdepth decoding on modern x86 computers and servers.
- ARM64 neon implementation of FilmGrain (4:2:0/4:2:2/4:4:4 8bit)
- New API to signal events happening during the decoding process
Changes for 0.8.2 'Eurasian Hobby':
-----------------------------------
0.8.2 is a middle-size update of the 0.8.0 branch:
- ARM32 optimizations for ipred and itx in 10/12bits,
completing the 10b/12b work on ARM64 and ARM32
- Give the post-filters their own threads
- ARM64: rewrite the wiener functions
- Speed up coefficient decoding, 0.5%-3% global decoding gain
- x86 optimizations for CDEF_filter and wiener in 10/12bit
- x86: rewrite the SGR AVX2 asm
- x86: improve msac speed on SSE2+ machines
- ARM32: improve speed of ipred and warp
- ARM64: improve speed of ipred, cdef_dir, cdef_filter, warp_motion and itx16
- ARM32/64: improve speed of looprestoration
- Add seeking, pausing to the player
- Update the player for rendering of 10b/12b
- Misc speed improvements and fixes on all platforms
- Add a xxh3 muxer in the dav1d application
Changes for 0.8.1 'Eurasian Hobby':
-----------------------------------
0.8.1 is a minor update on 0.8.0:
- Keep references to buffers valid after dav1d_close(). Fixes a regression
caused by the picture buffer pool added in 0.8.0.
- ARM32 optimizations for 10bit bitdepth for SGR
- ARM32 optimizations for 16bit bitdepth for blend/w_masl/emu_edge
- ARM64 optimizations for 10bit bitdepth for SGR
- x86 optimizations for wiener in SSE2/SSSE3/AVX2
Changes for 0.8.0 'Eurasian Hobby':
-----------------------------------
0.8.0 is a major update for dav1d:
- Improve the performance by using a picture buffer pool;
The improvements can reach 10% on some cases on Windows.
- Support for Apple ARM Silicon
- ARM32 optimizations for 8bit bitdepth for ipred paeth, smooth, cfl
- ARM32 optimizations for 10/12/16bit bitdepth for mc_avg/mask/w_avg,
put/prep 8tap/bilin, wiener and CDEF filters
- ARM64 optimizations for cfl_ac 444 for all bitdepths
- x86 optimizations for MC 8-tap, mc_scaled in AVX2
- x86 optimizations for CDEF in SSE and {put/prep}_{8tap/bilin} in SSSE3
Changes for 0.7.1 'Frigatebird':
------------------------------
0.7.1 is a minor update on 0.7.0:
- ARM32 NEON optimizations for itxfm, which can give up to 28% speedup, and MSAC
- SSE2 optimizations for prep_bilin and prep_8tap
- AVX2 optimizations for MC scaled
- Fix a clamping issue in motion vector projection
- Fix an issue on some specific Haswell CPU on ipred_z AVX2 functions
- Improvements on the dav1dplay utility player to support resizing
Changes for 0.7.0 'Frigatebird':
------------------------------
0.7.0 is a major release for dav1d:
- Faster refmv implementation gaining up to 12% speed while -25% of RAM (Single Thread)
- 10b/12b ARM64 optimizations are mostly complete:
- ipred (paeth, smooth, dc, pal, filter, cfl)
- itxfm (only 10b)
- AVX2/SSSE3 for non-4:2:0 film grain and for mc.resize
- AVX2 for cfl4:4:4
- AVX-512 CDEF filter
- ARM64 8b improvements for cfl_ac and itxfm
- ARM64 implementation for emu_edge in 8b/10b/12b
- ARM32 implementation for emu_edge in 8b
- Improvements on the dav1dplay utility player to support 10 bit,
non-4:2:0 pixel formats and film grain on the GPU
Changes for 0.6.0 'Gyrfalcon':
------------------------------
0.6.0 is a major release for dav1d:
- New ARM64 optimizations for the 10/12bit depth:
- mc_avg, mc_w_avg, mc_mask
- mc_put/mc_prep 8tap/bilin
- mc_warp_8x8
- mc_w_mask
- mc_blend
- wiener
- SGR
- loopfilter
- cdef
- New AVX-512 optimizations for prep_bilin, prep_8tap, cdef_filter, mc_avg/w_avg/mask
- New SSSE3 optimizations for film grain
- New AVX2 optimizations for msac_adapt16
- Fix rare mismatches against the reference decoder, notably because of clipping
- Improvements on ARM64 on msac, cdef and looprestoration optimizations
- Improvements on AVX2 optimizations for cdef_filter
- Improvements in the C version for itxfm, cdef_filter
Changes for 0.5.2 'Asiatic Cheetah':
------------------------------------
0.5.2 is a small release improving speed for ARM32 and adding minor features:
- ARM32 optimizations for loopfilter, ipred_dc|h|v
- Add section-5 raw OBU demuxer
- Improve the speed by reducing the L2 cache collisions
- Fix minor issues
Changes for 0.5.1 'Asiatic Cheetah':
------------------------------------
0.5.1 is a small release improving speeds and fixing minor issues
compared to 0.5.0:
- SSE2 optimizations for CDEF, wiener and warp_affine
- NEON optimizations for SGR on ARM32
- Fix mismatch issue in x86 asm in inverse identity transforms
- Fix build issue in ARM64 assembly if debug info was enabled
- Add a workaround for Xcode 11 -fstack-check bug
Changes for 0.5.0 'Asiatic Cheetah':
------------------------------------
0.5.0 is a medium release fixing regressions and minor issues,
and improving speed significantly:
- Export ITU T.35 metadata
- Speed improvements on blend_ on ARM
- Speed improvements on decode_coef and MSAC
- NEON optimizations for blend*, w_mask_, ipred functions for ARM64
- NEON optimizations for CDEF and warp on ARM32
- SSE2 optimizations for MSAC hi_tok decoding
- SSSE3 optimizations for deblocking loopfilters and warp_affine
- AVX2 optimizations for film grain and ipred_z2
- SSE4 optimizations for warp_affine
- VSX optimizations for wiener
- Fix inverse transform overflows in x86 and NEON asm
- Fix integer overflows with large frames
- Improve film grain generation to match reference code
- Improve compatibility with older binutils for ARM
- More advanced Player example in tools
Changes for 0.4.0 'Cheetah':
----------------------------
- Fix playback with unknown OBUs
- Add an option to limit the maximum frame size
- SSE2 and ARM64 optimizations for MSAC
- Improve speed on 32bits systems
- Optimization in obmc blend
- Reduce RAM usage significantly
- The initial PPC SIMD code, cdef_filter
- NEON optimizations for blend functions on ARM
- NEON optimizations for w_mask functions on ARM
- NEON optimizations for inverse transforms on ARM64
- VSX optimizations for CDEF filter
- Improve handling of malloc failures
- Simple Player example in tools
Changes for 0.3.1 'Sailfish':
------------------------------
- Fix a buffer overflow in frame-threading mode on SSSE3 CPUs
- Reduce binary size, notably on Windows
- SSSE3 optimizations for ipred_filter
- ARM optimizations for MSAC
Changes for 0.3.0 'Sailfish':
------------------------------
This is the final release for the numerous speed improvements of 0.3.0-rc.
It mostly:
- Fixes an annoying crash on SSSE3 that happened in the itx functions
Changes for 0.2.2 (0.3.0-rc) 'Antelope':
-----------------------------
- Large improvement on MSAC decoding with SSE, bringing 4-6% speed increase
The impact is important on SSSE3, SSE4 and AVX2 cpus
- SSSE3 optimizations for all blocks size in itx
- SSSE3 optimizations for ipred_paeth and ipred_cfl (420, 422 and 444)
- Speed improvements on CDEF for SSE4 CPUs
- NEON optimizations for SGR and loop filter
- Minor crashes, improvements and build changes
Changes for 0.2.1 'Antelope':
----------------------------
- SSSE3 optimization for cdef_dir
- AVX2 improvements of the existing CDEF optimizations
- NEON improvements of the existing CDEF and wiener optimizations
- Clarification about the numbering/versionning scheme
Changes for 0.2.0 'Antelope':
----------------------------
- ARM64 and ARM optimizations using NEON instructions
- SSSE3 optimizations for both 32 and 64bits
- More AVX2 assembly, reaching almost completion
- Fix installation of includes
- Rewrite inverse transforms to avoid overflows
- Snap packaging for Linux
- Updated API (ABI and API break)
- Fixes for un-decodable samples
Changes for 0.1.0 'Gazelle':
----------------------------
Initial release of dav1d, the fast and small AV1 decoder.
- Support for all features of the AV1 bitstream
- Support for all bitdepth, 8, 10 and 12bits
- Support for all chroma subsamplings 4:2:0, 4:2:2, 4:4:4 *and* grayscale
- Full acceleration for AVX2 64bits processors, making it the fastest decoder
- Partial acceleration for SSSE3 processors
- Partial acceleration for NEON processors
|