1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
|
/* SPDX-License-Identifier: MIT */
/*
* Copyright © 2022 Intel Corporation
*/
#ifndef _XE_VM_DOC_H_
#define _XE_VM_DOC_H_
/**
* DOC: XE VM (user address space)
*
* VM creation
* ===========
*
* Allocate a physical page for root of the page table structure, create default
* bind engine, and return a handle to the user.
*
* Scratch page
* ------------
*
* If the VM is created with the flag, DRM_XE_VM_CREATE_FLAG_SCRATCH_PAGE, set the
* entire page table structure defaults pointing to blank page allocated by the
* VM. Invalid memory access rather than fault just read / write to this page.
*
* VM bind (create GPU mapping for a BO or userptr)
* ================================================
*
* Creates GPU mapings for a BO or userptr within a VM. VM binds uses the same
* in / out fence interface (struct drm_xe_sync) as execs which allows users to
* think of binds and execs as more or less the same operation.
*
* Operations
* ----------
*
* DRM_XE_VM_BIND_OP_MAP - Create mapping for a BO
* DRM_XE_VM_BIND_OP_UNMAP - Destroy mapping for a BO / userptr
* DRM_XE_VM_BIND_OP_MAP_USERPTR - Create mapping for userptr
*
* Implementation details
* ~~~~~~~~~~~~~~~~~~~~~~
*
* All bind operations are implemented via a hybrid approach of using the CPU
* and GPU to modify page tables. If a new physical page is allocated in the
* page table structure we populate that page via the CPU and insert that new
* page into the existing page table structure via a GPU job. Also any existing
* pages in the page table structure that need to be modified also are updated
* via the GPU job. As the root physical page is prealloced on VM creation our
* GPU job will always have at least 1 update. The in / out fences are passed to
* this job so again this is conceptually the same as an exec.
*
* Very simple example of few binds on an empty VM with 48 bits of address space
* and the resulting operations:
*
* .. code-block::
*
* bind BO0 0x0-0x1000
* alloc page level 3a, program PTE[0] to BO0 phys address (CPU)
* alloc page level 2, program PDE[0] page level 3a phys address (CPU)
* alloc page level 1, program PDE[0] page level 2 phys address (CPU)
* update root PDE[0] to page level 1 phys address (GPU)
*
* bind BO1 0x201000-0x202000
* alloc page level 3b, program PTE[1] to BO1 phys address (CPU)
* update page level 2 PDE[1] to page level 3b phys address (GPU)
*
* bind BO2 0x1ff000-0x201000
* update page level 3a PTE[511] to BO2 phys addres (GPU)
* update page level 3b PTE[0] to BO2 phys addres + 0x1000 (GPU)
*
* GPU bypass
* ~~~~~~~~~~
*
* In the above example the steps using the GPU can be converted to CPU if the
* bind can be done immediately (all in-fences satisfied, VM dma-resv kernel
* slot is idle).
*
* Address space
* -------------
*
* Depending on platform either 48 or 57 bits of address space is supported.
*
* Page sizes
* ----------
*
* The minimum page size is either 4k or 64k depending on platform and memory
* placement (sysmem vs. VRAM). We enforce that binds must be aligned to the
* minimum page size.
*
* Larger pages (2M or 1GB) can be used for BOs in VRAM, the BO physical address
* is aligned to the larger pages size, and VA is aligned to the larger page
* size. Larger pages for userptrs / BOs in sysmem should be possible but is not
* yet implemented.
*
* Sync error handling mode
* ------------------------
*
* In both modes during the bind IOCTL the user input is validated. In sync
* error handling mode the newly bound BO is validated (potentially moved back
* to a region of memory where is can be used), page tables are updated by the
* CPU and the job to do the GPU binds is created in the IOCTL itself. This step
* can fail due to memory pressure. The user can recover by freeing memory and
* trying this operation again.
*
* Async error handling mode
* -------------------------
*
* In async error handling the step of validating the BO, updating page tables,
* and generating a job are deferred to an async worker. As this step can now
* fail after the IOCTL has reported success we need an error handling flow for
* which the user can recover from.
*
* The solution is for a user to register a user address with the VM which the
* VM uses to report errors to. The ufence wait interface can be used to wait on
* a VM going into an error state. Once an error is reported the VM's async
* worker is paused. While the VM's async worker is paused sync,
* DRM_XE_VM_BIND_OP_UNMAP operations are allowed (this can free memory). Once the
* uses believe the error state is fixed, the async worker can be resumed via
* XE_VM_BIND_OP_RESTART operation. When VM async bind work is restarted, the
* first operation processed is the operation that caused the original error.
*
* Bind queues / engines
* ---------------------
*
* Think of the case where we have two bind operations A + B and are submitted
* in that order. A has in fences while B has none. If using a single bind
* queue, B is now blocked on A's in fences even though it is ready to run. This
* example is a real use case for VK sparse binding. We work around this
* limitation by implementing bind engines.
*
* In the bind IOCTL the user can optionally pass in an engine ID which must map
* to an engine which is of the special class DRM_XE_ENGINE_CLASS_VM_BIND.
* Underneath this is a really virtual engine that can run on any of the copy
* hardware engines. The job(s) created each IOCTL are inserted into this
* engine's ring. In the example above if A and B have different bind engines B
* is free to pass A. If the engine ID field is omitted, the default bind queue
* for the VM is used.
*
* TODO: Explain race in issue 41 and how we solve it
*
* Array of bind operations
* ------------------------
*
* The uAPI allows multiple binds operations to be passed in via a user array,
* of struct drm_xe_vm_bind_op, in a single VM bind IOCTL. This interface
* matches the VK sparse binding API. The implementation is rather simple, parse
* the array into a list of operations, pass the in fences to the first operation,
* and pass the out fences to the last operation. The ordered nature of a bind
* engine makes this possible.
*
* Munmap semantics for unbinds
* ----------------------------
*
* Munmap allows things like:
*
* .. code-block::
*
* 0x0000-0x2000 and 0x3000-0x5000 have mappings
* Munmap 0x1000-0x4000, results in mappings 0x0000-0x1000 and 0x4000-0x5000
*
* To support this semantic in the above example we decompose the above example
* into 4 operations:
*
* .. code-block::
*
* unbind 0x0000-0x2000
* unbind 0x3000-0x5000
* rebind 0x0000-0x1000
* rebind 0x4000-0x5000
*
* Why not just do a partial unbind of 0x1000-0x2000 and 0x3000-0x4000? This
* falls apart when using large pages at the edges and the unbind forces us to
* use a smaller page size. For simplity we always issue a set of unbinds
* unmapping anything in the range and at most 2 rebinds on the edges.
*
* Similar to an array of binds, in fences are passed to the first operation and
* out fences are signaled on the last operation.
*
* In this example there is a window of time where 0x0000-0x1000 and
* 0x4000-0x5000 are invalid but the user didn't ask for these addresses to be
* removed from the mapping. To work around this we treat any munmap style
* unbinds which require a rebind as a kernel operations (BO eviction or userptr
* invalidation). The first operation waits on the VM's
* DMA_RESV_USAGE_PREEMPT_FENCE slots (waits for all pending jobs on VM to
* complete / triggers preempt fences) and the last operation is installed in
* the VM's DMA_RESV_USAGE_KERNEL slot (blocks future jobs / resume compute mode
* VM). The caveat is all dma-resv slots must be updated atomically with respect
* to execs and compute mode rebind worker. To accomplish this, hold the
* vm->lock in write mode from the first operation until the last.
*
* Deferred binds in fault mode
* ----------------------------
*
* In a VM is in fault mode (TODO: link to fault mode), new bind operations that
* create mappings are by default are deferred to the page fault handler (first
* use). This behavior can be overriden by setting the flag
* DRM_XE_VM_BIND_FLAG_IMMEDIATE which indicates to creating the mapping
* immediately.
*
* User pointer
* ============
*
* User pointers are user allocated memory (malloc'd, mmap'd, etc..) for which the
* user wants to create a GPU mapping. Typically in other DRM drivers a dummy BO
* was created and then a binding was created. We bypass creating a dummy BO in
* XE and simply create a binding directly from the userptr.
*
* Invalidation
* ------------
*
* Since this a core kernel managed memory the kernel can move this memory
* whenever it wants. We register an invalidation MMU notifier to alert XE when
* a user poiter is about to move. The invalidation notifier needs to block
* until all pending users (jobs or compute mode engines) of the userptr are
* idle to ensure no faults. This done by waiting on all of VM's dma-resv slots.
*
* Rebinds
* -------
*
* Either the next exec (non-compute) or rebind worker (compute mode) will
* rebind the userptr. The invalidation MMU notifier kicks the rebind worker
* after the VM dma-resv wait if the VM is in compute mode.
*
* Compute mode
* ============
*
* A VM in compute mode enables long running workloads and ultra low latency
* submission (ULLS). ULLS is implemented via a continuously running batch +
* semaphores. This enables to the user to insert jump to new batch commands
* into the continuously running batch. In both cases these batches exceed the
* time a dma fence is allowed to exist for before signaling, as such dma fences
* are not used when a VM is in compute mode. User fences (TODO: link user fence
* doc) are used instead to signal operation's completion.
*
* Preempt fences
* --------------
*
* If the kernel decides to move memory around (either userptr invalidate, BO
* eviction, or mumap style unbind which results in a rebind) and a batch is
* running on an engine, that batch can fault or cause a memory corruption as
* page tables for the moved memory are no longer valid. To work around this we
* introduce the concept of preempt fences. When sw signaling is enabled on a
* preempt fence it tells the submission backend to kick that engine off the
* hardware and the preempt fence signals when the engine is off the hardware.
* Once all preempt fences are signaled for a VM the kernel can safely move the
* memory and kick the rebind worker which resumes all the engines execution.
*
* A preempt fence, for every engine using the VM, is installed the VM's
* dma-resv DMA_RESV_USAGE_PREEMPT_FENCE slot. The same preempt fence, for every
* engine using the VM, is also installed into the same dma-resv slot of every
* external BO mapped in the VM.
*
* Rebind worker
* -------------
*
* The rebind worker is very similar to an exec. It is resposible for rebinding
* evicted BOs or userptrs, waiting on those operations, installing new preempt
* fences, and finally resuming executing of engines in the VM.
*
* Flow
* ~~~~
*
* .. code-block::
*
* <----------------------------------------------------------------------|
* Check if VM is closed, if so bail out |
* Lock VM global lock in read mode |
* Pin userptrs (also finds userptr invalidated since last rebind worker) |
* Lock VM dma-resv and external BOs dma-resv |
* Validate BOs that have been evicted |
* Wait on and allocate new preempt fences for every engine using the VM |
* Rebind invalidated userptrs + evicted BOs |
* Wait on last rebind fence |
* Wait VM's DMA_RESV_USAGE_KERNEL dma-resv slot |
* Install preeempt fences and issue resume for every engine using the VM |
* Check if any userptrs invalidated since pin |
* Squash resume for all engines |
* Unlock all |
* Wait all VM's dma-resv slots |
* Retry ----------------------------------------------------------
* Release all engines waiting to resume
* Unlock all
*
* Timeslicing
* -----------
*
* In order to prevent an engine from continuously being kicked off the hardware
* and making no forward progress an engine has a period of time it allowed to
* run after resume before it can be kicked off again. This effectively gives
* each engine a timeslice.
*
* Handling multiple GTs
* =====================
*
* If a GT has slower access to some regions and the page table structure are in
* the slow region, the performance on that GT could adversely be affected. To
* work around this we allow a VM page tables to be shadowed in multiple GTs.
* When VM is created, a default bind engine and PT table structure are created
* on each GT.
*
* Binds can optionally pass in a mask of GTs where a mapping should be created,
* if this mask is zero then default to all the GTs where the VM has page
* tables.
*
* The implementation for this breaks down into a bunch for_each_gt loops in
* various places plus exporting a composite fence for multi-GT binds to the
* user.
*
* Fault mode (unified shared memory)
* ==================================
*
* A VM in fault mode can be enabled on devices that support page faults. If
* page faults are enabled, using dma fences can potentially induce a deadlock:
* A pending page fault can hold up the GPU work which holds up the dma fence
* signaling, and memory allocation is usually required to resolve a page
* fault, but memory allocation is not allowed to gate dma fence signaling. As
* such, dma fences are not allowed when VM is in fault mode. Because dma-fences
* are not allowed, long running workloads and ULLS are enabled on a faulting
* VM.
*
* Defered VM binds
* ----------------
*
* By default, on a faulting VM binds just allocate the VMA and the actual
* updating of the page tables is defered to the page fault handler. This
* behavior can be overridden by setting the flag DRM_XE_VM_BIND_FLAG_IMMEDIATE in
* the VM bind which will then do the bind immediately.
*
* Page fault handler
* ------------------
*
* Page faults are received in the G2H worker under the CT lock which is in the
* path of dma fences (no memory allocations are allowed, faults require memory
* allocations) thus we cannot process faults under the CT lock. Another issue
* is faults issue TLB invalidations which require G2H credits and we cannot
* allocate G2H credits in the G2H handlers without deadlocking. Lastly, we do
* not want the CT lock to be an outer lock of the VM global lock (VM global
* lock required to fault processing).
*
* To work around the above issue with processing faults in the G2H worker, we
* sink faults to a buffer which is large enough to sink all possible faults on
* the GT (1 per hardware engine) and kick a worker to process the faults. Since
* the page faults G2H are already received in a worker, kicking another worker
* adds more latency to a critical performance path. We add a fast path in the
* G2H irq handler which looks at first G2H and if it is a page fault we sink
* the fault to the buffer and kick the worker to process the fault. TLB
* invalidation responses are also in the critical path so these can also be
* processed in this fast path.
*
* Multiple buffers and workers are used and hashed over based on the ASID so
* faults from different VMs can be processed in parallel.
*
* The page fault handler itself is rather simple, flow is below.
*
* .. code-block::
*
* Lookup VM from ASID in page fault G2H
* Lock VM global lock in read mode
* Lookup VMA from address in page fault G2H
* Check if VMA is valid, if not bail
* Check if VMA's BO has backing store, if not allocate
* <----------------------------------------------------------------------|
* If userptr, pin pages |
* Lock VM & BO dma-resv locks |
* If atomic fault, migrate to VRAM, else validate BO location |
* Issue rebind |
* Wait on rebind to complete |
* Check if userptr invalidated since pin |
* Drop VM & BO dma-resv locks |
* Retry ----------------------------------------------------------
* Unlock all
* Issue blocking TLB invalidation |
* Send page fault response to GuC
*
* Access counters
* ---------------
*
* Access counters can be configured to trigger a G2H indicating the device is
* accessing VMAs in system memory frequently as hint to migrate those VMAs to
* VRAM.
*
* Same as the page fault handler, access counters G2H cannot be processed the
* G2H worker under the CT lock. Again we use a buffer to sink access counter
* G2H. Unlike page faults there is no upper bound so if the buffer is full we
* simply drop the G2H. Access counters are a best case optimization and it is
* safe to drop these unlike page faults.
*
* The access counter handler itself is rather simple flow is below.
*
* .. code-block::
*
* Lookup VM from ASID in access counter G2H
* Lock VM global lock in read mode
* Lookup VMA from address in access counter G2H
* If userptr, bail nothing to do
* Lock VM & BO dma-resv locks
* Issue migration to VRAM
* Unlock all
*
* Notice no rebind is issued in the access counter handler as the rebind will
* be issued on next page fault.
*
* Cavets with eviction / user pointer invalidation
* ------------------------------------------------
*
* In the case of eviction and user pointer invalidation on a faulting VM, there
* is no need to issue a rebind rather we just need to blow away the page tables
* for the VMAs and the page fault handler will rebind the VMAs when they fault.
* The cavet is to update / read the page table structure the VM global lock is
* neeeed. In both the case of eviction and user pointer invalidation locks are
* held which make acquiring the VM global lock impossible. To work around this
* every VMA maintains a list of leaf page table entries which should be written
* to zero to blow away the VMA's page tables. After writing zero to these
* entries a blocking TLB invalidate is issued. At this point it is safe for the
* kernel to move the VMA's memory around. This is a necessary lockless
* algorithm and is safe as leafs cannot be changed while either an eviction or
* userptr invalidation is occurring.
*
* Locking
* =======
*
* VM locking protects all of the core data paths (bind operations, execs,
* evictions, and compute mode rebind worker) in XE.
*
* Locks
* -----
*
* VM global lock (vm->lock) - rw semaphore lock. Outer most lock which protects
* the list of userptrs mapped in the VM, the list of engines using this VM, and
* the array of external BOs mapped in the VM. When adding or removing any of the
* aforemented state from the VM should acquire this lock in write mode. The VM
* bind path also acquires this lock in write while the exec / compute mode
* rebind worker acquire this lock in read mode.
*
* VM dma-resv lock (vm->ttm.base.resv->lock) - WW lock. Protects VM dma-resv
* slots which is shared with any private BO in the VM. Expected to be acquired
* during VM binds, execs, and compute mode rebind worker. This lock is also
* held when private BOs are being evicted.
*
* external BO dma-resv lock (bo->ttm.base.resv->lock) - WW lock. Protects
* external BO dma-resv slots. Expected to be acquired during VM binds (in
* addition to the VM dma-resv lock). All external BO dma-locks within a VM are
* expected to be acquired (in addition to the VM dma-resv lock) during execs
* and the compute mode rebind worker. This lock is also held when an external
* BO is being evicted.
*
* Putting it all together
* -----------------------
*
* 1. An exec and bind operation with the same VM can't be executing at the same
* time (vm->lock).
*
* 2. A compute mode rebind worker and bind operation with the same VM can't be
* executing at the same time (vm->lock).
*
* 3. We can't add / remove userptrs or external BOs to a VM while an exec with
* the same VM is executing (vm->lock).
*
* 4. We can't add / remove userptrs, external BOs, or engines to a VM while a
* compute mode rebind worker with the same VM is executing (vm->lock).
*
* 5. Evictions within a VM can't be happen while an exec with the same VM is
* executing (dma-resv locks).
*
* 6. Evictions within a VM can't be happen while a compute mode rebind worker
* with the same VM is executing (dma-resv locks).
*
* dma-resv usage
* ==============
*
* As previously stated to enforce the ordering of kernel ops (eviction, userptr
* invalidation, munmap style unbinds which result in a rebind), rebinds during
* execs, execs, and resumes in the rebind worker we use both the VMs and
* external BOs dma-resv slots. Let try to make this as clear as possible.
*
* Slot installation
* -----------------
*
* 1. Jobs from kernel ops install themselves into the DMA_RESV_USAGE_KERNEL
* slot of either an external BO or VM (depends on if kernel op is operating on
* an external or private BO)
*
* 2. In non-compute mode, jobs from execs install themselves into the
* DMA_RESV_USAGE_BOOKKEEP slot of the VM
*
* 3. In non-compute mode, jobs from execs install themselves into the
* DMA_RESV_USAGE_WRITE slot of all external BOs in the VM
*
* 4. Jobs from binds install themselves into the DMA_RESV_USAGE_BOOKKEEP slot
* of the VM
*
* 5. Jobs from binds install themselves into the DMA_RESV_USAGE_BOOKKEEP slot
* of the external BO (if the bind is to an external BO, this is addition to #4)
*
* 6. Every engine using a compute mode VM has a preempt fence in installed into
* the DMA_RESV_USAGE_PREEMPT_FENCE slot of the VM
*
* 7. Every engine using a compute mode VM has a preempt fence in installed into
* the DMA_RESV_USAGE_PREEMPT_FENCE slot of all the external BOs in the VM
*
* Slot waiting
* ------------
*
* 1. The exection of all jobs from kernel ops shall wait on all slots
* (DMA_RESV_USAGE_PREEMPT_FENCE) of either an external BO or VM (depends on if
* kernel op is operating on external or private BO)
*
* 2. In non-compute mode, the exection of all jobs from rebinds in execs shall
* wait on the DMA_RESV_USAGE_KERNEL slot of either an external BO or VM
* (depends on if the rebind is operatiing on an external or private BO)
*
* 3. In non-compute mode, the exection of all jobs from execs shall wait on the
* last rebind job
*
* 4. In compute mode, the exection of all jobs from rebinds in the rebind
* worker shall wait on the DMA_RESV_USAGE_KERNEL slot of either an external BO
* or VM (depends on if rebind is operating on external or private BO)
*
* 5. In compute mode, resumes in rebind worker shall wait on last rebind fence
*
* 6. In compute mode, resumes in rebind worker shall wait on the
* DMA_RESV_USAGE_KERNEL slot of the VM
*
* Putting it all together
* -----------------------
*
* 1. New jobs from kernel ops are blocked behind any existing jobs from
* non-compute mode execs
*
* 2. New jobs from non-compute mode execs are blocked behind any existing jobs
* from kernel ops and rebinds
*
* 3. New jobs from kernel ops are blocked behind all preempt fences signaling in
* compute mode
*
* 4. Compute mode engine resumes are blocked behind any existing jobs from
* kernel ops and rebinds
*
* Future work
* ===========
*
* Support large pages for sysmem and userptr.
*
* Update page faults to handle BOs are page level grainularity (e.g. part of BO
* could be in system memory while another part could be in VRAM).
*
* Page fault handler likely we be optimized a bit more (e.g. Rebinds always
* wait on the dma-resv kernel slots of VM or BO, technically we only have to
* wait the BO moving. If using a job to do the rebind, we could not block in
* the page fault handler rather attach a callback to fence of the rebind job to
* signal page fault complete. Our handling of short circuting for atomic faults
* for bound VMAs could be better. etc...). We can tune all of this once we have
* benchmarks / performance number from workloads up and running.
*/
#endif
|