summaryrefslogtreecommitdiffstats
path: root/Documentation/accel/qaic/aic100.rst
blob: 590dae77ea124fe4794648178ae1fe8b4d3852e8 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
.. SPDX-License-Identifier: GPL-2.0-only

===============================
 Qualcomm Cloud AI 100 (AIC100)
===============================

Overview
========

The Qualcomm Cloud AI 100/AIC100 family of products (including SA9000P - part of
Snapdragon Ride) are PCIe adapter cards which contain a dedicated SoC ASIC for
the purpose of efficiently running Artificial Intelligence (AI) Deep Learning
inference workloads. They are AI accelerators.

The PCIe interface of AIC100 is capable of PCIe Gen4 speeds over eight lanes
(x8). An individual SoC on a card can have up to 16 NSPs for running workloads.
Each SoC has an A53 management CPU. On card, there can be up to 32 GB of DDR.

Multiple AIC100 cards can be hosted in a single system to scale overall
performance. AIC100 cards are multi-user capable and able to execute workloads
from multiple users in a concurrent manner.

Hardware Description
====================

An AIC100 card consists of an AIC100 SoC, on-card DDR, and a set of misc
peripherals (PMICs, etc).

An AIC100 card can either be a PCIe HHHL form factor (a traditional PCIe card),
or a Dual M.2 card. Both use PCIe to connect to the host system.

As a PCIe endpoint/adapter, AIC100 uses the standard VendorID(VID)/
DeviceID(DID) combination to uniquely identify itself to the host. AIC100
uses the standard Qualcomm VID (0x17cb). All AIC100 SKUs use the same
AIC100 DID (0xa100).

AIC100 does not implement FLR (function level reset).

AIC100 implements MSI but does not implement MSI-X. AIC100 prefers 17 MSIs to
operate (1 for MHI, 16 for the DMA Bridge). Falling back to 1 MSI is possible in
scenarios where reserving 32 MSIs isn't feasible.

As a PCIe device, AIC100 utilizes BARs to provide host interfaces to the device
hardware. AIC100 provides 3, 64-bit BARs.

* The first BAR is 4K in size, and exposes the MHI interface to the host.

* The second BAR is 2M in size, and exposes the DMA Bridge interface to the
  host.

* The third BAR is variable in size based on an individual AIC100's
  configuration, but defaults to 64K. This BAR currently has no purpose.

From the host perspective, AIC100 has several key hardware components -

* MHI (Modem Host Interface)
* QSM (QAIC Service Manager)
* NSPs (Neural Signal Processor)
* DMA Bridge
* DDR

MHI
---

AIC100 has one MHI interface over PCIe. MHI itself is documented at
Documentation/mhi/index.rst MHI is the mechanism the host uses to communicate
with the QSM. Except for workload data via the DMA Bridge, all interaction with
the device occurs via MHI.

QSM
---

QAIC Service Manager. This is an ARM A53 CPU that runs the primary
firmware of the card and performs on-card management tasks. It also
communicates with the host via MHI. Each AIC100 has one of
these.

NSP
---

Neural Signal Processor. Each AIC100 has up to 16 of these. These are
the processors that run the workloads on AIC100. Each NSP is a Qualcomm Hexagon
(Q6) DSP with HVX and HMX. Each NSP can only run one workload at a time, but
multiple NSPs may be assigned to a single workload. Since each NSP can only run
one workload, AIC100 is limited to 16 concurrent workloads. Workload
"scheduling" is under the purview of the host. AIC100 does not automatically
timeslice.

DMA Bridge
----------

The DMA Bridge is custom DMA engine that manages the flow of data
in and out of workloads. AIC100 has one of these. The DMA Bridge has 16
channels, each consisting of a set of request/response FIFOs. Each active
workload is assigned a single DMA Bridge channel. The DMA Bridge exposes
hardware registers to manage the FIFOs (head/tail pointers), but requires host
memory to store the FIFOs.

DDR
---

AIC100 has on-card DDR. In total, an AIC100 can have up to 32 GB of DDR.
This DDR is used to store workloads, data for the workloads, and is used by the
QSM for managing the device. NSPs are granted access to sections of the DDR by
the QSM. The host does not have direct access to the DDR, and must make
requests to the QSM to transfer data to the DDR.

High-level Use Flow
===================

AIC100 is a multi-user, programmable accelerator typically used for running
neural networks in inferencing mode to efficiently perform AI operations.
AIC100 is not intended for training neural networks. AIC100 can be utilized
for generic compute workloads.

Assuming a user wants to utilize AIC100, they would follow these steps:

1. Compile the workload into an ELF targeting the NSP(s)
2. Make requests to the QSM to load the workload and related artifacts into the
   device DDR
3. Make a request to the QSM to activate the workload onto a set of idle NSPs
4. Make requests to the DMA Bridge to send input data to the workload to be
   processed, and other requests to receive processed output data from the
   workload.
5. Once the workload is no longer required, make a request to the QSM to
   deactivate the workload, thus putting the NSPs back into an idle state.
6. Once the workload and related artifacts are no longer needed for future
   sessions, make requests to the QSM to unload the data from DDR. This frees
   the DDR to be used by other users.


Boot Flow
=========

AIC100 uses a flashless boot flow, derived from Qualcomm MSMs.

When AIC100 is first powered on, it begins executing PBL (Primary Bootloader)
from ROM. PBL enumerates the PCIe link, and initializes the BHI (Boot Host
Interface) component of MHI.

Using BHI, the host points PBL to the location of the SBL (Secondary Bootloader)
image. The PBL pulls the image from the host, validates it, and begins
execution of SBL.

SBL initializes MHI, and uses MHI to notify the host that the device has entered
the SBL stage. SBL performs a number of operations:

* SBL initializes the majority of hardware (anything PBL left uninitialized),
  including DDR.
* SBL offloads the bootlog to the host.
* SBL synchronizes timestamps with the host for future logging.
* SBL uses the Sahara protocol to obtain the runtime firmware images from the
  host.

Once SBL has obtained and validated the runtime firmware, it brings the NSPs out
of reset, and jumps into the QSM.

The QSM uses MHI to notify the host that the device has entered the QSM stage
(AMSS in MHI terms). At this point, the AIC100 device is fully functional, and
ready to process workloads.

Userspace components
====================

Compiler
--------

An open compiler for AIC100 based on upstream LLVM can be found at:
https://github.com/quic/software-kit-for-qualcomm-cloud-ai-100-cc

Usermode Driver (UMD)
---------------------

An open UMD that interfaces with the qaic kernel driver can be found at:
https://github.com/quic/software-kit-for-qualcomm-cloud-ai-100

Sahara loader
-------------

An open implementation of the Sahara protocol called kickstart can be found at:
https://github.com/andersson/qdl

MHI Channels
============

AIC100 defines a number of MHI channels for different purposes. This is a list
of the defined channels, and their uses.

+----------------+---------+----------+----------------------------------------+
| Channel name   | IDs     | EEs      | Purpose                                |
+================+=========+==========+========================================+
| QAIC_LOOPBACK  | 0 & 1   | AMSS     | Any data sent to the device on this    |
|                |         |          | channel is sent back to the host.      |
+----------------+---------+----------+----------------------------------------+
| QAIC_SAHARA    | 2 & 3   | SBL      | Used by SBL to obtain the runtime      |
|                |         |          | firmware from the host.                |
+----------------+---------+----------+----------------------------------------+
| QAIC_DIAG      | 4 & 5   | AMSS     | Used to communicate with QSM via the   |
|                |         |          | DIAG protocol.                         |
+----------------+---------+----------+----------------------------------------+
| QAIC_SSR       | 6 & 7   | AMSS     | Used to notify the host of subsystem   |
|                |         |          | restart events, and to offload SSR     |
|                |         |          | crashdumps.                            |
+----------------+---------+----------+----------------------------------------+
| QAIC_QDSS      | 8 & 9   | AMSS     | Used for the Qualcomm Debug Subsystem. |
+----------------+---------+----------+----------------------------------------+
| QAIC_CONTROL   | 10 & 11 | AMSS     | Used for the Neural Network Control    |
|                |         |          | (NNC) protocol. This is the primary    |
|                |         |          | channel between host and QSM for       |
|                |         |          | managing workloads.                    |
+----------------+---------+----------+----------------------------------------+
| QAIC_LOGGING   | 12 & 13 | SBL      | Used by the SBL to send the bootlog to |
|                |         |          | the host.                              |
+----------------+---------+----------+----------------------------------------+
| QAIC_STATUS    | 14 & 15 | AMSS     | Used to notify the host of Reliability,|
|                |         |          | Accessibility, Serviceability (RAS)    |
|                |         |          | events.                                |
+----------------+---------+----------+----------------------------------------+
| QAIC_TELEMETRY | 16 & 17 | AMSS     | Used to get/set power/thermal/etc      |
|                |         |          | attributes.                            |
+----------------+---------+----------+----------------------------------------+
| QAIC_DEBUG     | 18 & 19 | AMSS     | Not used.                              |
+----------------+---------+----------+----------------------------------------+
| QAIC_TIMESYNC  | 20 & 21 | SBL      | Used to synchronize timestamps in the  |
|                |         |          | device side logs with the host time    |
|                |         |          | source.                                |
+----------------+---------+----------+----------------------------------------+
| QAIC_TIMESYNC  | 22 & 23 | AMSS     | Used to periodically synchronize       |
| _PERIODIC      |         |          | timestamps in the device side logs with|
|                |         |          | the host time source.                  |
+----------------+---------+----------+----------------------------------------+

DMA Bridge
==========

Overview
--------

The DMA Bridge is one of the main interfaces to the host from the device
(the other being MHI). As part of activating a workload to run on NSPs, the QSM
assigns that network a DMA Bridge channel. A workload's DMA Bridge channel
(DBC for short) is solely for the use of that workload and is not shared with
other workloads.

Each DBC is a pair of FIFOs that manage data in and out of the workload. One
FIFO is the request FIFO. The other FIFO is the response FIFO.

Each DBC contains 4 registers in hardware:

* Request FIFO head pointer (offset 0x0). Read only by the host. Indicates the
  latest item in the FIFO the device has consumed.
* Request FIFO tail pointer (offset 0x4). Read/write by the host. Host
  increments this register to add new items to the FIFO.
* Response FIFO head pointer (offset 0x8). Read/write by the host. Indicates
  the latest item in the FIFO the host has consumed.
* Response FIFO tail pointer (offset 0xc). Read only by the host. Device
  increments this register to add new items to the FIFO.

The values in each register are indexes in the FIFO. To get the location of the
FIFO element pointed to by the register: FIFO base address + register * element
size.

DBC registers are exposed to the host via the second BAR. Each DBC consumes
4KB of space in the BAR.

The actual FIFOs are backed by host memory. When sending a request to the QSM
to activate a network, the host must donate memory to be used for the FIFOs.
Due to internal mapping limitations of the device, a single contiguous chunk of
memory must be provided per DBC, which hosts both FIFOs. The request FIFO will
consume the beginning of the memory chunk, and the response FIFO will consume
the end of the memory chunk.

Request FIFO
------------

A request FIFO element has the following structure:

.. code-block:: c

  struct request_elem {
	u16 req_id;
	u8  seq_id;
	u8  pcie_dma_cmd;
	u32 reserved;
	u64 pcie_dma_source_addr;
	u64 pcie_dma_dest_addr;
	u32 pcie_dma_len;
	u32 reserved;
	u64 doorbell_addr;
	u8  doorbell_attr;
	u8  reserved;
	u16 reserved;
	u32 doorbell_data;
	u32 sem_cmd0;
	u32 sem_cmd1;
	u32 sem_cmd2;
	u32 sem_cmd3;
  };

Request field descriptions:

req_id
	request ID. A request FIFO element and a response FIFO element with
	the same request ID refer to the same command.

seq_id
	sequence ID within a request. Ignored by the DMA Bridge.

pcie_dma_cmd
	describes the DMA element of this request.

	* Bit(7) is the force msi flag, which overrides the DMA Bridge MSI logic
	  and generates a MSI when this request is complete, and QSM
	  configures the DMA Bridge to look at this bit.
	* Bits(6:5) are reserved.
	* Bit(4) is the completion code flag, and indicates that the DMA Bridge
	  shall generate a response FIFO element when this request is
	  complete.
	* Bit(3) indicates if this request is a linked list transfer(0) or a bulk
	  transfer(1).
	* Bit(2) is reserved.
	* Bits(1:0) indicate the type of transfer. No transfer(0), to device(1),
	  from device(2). Value 3 is illegal.

pcie_dma_source_addr
	source address for a bulk transfer, or the address of the linked list.

pcie_dma_dest_addr
	destination address for a bulk transfer.

pcie_dma_len
	length of the bulk transfer. Note that the size of this field
	limits transfers to 4G in size.

doorbell_addr
	address of the doorbell to ring when this request is complete.

doorbell_attr
	doorbell attributes.

	* Bit(7) indicates if a write to a doorbell is to occur.
	* Bits(6:2) are reserved.
	* Bits(1:0) contain the encoding of the doorbell length. 0 is 32-bit,
	  1 is 16-bit, 2 is 8-bit, 3 is reserved. The doorbell address
	  must be naturally aligned to the specified length.

doorbell_data
	data to write to the doorbell. Only the bits corresponding to
	the doorbell length are valid.

sem_cmdN
	semaphore command.

	* Bit(31) indicates this semaphore command is enabled.
	* Bit(30) is the to-device DMA fence. Block this request until all
	  to-device DMA transfers are complete.
	* Bit(29) is the from-device DMA fence. Block this request until all
	  from-device DMA transfers are complete.
	* Bits(28:27) are reserved.
	* Bits(26:24) are the semaphore command. 0 is NOP. 1 is init with the
	  specified value. 2 is increment. 3 is decrement. 4 is wait
	  until the semaphore is equal to the specified value. 5 is wait
	  until the semaphore is greater or equal to the specified value.
	  6 is "P", wait until semaphore is greater than 0, then
	  decrement by 1. 7 is reserved.
	* Bit(23) is reserved.
	* Bit(22) is the semaphore sync. 0 is post sync, which means that the
	  semaphore operation is done after the DMA transfer. 1 is
	  presync, which gates the DMA transfer. Only one presync is
	  allowed per request.
	* Bit(21) is reserved.
	* Bits(20:16) is the index of the semaphore to operate on.
	* Bits(15:12) are reserved.
	* Bits(11:0) are the semaphore value to use in operations.

Overall, a request is processed in 4 steps:

1. If specified, the presync semaphore condition must be true
2. If enabled, the DMA transfer occurs
3. If specified, the postsync semaphore conditions must be true
4. If enabled, the doorbell is written

By using the semaphores in conjunction with the workload running on the NSPs,
the data pipeline can be synchronized such that the host can queue multiple
requests of data for the workload to process, but the DMA Bridge will only copy
the data into the memory of the workload when the workload is ready to process
the next input.

Response FIFO
-------------

Once a request is fully processed, a response FIFO element is generated if
specified in pcie_dma_cmd. The structure of a response FIFO element:

.. code-block:: c

  struct response_elem {
	u16 req_id;
	u16 completion_code;
  };

req_id
	matches the req_id of the request that generated this element.

completion_code
	status of this request. 0 is success. Non-zero is an error.

The DMA Bridge will generate a MSI to the host as a reaction to activity in the
response FIFO of a DBC. The DMA Bridge hardware has an IRQ storm mitigation
algorithm, where it will only generate a MSI when the response FIFO transitions
from empty to non-empty (unless force MSI is enabled and triggered). In
response to this MSI, the host is expected to drain the response FIFO, and must
take care to handle any race conditions between draining the FIFO, and the
device inserting elements into the FIFO.

Neural Network Control (NNC) Protocol
=====================================

The NNC protocol is how the host makes requests to the QSM to manage workloads.
It uses the QAIC_CONTROL MHI channel.

Each NNC request is packaged into a message. Each message is a series of
transactions. A passthrough type transaction can contain elements known as
commands.

QSM requires NNC messages be little endian encoded and the fields be naturally
aligned. Since there are 64-bit elements in some NNC messages, 64-bit alignment
must be maintained.

A message contains a header and then a series of transactions. A message may be
at most 4K in size from QSM to the host. From the host to the QSM, a message
can be at most 64K (maximum size of a single MHI packet), but there is a
continuation feature where message N+1 can be marked as a continuation of
message N. This is used for exceedingly large DMA xfer transactions.

Transaction descriptions
------------------------

passthrough
	Allows userspace to send an opaque payload directly to the QSM.
	This is used for NNC commands. Userspace is responsible for managing
	the QSM message requirements in the payload.

dma_xfer
	DMA transfer. Describes an object that the QSM should DMA into the
	device via address and size tuples.

activate
	Activate a workload onto NSPs. The host must provide memory to be
	used by the DBC.

deactivate
	Deactivate an active workload and return the NSPs to idle.

status
	Query the QSM about it's NNC implementation. Returns the NNC version,
	and if CRC is used.

terminate
	Release a user's resources.

dma_xfer_cont
	Continuation of a previous DMA transfer. If a DMA transfer
	cannot be specified in a single message (highly fragmented), this
	transaction can be used to specify more ranges.

validate_partition
	Query to QSM to determine if a partition identifier is valid.

Each message is tagged with a user id, and a partition id. The user id allows
QSM to track resources, and release them when the user goes away (eg the process
crashes). A partition id identifies the resource partition that QSM manages,
which this message applies to.

Messages may have CRCs. Messages should have CRCs applied until the QSM
reports via the status transaction that CRCs are not needed. The QSM on the
SA9000P requires CRCs for black channel safing.

Subsystem Restart (SSR)
=======================

SSR is the concept of limiting the impact of an error. An AIC100 device may
have multiple users, each with their own workload running. If the workload of
one user crashes, the fallout of that should be limited to that workload and not
impact other workloads. SSR accomplishes this.

If a particular workload crashes, QSM notifies the host via the QAIC_SSR MHI
channel. This notification identifies the workload by it's assigned DBC. A
multi-stage recovery process is then used to cleanup both sides, and get the
DBC/NSPs into a working state.

When SSR occurs, any state in the workload is lost. Any inputs that were in
process, or queued by not yet serviced, are lost. The loaded artifacts will
remain in on-card DDR, but the host will need to re-activate the workload if
it desires to recover the workload.

Reliability, Accessibility, Serviceability (RAS)
================================================

AIC100 is expected to be deployed in server systems where RAS ideology is
applied. Simply put, RAS is the concept of detecting, classifying, and
reporting errors. While PCIe has AER (Advanced Error Reporting) which factors
into RAS, AER does not allow for a device to report details about internal
errors. Therefore, AIC100 implements a custom RAS mechanism. When a RAS event
occurs, QSM will report the event with appropriate details via the QAIC_STATUS
MHI channel. A sysadmin may determine that a particular device needs
additional service based on RAS reports.

Telemetry
=========

QSM has the ability to report various physical attributes of the device, and in
some cases, to allow the host to control them. Examples include thermal limits,
thermal readings, and power readings. These items are communicated via the
QAIC_TELEMETRY MHI channel.