1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
|
==============
NVMe Emulation
==============
QEMU provides NVMe emulation through the ``nvme``, ``nvme-ns`` and
``nvme-subsys`` devices.
See the following sections for specific information on
* `Adding NVMe Devices`_, `additional namespaces`_ and `NVM subsystems`_.
* Configuration of `Optional Features`_ such as `Controller Memory Buffer`_,
`Simple Copy`_, `Zoned Namespaces`_, `metadata`_ and `End-to-End Data
Protection`_,
Adding NVMe Devices
===================
Controller Emulation
--------------------
The QEMU emulated NVMe controller implements version 1.4 of the NVM Express
specification. All mandatory features are implement with a couple of exceptions
and limitations:
* Accounting numbers in the SMART/Health log page are reset when the device
is power cycled.
* Interrupt Coalescing is not supported and is disabled by default.
The simplest way to attach an NVMe controller on the QEMU PCI bus is to add the
following parameters:
.. code-block:: console
-drive file=nvm.img,if=none,id=nvm
-device nvme,serial=deadbeef,drive=nvm
There are a number of optional general parameters for the ``nvme`` device. Some
are mentioned here, but see ``-device nvme,help`` to list all possible
parameters.
``max_ioqpairs=UINT32`` (default: ``64``)
Set the maximum number of allowed I/O queue pairs. This replaces the
deprecated ``num_queues`` parameter.
``msix_qsize=UINT16`` (default: ``65``)
The number of MSI-X vectors that the device should support.
``mdts=UINT8`` (default: ``7``)
Set the Maximum Data Transfer Size of the device.
``use-intel-id`` (default: ``off``)
Since QEMU 5.2, the device uses a QEMU allocated "Red Hat" PCI Device and
Vendor ID. Set this to ``on`` to revert to the unallocated Intel ID
previously used.
Additional Namespaces
---------------------
In the simplest possible invocation sketched above, the device only support a
single namespace with the namespace identifier ``1``. To support multiple
namespaces and additional features, the ``nvme-ns`` device must be used.
.. code-block:: console
-device nvme,id=nvme-ctrl-0,serial=deadbeef
-drive file=nvm-1.img,if=none,id=nvm-1
-device nvme-ns,drive=nvm-1
-drive file=nvm-2.img,if=none,id=nvm-2
-device nvme-ns,drive=nvm-2
The namespaces defined by the ``nvme-ns`` device will attach to the most
recently defined ``nvme-bus`` that is created by the ``nvme`` device. Namespace
identifiers are allocated automatically, starting from ``1``.
There are a number of parameters available:
``nsid`` (default: ``0``)
Explicitly set the namespace identifier.
``uuid`` (default: *autogenerated*)
Set the UUID of the namespace. This will be reported as a "Namespace UUID"
descriptor in the Namespace Identification Descriptor List.
``eui64``
Set the EUI-64 of the namespace. This will be reported as a "IEEE Extended
Unique Identifier" descriptor in the Namespace Identification Descriptor List.
Since machine type 6.1 a non-zero default value is used if the parameter
is not provided. For earlier machine types the field defaults to 0.
``bus``
If there are more ``nvme`` devices defined, this parameter may be used to
attach the namespace to a specific ``nvme`` device (identified by an ``id``
parameter on the controller device).
NVM Subsystems
--------------
Additional features becomes available if the controller device (``nvme``) is
linked to an NVM Subsystem device (``nvme-subsys``).
The NVM Subsystem emulation allows features such as shared namespaces and
multipath I/O.
.. code-block:: console
-device nvme-subsys,id=nvme-subsys-0,nqn=subsys0
-device nvme,serial=deadbeef,subsys=nvme-subsys-0
-device nvme,serial=deadbeef,subsys=nvme-subsys-0
This will create an NVM subsystem with two controllers. Having controllers
linked to an ``nvme-subsys`` device allows additional ``nvme-ns`` parameters:
``shared`` (default: ``on`` since 6.2)
Specifies that the namespace will be attached to all controllers in the
subsystem. If set to ``off``, the namespace will remain a private namespace
and may only be attached to a single controller at a time. Shared namespaces
are always automatically attached to all controllers (also when controllers
are hotplugged).
``detached`` (default: ``off``)
If set to ``on``, the namespace will be be available in the subsystem, but
not attached to any controllers initially. A shared namespace with this set
to ``on`` will never be automatically attached to controllers.
Thus, adding
.. code-block:: console
-drive file=nvm-1.img,if=none,id=nvm-1
-device nvme-ns,drive=nvm-1,nsid=1
-drive file=nvm-2.img,if=none,id=nvm-2
-device nvme-ns,drive=nvm-2,nsid=3,shared=off,detached=on
will cause NSID 1 will be a shared namespace that is initially attached to both
controllers. NSID 3 will be a private namespace due to ``shared=off`` and only
attachable to a single controller at a time. Additionally it will not be
attached to any controller initially (due to ``detached=on``) or to hotplugged
controllers.
Optional Features
=================
Controller Memory Buffer
------------------------
``nvme`` device parameters related to the Controller Memory Buffer support:
``cmb_size_mb=UINT32`` (default: ``0``)
This adds a Controller Memory Buffer of the given size at offset zero in BAR
2.
``legacy-cmb`` (default: ``off``)
By default, the device uses the "v1.4 scheme" for the Controller Memory
Buffer support (i.e, the CMB is initially disabled and must be explicitly
enabled by the host). Set this to ``on`` to behave as a v1.3 device wrt. the
CMB.
Simple Copy
-----------
The device includes support for TP 4065 ("Simple Copy Command"). A number of
additional ``nvme-ns`` device parameters may be used to control the Copy
command limits:
``mssrl=UINT16`` (default: ``128``)
Set the Maximum Single Source Range Length (``MSSRL``). This is the maximum
number of logical blocks that may be specified in each source range.
``mcl=UINT32`` (default: ``128``)
Set the Maximum Copy Length (``MCL``). This is the maximum number of logical
blocks that may be specified in a Copy command (the total for all source
ranges).
``msrc=UINT8`` (default: ``127``)
Set the Maximum Source Range Count (``MSRC``). This is the maximum number of
source ranges that may be used in a Copy command. This is a 0's based value.
Zoned Namespaces
----------------
A namespaces may be "Zoned" as defined by TP 4053 ("Zoned Namespaces"). Set
``zoned=on`` on an ``nvme-ns`` device to configure it as a zoned namespace.
The namespace may be configured with additional parameters
``zoned.zone_size=SIZE`` (default: ``128MiB``)
Define the zone size (``ZSZE``).
``zoned.zone_capacity=SIZE`` (default: ``0``)
Define the zone capacity (``ZCAP``). If left at the default (``0``), the zone
capacity will equal the zone size.
``zoned.descr_ext_size=UINT32`` (default: ``0``)
Set the Zone Descriptor Extension Size (``ZDES``). Must be a multiple of 64
bytes.
``zoned.cross_read=BOOL`` (default: ``off``)
Set to ``on`` to allow reads to cross zone boundaries.
``zoned.max_active=UINT32`` (default: ``0``)
Set the maximum number of active resources (``MAR``). The default (``0``)
allows all zones to be active.
``zoned.max_open=UINT32`` (default: ``0``)
Set the maximum number of open resources (``MOR``). The default (``0``)
allows all zones to be open. If ``zoned.max_active`` is specified, this value
must be less than or equal to that.
``zoned.zasl=UINT8`` (default: ``0``)
Set the maximum data transfer size for the Zone Append command. Like
``mdts``, the value is specified as a power of two (2^n) and is in units of
the minimum memory page size (CAP.MPSMIN). The default value (``0``)
has this property inherit the ``mdts`` value.
Flexible Data Placement
-----------------------
The device may be configured to support TP4146 ("Flexible Data Placement") by
configuring it (``fdp=on``) on the subsystem::
-device nvme-subsys,id=nvme-subsys-0,nqn=subsys0,fdp=on,fdp.nruh=16
The subsystem emulates a single Endurance Group, on which Flexible Data
Placement will be supported. Also note that the device emulation deviates
slightly from the specification, by always enabling the "FDP Mode" feature on
the controller if the subsystems is configured for Flexible Data Placement.
Enabling Flexible Data Placement on the subsyste enables the following
parameters:
``fdp.nrg`` (default: ``1``)
Set the number of Reclaim Groups.
``fdp.nruh`` (default: ``0``)
Set the number of Reclaim Unit Handles. This is a mandatory parameter and
must be non-zero.
``fdp.runs`` (default: ``96M``)
Set the Reclaim Unit Nominal Size. Defaults to 96 MiB.
Namespaces within this subsystem may requests Reclaim Unit Handles::
-device nvme-ns,drive=nvm-1,fdp.ruhs=RUHLIST
The ``RUHLIST`` is a semicolon separated list (i.e. ``0;1;2;3``) and may
include ranges (i.e. ``0;8-15``). If no reclaim unit handle list is specified,
the controller will assign the controller-specified reclaim unit handle to
placement handle identifier 0.
Metadata
--------
The virtual namespace device supports LBA metadata in the form separate
metadata (``MPTR``-based) and extended LBAs.
``ms=UINT16`` (default: ``0``)
Defines the number of metadata bytes per LBA.
``mset=UINT8`` (default: ``0``)
Set to ``1`` to enable extended LBAs.
End-to-End Data Protection
--------------------------
The virtual namespace device supports DIF- and DIX-based protection information
(depending on ``mset``).
``pi=UINT8`` (default: ``0``)
Enable protection information of the specified type (type ``1``, ``2`` or
``3``).
``pil=UINT8`` (default: ``0``)
Controls the location of the protection information within the metadata. Set
to ``1`` to transfer protection information as the first eight bytes of
metadata. Otherwise, the protection information is transferred as the last
eight bytes.
Virtualization Enhancements and SR-IOV (Experimental Support)
-------------------------------------------------------------
The ``nvme`` device supports Single Root I/O Virtualization and Sharing
along with Virtualization Enhancements. The controller has to be linked to
an NVM Subsystem device (``nvme-subsys``) for use with SR-IOV.
A number of parameters are present (**please note, that they may be
subject to change**):
``sriov_max_vfs`` (default: ``0``)
Indicates the maximum number of PCIe virtual functions supported
by the controller. Specifying a non-zero value enables reporting of both
SR-IOV and ARI (Alternative Routing-ID Interpretation) capabilities
by the NVMe device. Virtual function controllers will not report SR-IOV.
``sriov_vq_flexible``
Indicates the total number of flexible queue resources assignable to all
the secondary controllers. Implicitly sets the number of primary
controller's private resources to ``(max_ioqpairs - sriov_vq_flexible)``.
``sriov_vi_flexible``
Indicates the total number of flexible interrupt resources assignable to
all the secondary controllers. Implicitly sets the number of primary
controller's private resources to ``(msix_qsize - sriov_vi_flexible)``.
``sriov_max_vi_per_vf`` (default: ``0``)
Indicates the maximum number of virtual interrupt resources assignable
to a secondary controller. The default ``0`` resolves to
``(sriov_vi_flexible / sriov_max_vfs)``
``sriov_max_vq_per_vf`` (default: ``0``)
Indicates the maximum number of virtual queue resources assignable to
a secondary controller. The default ``0`` resolves to
``(sriov_vq_flexible / sriov_max_vfs)``
The simplest possible invocation enables the capability to set up one VF
controller and assign an admin queue, an IO queue, and a MSI-X interrupt.
.. code-block:: console
-device nvme-subsys,id=subsys0
-device nvme,serial=deadbeef,subsys=subsys0,sriov_max_vfs=1,
sriov_vq_flexible=2,sriov_vi_flexible=1
The minimum steps required to configure a functional NVMe secondary
controller are:
* unbind flexible resources from the primary controller
.. code-block:: console
nvme virt-mgmt /dev/nvme0 -c 0 -r 1 -a 1 -n 0
nvme virt-mgmt /dev/nvme0 -c 0 -r 0 -a 1 -n 0
* perform a Function Level Reset on the primary controller to actually
release the resources
.. code-block:: console
echo 1 > /sys/bus/pci/devices/0000:01:00.0/reset
* enable VF
.. code-block:: console
echo 1 > /sys/bus/pci/devices/0000:01:00.0/sriov_numvfs
* assign the flexible resources to the VF and set it ONLINE
.. code-block:: console
nvme virt-mgmt /dev/nvme0 -c 1 -r 1 -a 8 -n 1
nvme virt-mgmt /dev/nvme0 -c 1 -r 0 -a 8 -n 2
nvme virt-mgmt /dev/nvme0 -c 1 -r 0 -a 9 -n 0
* bind the NVMe driver to the VF
.. code-block:: console
echo 0000:01:00.1 > /sys/bus/pci/drivers/nvme/bind
|