1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
|
= Device Specification for Inter-VM shared memory device =
The Inter-VM shared memory device (ivshmem) is designed to share a
memory region between multiple QEMU processes running different guests
and the host. In order for all guests to be able to pick up the
shared memory area, it is modeled by QEMU as a PCI device exposing
said memory to the guest as a PCI BAR.
The device can use a shared memory object on the host directly, or it
can obtain one from an ivshmem server.
In the latter case, the device can additionally interrupt its peers, and
get interrupted by its peers.
== Configuring the ivshmem PCI device ==
There are two basic configurations:
- Just shared memory: -device ivshmem,shm=NAME,...
This uses shared memory object NAME.
- Shared memory plus interrupts: -device ivshmem,chardev=CHR,vectors=N,...
An ivshmem server must already be running on the host. The device
connects to the server's UNIX domain socket via character device
CHR.
Each peer gets assigned a unique ID by the server. IDs must be
between 0 and 65535.
Interrupts are message-signaled by default (MSI-X). With msi=off
the device has no MSI-X capability, and uses legacy INTx instead.
vectors=N configures the number of vectors to use.
For more details on ivshmem device properties, see The QEMU Emulator
User Documentation (qemu-doc.*).
== The ivshmem PCI device's guest interface ==
The device has vendor ID 1af4, device ID 1110, revision 0.
=== PCI BARs ===
The ivshmem PCI device has two or three BARs:
- BAR0 holds device registers (256 Byte MMIO)
- BAR1 holds MSI-X table and PBA (only when using MSI-X)
- BAR2 maps the shared memory object
There are two ways to use this device:
- If you only need the shared memory part, BAR2 suffices. This way,
you have access to the shared memory in the guest and can use it as
you see fit. Memnic, for example, uses ivshmem this way from guest
user space (see http://dpdk.org/browse/memnic).
- If you additionally need the capability for peers to interrupt each
other, you need BAR0 and, if using MSI-X, BAR1. You will most
likely want to write a kernel driver to handle interrupts. Requires
the device to be configured for interrupts, obviously.
If the device is configured for interrupts, BAR2 is initially invalid.
It becomes safely accessible only after the ivshmem server provided
the shared memory. Guest software should wait for the IVPosition
register (described below) to become non-negative before accessing
BAR2.
The device is not capable to tell guest software whether it is
configured for interrupts.
=== PCI device registers ===
BAR 0 contains the following registers:
Offset Size Access On reset Function
0 4 read/write 0 Interrupt Mask
bit 0: peer interrupt
bit 1..31: reserved
4 4 read/write 0 Interrupt Status
bit 0: peer interrupt
bit 1..31: reserved
8 4 read-only 0 or -1 IVPosition
12 4 write-only N/A Doorbell
bit 0..15: vector
bit 16..31: peer ID
16 240 none N/A reserved
Software should only access the registers as specified in column
"Access". Reserved bits should be ignored on read, and preserved on
write.
Interrupt Status and Mask Register together control the legacy INTx
interrupt when the device has no MSI-X capability: INTx is asserted
when the bit-wise AND of Status and Mask is non-zero and the device
has no MSI-X capability. Interrupt Status Register bit 0 becomes 1
when an interrupt request from a peer is received. Reading the
register clears it.
IVPosition Register: if the device is not configured for interrupts,
this is zero. Else, it's -1 for a short while after reset, then
changes to the device's ID (between 0 and 65535).
There is no good way for software to find out whether the device is
configured for interrupts. A positive IVPosition means interrupts,
but zero could be either. The initial -1 cannot be reliably observed.
Doorbell Register: writing this register requests to interrupt a peer.
The written value's high 16 bits are the ID of the peer to interrupt,
and its low 16 bits select an interrupt vector.
If the device is not configured for interrupts, the write is ignored.
If the interrupt hasn't completed setup, the write is ignored. The
device is not capable to tell guest software whether setup is
complete. Interrupts can regress to this state on migration.
If the peer with the requested ID isn't connected, or it has fewer
interrupt vectors connected, the write is ignored. The device is not
capable to tell guest software what peers are connected, or how many
interrupt vectors are connected.
If the peer doesn't use MSI-X, its Interrupt Status register is set to
1. This asserts INTx unless masked by the Interrupt Mask register.
The device is not capable to communicate the interrupt vector to guest
software then.
If the peer uses MSI-X, the interrupt for this vector becomes pending.
There is no way for software to clear the pending bit, and a polling
mode of operation is therefore impossible with MSI-X.
With multiple MSI-X vectors, different vectors can be used to indicate
different events have occurred. The semantics of interrupt vectors
are left to the application.
== Interrupt infrastructure ==
When configured for interrupts, the peers share eventfd objects in
addition to shared memory. The shared resources are managed by an
ivshmem server.
=== The ivshmem server ===
The server listens on a UNIX domain socket.
For each new client that connects to the server, the server
- picks an ID,
- creates eventfd file descriptors for the interrupt vectors,
- sends the ID and the file descriptor for the shared memory to the
new client,
- sends connect notifications for the new client to the other clients
(these contain file descriptors for sending interrupts),
- sends connect notifications for the other clients to the new client,
and
- sends interrupt setup messages to the new client (these contain file
descriptors for receiving interrupts).
When a client disconnects from the server, the server sends disconnect
notifications to the other clients.
The next section describes the protocol in detail.
If the server terminates without sending disconnect notifications for
its connected clients, the clients can elect to continue. They can
communicate with each other normally, but won't receive disconnect
notification on disconnect, and no new clients can connect. There is
no way for the clients to connect to a restarted server. The device
is not capable to tell guest software whether the server is still up.
Example server code is in contrib/ivshmem-server/. Not to be used in
production. It assumes all clients use the same number of interrupt
vectors.
A standalone client is in contrib/ivshmem-client/. It can be useful
for debugging.
=== The ivshmem Client-Server Protocol ===
An ivshmem device configured for interrupts connects to an ivshmem
server. This section details the protocol between the two.
The connection is one-way: the server sends messages to the client.
Each message consists of a single 8 byte little-endian signed number,
and may be accompanied by a file descriptor via SCM_RIGHTS. Both
client and server close the connection on error.
Note: QEMU currently doesn't close the connection right on error, but
only when the character device is destroyed.
On connect, the server sends the following messages in order:
1. The protocol version number, currently zero. The client should
close the connection on receipt of versions it can't handle.
2. The client's ID. This is unique among all clients of this server.
IDs must be between 0 and 65535, because the Doorbell register
provides only 16 bits for them.
3. The number -1, accompanied by the file descriptor for the shared
memory.
4. Connect notifications for existing other clients, if any. This is
a peer ID (number between 0 and 65535 other than the client's ID),
repeated N times. Each repetition is accompanied by one file
descriptor. These are for interrupting the peer with that ID using
vector 0,..,N-1, in order. If the client is configured for fewer
vectors, it closes the extra file descriptors. If it is configured
for more, the extra vectors remain unconnected.
5. Interrupt setup. This is the client's own ID, repeated N times.
Each repetition is accompanied by one file descriptor. These are
for receiving interrupts from peers using vector 0,..,N-1, in
order. If the client is configured for fewer vectors, it closes
the extra file descriptors. If it is configured for more, the
extra vectors remain unconnected.
From then on, the server sends these kinds of messages:
6. Connection / disconnection notification. This is a peer ID.
- If the number comes with a file descriptor, it's a connection
notification, exactly like in step 4.
- Else, it's a disconnection notification for the peer with that ID.
Known bugs:
* The protocol changed incompatibly in QEMU 2.5. Before, messages
were native endian long, and there was no version number.
* The protocol is poorly designed.
=== The ivshmem Client-Client Protocol ===
An ivshmem device configured for interrupts receives eventfd file
descriptors for interrupting peers and getting interrupted by peers
from the server, as explained in the previous section.
To interrupt a peer, the device writes the 8-byte integer 1 in native
byte order to the respective file descriptor.
To receive an interrupt, the device reads and discards as many 8-byte
integers as it can.
|