aboutsummaryrefslogtreecommitdiff
path: root/docs/specs/ivshmem-spec.txt
diff options
context:
space:
mode:
authorMarkus Armbruster <armbru@redhat.com>2016-03-15 19:34:25 +0100
committerMarkus Armbruster <armbru@redhat.com>2016-03-21 21:28:59 +0100
commitfdee2025dd690b35099dd4d04eeef27c2bc1bc9c (patch)
tree1d9ff8cff5a3b832e299b52685df1ee548e8cf43 /docs/specs/ivshmem-spec.txt
parent41b65e5eda4364fa966cb7bbf693a1d0cb4e8e1e (diff)
ivshmem: Rewrite specification document
This started as an attempt to update ivshmem_device_spec.txt for clarity, accuracy and completeness while working on its code, and quickly became a full rewrite. Since the diff would be useless anyway, I'm using the opportunity to rename the file to ivshmem-spec.txt. I tried hard to ensure the new text contradicts neither the old text nor the code. If the new text contradicts the old text but not the code, it's probably a bug in the old text. If the new text contradicts both, its probably a bug in the new text. Signed-off-by: Markus Armbruster <armbru@redhat.com> Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com> Message-Id: <1458066895-20632-11-git-send-email-armbru@redhat.com>
Diffstat (limited to 'docs/specs/ivshmem-spec.txt')
-rw-r--r--docs/specs/ivshmem-spec.txt243
1 files changed, 243 insertions, 0 deletions
diff --git a/docs/specs/ivshmem-spec.txt b/docs/specs/ivshmem-spec.txt
new file mode 100644
index 0000000000..0e9185a04b
--- /dev/null
+++ b/docs/specs/ivshmem-spec.txt
@@ -0,0 +1,243 @@
+= Device Specification for Inter-VM shared memory device =
+
+The Inter-VM shared memory device (ivshmem) is designed to share a
+memory region between multiple QEMU processes running different guests
+and the host. In order for all guests to be able to pick up the
+shared memory area, it is modeled by QEMU as a PCI device exposing
+said memory to the guest as a PCI BAR.
+
+The device can use a shared memory object on the host directly, or it
+can obtain one from an ivshmem server.
+
+In the latter case, the device can additionally interrupt its peers, and
+get interrupted by its peers.
+
+
+== Configuring the ivshmem PCI device ==
+
+There are two basic configurations:
+
+- Just shared memory: -device ivshmem,shm=NAME,...
+
+ This uses shared memory object NAME.
+
+- Shared memory plus interrupts: -device ivshmem,chardev=CHR,vectors=N,...
+
+ An ivshmem server must already be running on the host. The device
+ connects to the server's UNIX domain socket via character device
+ CHR.
+
+ Each peer gets assigned a unique ID by the server. IDs must be
+ between 0 and 65535.
+
+ Interrupts are message-signaled by default (MSI-X). With msi=off
+ the device has no MSI-X capability, and uses legacy INTx instead.
+ vectors=N configures the number of vectors to use.
+
+For more details on ivshmem device properties, see The QEMU Emulator
+User Documentation (qemu-doc.*).
+
+
+== The ivshmem PCI device's guest interface ==
+
+The device has vendor ID 1af4, device ID 1110, revision 0.
+
+=== PCI BARs ===
+
+The ivshmem PCI device has two or three BARs:
+
+- BAR0 holds device registers (256 Byte MMIO)
+- BAR1 holds MSI-X table and PBA (only when using MSI-X)
+- BAR2 maps the shared memory object
+
+There are two ways to use this device:
+
+- If you only need the shared memory part, BAR2 suffices. This way,
+ you have access to the shared memory in the guest and can use it as
+ you see fit. Memnic, for example, uses ivshmem this way from guest
+ user space (see http://dpdk.org/browse/memnic).
+
+- If you additionally need the capability for peers to interrupt each
+ other, you need BAR0 and, if using MSI-X, BAR1. You will most
+ likely want to write a kernel driver to handle interrupts. Requires
+ the device to be configured for interrupts, obviously.
+
+If the device is configured for interrupts, BAR2 is initially invalid.
+It becomes safely accessible only after the ivshmem server provided
+the shared memory. Guest software should wait for the IVPosition
+register (described below) to become non-negative before accessing
+BAR2.
+
+The device is not capable to tell guest software whether it is
+configured for interrupts.
+
+=== PCI device registers ===
+
+BAR 0 contains the following registers:
+
+ Offset Size Access On reset Function
+ 0 4 read/write 0 Interrupt Mask
+ bit 0: peer interrupt
+ bit 1..31: reserved
+ 4 4 read/write 0 Interrupt Status
+ bit 0: peer interrupt
+ bit 1..31: reserved
+ 8 4 read-only 0 or -1 IVPosition
+ 12 4 write-only N/A Doorbell
+ bit 0..15: vector
+ bit 16..31: peer ID
+ 16 240 none N/A reserved
+
+Software should only access the registers as specified in column
+"Access". Reserved bits should be ignored on read, and preserved on
+write.
+
+Interrupt Status and Mask Register together control the legacy INTx
+interrupt when the device has no MSI-X capability: INTx is asserted
+when the bit-wise AND of Status and Mask is non-zero and the device
+has no MSI-X capability. Interrupt Status Register bit 0 becomes 1
+when an interrupt request from a peer is received. Reading the
+register clears it.
+
+IVPosition Register: if the device is not configured for interrupts,
+this is zero. Else, it's -1 for a short while after reset, then
+changes to the device's ID (between 0 and 65535).
+
+There is no good way for software to find out whether the device is
+configured for interrupts. A positive IVPosition means interrupts,
+but zero could be either. The initial -1 cannot be reliably observed.
+
+Doorbell Register: writing this register requests to interrupt a peer.
+The written value's high 16 bits are the ID of the peer to interrupt,
+and its low 16 bits select an interrupt vector.
+
+If the device is not configured for interrupts, the write is ignored.
+
+If the interrupt hasn't completed setup, the write is ignored. The
+device is not capable to tell guest software whether setup is
+complete. Interrupts can regress to this state on migration.
+
+If the peer with the requested ID isn't connected, or it has fewer
+interrupt vectors connected, the write is ignored. The device is not
+capable to tell guest software what peers are connected, or how many
+interrupt vectors are connected.
+
+If the peer doesn't use MSI-X, its Interrupt Status register is set to
+1. This asserts INTx unless masked by the Interrupt Mask register.
+The device is not capable to communicate the interrupt vector to guest
+software then.
+
+If the peer uses MSI-X, the interrupt for this vector becomes pending.
+There is no way for software to clear the pending bit, and a polling
+mode of operation is therefore impossible with MSI-X.
+
+With multiple MSI-X vectors, different vectors can be used to indicate
+different events have occurred. The semantics of interrupt vectors
+are left to the application.
+
+
+== Interrupt infrastructure ==
+
+When configured for interrupts, the peers share eventfd objects in
+addition to shared memory. The shared resources are managed by an
+ivshmem server.
+
+=== The ivshmem server ===
+
+The server listens on a UNIX domain socket.
+
+For each new client that connects to the server, the server
+- picks an ID,
+- creates eventfd file descriptors for the interrupt vectors,
+- sends the ID and the file descriptor for the shared memory to the
+ new client,
+- sends connect notifications for the new client to the other clients
+ (these contain file descriptors for sending interrupts),
+- sends connect notifications for the other clients to the new client,
+ and
+- sends interrupt setup messages to the new client (these contain file
+ descriptors for receiving interrupts).
+
+When a client disconnects from the server, the server sends disconnect
+notifications to the other clients.
+
+The next section describes the protocol in detail.
+
+If the server terminates without sending disconnect notifications for
+its connected clients, the clients can elect to continue. They can
+communicate with each other normally, but won't receive disconnect
+notification on disconnect, and no new clients can connect. There is
+no way for the clients to connect to a restarted server. The device
+is not capable to tell guest software whether the server is still up.
+
+Example server code is in contrib/ivshmem-server/. Not to be used in
+production. It assumes all clients use the same number of interrupt
+vectors.
+
+A standalone client is in contrib/ivshmem-client/. It can be useful
+for debugging.
+
+=== The ivshmem Client-Server Protocol ===
+
+An ivshmem device configured for interrupts connects to an ivshmem
+server. This section details the protocol between the two.
+
+The connection is one-way: the server sends messages to the client.
+Each message consists of a single 8 byte little-endian signed number,
+and may be accompanied by a file descriptor via SCM_RIGHTS. Both
+client and server close the connection on error.
+
+On connect, the server sends the following messages in order:
+
+1. The protocol version number, currently zero. The client should
+ close the connection on receipt of versions it can't handle.
+
+2. The client's ID. This is unique among all clients of this server.
+ IDs must be between 0 and 65535, because the Doorbell register
+ provides only 16 bits for them.
+
+3. The number -1, accompanied by the file descriptor for the shared
+ memory.
+
+4. Connect notifications for existing other clients, if any. This is
+ a peer ID (number between 0 and 65535 other than the client's ID),
+ repeated N times. Each repetition is accompanied by one file
+ descriptor. These are for interrupting the peer with that ID using
+ vector 0,..,N-1, in order. If the client is configured for fewer
+ vectors, it closes the extra file descriptors. If it is configured
+ for more, the extra vectors remain unconnected.
+
+5. Interrupt setup. This is the client's own ID, repeated N times.
+ Each repetition is accompanied by one file descriptor. These are
+ for receiving interrupts from peers using vector 0,..,N-1, in
+ order. If the client is configured for fewer vectors, it closes
+ the extra file descriptors. If it is configured for more, the
+ extra vectors remain unconnected.
+
+From then on, the server sends these kinds of messages:
+
+6. Connection / disconnection notification. This is a peer ID.
+
+ - If the number comes with a file descriptor, it's a connection
+ notification, exactly like in step 4.
+
+ - Else, it's a disconnection notification for the peer with that ID.
+
+Known bugs:
+
+* The protocol changed incompatibly in QEMU 2.5. Before, messages
+ were native endian long, and there was no version number.
+
+* The protocol is poorly designed.
+
+=== The ivshmem Client-Client Protocol ===
+
+An ivshmem device configured for interrupts receives eventfd file
+descriptors for interrupting peers and getting interrupted by peers
+from the server, as explained in the previous section.
+
+To interrupt a peer, the device writes the 8-byte integer 1 in native
+byte order to the respective file descriptor.
+
+To receive an interrupt, the device reads and discards as many 8-byte
+integers as it can.