diff options
author | Markus Armbruster <armbru@redhat.com> | 2016-03-15 19:34:25 +0100 |
---|---|---|
committer | Markus Armbruster <armbru@redhat.com> | 2016-03-21 21:28:59 +0100 |
commit | fdee2025dd690b35099dd4d04eeef27c2bc1bc9c (patch) | |
tree | 1d9ff8cff5a3b832e299b52685df1ee548e8cf43 /docs/specs/ivshmem-spec.txt | |
parent | 41b65e5eda4364fa966cb7bbf693a1d0cb4e8e1e (diff) |
ivshmem: Rewrite specification document
This started as an attempt to update ivshmem_device_spec.txt for
clarity, accuracy and completeness while working on its code, and
quickly became a full rewrite. Since the diff would be useless
anyway, I'm using the opportunity to rename the file to
ivshmem-spec.txt.
I tried hard to ensure the new text contradicts neither the old text
nor the code. If the new text contradicts the old text but not the
code, it's probably a bug in the old text. If the new text
contradicts both, its probably a bug in the new text.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Message-Id: <1458066895-20632-11-git-send-email-armbru@redhat.com>
Diffstat (limited to 'docs/specs/ivshmem-spec.txt')
-rw-r--r-- | docs/specs/ivshmem-spec.txt | 243 |
1 files changed, 243 insertions, 0 deletions
diff --git a/docs/specs/ivshmem-spec.txt b/docs/specs/ivshmem-spec.txt new file mode 100644 index 0000000000..0e9185a04b --- /dev/null +++ b/docs/specs/ivshmem-spec.txt @@ -0,0 +1,243 @@ += Device Specification for Inter-VM shared memory device = + +The Inter-VM shared memory device (ivshmem) is designed to share a +memory region between multiple QEMU processes running different guests +and the host. In order for all guests to be able to pick up the +shared memory area, it is modeled by QEMU as a PCI device exposing +said memory to the guest as a PCI BAR. + +The device can use a shared memory object on the host directly, or it +can obtain one from an ivshmem server. + +In the latter case, the device can additionally interrupt its peers, and +get interrupted by its peers. + + +== Configuring the ivshmem PCI device == + +There are two basic configurations: + +- Just shared memory: -device ivshmem,shm=NAME,... + + This uses shared memory object NAME. + +- Shared memory plus interrupts: -device ivshmem,chardev=CHR,vectors=N,... + + An ivshmem server must already be running on the host. The device + connects to the server's UNIX domain socket via character device + CHR. + + Each peer gets assigned a unique ID by the server. IDs must be + between 0 and 65535. + + Interrupts are message-signaled by default (MSI-X). With msi=off + the device has no MSI-X capability, and uses legacy INTx instead. + vectors=N configures the number of vectors to use. + +For more details on ivshmem device properties, see The QEMU Emulator +User Documentation (qemu-doc.*). + + +== The ivshmem PCI device's guest interface == + +The device has vendor ID 1af4, device ID 1110, revision 0. + +=== PCI BARs === + +The ivshmem PCI device has two or three BARs: + +- BAR0 holds device registers (256 Byte MMIO) +- BAR1 holds MSI-X table and PBA (only when using MSI-X) +- BAR2 maps the shared memory object + +There are two ways to use this device: + +- If you only need the shared memory part, BAR2 suffices. This way, + you have access to the shared memory in the guest and can use it as + you see fit. Memnic, for example, uses ivshmem this way from guest + user space (see http://dpdk.org/browse/memnic). + +- If you additionally need the capability for peers to interrupt each + other, you need BAR0 and, if using MSI-X, BAR1. You will most + likely want to write a kernel driver to handle interrupts. Requires + the device to be configured for interrupts, obviously. + +If the device is configured for interrupts, BAR2 is initially invalid. +It becomes safely accessible only after the ivshmem server provided +the shared memory. Guest software should wait for the IVPosition +register (described below) to become non-negative before accessing +BAR2. + +The device is not capable to tell guest software whether it is +configured for interrupts. + +=== PCI device registers === + +BAR 0 contains the following registers: + + Offset Size Access On reset Function + 0 4 read/write 0 Interrupt Mask + bit 0: peer interrupt + bit 1..31: reserved + 4 4 read/write 0 Interrupt Status + bit 0: peer interrupt + bit 1..31: reserved + 8 4 read-only 0 or -1 IVPosition + 12 4 write-only N/A Doorbell + bit 0..15: vector + bit 16..31: peer ID + 16 240 none N/A reserved + +Software should only access the registers as specified in column +"Access". Reserved bits should be ignored on read, and preserved on +write. + +Interrupt Status and Mask Register together control the legacy INTx +interrupt when the device has no MSI-X capability: INTx is asserted +when the bit-wise AND of Status and Mask is non-zero and the device +has no MSI-X capability. Interrupt Status Register bit 0 becomes 1 +when an interrupt request from a peer is received. Reading the +register clears it. + +IVPosition Register: if the device is not configured for interrupts, +this is zero. Else, it's -1 for a short while after reset, then +changes to the device's ID (between 0 and 65535). + +There is no good way for software to find out whether the device is +configured for interrupts. A positive IVPosition means interrupts, +but zero could be either. The initial -1 cannot be reliably observed. + +Doorbell Register: writing this register requests to interrupt a peer. +The written value's high 16 bits are the ID of the peer to interrupt, +and its low 16 bits select an interrupt vector. + +If the device is not configured for interrupts, the write is ignored. + +If the interrupt hasn't completed setup, the write is ignored. The +device is not capable to tell guest software whether setup is +complete. Interrupts can regress to this state on migration. + +If the peer with the requested ID isn't connected, or it has fewer +interrupt vectors connected, the write is ignored. The device is not +capable to tell guest software what peers are connected, or how many +interrupt vectors are connected. + +If the peer doesn't use MSI-X, its Interrupt Status register is set to +1. This asserts INTx unless masked by the Interrupt Mask register. +The device is not capable to communicate the interrupt vector to guest +software then. + +If the peer uses MSI-X, the interrupt for this vector becomes pending. +There is no way for software to clear the pending bit, and a polling +mode of operation is therefore impossible with MSI-X. + +With multiple MSI-X vectors, different vectors can be used to indicate +different events have occurred. The semantics of interrupt vectors +are left to the application. + + +== Interrupt infrastructure == + +When configured for interrupts, the peers share eventfd objects in +addition to shared memory. The shared resources are managed by an +ivshmem server. + +=== The ivshmem server === + +The server listens on a UNIX domain socket. + +For each new client that connects to the server, the server +- picks an ID, +- creates eventfd file descriptors for the interrupt vectors, +- sends the ID and the file descriptor for the shared memory to the + new client, +- sends connect notifications for the new client to the other clients + (these contain file descriptors for sending interrupts), +- sends connect notifications for the other clients to the new client, + and +- sends interrupt setup messages to the new client (these contain file + descriptors for receiving interrupts). + +When a client disconnects from the server, the server sends disconnect +notifications to the other clients. + +The next section describes the protocol in detail. + +If the server terminates without sending disconnect notifications for +its connected clients, the clients can elect to continue. They can +communicate with each other normally, but won't receive disconnect +notification on disconnect, and no new clients can connect. There is +no way for the clients to connect to a restarted server. The device +is not capable to tell guest software whether the server is still up. + +Example server code is in contrib/ivshmem-server/. Not to be used in +production. It assumes all clients use the same number of interrupt +vectors. + +A standalone client is in contrib/ivshmem-client/. It can be useful +for debugging. + +=== The ivshmem Client-Server Protocol === + +An ivshmem device configured for interrupts connects to an ivshmem +server. This section details the protocol between the two. + +The connection is one-way: the server sends messages to the client. +Each message consists of a single 8 byte little-endian signed number, +and may be accompanied by a file descriptor via SCM_RIGHTS. Both +client and server close the connection on error. + +On connect, the server sends the following messages in order: + +1. The protocol version number, currently zero. The client should + close the connection on receipt of versions it can't handle. + +2. The client's ID. This is unique among all clients of this server. + IDs must be between 0 and 65535, because the Doorbell register + provides only 16 bits for them. + +3. The number -1, accompanied by the file descriptor for the shared + memory. + +4. Connect notifications for existing other clients, if any. This is + a peer ID (number between 0 and 65535 other than the client's ID), + repeated N times. Each repetition is accompanied by one file + descriptor. These are for interrupting the peer with that ID using + vector 0,..,N-1, in order. If the client is configured for fewer + vectors, it closes the extra file descriptors. If it is configured + for more, the extra vectors remain unconnected. + +5. Interrupt setup. This is the client's own ID, repeated N times. + Each repetition is accompanied by one file descriptor. These are + for receiving interrupts from peers using vector 0,..,N-1, in + order. If the client is configured for fewer vectors, it closes + the extra file descriptors. If it is configured for more, the + extra vectors remain unconnected. + +From then on, the server sends these kinds of messages: + +6. Connection / disconnection notification. This is a peer ID. + + - If the number comes with a file descriptor, it's a connection + notification, exactly like in step 4. + + - Else, it's a disconnection notification for the peer with that ID. + +Known bugs: + +* The protocol changed incompatibly in QEMU 2.5. Before, messages + were native endian long, and there was no version number. + +* The protocol is poorly designed. + +=== The ivshmem Client-Client Protocol === + +An ivshmem device configured for interrupts receives eventfd file +descriptors for interrupting peers and getting interrupted by peers +from the server, as explained in the previous section. + +To interrupt a peer, the device writes the 8-byte integer 1 in native +byte order to the respective file descriptor. + +To receive an interrupt, the device reads and discards as many 8-byte +integers as it can. |