diff options
Diffstat (limited to 'docs/interop/vhost-user.txt')
-rw-r--r-- | docs/interop/vhost-user.txt | 620 |
1 files changed, 620 insertions, 0 deletions
diff --git a/docs/interop/vhost-user.txt b/docs/interop/vhost-user.txt new file mode 100644 index 0000000000..481ab56e35 --- /dev/null +++ b/docs/interop/vhost-user.txt @@ -0,0 +1,620 @@ +Vhost-user Protocol +=================== + +Copyright (c) 2014 Virtual Open Systems Sarl. + +This work is licensed under the terms of the GNU GPL, version 2 or later. +See the COPYING file in the top-level directory. +=================== + +This protocol is aiming to complement the ioctl interface used to control the +vhost implementation in the Linux kernel. It implements the control plane needed +to establish virtqueue sharing with a user space process on the same host. It +uses communication over a Unix domain socket to share file descriptors in the +ancillary data of the message. + +The protocol defines 2 sides of the communication, master and slave. Master is +the application that shares its virtqueues, in our case QEMU. Slave is the +consumer of the virtqueues. + +In the current implementation QEMU is the Master, and the Slave is intended to +be a software Ethernet switch running in user space, such as Snabbswitch. + +Master and slave can be either a client (i.e. connecting) or server (listening) +in the socket communication. + +Message Specification +--------------------- + +Note that all numbers are in the machine native byte order. A vhost-user message +consists of 3 header fields and a payload: + +------------------------------------ +| request | flags | size | payload | +------------------------------------ + + * Request: 32-bit type of the request + * Flags: 32-bit bit field: + - Lower 2 bits are the version (currently 0x01) + - Bit 2 is the reply flag - needs to be sent on each reply from the slave + - Bit 3 is the need_reply flag - see VHOST_USER_PROTOCOL_F_REPLY_ACK for + details. + * Size - 32-bit size of the payload + + +Depending on the request type, payload can be: + + * A single 64-bit integer + ------- + | u64 | + ------- + + u64: a 64-bit unsigned integer + + * A vring state description + --------------- + | index | num | + --------------- + + Index: a 32-bit index + Num: a 32-bit number + + * A vring address description + -------------------------------------------------------------- + | index | flags | size | descriptor | used | available | log | + -------------------------------------------------------------- + + Index: a 32-bit vring index + Flags: a 32-bit vring flags + Descriptor: a 64-bit user address of the vring descriptor table + Used: a 64-bit user address of the vring used ring + Available: a 64-bit user address of the vring available ring + Log: a 64-bit guest address for logging + + * Memory regions description + --------------------------------------------------- + | num regions | padding | region0 | ... | region7 | + --------------------------------------------------- + + Num regions: a 32-bit number of regions + Padding: 32-bit + + A region is: + ----------------------------------------------------- + | guest address | size | user address | mmap offset | + ----------------------------------------------------- + + Guest address: a 64-bit guest address of the region + Size: a 64-bit size + User address: a 64-bit user address + mmap offset: 64-bit offset where region starts in the mapped memory + +* Log description + --------------------------- + | log size | log offset | + --------------------------- + log size: size of area used for logging + log offset: offset from start of supplied file descriptor + where logging starts (i.e. where guest address 0 would be logged) + + * An IOTLB message + --------------------------------------------------------- + | iova | size | user address | permissions flags | type | + --------------------------------------------------------- + + IOVA: a 64-bit I/O virtual address programmed by the guest + Size: a 64-bit size + User address: a 64-bit user address + Permissions: a 8-bit value: + - 0: No access + - 1: Read access + - 2: Write access + - 3: Read/Write access + Type: a 8-bit IOTLB message type: + - 1: IOTLB miss + - 2: IOTLB update + - 3: IOTLB invalidate + - 4: IOTLB access fail + +In QEMU the vhost-user message is implemented with the following struct: + +typedef struct VhostUserMsg { + VhostUserRequest request; + uint32_t flags; + uint32_t size; + union { + uint64_t u64; + struct vhost_vring_state state; + struct vhost_vring_addr addr; + VhostUserMemory memory; + VhostUserLog log; + struct vhost_iotlb_msg iotlb; + }; +} QEMU_PACKED VhostUserMsg; + +Communication +------------- + +The protocol for vhost-user is based on the existing implementation of vhost +for the Linux Kernel. Most messages that can be sent via the Unix domain socket +implementing vhost-user have an equivalent ioctl to the kernel implementation. + +The communication consists of master sending message requests and slave sending +message replies. Most of the requests don't require replies. Here is a list of +the ones that do: + + * VHOST_USER_GET_FEATURES + * VHOST_USER_GET_PROTOCOL_FEATURES + * VHOST_USER_GET_VRING_BASE + * VHOST_USER_SET_LOG_BASE (if VHOST_USER_PROTOCOL_F_LOG_SHMFD) + +[ Also see the section on REPLY_ACK protocol extension. ] + +There are several messages that the master sends with file descriptors passed +in the ancillary data: + + * VHOST_USER_SET_MEM_TABLE + * VHOST_USER_SET_LOG_BASE (if VHOST_USER_PROTOCOL_F_LOG_SHMFD) + * VHOST_USER_SET_LOG_FD + * VHOST_USER_SET_VRING_KICK + * VHOST_USER_SET_VRING_CALL + * VHOST_USER_SET_VRING_ERR + * VHOST_USER_SET_SLAVE_REQ_FD + +If Master is unable to send the full message or receives a wrong reply it will +close the connection. An optional reconnection mechanism can be implemented. + +Any protocol extensions are gated by protocol feature bits, +which allows full backwards compatibility on both master +and slave. +As older slaves don't support negotiating protocol features, +a feature bit was dedicated for this purpose: +#define VHOST_USER_F_PROTOCOL_FEATURES 30 + +Starting and stopping rings +---------------------- +Client must only process each ring when it is started. + +Client must only pass data between the ring and the +backend, when the ring is enabled. + +If ring is started but disabled, client must process the +ring without talking to the backend. + +For example, for a networking device, in the disabled state +client must not supply any new RX packets, but must process +and discard any TX packets. + +If VHOST_USER_F_PROTOCOL_FEATURES has not been negotiated, the ring is initialized +in an enabled state. + +If VHOST_USER_F_PROTOCOL_FEATURES has been negotiated, the ring is initialized +in a disabled state. Client must not pass data to/from the backend until ring is enabled by +VHOST_USER_SET_VRING_ENABLE with parameter 1, or after it has been disabled by +VHOST_USER_SET_VRING_ENABLE with parameter 0. + +Each ring is initialized in a stopped state, client must not process it until +ring is started, or after it has been stopped. + +Client must start ring upon receiving a kick (that is, detecting that file +descriptor is readable) on the descriptor specified by +VHOST_USER_SET_VRING_KICK, and stop ring upon receiving +VHOST_USER_GET_VRING_BASE. + +While processing the rings (whether they are enabled or not), client must +support changing some configuration aspects on the fly. + +Multiple queue support +---------------------- + +Multiple queue is treated as a protocol extension, hence the slave has to +implement protocol features first. The multiple queues feature is supported +only when the protocol feature VHOST_USER_PROTOCOL_F_MQ (bit 0) is set. + +The max number of queues the slave supports can be queried with message +VHOST_USER_GET_PROTOCOL_FEATURES. Master should stop when the number of +requested queues is bigger than that. + +As all queues share one connection, the master uses a unique index for each +queue in the sent message to identify a specified queue. One queue pair +is enabled initially. More queues are enabled dynamically, by sending +message VHOST_USER_SET_VRING_ENABLE. + +Migration +--------- + +During live migration, the master may need to track the modifications +the slave makes to the memory mapped regions. The client should mark +the dirty pages in a log. Once it complies to this logging, it may +declare the VHOST_F_LOG_ALL vhost feature. + +To start/stop logging of data/used ring writes, server may send messages +VHOST_USER_SET_FEATURES with VHOST_F_LOG_ALL and VHOST_USER_SET_VRING_ADDR with +VHOST_VRING_F_LOG in ring's flags set to 1/0, respectively. + +All the modifications to memory pointed by vring "descriptor" should +be marked. Modifications to "used" vring should be marked if +VHOST_VRING_F_LOG is part of ring's flags. + +Dirty pages are of size: +#define VHOST_LOG_PAGE 0x1000 + +The log memory fd is provided in the ancillary data of +VHOST_USER_SET_LOG_BASE message when the slave has +VHOST_USER_PROTOCOL_F_LOG_SHMFD protocol feature. + +The size of the log is supplied as part of VhostUserMsg +which should be large enough to cover all known guest +addresses. Log starts at the supplied offset in the +supplied file descriptor. +The log covers from address 0 to the maximum of guest +regions. In pseudo-code, to mark page at "addr" as dirty: + +page = addr / VHOST_LOG_PAGE +log[page / 8] |= 1 << page % 8 + +Where addr is the guest physical address. + +Use atomic operations, as the log may be concurrently manipulated. + +Note that when logging modifications to the used ring (when VHOST_VRING_F_LOG +is set for this ring), log_guest_addr should be used to calculate the log +offset: the write to first byte of the used ring is logged at this offset from +log start. Also note that this value might be outside the legal guest physical +address range (i.e. does not have to be covered by the VhostUserMemory table), +but the bit offset of the last byte of the ring must fall within +the size supplied by VhostUserLog. + +VHOST_USER_SET_LOG_FD is an optional message with an eventfd in +ancillary data, it may be used to inform the master that the log has +been modified. + +Once the source has finished migration, rings will be stopped by +the source. No further update must be done before rings are +restarted. + +IOMMU support +------------- + +When the VIRTIO_F_IOMMU_PLATFORM feature has been negotiated, the master +sends IOTLB entries update & invalidation by sending VHOST_USER_IOTLB_MSG +requests to the slave with a struct vhost_iotlb_msg as payload. For update +events, the iotlb payload has to be filled with the update message type (2), +the I/O virtual address, the size, the user virtual address, and the +permissions flags. Addresses and size must be within vhost memory regions set +via the VHOST_USER_SET_MEM_TABLE request. For invalidation events, the iotlb +payload has to be filled with the invalidation message type (3), the I/O virtual +address and the size. On success, the slave is expected to reply with a zero +payload, non-zero otherwise. + +The slave relies on the slave communcation channel (see "Slave communication" +section below) to send IOTLB miss and access failure events, by sending +VHOST_USER_SLAVE_IOTLB_MSG requests to the master with a struct vhost_iotlb_msg +as payload. For miss events, the iotlb payload has to be filled with the miss +message type (1), the I/O virtual address and the permissions flags. For access +failure event, the iotlb payload has to be filled with the access failure +message type (4), the I/O virtual address and the permissions flags. +For synchronization purpose, the slave may rely on the reply-ack feature, +so the master may send a reply when operation is completed if the reply-ack +feature is negotiated and slaves requests a reply. For miss events, completed +operation means either master sent an update message containing the IOTLB entry +containing requested address and permission, or master sent nothing if the IOTLB +miss message is invalid (invalid IOVA or permission). + +The master isn't expected to take the initiative to send IOTLB update messages, +as the slave sends IOTLB miss messages for the guest virtual memory areas it +needs to access. + +Slave communication +------------------- + +An optional communication channel is provided if the slave declares +VHOST_USER_PROTOCOL_F_SLAVE_REQ protocol feature, to allow the slave to make +requests to the master. + +The fd is provided via VHOST_USER_SET_SLAVE_REQ_FD ancillary data. + +A slave may then send VHOST_USER_SLAVE_* messages to the master +using this fd communication channel. + +Protocol features +----------------- + +#define VHOST_USER_PROTOCOL_F_MQ 0 +#define VHOST_USER_PROTOCOL_F_LOG_SHMFD 1 +#define VHOST_USER_PROTOCOL_F_RARP 2 +#define VHOST_USER_PROTOCOL_F_REPLY_ACK 3 +#define VHOST_USER_PROTOCOL_F_MTU 4 +#define VHOST_USER_PROTOCOL_F_SLAVE_REQ 5 + +Master message types +-------------------- + + * VHOST_USER_GET_FEATURES + + Id: 1 + Equivalent ioctl: VHOST_GET_FEATURES + Master payload: N/A + Slave payload: u64 + + Get from the underlying vhost implementation the features bitmask. + Feature bit VHOST_USER_F_PROTOCOL_FEATURES signals slave support for + VHOST_USER_GET_PROTOCOL_FEATURES and VHOST_USER_SET_PROTOCOL_FEATURES. + + * VHOST_USER_SET_FEATURES + + Id: 2 + Ioctl: VHOST_SET_FEATURES + Master payload: u64 + + Enable features in the underlying vhost implementation using a bitmask. + Feature bit VHOST_USER_F_PROTOCOL_FEATURES signals slave support for + VHOST_USER_GET_PROTOCOL_FEATURES and VHOST_USER_SET_PROTOCOL_FEATURES. + + * VHOST_USER_GET_PROTOCOL_FEATURES + + Id: 15 + Equivalent ioctl: VHOST_GET_FEATURES + Master payload: N/A + Slave payload: u64 + + Get the protocol feature bitmask from the underlying vhost implementation. + Only legal if feature bit VHOST_USER_F_PROTOCOL_FEATURES is present in + VHOST_USER_GET_FEATURES. + Note: slave that reported VHOST_USER_F_PROTOCOL_FEATURES must support + this message even before VHOST_USER_SET_FEATURES was called. + + * VHOST_USER_SET_PROTOCOL_FEATURES + + Id: 16 + Ioctl: VHOST_SET_FEATURES + Master payload: u64 + + Enable protocol features in the underlying vhost implementation. + Only legal if feature bit VHOST_USER_F_PROTOCOL_FEATURES is present in + VHOST_USER_GET_FEATURES. + Note: slave that reported VHOST_USER_F_PROTOCOL_FEATURES must support + this message even before VHOST_USER_SET_FEATURES was called. + + * VHOST_USER_SET_OWNER + + Id: 3 + Equivalent ioctl: VHOST_SET_OWNER + Master payload: N/A + + Issued when a new connection is established. It sets the current Master + as an owner of the session. This can be used on the Slave as a + "session start" flag. + + * VHOST_USER_RESET_OWNER + + Id: 4 + Master payload: N/A + + This is no longer used. Used to be sent to request disabling + all rings, but some clients interpreted it to also discard + connection state (this interpretation would lead to bugs). + It is recommended that clients either ignore this message, + or use it to disable all rings. + + * VHOST_USER_SET_MEM_TABLE + + Id: 5 + Equivalent ioctl: VHOST_SET_MEM_TABLE + Master payload: memory regions description + + Sets the memory map regions on the slave so it can translate the vring + addresses. In the ancillary data there is an array of file descriptors + for each memory mapped region. The size and ordering of the fds matches + the number and ordering of memory regions. + + * VHOST_USER_SET_LOG_BASE + + Id: 6 + Equivalent ioctl: VHOST_SET_LOG_BASE + Master payload: u64 + Slave payload: N/A + + Sets logging shared memory space. + When slave has VHOST_USER_PROTOCOL_F_LOG_SHMFD protocol + feature, the log memory fd is provided in the ancillary data of + VHOST_USER_SET_LOG_BASE message, the size and offset of shared + memory area provided in the message. + + + * VHOST_USER_SET_LOG_FD + + Id: 7 + Equivalent ioctl: VHOST_SET_LOG_FD + Master payload: N/A + + Sets the logging file descriptor, which is passed as ancillary data. + + * VHOST_USER_SET_VRING_NUM + + Id: 8 + Equivalent ioctl: VHOST_SET_VRING_NUM + Master payload: vring state description + + Set the size of the queue. + + * VHOST_USER_SET_VRING_ADDR + + Id: 9 + Equivalent ioctl: VHOST_SET_VRING_ADDR + Master payload: vring address description + Slave payload: N/A + + Sets the addresses of the different aspects of the vring. + + * VHOST_USER_SET_VRING_BASE + + Id: 10 + Equivalent ioctl: VHOST_SET_VRING_BASE + Master payload: vring state description + + Sets the base offset in the available vring. + + * VHOST_USER_GET_VRING_BASE + + Id: 11 + Equivalent ioctl: VHOST_USER_GET_VRING_BASE + Master payload: vring state description + Slave payload: vring state description + + Get the available vring base offset. + + * VHOST_USER_SET_VRING_KICK + + Id: 12 + Equivalent ioctl: VHOST_SET_VRING_KICK + Master payload: u64 + + Set the event file descriptor for adding buffers to the vring. It + is passed in the ancillary data. + Bits (0-7) of the payload contain the vring index. Bit 8 is the + invalid FD flag. This flag is set when there is no file descriptor + in the ancillary data. This signals that polling should be used + instead of waiting for a kick. + + * VHOST_USER_SET_VRING_CALL + + Id: 13 + Equivalent ioctl: VHOST_SET_VRING_CALL + Master payload: u64 + + Set the event file descriptor to signal when buffers are used. It + is passed in the ancillary data. + Bits (0-7) of the payload contain the vring index. Bit 8 is the + invalid FD flag. This flag is set when there is no file descriptor + in the ancillary data. This signals that polling will be used + instead of waiting for the call. + + * VHOST_USER_SET_VRING_ERR + + Id: 14 + Equivalent ioctl: VHOST_SET_VRING_ERR + Master payload: u64 + + Set the event file descriptor to signal when error occurs. It + is passed in the ancillary data. + Bits (0-7) of the payload contain the vring index. Bit 8 is the + invalid FD flag. This flag is set when there is no file descriptor + in the ancillary data. + + * VHOST_USER_GET_QUEUE_NUM + + Id: 17 + Equivalent ioctl: N/A + Master payload: N/A + Slave payload: u64 + + Query how many queues the backend supports. This request should be + sent only when VHOST_USER_PROTOCOL_F_MQ is set in queried protocol + features by VHOST_USER_GET_PROTOCOL_FEATURES. + + * VHOST_USER_SET_VRING_ENABLE + + Id: 18 + Equivalent ioctl: N/A + Master payload: vring state description + + Signal slave to enable or disable corresponding vring. + This request should be sent only when VHOST_USER_F_PROTOCOL_FEATURES + has been negotiated. + + * VHOST_USER_SEND_RARP + + Id: 19 + Equivalent ioctl: N/A + Master payload: u64 + + Ask vhost user backend to broadcast a fake RARP to notify the migration + is terminated for guest that does not support GUEST_ANNOUNCE. + Only legal if feature bit VHOST_USER_F_PROTOCOL_FEATURES is present in + VHOST_USER_GET_FEATURES and protocol feature bit VHOST_USER_PROTOCOL_F_RARP + is present in VHOST_USER_GET_PROTOCOL_FEATURES. + The first 6 bytes of the payload contain the mac address of the guest to + allow the vhost user backend to construct and broadcast the fake RARP. + + * VHOST_USER_NET_SET_MTU + + Id: 20 + Equivalent ioctl: N/A + Master payload: u64 + + Set host MTU value exposed to the guest. + This request should be sent only when VIRTIO_NET_F_MTU feature has been + successfully negotiated, VHOST_USER_F_PROTOCOL_FEATURES is present in + VHOST_USER_GET_FEATURES and protocol feature bit + VHOST_USER_PROTOCOL_F_NET_MTU is present in + VHOST_USER_GET_PROTOCOL_FEATURES. + If VHOST_USER_PROTOCOL_F_REPLY_ACK is negotiated, slave must respond + with zero in case the specified MTU is valid, or non-zero otherwise. + + * VHOST_USER_SET_SLAVE_REQ_FD + + Id: 21 + Equivalent ioctl: N/A + Master payload: N/A + + Set the socket file descriptor for slave initiated requests. It is passed + in the ancillary data. + This request should be sent only when VHOST_USER_F_PROTOCOL_FEATURES + has been negotiated, and protocol feature bit VHOST_USER_PROTOCOL_F_SLAVE_REQ + bit is present in VHOST_USER_GET_PROTOCOL_FEATURES. + If VHOST_USER_PROTOCOL_F_REPLY_ACK is negotiated, slave must respond + with zero for success, non-zero otherwise. + + * VHOST_USER_IOTLB_MSG + + Id: 22 + Equivalent ioctl: N/A (equivalent to VHOST_IOTLB_MSG message type) + Master payload: struct vhost_iotlb_msg + Slave payload: u64 + + Send IOTLB messages with struct vhost_iotlb_msg as payload. + Master sends such requests to update and invalidate entries in the device + IOTLB. The slave has to acknowledge the request with sending zero as u64 + payload for success, non-zero otherwise. + This request should be send only when VIRTIO_F_IOMMU_PLATFORM feature + has been successfully negotiated. + +Slave message types +------------------- + + * VHOST_USER_SLAVE_IOTLB_MSG + + Id: 1 + Equivalent ioctl: N/A (equivalent to VHOST_IOTLB_MSG message type) + Slave payload: struct vhost_iotlb_msg + Master payload: N/A + + Send IOTLB messages with struct vhost_iotlb_msg as payload. + Slave sends such requests to notify of an IOTLB miss, or an IOTLB + access failure. If VHOST_USER_PROTOCOL_F_REPLY_ACK is negotiated, + and slave set the VHOST_USER_NEED_REPLY flag, master must respond with + zero when operation is successfully completed, or non-zero otherwise. + This request should be send only when VIRTIO_F_IOMMU_PLATFORM feature + has been successfully negotiated. + +VHOST_USER_PROTOCOL_F_REPLY_ACK: +------------------------------- +The original vhost-user specification only demands replies for certain +commands. This differs from the vhost protocol implementation where commands +are sent over an ioctl() call and block until the client has completed. + +With this protocol extension negotiated, the sender (QEMU) can set the +"need_reply" [Bit 3] flag to any command. This indicates that +the client MUST respond with a Payload VhostUserMsg indicating success or +failure. The payload should be set to zero on success or non-zero on failure, +unless the message already has an explicit reply body. + +The response payload gives QEMU a deterministic indication of the result +of the command. Today, QEMU is expected to terminate the main vhost-user +loop upon receiving such errors. In future, qemu could be taught to be more +resilient for selective requests. + +For the message types that already solicit a reply from the client, the +presence of VHOST_USER_PROTOCOL_F_REPLY_ACK or need_reply bit being set brings +no behavioural change. (See the 'Communication' section for details.) |