diff options
Diffstat (limited to 'docs')
-rw-r--r-- | docs/devel/multiple-iothreads.txt | 4 | ||||
-rw-r--r-- | docs/index.rst | 1 | ||||
-rw-r--r-- | docs/interop/index.rst | 1 | ||||
-rw-r--r-- | docs/interop/vhost-user-gpu.rst | 242 | ||||
-rw-r--r-- | docs/interop/vhost-user.rst | 9 | ||||
-rw-r--r-- | docs/specs/index.rst | 13 | ||||
-rw-r--r-- | docs/specs/ppc-spapr-xive.rst | 174 | ||||
-rw-r--r-- | docs/specs/ppc-xive.rst | 199 |
8 files changed, 641 insertions, 2 deletions
diff --git a/docs/devel/multiple-iothreads.txt b/docs/devel/multiple-iothreads.txt index 4f9012d154..aeb997bed5 100644 --- a/docs/devel/multiple-iothreads.txt +++ b/docs/devel/multiple-iothreads.txt @@ -109,7 +109,7 @@ The AioContext originates from the QEMU block layer, even though nowadays AioContext is a generic event loop that can be used by any QEMU subsystem. The block layer has support for AioContext integrated. Each BlockDriverState -is associated with an AioContext using bdrv_set_aio_context() and +is associated with an AioContext using bdrv_try_set_aio_context() and bdrv_get_aio_context(). This allows block layer code to process I/O inside the right AioContext. Other subsystems may wish to follow a similar approach. @@ -134,5 +134,5 @@ Long-running jobs (usually in the form of coroutines) are best scheduled in the BlockDriverState's AioContext to avoid the need to acquire/release around each bdrv_*() call. The functions bdrv_add/remove_aio_context_notifier, or alternatively blk_add/remove_aio_context_notifier if you use BlockBackends, -can be used to get a notification whenever bdrv_set_aio_context() moves a +can be used to get a notification whenever bdrv_try_set_aio_context() moves a BlockDriverState to a different AioContext. diff --git a/docs/index.rst b/docs/index.rst index 3690955dd1..baa5791c17 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -12,4 +12,5 @@ Welcome to QEMU's documentation! interop/index devel/index + specs/index diff --git a/docs/interop/index.rst b/docs/interop/index.rst index a037bd67ec..b4bfcab417 100644 --- a/docs/interop/index.rst +++ b/docs/interop/index.rst @@ -16,3 +16,4 @@ Contents: live-block-operations pr-helper vhost-user + vhost-user-gpu diff --git a/docs/interop/vhost-user-gpu.rst b/docs/interop/vhost-user-gpu.rst new file mode 100644 index 0000000000..688f8b4259 --- /dev/null +++ b/docs/interop/vhost-user-gpu.rst @@ -0,0 +1,242 @@ +======================= +Vhost-user-gpu Protocol +======================= + +:Licence: This work is licensed under the terms of the GNU GPL, + version 2 or later. See the COPYING file in the top-level + directory. + +.. contents:: Table of Contents + +Introduction +============ + +The vhost-user-gpu protocol is aiming at sharing the rendering result +of a virtio-gpu, done from a vhost-user slave process to a vhost-user +master process (such as QEMU). It bears a resemblance to a display +server protocol, if you consider QEMU as the display server and the +slave as the client, but in a very limited way. Typically, it will +work by setting a scanout/display configuration, before sending flush +events for the display updates. It will also update the cursor shape +and position. + +The protocol is sent over a UNIX domain stream socket, since it uses +socket ancillary data to share opened file descriptors (DMABUF fds or +shared memory). The socket is usually obtained via +``VHOST_USER_GPU_SET_SOCKET``. + +Requests are sent by the *slave*, and the optional replies by the +*master*. + +Wire format +=========== + +Unless specified differently, numbers are in the machine native byte +order. + +A vhost-user-gpu message (request and reply) consists of 3 header +fields and a payload. + ++---------+-------+------+---------+ +| request | flags | size | payload | ++---------+-------+------+---------+ + +Header +------ + +:request: ``u32``, type of the request + +:flags: ``u32``, 32-bit bit field: + + - Bit 2 is the reply flag - needs to be set on each reply + +:size: ``u32``, size of the payload + +Payload types +------------- + +Depending on the request type, **payload** can be: + +VhostUserGpuCursorPos +^^^^^^^^^^^^^^^^^^^^^ + ++------------+---+---+ +| scanout-id | x | y | ++------------+---+---+ + +:scanout-id: ``u32``, the scanout where the cursor is located + +:x/y: ``u32``, the cursor postion + +VhostUserGpuCursorUpdate +^^^^^^^^^^^^^^^^^^^^^^^^ + ++-----+-------+-------+--------+ +| pos | hot_x | hot_y | cursor | ++-----+-------+-------+--------+ + +:pos: a ``VhostUserGpuCursorPos``, the cursor location + +:hot_x/hot_y: ``u32``, the cursor hot location + +:cursor: ``[u32; 64 * 64]``, 64x64 RGBA cursor data (PIXMAN_a8r8g8b8 format) + +VhostUserGpuScanout +^^^^^^^^^^^^^^^^^^^ + ++------------+---+---+ +| scanout-id | w | h | ++------------+---+---+ + +:scanout-id: ``u32``, the scanout configuration to set + +:w/h: ``u32``, the scanout width/height size + +VhostUserGpuUpdate +^^^^^^^^^^^^^^^^^^ + ++------------+---+---+---+---+------+ +| scanout-id | x | y | w | h | data | ++------------+---+---+---+---+------+ + +:scanout-id: ``u32``, the scanout content to update + +:x/y/w/h: ``u32``, region of the update + +:data: RGB data (PIXMAN_x8r8g8b8 format) + +VhostUserGpuDMABUFScanout +^^^^^^^^^^^^^^^^^^^^^^^^^ + ++------------+---+---+---+---+-----+-----+--------+-------+--------+ +| scanout-id | x | y | w | h | fdw | fwh | stride | flags | fourcc | ++------------+---+---+---+---+-----+-----+--------+-------+--------+ + +:scanout-id: ``u32``, the scanout configuration to set + +:x/y: ``u32``, the location of the scanout within the DMABUF + +:w/h: ``u32``, the scanout width/height size + +:fdw/fdh/stride/flags: ``u32``, the DMABUF width/height/stride/flags + +:fourcc: ``i32``, the DMABUF fourcc + + +C structure +----------- + +In QEMU the vhost-user-gpu message is implemented with the following struct: + +.. code:: c + + typedef struct VhostUserGpuMsg { + uint32_t request; /* VhostUserGpuRequest */ + uint32_t flags; + uint32_t size; /* the following payload size */ + union { + VhostUserGpuCursorPos cursor_pos; + VhostUserGpuCursorUpdate cursor_update; + VhostUserGpuScanout scanout; + VhostUserGpuUpdate update; + VhostUserGpuDMABUFScanout dmabuf_scanout; + struct virtio_gpu_resp_display_info display_info; + uint64_t u64; + } payload; + } QEMU_PACKED VhostUserGpuMsg; + +Protocol features +----------------- + +None yet. + +As the protocol may need to evolve, new messages and communication +changes are negotiated thanks to preliminary +``VHOST_USER_GPU_GET_PROTOCOL_FEATURES`` and +``VHOST_USER_GPU_SET_PROTOCOL_FEATURES`` requests. + +Communication +============= + +Message types +------------- + +``VHOST_USER_GPU_GET_PROTOCOL_FEATURES`` + :id: 1 + :request payload: N/A + :reply payload: ``u64`` + + Get the supported protocol features bitmask. + +``VHOST_USER_GPU_SET_PROTOCOL_FEATURES`` + :id: 2 + :request payload: ``u64`` + :reply payload: N/A + + Enable protocol features using a bitmask. + +``VHOST_USER_GPU_GET_DISPLAY_INFO`` + :id: 3 + :request payload: N/A + :reply payload: ``struct virtio_gpu_resp_display_info`` (from virtio specification) + + Get the preferred display configuration. + +``VHOST_USER_GPU_CURSOR_POS`` + :id: 4 + :request payload: ``VhostUserGpuCursorPos`` + :reply payload: N/A + + Set/show the cursor position. + +``VHOST_USER_GPU_CURSOR_POS_HIDE`` + :id: 5 + :request payload: ``VhostUserGpuCursorPos`` + :reply payload: N/A + + Set/hide the cursor. + +``VHOST_USER_GPU_CURSOR_UPDATE`` + :id: 6 + :request payload: ``VhostUserGpuCursorUpdate`` + :reply payload: N/A + + Update the cursor shape and location. + +``VHOST_USER_GPU_SCANOUT`` + :id: 7 + :request payload: ``VhostUserGpuScanout`` + :reply payload: N/A + + Set the scanout resolution. To disable a scanout, the dimensions + width/height are set to 0. + +``VHOST_USER_GPU_UPDATE`` + :id: 8 + :request payload: ``VhostUserGpuUpdate`` + :reply payload: N/A + + Update the scanout content. The data payload contains the graphical bits. + The display should be flushed and presented. + +``VHOST_USER_GPU_DMABUF_SCANOUT`` + :id: 9 + :request payload: ``VhostUserGpuDMABUFScanout`` + :reply payload: N/A + + Set the scanout resolution/configuration, and share a DMABUF file + descriptor for the scanout content, which is passed as ancillary + data. To disable a scanout, the dimensions width/height are set + to 0, there is no file descriptor passed. + +``VHOST_USER_GPU_DMABUF_UPDATE`` + :id: 10 + :request payload: ``VhostUserGpuUpdate`` + :reply payload: empty payload + + The display should be flushed and presented according to updated + region from ``VhostUserGpuUpdate``. + + Note: there is no data payload, since the scanout is shared thanks + to DMABUF, that must have been set previously with + ``VHOST_USER_GPU_DMABUF_SCANOUT``. diff --git a/docs/interop/vhost-user.rst b/docs/interop/vhost-user.rst index 7f3232c798..dc0ff9211f 100644 --- a/docs/interop/vhost-user.rst +++ b/docs/interop/vhost-user.rst @@ -1163,6 +1163,15 @@ Master message types send the shared inflight buffer back to slave so that slave could get inflight I/O after a crash or restart. +``VHOST_USER_GPU_SET_SOCKET`` + :id: 33 + :equivalent ioctl: N/A + :master payload: N/A + + Sets the GPU protocol socket file descriptor, which is passed as + ancillary data. The GPU protocol is used to inform the master of + rendering state and updates. See vhost-user-gpu.rst for details. + Slave message types ------------------- diff --git a/docs/specs/index.rst b/docs/specs/index.rst new file mode 100644 index 0000000000..2e927519c2 --- /dev/null +++ b/docs/specs/index.rst @@ -0,0 +1,13 @@ +. This is the top level page for the 'specs' manual + + +QEMU full-system emulation guest hardware specifications +======================================================== + + +Contents: + +.. toctree:: + :maxdepth: 2 + + xive diff --git a/docs/specs/ppc-spapr-xive.rst b/docs/specs/ppc-spapr-xive.rst new file mode 100644 index 0000000000..539ce7ca4e --- /dev/null +++ b/docs/specs/ppc-spapr-xive.rst @@ -0,0 +1,174 @@ +XIVE for sPAPR (pseries machines) +================================= + +The POWER9 processor comes with a new interrupt controller +architecture, called XIVE as "eXternal Interrupt Virtualization +Engine". It supports a larger number of interrupt sources and offers +virtualization features which enables the HW to deliver interrupts +directly to virtual processors without hypervisor assistance. + +A QEMU ``pseries`` machine (which is PAPR compliant) using POWER9 +processors can run under two interrupt modes: + +- *Legacy Compatibility Mode* + + the hypervisor provides identical interfaces and similar + functionality to PAPR+ Version 2.7. This is the default mode + + It is also referred as *XICS* in QEMU. + +- *XIVE native exploitation mode* + + the hypervisor provides new interfaces to manage the XIVE control + structures, and provides direct control for interrupt management + through MMIO pages. + +Which interrupt modes can be used by the machine is negotiated with +the guest O/S during the Client Architecture Support negotiation +sequence. The two modes are mutually exclusive. + +Both interrupt mode share the same IRQ number space. See below for the +layout. + +CAS Negotiation +--------------- + +QEMU advertises the supported interrupt modes in the device tree +property "ibm,arch-vec-5-platform-support" in byte 23 and the OS +Selection for XIVE is indicated in the "ibm,architecture-vec-5" +property byte 23. + +The interrupt modes supported by the machine depend on the CPU type +(POWER9 is required for XIVE) but also on the machine property +``ic-mode`` which can be set on the command line. It can take the +following values: ``xics``, ``xive``, ``dual`` and currently ``xics`` +is the default but it may change in the future. + +The choosen interrupt mode is activated after a reconfiguration done +in a machine reset. + +XIVE Device tree properties +--------------------------- + +The properties for the PAPR interrupt controller node when the *XIVE +native exploitation mode* is selected shoud contain: + +- ``device_type`` + + value should be "power-ivpe". + +- ``compatible`` + + value should be "ibm,power-ivpe". + +- ``reg`` + + contains the base address and size of the thread interrupt + managnement areas (TIMA), for the User level and for the Guest OS + level. Only the Guest OS level is taken into account today. + +- ``ibm,xive-eq-sizes`` + + the size of the event queues. One cell per size supported, contains + log2 of size, in ascending order. + +- ``ibm,xive-lisn-ranges`` + + the IRQ interrupt number ranges assigned to the guest for the IPIs. + +The root node also exports : + +- ``ibm,plat-res-int-priorities`` + + contains a list of priorities that the hypervisor has reserved for + its own use. + +IRQ number space +---------------- + +IRQ Number space of the ``pseries`` machine is 8K wide and is the same +for both interrupt mode. The different ranges are defined as follow : + +- ``0x0000 .. 0x0FFF`` 4K CPU IPIs (only used under XIVE) +- ``0x1000 .. 0x1000`` 1 EPOW +- ``0x1001 .. 0x1001`` 1 HOTPLUG +- ``0x1100 .. 0x11FF`` 256 VIO devices +- ``0x1200 .. 0x127F`` 32 PHBs devices +- ``0x1280 .. 0x12FF`` unused +- ``0x1300 .. 0x1FFF`` PHB MSIs + +Monitoring XIVE +--------------- + +The state of the XIVE interrupt controller can be queried through the +monitor commands ``info pic``. The output comes in two parts. + +First, the state of the thread interrupt context registers is dumped +for each CPU : + +:: + + (qemu) info pic + CPU[0000]: QW NSR CPPR IPB LSMFB ACK# INC AGE PIPR W2 + CPU[0000]: USER 00 00 00 00 00 00 00 00 00000000 + CPU[0000]: OS 00 ff 00 00 ff 00 ff ff 80000400 + CPU[0000]: POOL 00 00 00 00 00 00 00 00 00000000 + CPU[0000]: PHYS 00 00 00 00 00 00 00 ff 00000000 + ... + +In the case of a ``pseries`` machine, QEMU acts as the hypervisor and only +the O/S and USER register rings make sense. ``W2`` contains the vCPU CAM +line which is set to the VP identifier. + +Then comes the routing information which aggregates the EAS and the +END configuration: + +:: + + ... + LISN PQ EISN CPU/PRIO EQ + 00000000 MSI -- 00000010 0/6 380/16384 @1fe3e0000 ^1 [ 80000010 ... ] + 00000001 MSI -- 00000010 1/6 305/16384 @1fc230000 ^1 [ 80000010 ... ] + 00000002 MSI -- 00000010 2/6 220/16384 @1fc2f0000 ^1 [ 80000010 ... ] + 00000003 MSI -- 00000010 3/6 201/16384 @1fc390000 ^1 [ 80000010 ... ] + 00000004 MSI -Q M 00000000 + 00000005 MSI -Q M 00000000 + 00000006 MSI -Q M 00000000 + 00000007 MSI -Q M 00000000 + 00001000 MSI -- 00000012 0/6 380/16384 @1fe3e0000 ^1 [ 80000010 ... ] + 00001001 MSI -- 00000013 0/6 380/16384 @1fe3e0000 ^1 [ 80000010 ... ] + 00001100 MSI -- 00000100 1/6 305/16384 @1fc230000 ^1 [ 80000010 ... ] + 00001101 MSI -Q M 00000000 + 00001200 LSI -Q M 00000000 + 00001201 LSI -Q M 00000000 + 00001202 LSI -Q M 00000000 + 00001203 LSI -Q M 00000000 + 00001300 MSI -- 00000102 1/6 305/16384 @1fc230000 ^1 [ 80000010 ... ] + 00001301 MSI -- 00000103 2/6 220/16384 @1fc2f0000 ^1 [ 80000010 ... ] + 00001302 MSI -- 00000104 3/6 201/16384 @1fc390000 ^1 [ 80000010 ... ] + +The source information and configuration: + +- The ``LISN`` column outputs the interrupt number of the source in + range ``[ 0x0 ... 0x1FFF ]`` and its type : ``MSI`` or ``LSI`` +- The ``PQ`` column reflects the state of the PQ bits of the source : + + - ``--`` source is ready to take events + - ``P-`` an event was sent and an EOI is PENDING + - ``PQ`` an event was QUEUED + - ``-Q`` source is OFF + + a ``M`` indicates that source is *MASKED* at the EAS level, + +The targeting configuration : + +- The ``EISN`` column is the event data that will be queued in the event + queue of the O/S. +- The ``CPU/PRIO`` column is the tuple defining the CPU number and + priority queue serving the source. +- The ``EQ`` column outputs : + + - the current index of the event queue/ the max number of entries + - the O/S event queue address + - the toggle bit + - the last entries that were pushed in the event queue. diff --git a/docs/specs/ppc-xive.rst b/docs/specs/ppc-xive.rst new file mode 100644 index 0000000000..b997dc0629 --- /dev/null +++ b/docs/specs/ppc-xive.rst @@ -0,0 +1,199 @@ +================================ +POWER9 XIVE interrupt controller +================================ + +The POWER9 processor comes with a new interrupt controller +architecture, called XIVE as "eXternal Interrupt Virtualization +Engine". + +Compared to the previous architecture, the main characteristics of +XIVE are to support a larger number of interrupt sources and to +deliver interrupts directly to virtual processors without hypervisor +assistance. This removes the context switches required for the +delivery process. + + +XIVE architecture +================= + +The XIVE IC is composed of three sub-engines, each taking care of a +processing layer of external interrupts: + +- Interrupt Virtualization Source Engine (IVSE), or Source Controller + (SC). These are found in PCI PHBs, in the PSI host bridge + controller, but also inside the main controller for the core IPIs + and other sub-chips (NX, CAP, NPU) of the chip/processor. They are + configured to feed the IVRE with events. +- Interrupt Virtualization Routing Engine (IVRE) or Virtualization + Controller (VC). It handles event coalescing and perform interrupt + routing by matching an event source number with an Event + Notification Descriptor (END). +- Interrupt Virtualization Presentation Engine (IVPE) or Presentation + Controller (PC). It maintains the interrupt context state of each + thread and handles the delivery of the external interrupt to the + thread. + +:: + + XIVE Interrupt Controller + +------------------------------------+ IPIs + | +---------+ +---------+ +--------+ | +-------+ + | |IVRE | |Common Q | |IVPE |----> | CORES | + | | esb | | | | |----> | | + | | eas | | Bridge | | tctx |----> | | + | |SC end | | | | nvt | | | | + +------+ | +---------+ +----+----+ +--------+ | +-+-+-+-+ + | RAM | +------------------|-----------------+ | | | + | | | | | | + | | | | | | + | | +--------------------v------------------------v-v-v--+ other + | <--+ Power Bus +--> chips + | esb | +---------+-----------------------+------------------+ + | eas | | | + | end | +--|------+ | + | nvt | +----+----+ | +----+----+ + +------+ |IVSE | | |IVSE | + | | | | | + | PQ-bits | | | PQ-bits | + | local |-+ | in VC | + +---------+ +---------+ + PCIe NX,NPU,CAPI + + + PQ-bits: 2 bits source state machine (P:pending Q:queued) + esb: Event State Buffer (Array of PQ bits in an IVSE) + eas: Event Assignment Structure + end: Event Notification Descriptor + nvt: Notification Virtual Target + tctx: Thread interrupt Context registers + + + +XIVE internal tables +-------------------- + +Each of the sub-engines uses a set of tables to redirect interrupts +from event sources to CPU threads. + +:: + + +-------+ + User or O/S | EQ | + or +------>|entries| + Hypervisor | | .. | + Memory | +-------+ + | ^ + | | + +-------------------------------------------------+ + | | + Hypervisor +------+ +---+--+ +---+--+ +------+ + Memory | ESB | | EAT | | ENDT | | NVTT | + (skiboot) +----+-+ +----+-+ +----+-+ +------+ + ^ | ^ | ^ | ^ + | | | | | | | + +-------------------------------------------------+ + | | | | | | | + | | | | | | | + +----|--|--------|--|--------|--|-+ +-|-----+ +------+ + | | | | | | | | | | tctx| |Thread| + IPI or ---+ + v + v + v |---| + .. |-----> | + HW events | | | | | | + | IVRE | | IVPE | +------+ + +---------------------------------+ +-------+ + + +The IVSE have a 2-bits state machine, P for pending and Q for queued, +for each source that allows events to be triggered. They are stored in +an Event State Buffer (ESB) array and can be controlled by MMIOs. + +If the event is let through, the IVRE looks up in the Event Assignment +Structure (EAS) table for an Event Notification Descriptor (END) +configured for the source. Each Event Notification Descriptor defines +a notification path to a CPU and an in-memory Event Queue, in which +will be enqueued an EQ data for the O/S to pull. + +The IVPE determines if a Notification Virtual Target (NVT) can handle +the event by scanning the thread contexts of the VCPUs dispatched on +the processor HW threads. It maintains the interrupt context state of +each thread in a NVT table. + +XIVE thread interrupt context +----------------------------- + +The XIVE presenter can generate four different exceptions to its +HW threads: + +- hypervisor exception +- O/S exception +- Event-Based Branch (user level) +- msgsnd (doorbell) + +Each exception has a state independent from the others called a Thread +Interrupt Management context. This context is a set of registers which +lets the thread handle priority management and interrupt +acknowledgment among other things. The most important ones being : + +- Interrupt Priority Register (PIPR) +- Interrupt Pending Buffer (IPB) +- Current Processor Priority (CPPR) +- Notification Source Register (NSR) + +TIMA +~~~~ + +The Thread Interrupt Management registers are accessible through a +specific MMIO region, called the Thread Interrupt Management Area +(TIMA), four aligned pages, each exposing a different view of the +registers. First page (page address ending in ``0b00``) gives access +to the entire context and is reserved for the ring 0 view for the +physical thread context. The second (page address ending in ``0b01``) +is for the hypervisor, ring 1 view. The third (page address ending in +``0b10``) is for the operating system, ring 2 view. The fourth (page +address ending in ``0b11``) is for user level, ring 3 view. + +Interrupt flow from an O/S perspective +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +After an event data has been enqueued in the O/S Event Queue, the IVPE +raises the bit corresponding to the priority of the pending interrupt +in the register IBP (Interrupt Pending Buffer) to indicate that an +event is pending in one of the 8 priority queues. The Pending +Interrupt Priority Register (PIPR) is also updated using the IPB. This +register represent the priority of the most favored pending +notification. + +The PIPR is then compared to the the Current Processor Priority +Register (CPPR). If it is more favored (numerically less than), the +CPU interrupt line is raised and the EO bit of the Notification Source +Register (NSR) is updated to notify the presence of an exception for +the O/S. The O/S acknowledges the interrupt with a special load in the +Thread Interrupt Management Area. + +The O/S handles the interrupt and when done, performs an EOI using a +MMIO operation on the ESB management page of the associate source. + +Overview of the QEMU models for XIVE +==================================== + +The XiveSource models the IVSE in general, internal and external. It +handles the source ESBs and the MMIO interface to control them. + +The XiveNotifier is a small helper interface interconnecting the +XiveSource to the XiveRouter. + +The XiveRouter is an abstract model acting as a combined IVRE and +IVPE. It routes event notifications using the EAS and END tables to +the IVPE sub-engine which does a CAM scan to find a CPU to deliver the +exception. Storage should be provided by the inheriting classes. + +XiveEnDSource is a special source object. It exposes the END ESB MMIOs +of the Event Queues which are used for coalescing event notifications +and for escalation. Not used on the field, only to sync the EQ cache +in OPAL. + +Finally, the XiveTCTX contains the interrupt state context of a thread, +four sets of registers, one for each exception that can be delivered +to a CPU. These contexts are scanned by the IVPE to find a matching VP +when a notification is triggered. It also models the Thread Interrupt +Management Area (TIMA), which exposes the thread context registers to +the CPU for interrupt management. |