diff options
author | Peter Maydell <peter.maydell@linaro.org> | 2022-01-18 19:43:33 +0000 |
---|---|---|
committer | Peter Maydell <peter.maydell@linaro.org> | 2022-01-18 19:43:33 +0000 |
commit | 0dabdd6b3a7ead1183d6f26eaded7d0c332e4cc7 (patch) | |
tree | 23a05f5d199c5677fa5573bd0d010f675ef7b52e /docs | |
parent | 8b846207151955a7d4de2d33d07645991824e345 (diff) | |
parent | ba49190107ee9803fb2f336b15283b457384b178 (diff) |
Merge remote-tracking branch 'remotes/legoater/tags/pull-ppc-20220118' into staging
ppc 7.0 queue:
* More documentation updates (Leonardo)
* Fixes for the 7448 CPU (Fabiano and Cedric)
* Final removal of 403 CPUs and the .load_state_old handler (Cedric)
* More cleanups of PHB4 models (Daniel and Cedric)
# gpg: Signature made Tue 18 Jan 2022 11:59:16 GMT
# gpg: using RSA key A0F66548F04895EBFE6B0B6051A343C7CFFBECA1
# gpg: Good signature from "Cédric Le Goater <clg@kaod.org>" [undefined]
# gpg: WARNING: This key is not certified with a trusted signature!
# gpg: There is no indication that the signature belongs to the owner.
# Primary key fingerprint: A0F6 6548 F048 95EB FE6B 0B60 51A3 43C7 CFFB ECA1
* remotes/legoater/tags/pull-ppc-20220118: (31 commits)
ppc/pnv: Remove PHB4 version property
ppc/pnv: Add a 'rp_model' class attribute for the PHB4 PEC
ppc/pnv: Move root port allocation under pnv_pec_default_phb_realize()
ppc/pnv: rename pnv_pec_stk_update_map()
ppc/pnv: remove PnvPhb4PecStack object
ppc/pnv: make PECs create and realize PHB4s
ppc/pnv: remove PnvPhb4PecStack::stack_no
ppc/pnv: move default_phb_realize() to pec_realize()
ppc/pnv: remove stack pointer from PnvPHB4
ppc/pnv: reduce stack->stack_no usage
ppc/pnv: introduce PnvPHB4 'pec' property
ppc/pnv: move phb_regs_mr to PnvPHB4
ppc/pnv: move nest_regs_mr to PnvPHB4
ppc/pnv: change pnv_pec_stk_update_map() to use PnvPHB4
ppc/pnv: move nest_regs[] to PnvPHB4
ppc/pnv: move mmbar0/mmbar1 and friends to PnvPHB4
ppc/pnv: change pnv_phb4_update_regions() to use PnvPHB4
ppc/pnv: move intbar to PnvPHB4
ppc/pnv: move phbbar to PnvPHB4
ppc/pnv: move PCI registers to PnvPHB4
...
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Diffstat (limited to 'docs')
-rw-r--r-- | docs/specs/ppc-spapr-hotplug.rst | 510 | ||||
-rw-r--r-- | docs/specs/ppc-spapr-hotplug.txt | 409 | ||||
-rw-r--r-- | docs/specs/ppc-spapr-uv-hcalls.rst | 89 | ||||
-rw-r--r-- | docs/specs/ppc-spapr-uv-hcalls.txt | 76 | ||||
-rw-r--r-- | docs/system/ppc/pseries.rst | 8 |
5 files changed, 601 insertions, 491 deletions
diff --git a/docs/specs/ppc-spapr-hotplug.rst b/docs/specs/ppc-spapr-hotplug.rst new file mode 100644 index 0000000000..f84dc55ad9 --- /dev/null +++ b/docs/specs/ppc-spapr-hotplug.rst @@ -0,0 +1,510 @@ +============================= +sPAPR Dynamic Reconfiguration +============================= + +sPAPR or pSeries guests make use of a facility called dynamic reconfiguration +to handle hot plugging of dynamic "physical" resources like PCI cards, or +"logical"/para-virtual resources like memory, CPUs, and "physical" +host-bridges, which are generally managed by the host/hypervisor and provided +to guests as virtualized resources. The specifics of dynamic reconfiguration +are documented extensively in section 13 of the Linux on Power Architecture +Reference document ([LoPAR]_). This document provides a summary of that +information as it applies to the implementation within QEMU. + +Dynamic-reconfiguration Connectors +================================== + +To manage hot plug/unplug of these resources, a firmware abstraction known as +a Dynamic Resource Connector (DRC) is used to assign a particular dynamic +resource to the guest, and provide an interface for the guest to manage +configuration/removal of the resource associated with it. + +Device tree description of DRCs +=============================== + +A set of four Open Firmware device tree array properties are used to describe +the name/index/power-domain/type of each DRC allocated to a guest at +boot time. There may be multiple sets of these arrays, rooted at different +paths in the device tree depending on the type of resource the DRCs manage. + +In some cases, the DRCs themselves may be provided by a dynamic resource, +such as the DRCs managing PCI slots on a hot plugged PHB. In this case the +arrays would be fetched as part of the device tree retrieval interfaces +for hot plugged resources described under :ref:`guest-host-interface`. + +The array properties are described below. Each entry/element in an array +describes the DRC identified by the element in the corresponding position +of ``ibm,drc-indexes``: + +``ibm,drc-names`` +----------------- + + First 4-bytes: big-endian (BE) encoded integer denoting the number of entries. + + Each entry: a NULL-terminated ``<name>`` string encoded as a byte array. + + ``<name>`` values for logical/virtual resources are defined in the Linux on + Power Architecture Reference ([LoPAR]_) section 13.5.2.4, and basically + consist of the type of the resource followed by a space and a numerical + value that's unique across resources of that type. + + ``<name>`` values for "physical" resources such as PCI or VIO devices are + defined as being "location codes", which are the "location labels" of each + encapsulating device, starting from the chassis down to the individual slot + for the device, concatenated by a hyphen. This provides a mapping of + resources to a physical location in a chassis for debugging purposes. For + QEMU, this mapping is less important, so we assign a location code that + conforms to naming specifications, but is simply a location label for the + slot by itself to simplify the implementation. The naming convention for + location labels is documented in detail in the [LoPAR]_ section 12.3.1.5, + and in our case amounts to using ``C<n>`` for PCI/VIO device slots, where + ``<n>`` is unique across all PCI/VIO device slots. + +``ibm,drc-indexes`` +------------------- + + First 4-bytes: BE-encoded integer denoting the number of entries. + + Each 4-byte entry: BE-encoded ``<index>`` integer that is unique across all + DRCs in the machine. + + ``<index>`` is arbitrary, but in the case of QEMU we try to maintain the + convention used to assign them to pSeries guests on pHyp (the hypervisor + portion of PowerVM): + + ``bit[31:28]``: integer encoding of ``<type>``, where ``<type>`` is: + + ``1`` for CPU resource. + + ``2`` for PHB resource. + + ``3`` for VIO resource. + + ``4`` for PCI resource. + + ``8`` for memory resource. + + ``bit[27:0]``: integer encoding of ``<id>``, where ``<id>`` is unique + across all resources of specified type. + +``ibm,drc-power-domains`` +------------------------- + + First 4-bytes: BE-encoded integer denoting the number of entries. + + Each 4-byte entry: 32-bit, BE-encoded ``<index>`` integer that specifies the + power domain the resource will be assigned to. In the case of QEMU we + associated all resources with a "live insertion" domain, where the power is + assumed to be managed automatically. The integer value for this domain is a + special value of ``-1``. + + +``ibm,drc-types`` +----------------- + + First 4-bytes: BE-encoded integer denoting the number of entries. + + Each entry: a NULL-terminated ``<type>`` string encoded as a byte array. + ``<type>`` is assigned as follows: + + "CPU" for a CPU. + + "PHB" for a physical host-bridge. + + "SLOT" for a VIO slot. + + "28" for a PCI slot. + + "MEM" for memory resource. + +.. _guest-host-interface: + +Guest->Host interface to manage dynamic resources +================================================= + +Each DRC is given a globally unique DRC index, and resources associated with a +particular DRC are configured/managed by the guest via a number of RTAS calls +which reference individual DRCs based on the DRC index. This can be considered +the guest->host interface. + +``rtas-set-power-level`` +------------------------ + +Set the power level for a specified power domain. + + ``arg[0]``: integer identifying power domain. + + ``arg[1]``: new power level for the domain, ``0-100``. + + ``output[0]``: status, ``0`` on success. + + ``output[1]``: power level after command. + +``rtas-get-power-level`` +------------------------ + +Get the power level for a specified power domain. + + ``arg[0]``: integer identifying power domain. + + ``output[0]``: status, ``0`` on success. + + ``output[1]``: current power level. + +``rtas-set-indicator`` +---------------------- + +Set the state of an indicator or sensor. + + ``arg[0]``: integer identifying sensor/indicator type. + + ``arg[1]``: index of sensor, for DR-related sensors this is generally the DRC + index. + + ``arg[2]``: desired sensor value. + + ``output[0]``: status, ``0`` on success. + +For the purpose of this document we focus on the indicator/sensor types +associated with a DRC. The types are: + +* ``9001``: ``isolation-state``, controls/indicates whether a device has been + made accessible to a guest. Supported sensor values: + + ``0``: ``isolate``, device is made inaccessible by guest OS. + + ``1``: ``unisolate``, device is made available to guest OS. + +* ``9002``: ``dr-indicator``, controls "visual" indicator associated with + device. Supported sensor values: + + ``0``: ``inactive``, resource may be safely removed. + + ``1``: ``active``, resource is in use and cannot be safely removed. + + ``2``: ``identify``, used to visually identify slot for interactive hot plug. + + ``3``: ``action``, in most cases, used in the same manner as identify. + +* ``9003``: ``allocation-state``, generally only used for "logical" DR resources + to request the allocation/deallocation of a resource prior to acquiring it via + ``isolation-state->unisolate``, or after releasing it via + ``isolation-state->isolate``, respectively. For "physical" DR (like PCI + hot plug/unplug) the pre-allocation of the resource is implied and this sensor + is unused. Supported sensor values: + + ``0``: ``unusable``, tell firmware/system the resource can be + unallocated/reclaimed and added back to the system resource pool. + + ``1``: ``usable``, request the resource be allocated/reserved for use by + guest OS. + + ``2``: ``exchange``, used to allocate a spare resource to use for fail-over + in certain situations. Unused in QEMU. + + ``3``: ``recover``, used to reclaim a previously allocated resource that's + not currently allocated to the guest OS. Unused in QEMU. + +``rtas-get-sensor-state:`` +-------------------------- + +Used to read an indicator or sensor value. + + ``arg[0]``: integer identifying sensor/indicator type. + + ``arg[1]``: index of sensor, for DR-related sensors this is generally the DRC + index + + ``output[0]``: status, 0 on success + +For DR-related operations, the only noteworthy sensor is ``dr-entity-sense``, +which has a type value of ``9003``, as ``allocation-state`` does in the case of +``rtas-set-indicator``. The semantics/encodings of the sensor values are +distinct however. + +Supported sensor values for ``dr-entity-sense`` (``9003``) sensor: + + ``0``: empty. + + For physical resources: DRC/slot is empty. + + For logical resources: unused. + + ``1``: present. + + For physical resources: DRC/slot is populated with a device/resource. + + For logical resources: resource has been allocated to the DRC. + + ``2``: unusable. + + For physical resources: unused. + + For logical resources: DRC has no resource allocated to it. + + ``3``: exchange. + + For physical resources: unused. + + For logical resources: resource available for exchange (see + ``allocation-state`` sensor semantics above). + + ``4``: recovery. + + For physical resources: unused. + + For logical resources: resource available for recovery (see + ``allocation-state`` sensor semantics above). + +``rtas-ibm-configure-connector`` +-------------------------------- + +Used to fetch an OpenFirmware device tree description of the resource associated +with a particular DRC. + + ``arg[0]``: guest physical address of 4096-byte work area buffer. + + ``arg[1]``: 0, or address of additional 4096-byte work area buffer; only + non-zero if a prior RTAS response indicated a need for additional memory. + + ``output[0]``: status: + + ``0``: completed transmittal of device tree node. + + ``1``: instruct guest to prepare for next device tree sibling node. + + ``2``: instruct guest to prepare for next device tree child node. + + ``3``: instruct guest to prepare for next device tree property. + + ``4``: instruct guest to ascend to parent device tree node. + + ``5``: instruct guest to provide additional work-area buffer via ``arg[1]``. + + ``990x``: instruct guest that operation took too long and to try again + later. + +The DRC index is encoded in the first 4-bytes of the first work area buffer. +Work area (``wa``) layout, using 4-byte offsets: + + ``wa[0]``: DRC index of the DRC to fetch device tree nodes from. + + ``wa[1]``: ``0`` (hard-coded). + + ``wa[2]``: + + For next-sibling/next-child response: + + ``wa`` offset of null-terminated string denoting the new node's name. + + For next-property response: + + ``wa`` offset of null-terminated string denoting new property's name. + + ``wa[3]``: for next-property response (unused otherwise): + + Byte-length of new property's value. + + ``wa[4]``: for next-property response (unused otherwise): + + New property's value, encoded as an OFDT-compatible byte array. + +Hot plug/unplug events +====================== + +For most DR operations, the hypervisor will issue host->guest add/remove events +using the EPOW/check-exception notification framework, where the host issues a +check-exception interrupt, then provides an RTAS event log via an +rtas-check-exception call issued by the guest in response. This framework is +documented by PAPR+ v2.7, and already use in by QEMU for generating powerdown +requests via EPOW events. + +For DR, this framework has been extended to include hotplug events, which were +previously unneeded due to direct manipulation of DR-related guest userspace +tools by host-level management such as an HMC. This level of management is not +applicable to KVM on Power, hence the reason for extending the notification +framework to support hotplug events. + +The format for these EPOW-signalled events is described below under +:ref:`hot-plug-unplug-event-structure`. Note that these events are not formally +part of the PAPR+ specification, and have been superseded by a newer format, +also described below under :ref:`hot-plug-unplug-event-structure`, and so are +now deemed a "legacy" format. The formats are similar, but the "modern" format +contains additional fields/flags, which are denoted for the purposes of this +documentation with ``#ifdef GUEST_SUPPORTS_MODERN`` guards. + +QEMU should assume support only for "legacy" fields/flags unless the guest +advertises support for the "modern" format via +``ibm,client-architecture-support`` hcall by setting byte 5, bit 6 of it's +``ibm,architecture-vec-5`` option vector structure (as described by [LoPAR]_, +section B.5.2.3). As with "legacy" format events, "modern" format events are +surfaced to the guest via check-exception RTAS calls, but use a dedicated event +source to signal the guest. This event source is advertised to the guest by the +addition of a ``hot-plug-events`` node under ``/event-sources`` node of the +guest's device tree using the standard format described in [LoPAR]_, +section B.5.12.2. + +.. _hot-plug-unplug-event-structure: + +Hot plug/unplug event structure +=============================== + +The hot plug specific payload in QEMU is implemented as follows (with all values +encoded in big-endian format): + +.. code-block:: c + + struct rtas_event_log_v6_hp { + #define SECTION_ID_HOTPLUG 0x4850 /* HP */ + struct section_header { + uint16_t section_id; /* set to SECTION_ID_HOTPLUG */ + uint16_t section_length; /* sizeof(rtas_event_log_v6_hp), + * plus the length of the DRC name + * if a DRC name identifier is + * specified for hotplug_identifier + */ + uint8_t section_version; /* version 1 */ + uint8_t section_subtype; /* unused */ + uint16_t creator_component_id; /* unused */ + } hdr; + #define RTAS_LOG_V6_HP_TYPE_CPU 1 + #define RTAS_LOG_V6_HP_TYPE_MEMORY 2 + #define RTAS_LOG_V6_HP_TYPE_SLOT 3 + #define RTAS_LOG_V6_HP_TYPE_PHB 4 + #define RTAS_LOG_V6_HP_TYPE_PCI 5 + uint8_t hotplug_type; /* type of resource/device */ + #define RTAS_LOG_V6_HP_ACTION_ADD 1 + #define RTAS_LOG_V6_HP_ACTION_REMOVE 2 + uint8_t hotplug_action; /* action (add/remove) */ + #define RTAS_LOG_V6_HP_ID_DRC_NAME 1 + #define RTAS_LOG_V6_HP_ID_DRC_INDEX 2 + #define RTAS_LOG_V6_HP_ID_DRC_COUNT 3 + #ifdef GUEST_SUPPORTS_MODERN + #define RTAS_LOG_V6_HP_ID_DRC_COUNT_INDEXED 4 + #endif + uint8_t hotplug_identifier; /* type of the resource identifier, + * which serves as the discriminator + * for the 'drc' union field below + */ + #ifdef GUEST_SUPPORTS_MODERN + uint8_t capabilities; /* capability flags, currently unused + * by QEMU + */ + #else + uint8_t reserved; + #endif + union { + uint32_t index; /* DRC index of resource to take action + * on + */ + uint32_t count; /* number of DR resources to take + * action on (guest chooses which) + */ + #ifdef GUEST_SUPPORTS_MODERN + struct { + uint32_t count; /* number of DR resources to take + * action on + */ + uint32_t index; /* DRC index of first resource to take + * action on. guest will take action + * on DRC index <index> through + * DRC index <index + count - 1> in + * sequential order + */ + } count_indexed; + #endif + char name[1]; /* string representing the name of the + * DRC to take action on + */ + } drc; + } QEMU_PACKED; + +``ibm,lrdr-capacity`` +===================== + +``ibm,lrdr-capacity`` is a property in the /rtas device tree node that +identifies the dynamic reconfiguration capabilities of the guest. It consists +of a triple consisting of ``<phys>``, ``<size>`` and ``<maxcpus>``. + + ``<phys>``, encoded in BE format represents the maximum address in bytes and + hence the maximum memory that can be allocated to the guest. + + ``<size>``, encoded in BE format represents the size increments in which + memory can be hot-plugged to the guest. + + ``<maxcpus>``, a BE-encoded integer, represents the maximum number of + processors that the guest can have. + +``pseries`` guests use this property to note the maximum allowed CPUs for the +guest. + +``ibm,dynamic-reconfiguration-memory`` +====================================== + +``ibm,dynamic-reconfiguration-memory`` is a device tree node that represents +dynamically reconfigurable logical memory blocks (LMB). This node is generated +only when the guest advertises the support for it via +``ibm,client-architecture-support`` call. Memory that is not dynamically +reconfigurable is represented by ``/memory`` nodes. The properties of this node +that are of interest to the sPAPR memory hotplug implementation in QEMU are +described here. + +``ibm,lmb-size`` +---------------- + +This 64-bit integer defines the size of each dynamically reconfigurable LMB. + +``ibm,associativity-lookup-arrays`` +----------------------------------- + +This property defines a lookup array in which the NUMA associativity +information for each LMB can be found. It is a property encoded array +that begins with an integer M, the number of associativity lists followed +by an integer N, the number of entries per associativity list and terminated +by M associativity lists each of length N integers. + +This property provides the same information as given by ``ibm,associativity`` +property in a ``/memory`` node. Each assigned LMB has an index value between +0 and M-1 which is used as an index into this table to select which +associativity list to use for the LMB. This index value for each LMB is defined +in ``ibm,dynamic-memory`` property. + +``ibm,dynamic-memory`` +---------------------- + +This property describes the dynamically reconfigurable memory. It is a +property encoded array that has an integer N, the number of LMBs followed +by N LMB list entries. + +Each LMB list entry consists of the following elements: + +- Logical address of the start of the LMB encoded as a 64-bit integer. This + corresponds to ``reg`` property in ``/memory`` node. +- DRC index of the LMB that corresponds to ``ibm,my-drc-index`` property + in a ``/memory`` node. +- Four bytes reserved for expansion. +- Associativity list index for the LMB that is used as an index into + ``ibm,associativity-lookup-arrays`` property described earlier. This is used + to retrieve the right associativity list to be used for this LMB. +- A 32-bit flags word. The bit at bit position ``0x00000008`` defines whether + the LMB is assigned to the partition as of boot time. + +``ibm,dynamic-memory-v2`` +------------------------- + +This property describes the dynamically reconfigurable memory. This is +an alternate and newer way to describe dynamically reconfigurable memory. +It is a property encoded array that has an integer N (the number of +LMB set entries) followed by N LMB set entries. There is an LMB set entry +for each sequential group of LMBs that share common attributes. + +Each LMB set entry consists of the following elements: + +- Number of sequential LMBs in the entry represented by a 32-bit integer. +- Logical address of the first LMB in the set encoded as a 64-bit integer. +- DRC index of the first LMB in the set. +- Associativity list index that is used as an index into + ``ibm,associativity-lookup-arrays`` property described earlier. This + is used to retrieve the right associativity list to be used for all + the LMBs in this set. +- A 32-bit flags word that applies to all the LMBs in the set. diff --git a/docs/specs/ppc-spapr-hotplug.txt b/docs/specs/ppc-spapr-hotplug.txt deleted file mode 100644 index d4fb2d46d9..0000000000 --- a/docs/specs/ppc-spapr-hotplug.txt +++ /dev/null @@ -1,409 +0,0 @@ -= sPAPR Dynamic Reconfiguration = - -sPAPR/"pseries" guests make use of a facility called dynamic-reconfiguration -to handle hotplugging of dynamic "physical" resources like PCI cards, or -"logical"/paravirtual resources like memory, CPUs, and "physical" -host-bridges, which are generally managed by the host/hypervisor and provided -to guests as virtualized resources. The specifics of dynamic-reconfiguration -are documented extensively in PAPR+ v2.7, Section 13.1. This document -provides a summary of that information as it applies to the implementation -within QEMU. - -== Dynamic-reconfiguration Connectors == - -To manage hotplug/unplug of these resources, a firmware abstraction known as -a Dynamic Resource Connector (DRC) is used to assign a particular dynamic -resource to the guest, and provide an interface for the guest to manage -configuration/removal of the resource associated with it. - -== Device-tree description of DRCs == - -A set of 4 Open Firmware device tree array properties are used to describe -the name/index/power-domain/type of each DRC allocated to a guest at -boot-time. There may be multiple sets of these arrays, rooted at different -paths in the device tree depending on the type of resource the DRCs manage. - -In some cases, the DRCs themselves may be provided by a dynamic resource, -such as the DRCs managing PCI slots on a hotplugged PHB. In this case the -arrays would be fetched as part of the device tree retrieval interfaces -for hotplugged resources described under "Guest->Host interface". - -The array properties are described below. Each entry/element in an array -describes the DRC identified by the element in the corresponding position -of ibm,drc-indexes: - -ibm,drc-names: - first 4-bytes: BE-encoded integer denoting the number of entries - each entry: a NULL-terminated <name> string encoded as a byte array - - <name> values for logical/virtual resources are defined in PAPR+ v2.7, - Section 13.5.2.4, and basically consist of the type of the resource - followed by a space and a numerical value that's unique across resources - of that type. - - <name> values for "physical" resources such as PCI or VIO devices are - defined as being "location codes", which are the "location labels" of - each encapsulating device, starting from the chassis down to the - individual slot for the device, concatenated by a hyphen. This provides - a mapping of resources to a physical location in a chassis for debugging - purposes. For QEMU, this mapping is less important, so we assign a - location code that conforms to naming specifications, but is simply a - location label for the slot by itself to simplify the implementation. - The naming convention for location labels is documented in detail in - PAPR+ v2.7, Section 12.3.1.5, and in our case amounts to using "C<n>" - for PCI/VIO device slots, where <n> is unique across all PCI/VIO - device slots. - -ibm,drc-indexes: - first 4-bytes: BE-encoded integer denoting the number of entries - each 4-byte entry: BE-encoded <index> integer that is unique across all DRCs - in the machine - - <index> is arbitrary, but in the case of QEMU we try to maintain the - convention used to assign them to pSeries guests on pHyp: - - bit[31:28]: integer encoding of <type>, where <type> is: - 1 for CPU resource - 2 for PHB resource - 3 for VIO resource - 4 for PCI resource - 8 for Memory resource - bit[27:0]: integer encoding of <id>, where <id> is unique across - all resources of specified type - -ibm,drc-power-domains: - first 4-bytes: BE-encoded integer denoting the number of entries - each 4-byte entry: 32-bit, BE-encoded <index> integer that specifies the - power domain the resource will be assigned to. In the case of QEMU - we associated all resources with a "live insertion" domain, where the - power is assumed to be managed automatically. The integer value for - this domain is a special value of -1. - - -ibm,drc-types: - first 4-bytes: BE-encoded integer denoting the number of entries - each entry: a NULL-terminated <type> string encoded as a byte array - - <type> is assigned as follows: - "CPU" for a CPU - "PHB" for a physical host-bridge - "SLOT" for a VIO slot - "28" for a PCI slot - "MEM" for memory resource - -== Guest->Host interface to manage dynamic resources == - -Each DRC is given a globally unique DRC Index, and resources associated with -a particular DRC are configured/managed by the guest via a number of RTAS -calls which reference individual DRCs based on the DRC index. This can be -considered the guest->host interface. - -rtas-set-power-level: - arg[0]: integer identifying power domain - arg[1]: new power level for the domain, 0-100 - output[0]: status, 0 on success - output[1]: power level after command - - Set the power level for a specified power domain - -rtas-get-power-level: - arg[0]: integer identifying power domain - output[0]: status, 0 on success - output[1]: current power level - - Get the power level for a specified power domain - -rtas-set-indicator: - arg[0]: integer identifying sensor/indicator type - arg[1]: index of sensor, for DR-related sensors this is generally the - DRC index - arg[2]: desired sensor value - output[0]: status, 0 on success - - Set the state of an indicator or sensor. For the purpose of this document we - focus on the indicator/sensor types associated with a DRC. The types are: - - 9001: isolation-state, controls/indicates whether a device has been made - accessible to a guest - - supported sensor values: - 0: isolate, device is made unaccessible by guest OS - 1: unisolate, device is made available to guest OS - - 9002: dr-indicator, controls "visual" indicator associated with device - - supported sensor values: - 0: inactive, resource may be safely removed - 1: active, resource is in use and cannot be safely removed - 2: identify, used to visually identify slot for interactive hotplug - 3: action, in most cases, used in the same manner as identify - - 9003: allocation-state, generally only used for "logical" DR resources to - request the allocation/deallocation of a resource prior to acquiring - it via isolation-state->unisolate, or after releasing it via - isolation-state->isolate, respectively. for "physical" DR (like PCI - hotplug/unplug) the pre-allocation of the resource is implied and - this sensor is unused. - - supported sensor values: - 0: unusable, tell firmware/system the resource can be - unallocated/reclaimed and added back to the system resource pool - 1: usable, request the resource be allocated/reserved for use by - guest OS - 2: exchange, used to allocate a spare resource to use for fail-over - in certain situations. unused in QEMU - 3: recover, used to reclaim a previously allocated resource that's - not currently allocated to the guest OS. unused in QEMU - -rtas-get-sensor-state: - arg[0]: integer identifying sensor/indicator type - arg[1]: index of sensor, for DR-related sensors this is generally the - DRC index - output[0]: status, 0 on success - - Used to read an indicator or sensor value. - - For DR-related operations, the only noteworthy sensor is dr-entity-sense, - which has a type value of 9003, as allocation-state does in the case of - rtas-set-indicator. The semantics/encodings of the sensor values are distinct - however: - - supported sensor values for dr-entity-sense (9003) sensor: - 0: empty, - for physical resources: DRC/slot is empty - for logical resources: unused - 1: present, - for physical resources: DRC/slot is populated with a device/resource - for logical resources: resource has been allocated to the DRC - 2: unusable, - for physical resources: unused - for logical resources: DRC has no resource allocated to it - 3: exchange, - for physical resources: unused - for logical resources: resource available for exchange (see - allocation-state sensor semantics above) - 4: recovery, - for physical resources: unused - for logical resources: resource available for recovery (see - allocation-state sensor semantics above) - -rtas-ibm-configure-connector: - arg[0]: guest physical address of 4096-byte work area buffer - arg[1]: 0, or address of additional 4096-byte work area buffer. only non-zero - if a prior RTAS response indicated a need for additional memory - output[0]: status: - 0: completed transmittal of device-tree node - 1: instruct guest to prepare for next DT sibling node - 2: instruct guest to prepare for next DT child node - 3: instruct guest to prepare for next DT property - 4: instruct guest to ascend to parent DT node - 5: instruct guest to provide additional work-area buffer - via arg[1] - 990x: instruct guest that operation took too long and to try - again later - - Used to fetch an OF device-tree description of the resource associated with - a particular DRC. The DRC index is encoded in the first 4-bytes of the first - work area buffer. - - Work area layout, using 4-byte offsets: - wa[0]: DRC index of the DRC to fetch device-tree nodes from - wa[1]: 0 (hard-coded) - wa[2]: for next-sibling/next-child response: - wa offset of null-terminated string denoting the new node's name - for next-property response: - wa offset of null-terminated string denoting new property's name - wa[3]: for next-property response (unused otherwise): - byte-length of new property's value - wa[4]: for next-property response (unused otherwise): - new property's value, encoded as an OFDT-compatible byte array - -== hotplug/unplug events == - -For most DR operations, the hypervisor will issue host->guest add/remove events -using the EPOW/check-exception notification framework, where the host issues a -check-exception interrupt, then provides an RTAS event log via an -rtas-check-exception call issued by the guest in response. This framework is -documented by PAPR+ v2.7, and already use in by QEMU for generating powerdown -requests via EPOW events. - -For DR, this framework has been extended to include hotplug events, which were -previously unneeded due to direct manipulation of DR-related guest userspace -tools by host-level management such as an HMC. This level of management is not -applicable to PowerKVM, hence the reason for extending the notification -framework to support hotplug events. - -The format for these EPOW-signalled events is described below under -"hotplug/unplug event structure". Note that these events are not -formally part of the PAPR+ specification, and have been superseded by a -newer format, also described below under "hotplug/unplug event structure", -and so are now deemed a "legacy" format. The formats are similar, but the -"modern" format contains additional fields/flags, which are denoted for the -purposes of this documentation with "#ifdef GUEST_SUPPORTS_MODERN" guards. - -QEMU should assume support only for "legacy" fields/flags unless the guest -advertises support for the "modern" format via ibm,client-architecture-support -hcall by setting byte 5, bit 6 of it's ibm,architecture-vec-5 option vector -structure (as described by LoPAPR v11, B.6.2.3). As with "legacy" format events, -"modern" format events are surfaced to the guest via check-exception RTAS calls, -but use a dedicated event source to signal the guest. This event source is -advertised to the guest by the addition of a "hot-plug-events" node under -"/event-sources" node of the guest's device tree using the standard format -described in LoPAPR v11, B.6.12.1. - -== hotplug/unplug event structure == - -The hotplug-specific payload in QEMU is implemented as follows (with all values -encoded in big-endian format): - -struct rtas_event_log_v6_hp { -#define SECTION_ID_HOTPLUG 0x4850 /* HP */ - struct section_header { - uint16_t section_id; /* set to SECTION_ID_HOTPLUG */ - uint16_t section_length; /* sizeof(rtas_event_log_v6_hp), - * plus the length of the DRC name - * if a DRC name identifier is - * specified for hotplug_identifier - */ - uint8_t section_version; /* version 1 */ - uint8_t section_subtype; /* unused */ - uint16_t creator_component_id; /* unused */ - } hdr; -#define RTAS_LOG_V6_HP_TYPE_CPU 1 -#define RTAS_LOG_V6_HP_TYPE_MEMORY 2 -#define RTAS_LOG_V6_HP_TYPE_SLOT 3 -#define RTAS_LOG_V6_HP_TYPE_PHB 4 -#define RTAS_LOG_V6_HP_TYPE_PCI 5 - uint8_t hotplug_type; /* type of resource/device */ -#define RTAS_LOG_V6_HP_ACTION_ADD 1 -#define RTAS_LOG_V6_HP_ACTION_REMOVE 2 - uint8_t hotplug_action; /* action (add/remove) */ -#define RTAS_LOG_V6_HP_ID_DRC_NAME 1 -#define RTAS_LOG_V6_HP_ID_DRC_INDEX 2 -#define RTAS_LOG_V6_HP_ID_DRC_COUNT 3 -#ifdef GUEST_SUPPORTS_MODERN -#define RTAS_LOG_V6_HP_ID_DRC_COUNT_INDEXED 4 -#endif - uint8_t hotplug_identifier; /* type of the resource identifier, - * which serves as the discriminator - * for the 'drc' union field below - */ -#ifdef GUEST_SUPPORTS_MODERN - uint8_t capabilities; /* capability flags, currently unused - * by QEMU - */ -#else - uint8_t reserved; -#endif - union { - uint32_t index; /* DRC index of resource to take action - * on - */ - uint32_t count; /* number of DR resources to take - * action on (guest chooses which) - */ -#ifdef GUEST_SUPPORTS_MODERN - struct { - uint32_t count; /* number of DR resources to take - * action on - */ - uint32_t index; /* DRC index of first resource to take - * action on. guest will take action - * on DRC index <index> through - * DRC index <index + count - 1> in - * sequential order - */ - } count_indexed; -#endif - char name[1]; /* string representing the name of the - * DRC to take action on - */ - } drc; -} QEMU_PACKED; - -== ibm,lrdr-capacity == - -ibm,lrdr-capacity is a property in the /rtas device tree node that identifies -the dynamic reconfiguration capabilities of the guest. It consists of a triple -consisting of <phys>, <size> and <maxcpus>. - - <phys>, encoded in BE format represents the maximum address in bytes and - hence the maximum memory that can be allocated to the guest. - - <size>, encoded in BE format represents the size increments in which - memory can be hot-plugged to the guest. - - <maxcpus>, a BE-encoded integer, represents the maximum number of - processors that the guest can have. - -pseries guests use this property to note the maximum allowed CPUs for the -guest. - -== ibm,dynamic-reconfiguration-memory == - -ibm,dynamic-reconfiguration-memory is a device tree node that represents -dynamically reconfigurable logical memory blocks (LMB). This node -is generated only when the guest advertises the support for it via -ibm,client-architecture-support call. Memory that is not dynamically -reconfigurable is represented by /memory nodes. The properties of this -node that are of interest to the sPAPR memory hotplug implementation -in QEMU are described here. - -ibm,lmb-size - -This 64bit integer defines the size of each dynamically reconfigurable LMB. - -ibm,associativity-lookup-arrays - -This property defines a lookup array in which the NUMA associativity -information for each LMB can be found. It is a property encoded array -that begins with an integer M, the number of associativity lists followed -by an integer N, the number of entries per associativity list and terminated -by M associativity lists each of length N integers. - -This property provides the same information as given by ibm,associativity -property in a /memory node. Each assigned LMB has an index value between -0 and M-1 which is used as an index into this table to select which -associativity list to use for the LMB. This index value for each LMB -is defined in ibm,dynamic-memory property. - -ibm,dynamic-memory - -This property describes the dynamically reconfigurable memory. It is a -property encoded array that has an integer N, the number of LMBs followed -by N LMB list entries. - -Each LMB list entry consists of the following elements: - -- Logical address of the start of the LMB encoded as a 64bit integer. This - corresponds to reg property in /memory node. -- DRC index of the LMB that corresponds to ibm,my-drc-index property - in a /memory node. -- Four bytes reserved for expansion. -- Associativity list index for the LMB that is used as an index into - ibm,associativity-lookup-arrays property described earlier. This - is used to retrieve the right associativity list to be used for this - LMB. -- A 32bit flags word. The bit at bit position 0x00000008 defines whether - the LMB is assigned to the partition as of boot time. - -ibm,dynamic-memory-v2 - -This property describes the dynamically reconfigurable memory. This is -an alternate and newer way to describe dynamically reconfigurable memory. -It is a property encoded array that has an integer N (the number of -LMB set entries) followed by N LMB set entries. There is an LMB set entry -for each sequential group of LMBs that share common attributes. - -Each LMB set entry consists of the following elements: - -- Number of sequential LMBs in the entry represented by a 32bit integer. -- Logical address of the first LMB in the set encoded as a 64bit integer. -- DRC index of the first LMB in the set. -- Associativity list index that is used as an index into - ibm,associativity-lookup-arrays property described earlier. This - is used to retrieve the right associativity list to be used for all - the LMBs in this set. -- A 32bit flags word that applies to all the LMBs in the set. - -[1] http://thread.gmane.org/gmane.linux.ports.ppc.embedded/75350/focus=106867 diff --git a/docs/specs/ppc-spapr-uv-hcalls.rst b/docs/specs/ppc-spapr-uv-hcalls.rst new file mode 100644 index 0000000000..a00288deb3 --- /dev/null +++ b/docs/specs/ppc-spapr-uv-hcalls.rst @@ -0,0 +1,89 @@ +=================================== +Hypervisor calls and the Ultravisor +=================================== + +On PPC64 systems supporting Protected Execution Facility (PEF), system memory +can be placed in a secured region where only an ultravisor running in firmware +can provide access to. pSeries guests on such systems can communicate with +the ultravisor (via ultracalls) to switch to a secure virtual machine (SVM) mode +where the guest's memory is relocated to this secured region, making its memory +inaccessible to normal processes/guests running on the host. + +The various ultracalls/hypercalls relating to SVM mode are currently only +documented internally, but are planned for direct inclusion into the Linux on +Power Architecture Reference document ([LoPAR]_). An internal ACR has been filed +to reserve a hypercall number range specific to this use case to avoid any +future conflicts with the IBM internally maintained Power Architecture Platform +Reference (PAPR+) documentation specification. This document summarizes some of +these details as they relate to QEMU. + +Hypercalls needed by the ultravisor +=================================== + +Switching to SVM mode involves a number of hcalls issued by the ultravisor to +the hypervisor to orchestrate the movement of guest memory to secure memory and +various other aspects of the SVM mode. Numbers are assigned for these hcalls +within the reserved range ``0xEF00-0xEF80``. The below documents the hcalls +relevant to QEMU. + +``H_TPM_COMM`` (``0xef10``) +--------------------------- + +SVM file systems are encrypted using a symmetric key. This key is then +wrapped/encrypted using the public key of a trusted system which has the private +key stored in the system's TPM. An Ultravisor will use this hcall to +unwrap/unseal the symmetric key using the system's TPM device or a TPM Resource +Manager associated with the device. + +The Ultravisor sets up a separate session key with the TPM in advance during +host system boot. All sensitive in and out values will be encrypted using the +session key. Though the hypervisor will see the in and out buffers in raw form, +any sensitive contents will generally be encrypted using this session key. + +Arguments: + + ``r3``: ``H_TPM_COMM`` (``0xef10``) + + ``r4``: ``TPM`` operation, one of: + + ``TPM_COMM_OP_EXECUTE`` (``0x1``): send a request to a TPM and receive a + response, opening a new TPM session if one has not already been opened. + + ``TPM_COMM_OP_CLOSE_SESSION`` (``0x2``): close the existing TPM session, if + any. + + ``r5``: ``in_buffer``, guest physical address of buffer containing the + request. Caller may use the same address for both request and response. + + ``r6``: ``in_size``, size of the in buffer. Must be less than or equal to + 4 KB. + + ``r7``: ``out_buffer``, guest physical address of buffer to store the + response. Caller may use the same address for both request and response. + + ``r8``: ``out_size``, size of the out buffer. Must be at least 4 KB, as this + is the maximum request/response size supported by most TPM implementations, + including the TPM Resource Manager in the linux kernel. + +Return values: + + ``r3``: one of the following values: + + ``H_Success``: request processed successfully. + + ``H_PARAMETER``: invalid TPM operation. + + ``H_P2``: ``in_buffer`` is invalid. + + ``H_P3``: ``in_size`` is invalid. + + ``H_P4``: ``out_buffer`` is invalid. + + ``H_P5``: ``out_size`` is invalid. + + ``H_RESOURCE``: problem communicating with TPM. + + ``H_FUNCTION``: TPM access is not currently allowed/configured. + + ``r4``: For ``TPM_COMM_OP_EXECUTE``, the size of the response will be stored + here upon success. diff --git a/docs/specs/ppc-spapr-uv-hcalls.txt b/docs/specs/ppc-spapr-uv-hcalls.txt deleted file mode 100644 index 389c2740d7..0000000000 --- a/docs/specs/ppc-spapr-uv-hcalls.txt +++ /dev/null @@ -1,76 +0,0 @@ -On PPC64 systems supporting Protected Execution Facility (PEF), system -memory can be placed in a secured region where only an "ultravisor" -running in firmware can provide to access it. pseries guests on such -systems can communicate with the ultravisor (via ultracalls) to switch to a -secure VM mode (SVM) where the guest's memory is relocated to this secured -region, making its memory inaccessible to normal processes/guests running on -the host. - -The various ultracalls/hypercalls relating to SVM mode are currently -only documented internally, but are planned for direct inclusion into the -public OpenPOWER version of the PAPR specification (LoPAPR/LoPAR). An internal -ACR has been filed to reserve a hypercall number range specific to this -use-case to avoid any future conflicts with the internally-maintained PAPR -specification. This document summarizes some of these details as they relate -to QEMU. - -== hypercalls needed by the ultravisor == - -Switching to SVM mode involves a number of hcalls issued by the ultravisor -to the hypervisor to orchestrate the movement of guest memory to secure -memory and various other aspects SVM mode. Numbers are assigned for these -hcalls within the reserved range 0xEF00-0xEF80. The below documents the -hcalls relevant to QEMU. - -- H_TPM_COMM (0xef10) - - For TPM_COMM_OP_EXECUTE operation: - Send a request to a TPM and receive a response, opening a new TPM session - if one has not already been opened. - - For TPM_COMM_OP_CLOSE_SESSION operation: - Close the existing TPM session, if any. - - Arguments: - - r3 : H_TPM_COMM (0xef10) - r4 : TPM operation, one of: - TPM_COMM_OP_EXECUTE (0x1) - TPM_COMM_OP_CLOSE_SESSION (0x2) - r5 : in_buffer, guest physical address of buffer containing the request - - Caller may use the same address for both request and response - r6 : in_size, size of the in buffer - - Must be less than or equal to 4KB - r7 : out_buffer, guest physical address of buffer to store the response - - Caller may use the same address for both request and response - r8 : out_size, size of the out buffer - - Must be at least 4KB, as this is the maximum request/response size - supported by most TPM implementations, including the TPM Resource - Manager in the linux kernel. - - Return values: - - r3 : H_Success request processed successfully - H_PARAMETER invalid TPM operation - H_P2 in_buffer is invalid - H_P3 in_size is invalid - H_P4 out_buffer is invalid - H_P5 out_size is invalid - H_RESOURCE problem communicating with TPM - H_FUNCTION TPM access is not currently allowed/configured - r4 : For TPM_COMM_OP_EXECUTE, the size of the response will be stored here - upon success. - - Use-case/notes: - - SVM filesystems are encrypted using a symmetric key. This key is then - wrapped/encrypted using the public key of a trusted system which has the - private key stored in the system's TPM. An Ultravisor will use this - hcall to unwrap/unseal the symmetric key using the system's TPM device - or a TPM Resource Manager associated with the device. - - The Ultravisor sets up a separate session key with the TPM in advance - during host system boot. All sensitive in and out values will be - encrypted using the session key. Though the hypervisor will see the 'in' - and 'out' buffers in raw form, any sensitive contents will generally be - encrypted using this session key. diff --git a/docs/system/ppc/pseries.rst b/docs/system/ppc/pseries.rst index 1689324815..569237dc0c 100644 --- a/docs/system/ppc/pseries.rst +++ b/docs/system/ppc/pseries.rst @@ -110,16 +110,12 @@ can also be found in QEMU documentation: .. toctree:: :maxdepth: 1 + ../../specs/ppc-spapr-hotplug.rst ../../specs/ppc-spapr-hcalls.rst ../../specs/ppc-spapr-numa.rst + ../../specs/ppc-spapr-uv-hcalls.rst ../../specs/ppc-spapr-xive.rst -Other documentation available in QEMU docs directory: - -* Hot plug (``/docs/specs/ppc-spapr-hotplug.txt``). -* Hypervisor calls needed by the Ultravisor - (``/docs/specs/ppc-spapr-uv-hcalls.txt``). - Switching between the KVM-PR and KVM-HV kernel module ===================================================== |