diff options
author | Peter Maydell <peter.maydell@linaro.org> | 2020-08-24 09:35:21 +0100 |
---|---|---|
committer | Peter Maydell <peter.maydell@linaro.org> | 2020-08-24 09:35:21 +0100 |
commit | dd8014e4e904e895435aae9f11a686f072762782 (patch) | |
tree | ea1f526128f3d88a92f90cf8833b3adf9c8ff828 /docs | |
parent | 8367a77c4d3f6e1e60890f5510304feb2c621611 (diff) | |
parent | 3110f0ee19ccdb50adff3dfa1321039f69efddcd (diff) |
Merge remote-tracking branch 'remotes/dgibson/tags/ppc-for-5.2-20200818' into staging
ppc patch queue 2020-08-18
Here's my first pull request for qemu-5.2, which has quite a few
accumulated things. Highlights are:
* Preliminary support for POWER10 (Power ISA 3.1) instruction emulation
* Add documentation on the (very confusing) pseries NUMA configuration
* Fix some bugs handling edge cases with XICS, XIVE and kernel_irqchip
* Fix icount for a number of POWER registers
* Many cleanups to error handling in XIVE code
* Validate size of -prom-env data
# gpg: Signature made Tue 18 Aug 2020 05:18:36 BST
# gpg: using RSA key 75F46586AE61A66CC44E87DC6C38CACA20D9B392
# gpg: Good signature from "David Gibson <david@gibson.dropbear.id.au>" [full]
# gpg: aka "David Gibson (Red Hat) <dgibson@redhat.com>" [full]
# gpg: aka "David Gibson (ozlabs.org) <dgibson@ozlabs.org>" [full]
# gpg: aka "David Gibson (kernel.org) <dwg@kernel.org>" [unknown]
# Primary key fingerprint: 75F4 6586 AE61 A66C C44E 87DC 6C38 CACA 20D9 B392
* remotes/dgibson/tags/ppc-for-5.2-20200818: (40 commits)
spapr/xive: Use xive_source_esb_len()
nvram: Exit QEMU if NVRAM cannot contain all -prom-env data
spapr/xive: Simplify error handling of kvmppc_xive_cpu_synchronize_state()
ppc/xive: Simplify error handling in xive_tctx_realize()
spapr/xive: Simplify error handling in kvmppc_xive_connect()
ppc/xive: Fix error handling in vmstate_xive_tctx_*() callbacks
spapr/xive: Fix error handling in kvmppc_xive_post_load()
spapr/kvm: Fix error handling in kvmppc_xive_pre_save()
spapr/xive: Rework error handling of kvmppc_xive_set_source_config()
spapr/xive: Rework error handling in kvmppc_xive_get_queues()
spapr/xive: Rework error handling of kvmppc_xive_[gs]et_queue_config()
spapr/xive: Rework error handling of kvmppc_xive_cpu_[gs]et_state()
spapr/xive: Rework error handling of kvmppc_xive_mmap()
spapr/xive: Rework error handling of kvmppc_xive_source_reset()
spapr/xive: Rework error handling of kvmppc_xive_cpu_connect()
spapr: Simplify error handling in spapr_phb_realize()
spapr/xive: Convert KVM device fd checks to assert()
ppc/xive: Introduce dedicated kvm_irqchip_in_kernel() wrappers
ppc/xive: Rework setup of XiveSource::esb_mmio
target/ppc: Integrate icount to purr, vtb, and tbu40
...
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Diffstat (limited to 'docs')
-rw-r--r-- | docs/specs/index.rst | 1 | ||||
-rw-r--r-- | docs/specs/ppc-spapr-numa.rst | 191 | ||||
-rw-r--r-- | docs/specs/ppc-spapr-xive.rst | 10 |
3 files changed, 201 insertions, 1 deletions
diff --git a/docs/specs/index.rst b/docs/specs/index.rst index 426632a475..1b0eb979d5 100644 --- a/docs/specs/index.rst +++ b/docs/specs/index.rst @@ -12,6 +12,7 @@ Contents: ppc-xive ppc-spapr-xive + ppc-spapr-numa acpi_hw_reduced_hotplug tpm acpi_hest_ghes diff --git a/docs/specs/ppc-spapr-numa.rst b/docs/specs/ppc-spapr-numa.rst new file mode 100644 index 0000000000..e762038022 --- /dev/null +++ b/docs/specs/ppc-spapr-numa.rst @@ -0,0 +1,191 @@ + +NUMA mechanics for sPAPR (pseries machines) +============================================ + +NUMA in sPAPR works different than the System Locality Distance +Information Table (SLIT) in ACPI. The logic is explained in the LOPAPR +1.1 chapter 15, "Non Uniform Memory Access (NUMA) Option". This +document aims to complement this specification, providing details +of the elements that impacts how QEMU views NUMA in pseries. + +Associativity and ibm,associativity property +-------------------------------------------- + +Associativity is defined as a group of platform resources that has +similar mean performance (or in our context here, distance) relative to +everyone else outside of the group. + +The format of the ibm,associativity property varies with the value of +bit 0 of byte 5 of the ibm,architecture-vec-5 property. The format with +bit 0 equal to zero is deprecated. The current format, with the bit 0 +with the value of one, makes ibm,associativity property represent the +physical hierarchy of the platform, as one or more lists that starts +with the highest level grouping up to the smallest. Considering the +following topology: + +:: + + Mem M1 ---- Proc P1 | + ----------------- | Socket S1 ---| + chip C1 | | + | HW module 1 (MOD1) + Mem M2 ---- Proc P2 | | + ----------------- | Socket S2 ---| + chip C2 | + +The ibm,associativity property for the processors would be: + +* P1: {MOD1, S1, C1, P1} +* P2: {MOD1, S2, C2, P2} + +Each allocable resource has an ibm,associativity property. The LOPAPR +specification allows multiple lists to be present in this property, +considering that the same resource can have multiple connections to the +platform. + +Relative Performance Distance and ibm,associativity-reference-points +-------------------------------------------------------------------- + +The ibm,associativity-reference-points property is an array that is used +to define the relevant performance/distance related boundaries, defining +the NUMA levels for the platform. + +The definition of its elements also varies with the value of bit 0 of byte 5 +of the ibm,architecture-vec-5 property. The format with bit 0 equal to zero +is also deprecated. With the current format, each integer of the +ibm,associativity-reference-points represents an 1 based ordinal index (i.e. +the first element is 1) of the ibm,associativity array. The first +boundary is the most significant to application performance, followed by +less significant boundaries. Allocated resources that belongs to the +same performance boundaries are expected to have relative NUMA distance +that matches the relevancy of the boundary itself. Resources that belongs +to the same first boundary will have the shortest distance from each +other. Subsequent boundaries represents greater distances and degraded +performance. + +Using the previous example, the following setting reference points defines +three NUMA levels: + +* ibm,associativity-reference-points = {0x3, 0x2, 0x1} + +The first NUMA level (0x3) is interpreted as the third element of each +ibm,associativity array, the second level is the second element and +the third level is the first element. Let's also consider that elements +belonging to the first NUMA level have distance equal to 10 from each +other, and each NUMA level doubles the distance from the previous. This +means that the second would be 20 and the third level 40. For the P1 and +P2 processors, we would have the following NUMA levels: + +:: + + * ibm,associativity-reference-points = {0x3, 0x2, 0x1} + + * P1: associativity{MOD1, S1, C1, P1} + + First NUMA level (0x3) => associativity[2] = C1 + Second NUMA level (0x2) => associativity[1] = S1 + Third NUMA level (0x1) => associativity[0] = MOD1 + + * P2: associativity{MOD1, S2, C2, P2} + + First NUMA level (0x3) => associativity[2] = C2 + Second NUMA level (0x2) => associativity[1] = S2 + Third NUMA level (0x1) => associativity[0] = MOD1 + + P1 and P2 have the same third NUMA level, MOD1: Distance between them = 40 + +Changing the ibm,associativity-reference-points array changes the performance +distance attributes for the same associativity arrays, as the following +example illustrates: + +:: + + * ibm,associativity-reference-points = {0x2} + + * P1: associativity{MOD1, S1, C1, P1} + + First NUMA level (0x2) => associativity[1] = S1 + + * P2: associativity{MOD1, S2, C2, P2} + + First NUMA level (0x2) => associativity[1] = S2 + + P1 and P2 does not have a common performance boundary. Since this is a one level + NUMA configuration, distance between them is one boundary above the first + level, 20. + + +In a hypothetical platform where all resources inside the same hardware module +is considered to be on the same performance boundary: + +:: + + * ibm,associativity-reference-points = {0x1} + + * P1: associativity{MOD1, S1, C1, P1} + + First NUMA level (0x1) => associativity[0] = MOD0 + + * P2: associativity{MOD1, S2, C2, P2} + + First NUMA level (0x1) => associativity[0] = MOD0 + + P1 and P2 belongs to the same first order boundary. The distance between then + is 10. + + +How the pseries Linux guest calculates NUMA distances +===================================================== + +Another key difference between ACPI SLIT and the LOPAPR regarding NUMA is +how the distances are expressed. The SLIT table provides the NUMA distance +value between the relevant resources. LOPAPR does not provide a standard +way to calculate it. We have the ibm,associativity for each resource, which +provides a common-performance hierarchy, and the ibm,associativity-reference-points +array that tells which level of associativity is considered to be relevant +or not. + +The result is that each OS is free to implement and to interpret the distance +as it sees fit. For the pseries Linux guest, each level of NUMA duplicates +the distance of the previous level, and the maximum amount of levels is +limited to MAX_DISTANCE_REF_POINTS = 4 (from arch/powerpc/mm/numa.c in the +kernel tree). This results in the following distances: + +* both resources in the first NUMA level: 10 +* resources one NUMA level apart: 20 +* resources two NUMA levels apart: 40 +* resources three NUMA levels apart: 80 +* resources four NUMA levels apart: 160 + + +Consequences for QEMU NUMA tuning +--------------------------------- + +The way the pseries Linux guest calculates NUMA distances has a direct effect +on what QEMU users can expect when doing NUMA tuning. As of QEMU 5.1, this is +the default ibm,associativity-reference-points being used in the pseries +machine: + +ibm,associativity-reference-points = {0x4, 0x4, 0x2} + +The first and second level are equal, 0x4, and a third one was added in +commit a6030d7e0b35 exclusively for NVLink GPUs support. This means that +regardless of how the ibm,associativity properties are being created in +the device tree, the pseries Linux guest will only recognize three scenarios +as far as NUMA distance goes: + +* if the resources belongs to the same first NUMA level = 10 +* second level is skipped since it's equal to the first +* all resources that aren't a NVLink GPU, it is guaranteed that they will belong + to the same third NUMA level, having distance = 40 +* for NVLink GPUs, distance = 80 from everything else + +In short, we can summarize the NUMA distances seem in pseries Linux guests, using +QEMU up to 5.1, as follows: + +* local distance, i.e. the distance of the resource to its own NUMA node: 10 +* if it's a NVLink GPU device, distance: 80 +* every other resource, distance: 40 + +This also means that user input in QEMU command line does not change the +NUMA distancing inside the guest for the pseries machine. diff --git a/docs/specs/ppc-spapr-xive.rst b/docs/specs/ppc-spapr-xive.rst index 6159bc6eed..7144347560 100644 --- a/docs/specs/ppc-spapr-xive.rst +++ b/docs/specs/ppc-spapr-xive.rst @@ -61,6 +61,11 @@ depend on the XIVE KVM capability of the host. On older kernels without XIVE KVM support, QEMU will use the emulated XIVE device as a fallback and on newer kernels (>=5.2), the KVM XIVE device. +XIVE native exploitation mode is not supported for KVM nested guests, +VMs running under a L1 hypervisor (KVM on pSeries). In that case, the +hypervisor will not advertise the KVM capability and QEMU will use the +emulated XIVE device, same as for older versions of KVM. + As a final refinement, the user can also switch the use of the KVM device with the machine option ``kernel_irqchip``. @@ -121,6 +126,9 @@ xics XICS KVM XICS emul. XICS KVM (1) QEMU warns with ``warning: kernel_irqchip requested but unavailable: IRQ_XIVE capability must be present for KVM`` + In some cases (old host kernels or KVM nested guests), one may hit a + QEMU/KVM incompatibility due to device destruction in reset. QEMU fails + with ``KVM is incompatible with ic-mode=dual,kernel-irqchip=on`` (2) QEMU fails with ``kernel_irqchip requested but unavailable: IRQ_XIVE capability must be present for KVM`` @@ -143,7 +151,7 @@ xics XICS KVM XICS emul. XICS KVM mode (XICS), either don't set the ic-mode machine property or try ic-mode=xics or ic-mode=dual`` (4) QEMU/KVM incompatibility due to device destruction in reset. QEMU fails - with ``KVM is too old to support ic-mode=dual,kernel-irqchip=on`` + with ``KVM is incompatible with ic-mode=dual,kernel-irqchip=on`` XIVE Device tree properties |