Merge remote-tracking branch 'remotes/dgibson/tags/ppc-for-5.2-20200818' into staging

ppc patch queue 2020-08-18 Here's my first pull request for qemu-5.2, which has quite a few accumulated things. Highlights are: * Preliminary support for POWER10 (Power ISA 3.1) instruction emulation * Add documentation on the (very confusing) pseries NUMA configuration * Fix some bugs handling edge cases with XICS, XIVE and kernel_irqchip * Fix icount for a number of POWER registers * Many cleanups to error handling in XIVE code * Validate size of -prom-env data # gpg: Signature made Tue 18 Aug 2020 05:18:36 BST # gpg: using RSA key 75F46586AE61A66CC44E87DC6C38CACA20D9B392 # gpg: Good signature from "David Gibson <david@gibson.dropbear.id.au>" [full] # gpg: aka "David Gibson (Red Hat) <dgibson@redhat.com>" [full] # gpg: aka "David Gibson (ozlabs.org) <dgibson@ozlabs.org>" [full] # gpg: aka "David Gibson (kernel.org) <dwg@kernel.org>" [unknown] # Primary key fingerprint: 75F4 6586 AE61 A66C C44E 87DC 6C38 CACA 20D9 B392 * remotes/dgibson/tags/ppc-for-5.2-20200818: (40 commits) spapr/xive: Use xive_source_esb_len() nvram: Exit QEMU if NVRAM cannot contain all -prom-env data spapr/xive: Simplify error handling of kvmppc_xive_cpu_synchronize_state() ppc/xive: Simplify error handling in xive_tctx_realize() spapr/xive: Simplify error handling in kvmppc_xive_connect() ppc/xive: Fix error handling in vmstate_xive_tctx_*() callbacks spapr/xive: Fix error handling in kvmppc_xive_post_load() spapr/kvm: Fix error handling in kvmppc_xive_pre_save() spapr/xive: Rework error handling of kvmppc_xive_set_source_config() spapr/xive: Rework error handling in kvmppc_xive_get_queues() spapr/xive: Rework error handling of kvmppc_xive_[gs]et_queue_config() spapr/xive: Rework error handling of kvmppc_xive_cpu_[gs]et_state() spapr/xive: Rework error handling of kvmppc_xive_mmap() spapr/xive: Rework error handling of kvmppc_xive_source_reset() spapr/xive: Rework error handling of kvmppc_xive_cpu_connect() spapr: Simplify error handling in spapr_phb_realize() spapr/xive: Convert KVM device fd checks to assert() ppc/xive: Introduce dedicated kvm_irqchip_in_kernel() wrappers ppc/xive: Rework setup of XiveSource::esb_mmio target/ppc: Integrate icount to purr, vtb, and tbu40 ... Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
author: Peter Maydell <peter.maydell@linaro.org> 2020-08-24 09:35:21 +0100
committer: Peter Maydell <peter.maydell@linaro.org> 2020-08-24 09:35:21 +0100
commit: dd8014e4e904e895435aae9f11a686f072762782 (patch)
tree: ea1f526128f3d88a92f90cf8833b3adf9c8ff828 /docs
parent: 8367a77c4d3f6e1e60890f5510304feb2c621611 (diff)
parent: 3110f0ee19ccdb50adff3dfa1321039f69efddcd (diff)
3 files changed, 201 insertions, 1 deletions
diff --git a/docs/specs/index.rst b/docs/specs/index.rst
index 426632a475..1b0eb979d5 100644
--- a/docs/specs/index.rst
+++ b/docs/specs/index.rst
@@ -12,6 +12,7 @@ Contents:
 
    ppc-xive
    ppc-spapr-xive
+   ppc-spapr-numa
    acpi_hw_reduced_hotplug
    tpm
    acpi_hest_ghes
diff --git a/docs/specs/ppc-spapr-numa.rst b/docs/specs/ppc-spapr-numa.rst
new file mode 100644
index 0000000000..e762038022
--- /dev/null
+++ b/docs/specs/ppc-spapr-numa.rst
@@ -0,0 +1,191 @@
+
+NUMA mechanics for sPAPR (pseries machines)
+============================================
+
+NUMA in sPAPR works different than the System Locality Distance
+Information Table (SLIT) in ACPI. The logic is explained in the LOPAPR
+1.1 chapter 15, "Non Uniform Memory Access (NUMA) Option". This
+document aims to complement this specification, providing details
+of the elements that impacts how QEMU views NUMA in pseries.
+
+Associativity and ibm,associativity property
+--------------------------------------------
+
+Associativity is defined as a group of platform resources that has
+similar mean performance (or in our context here, distance) relative to
+everyone else outside of the group.
+
+The format of the ibm,associativity property varies with the value of
+bit 0 of byte 5 of the ibm,architecture-vec-5 property. The format with
+bit 0 equal to zero is deprecated. The current format, with the bit 0
+with the value of one, makes ibm,associativity property represent the
+physical hierarchy of the platform, as one or more lists that starts
+with the highest level grouping up to the smallest. Considering the
+following topology:
+
+::
+
+    Mem M1 ---- Proc P1    |
+    -----------------      | Socket S1  ---|
+          chip C1          |               |
+                                           | HW module 1 (MOD1)
+    Mem M2 ---- Proc P2    |               |
+    -----------------      | Socket S2  ---|
+          chip C2          |
+
+The ibm,associativity property for the processors would be:
+
+* P1: {MOD1, S1, C1, P1}
+* P2: {MOD1, S2, C2, P2}
+
+Each allocable resource has an ibm,associativity property. The LOPAPR
+specification allows multiple lists to be present in this property,
+considering that the same resource can have multiple connections to the
+platform.
+
+Relative Performance Distance and ibm,associativity-reference-points
+--------------------------------------------------------------------
+
+The ibm,associativity-reference-points property is an array that is used
+to define the relevant performance/distance  related boundaries, defining
+the NUMA levels for the platform.
+
+The definition of its elements also varies with the value of bit 0 of byte 5
+of the ibm,architecture-vec-5 property. The format with bit 0 equal to zero
+is also deprecated. With the current format, each integer of the
+ibm,associativity-reference-points represents an 1 based ordinal index (i.e.
+the first element is 1) of the ibm,associativity array. The first
+boundary is the most significant to application performance, followed by
+less significant boundaries. Allocated resources that belongs to the
+same performance boundaries are expected to have relative NUMA distance
+that matches the relevancy of the boundary itself. Resources that belongs
+to the same first boundary will have the shortest distance from each
+other. Subsequent boundaries represents greater distances and degraded
+performance.
+
+Using the previous example, the following setting reference points defines
+three NUMA levels:
+
+* ibm,associativity-reference-points = {0x3, 0x2, 0x1}
+
+The first NUMA level (0x3) is interpreted as the third element of each
+ibm,associativity array, the second level is the second element and
+the third level is the first element. Let's also consider that elements
+belonging to the first NUMA level have distance equal to 10 from each
+other, and each NUMA level doubles the distance from the previous. This
+means that the second would be 20 and the third level 40. For the P1 and
+P2 processors, we would have the following NUMA levels:
+
+::
+
+  * ibm,associativity-reference-points = {0x3, 0x2, 0x1}
+
+  * P1: associativity{MOD1, S1, C1, P1}
+
+  First NUMA level (0x3) => associativity[2] = C1
+  Second NUMA level (0x2) => associativity[1] = S1
+  Third NUMA level (0x1) => associativity[0] = MOD1
+
+  * P2: associativity{MOD1, S2, C2, P2}
+
+  First NUMA level (0x3) => associativity[2] = C2
+  Second NUMA level (0x2) => associativity[1] = S2
+  Third NUMA level (0x1) => associativity[0] = MOD1
+
+  P1 and P2 have the same third NUMA level, MOD1: Distance between them = 40
+
+Changing the ibm,associativity-reference-points array changes the performance
+distance attributes for the same associativity arrays, as the following
+example illustrates:
+
+::
+
+  * ibm,associativity-reference-points = {0x2}
+
+  * P1: associativity{MOD1, S1, C1, P1}
+
+  First NUMA level (0x2) => associativity[1] = S1
+
+  * P2: associativity{MOD1, S2, C2, P2}
+
+  First NUMA level (0x2) => associativity[1] = S2
+
+  P1 and P2 does not have a common performance boundary. Since this is a one level
+  NUMA configuration, distance between them is one boundary above the first
+  level, 20.
+
+
+In a hypothetical platform where all resources inside the same hardware module
+is considered to be on the same performance boundary:
+
+::
+
+  * ibm,associativity-reference-points = {0x1}
+
+  * P1: associativity{MOD1, S1, C1, P1}
+
+  First NUMA level (0x1) => associativity[0] = MOD0
+
+  * P2: associativity{MOD1, S2, C2, P2}
+
+  First NUMA level (0x1) => associativity[0] = MOD0
+
+  P1 and P2 belongs to the same first order boundary. The distance between then
+  is 10.
+
+
+How the pseries Linux guest calculates NUMA distances
+=====================================================
+
+Another key difference between ACPI SLIT and the LOPAPR regarding NUMA is
+how the distances are expressed. The SLIT table provides the NUMA distance
+value between the relevant resources. LOPAPR does not provide a standard
+way to calculate it. We have the ibm,associativity for each resource, which
+provides a common-performance hierarchy,  and the ibm,associativity-reference-points
+array that tells which level of associativity is considered to be relevant
+or not.
+
+The result is that each OS is free to implement and to interpret the distance
+as it sees fit. For the pseries Linux guest, each level of NUMA duplicates
+the distance of the previous level, and the maximum amount of levels is
+limited to MAX_DISTANCE_REF_POINTS = 4 (from arch/powerpc/mm/numa.c in the
+kernel tree). This results in the following distances:
+
+* both resources in the first NUMA level: 10
+* resources one NUMA level apart: 20
+* resources two NUMA levels apart: 40
+* resources three NUMA levels apart: 80
+* resources four NUMA levels apart: 160
+
+
+Consequences for QEMU NUMA tuning
+---------------------------------
+
+The way the pseries Linux guest calculates NUMA distances has a direct effect
+on what QEMU users can expect when doing NUMA tuning. As of QEMU 5.1, this is
+the default ibm,associativity-reference-points being used in the pseries
+machine:
+
+ibm,associativity-reference-points = {0x4, 0x4, 0x2}
+
+The first and second level are equal, 0x4, and a third one was added in
+commit a6030d7e0b35 exclusively for NVLink GPUs support. This means that
+regardless of how the ibm,associativity properties are being created in
+the device tree, the pseries Linux guest will only recognize three scenarios
+as far as NUMA distance goes:
+
+* if the resources belongs to the same first NUMA level = 10
+* second level is skipped since it's equal to the first
+* all resources that aren't a NVLink GPU, it is guaranteed that they will belong
+  to the same third NUMA level, having distance = 40
+* for NVLink GPUs, distance = 80 from everything else
+
+In short, we can summarize the NUMA distances seem in pseries Linux guests, using
+QEMU up to 5.1, as follows:
+
+* local distance, i.e. the distance of the resource to its own NUMA node: 10
+* if it's a NVLink GPU device, distance: 80
+* every other resource, distance: 40
+
+This also means that user input in QEMU command line does not change the
+NUMA distancing inside the guest for the pseries machine.
diff --git a/docs/specs/ppc-spapr-xive.rst b/docs/specs/ppc-spapr-xive.rst
index 6159bc6eed..7144347560 100644
--- a/docs/specs/ppc-spapr-xive.rst
+++ b/docs/specs/ppc-spapr-xive.rst
@@ -61,6 +61,11 @@ depend on the XIVE KVM capability of the host. On older kernels
 without XIVE KVM support, QEMU will use the emulated XIVE device as a
 fallback and on newer kernels (>=5.2), the KVM XIVE device.
 
+XIVE native exploitation mode is not supported for KVM nested guests,
+VMs running under a L1 hypervisor (KVM on pSeries). In that case, the
+hypervisor will not advertise the KVM capability and QEMU will use the
+emulated XIVE device, same as for older versions of KVM.
+
 As a final refinement, the user can also switch the use of the KVM
 device with the machine option ``kernel_irqchip``.
 
@@ -121,6 +126,9 @@ xics            XICS KVM       XICS emul.     XICS KVM
 
 (1) QEMU warns with ``warning: kernel_irqchip requested but unavailable:
     IRQ_XIVE capability must be present for KVM``
+    In some cases (old host kernels or KVM nested guests), one may hit a
+    QEMU/KVM incompatibility due to device destruction in reset. QEMU fails
+    with ``KVM is incompatible with ic-mode=dual,kernel-irqchip=on``
 (2) QEMU fails with ``kernel_irqchip requested but unavailable:
     IRQ_XIVE capability must be present for KVM``
 
@@ -143,7 +151,7 @@ xics            XICS KVM       XICS emul.     XICS KVM
     mode (XICS), either don't set the ic-mode machine property or try
     ic-mode=xics or ic-mode=dual``
 (4) QEMU/KVM incompatibility due to device destruction in reset. QEMU fails
-    with ``KVM is too old to support ic-mode=dual,kernel-irqchip=on``
+    with ``KVM is incompatible with ic-mode=dual,kernel-irqchip=on``
 
 
 XIVE Device tree properties
author	Peter Maydell <peter.maydell@linaro.org>	2020-08-24 09:35:21 +0100
committer	Peter Maydell <peter.maydell@linaro.org>	2020-08-24 09:35:21 +0100
commit	dd8014e4e904e895435aae9f11a686f072762782 (patch)
tree	ea1f526128f3d88a92f90cf8833b3adf9c8ff828 /docs
parent	8367a77c4d3f6e1e60890f5510304feb2c621611 (diff)
parent	3110f0ee19ccdb50adff3dfa1321039f69efddcd (diff)