aboutsummaryrefslogtreecommitdiff
path: root/include/hw/ppc/spapr.h
AgeCommit message (Collapse)Author
2020-03-17ppc/spapr: Add FWNMI System Reset stateNicholas Piggin
The FWNMI option must deliver system reset interrupts to their registered address, and there are a few constraints on the handler addresses specified in PAPR. Add the system reset address state and checks. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Message-Id: <20200316142613.121089-4-npiggin@gmail.com> Reviewed-by: Greg Kurz <groug@kaod.org> Reviwed-by: Mahesh Salgaonkar <mahesh@linux.ibm.com> Reviewed-by: Cédric Le Goater <clg@kaod.org> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2020-03-17ppc/spapr: Change FWNMI namesNicholas Piggin
The option is called "FWNMI", and it involves more than just machine checks, also machine checks can be delivered without the FWNMI option, so re-name various things to reflect that. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Message-Id: <20200316142613.121089-3-npiggin@gmail.com> Reviewed-by: Greg Kurz <groug@kaod.org> Reviewed-by: Cédric Le Goater <clg@kaod.org> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2020-03-17spapr/rtas: Reserve space for RTAS blob and logAlexey Kardashevskiy
At the moment SLOF reserves space for RTAS and instantiates the RTAS blob which is 20 bytes binary blob calling an hypercall. The rest of the RTAS area is a log which SLOF has no idea about but QEMU does. This moves RTAS sizing to QEMU and this overrides the size from SLOF. The only remaining problem is that SLOF copies the number of bytes it reserved (2KB for now) so QEMU needs to reserve at least this much; SLOF will be fixed separately to check that rtas-size from QEMU is enough for those 20 bytes for the H_RTAS hcall. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Message-Id: <20200316011841.99970-1-aik@ozlabs.ru> Reviewed-by: Greg Kurz <groug@kaod.org> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2020-03-17spapr: Don't clamp RMA to 16GiB on new machine typesDavid Gibson
In spapr_machine_init() we clamp the size of the RMA to 16GiB and the comment saying why doesn't make a whole lot of sense. In fact, this was done because the real mode handling code elsewhere limited the RMA in TCG mode to the maximum value configurable in LPCR[RMLS], 16GiB. But, * Actually LPCR[RMLS] has been able to encode a 256GiB size for a very long time, we just didn't implement it properly in the softmmu * LPCR[RMLS] shouldn't really be relevant anyway, it only was because we used to abuse the RMOR based translation mode in order to handle the fact that we're not modelling the hypervisor parts of the cpu We've now removed those limitations in the modelling so the 16GiB clamp no longer serves a function. However, we can't just remove the limit universally: that would break migration to earlier qemu versions, where the 16GiB RMLS limit still applies, no matter how bad the reasons for it are. So, we replace the 16GiB clamp, with a clamp to a limit defined in the machine type class. We set it to 16 GiB for machine types 4.2 and earlier, but set it to 0 meaning unlimited for the new 5.0 machine type. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Greg Kurz <groug@kaod.org> Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
2020-03-17spapr: Don't attempt to clamp RMA to VRMA constraintDavid Gibson
The Real Mode Area (RMA) is the part of memory which a guest can access when in real (MMU off) mode. Of course, for a guest under KVM, the MMU isn't really turned off, it's just in a special translation mode - Virtual Real Mode Area (VRMA) - which looks like real mode in guest mode. The mechanics of how this works when using the hash MMU (HPT) put a constraint on the size of the RMA, which depends on the size of the HPT. So, the latter part of spapr_setup_hpt_and_vrma() clamps the RMA we advertise to the guest based on this VRMA limit. There are several things wrong with this: 1) spapr_setup_hpt_and_vrma() doesn't actually clamp, it takes the minimum of Node 0 memory size and the VRMA limit. That will *often* work the same as clamping, but there can be other constraints on RMA size which supersede Node 0 memory size. We have real bugs caused by this (currently worked around in the guest kernel) 2) Some callers of spapr_setup_hpt_and_vrma() are in a situation where we're past the point that we can actually advertise an RMA limit to the guest 3) But most fundamentally, the VRMA limit depends on host configuration (page size) which shouldn't be visible to the guest, but this partially exposes it. This can cause problems with migration in certain edge cases, although we will mostly get away with it. In practice, this clamping is almost never applied anyway. With 64kiB pages and the normal rules for sizing of the HPT, the theoretical VRMA limit will be 4x(guest memory size) and so never hit. It will hit with 4kiB pages, where it will be (guest memory size)/4. However all mainstream distro kernels for POWER have used a 64kiB page size for at least 10 years. So, simply replace this logic with a check that the RMA we've calculated based only on guest visible configuration will fit within the host implied VRMA limit. This can break if running HPT guests on a host kernel with 4kiB page size. As noted that's very rare. There also exist several possible workarounds: * Change the host kernel to use 64kiB pages * Use radix MMU (RPT) guests instead of HPT * Use 64kiB hugepages on the host to back guest memory * Increase the guest memory size so that the RMA hits one of the fixed limits before the RMA limit. This is relatively easy on POWER8 which has a 16GiB limit, harder on POWER9 which has a 1TiB limit. * Use a guest NUMA configuration which artificially constrains the RMA within the VRMA limit (the RMA must always fit within Node 0). Previously, on KVM, we also temporarily reduced the rma_size to 256M so that the we'd load the kernel and initrd safely, regardless of the VRMA limit. This was a) confusing, b) could significantly limit the size of images we could load and c) introduced a behavioural difference between KVM and TCG. So we remove that as well. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: Greg Kurz <groug@kaod.org>
2020-03-17spapr: Handle pending hot plug/unplug requests at CASGreg Kurz
If a hot plug or unplug request is pending at CAS, we currently trigger a CAS reboot, which severely increases the guest boot time. This is because SLOF doesn't handle hot plug events and we had no way to fix the FDT that gets presented to the guest. We can do better thanks to recent changes in QEMU and SLOF: - we now return a full FDT to SLOF during CAS - SLOF was fixed to correctly detect any device that was either added or removed since boot time and to update its internal DT accordingly. The right solution is to process all pending hot plug/unplug requests during CAS: convert hot plugged devices to cold plugged devices and remove the hot unplugged ones, which is exactly what spapr_drc_reset() does. Also clear all hot plug events that are currently queued since they're no longer relevant. Note that SLOF cannot currently populate hot plugged PCI bridges or PHBs at CAS. Until this limitation is lifted, SLOF will reset the machine when this scenario occurs : this will allow the FDT to be fully processed when SLOF is started again (ie. the same effect as the CAS reboot that would occur anyway without this patch). Signed-off-by: Greg Kurz <groug@kaod.org> Message-Id: <158257222352.4102917.8984214333937947307.stgit@bahia.lan> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2020-02-21spapr: Allow changing offset for -kernel imageAlexey Kardashevskiy
This allows moving the kernel in the guest memory. The option is useful for step debugging (as Linux is linked at 0x0); it also allows loading grub which is normally linked to run at 0x20000. This uses the existing kernel address by default. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Message-Id: <20200203032943.121178-6-aik@ozlabs.ru> Reviewed-by: Fabiano Rosas <farosas@linux.ibm.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2020-02-21spapr: Add Hcalls to support PAPR NVDIMM deviceShivaprasad G Bhat
This patch implements few of the necessary hcalls for the nvdimm support. PAPR semantics is such that each NVDIMM device is comprising of multiple SCM(Storage Class Memory) blocks. The guest requests the hypervisor to bind each of the SCM blocks of the NVDIMM device using hcalls. There can be SCM block unbind requests in case of driver errors or unplug(not supported now) use cases. The NVDIMM label read/writes are done through hcalls. Since each virtual NVDIMM device is divided into multiple SCM blocks, the bind, unbind, and queries using hcalls on those blocks can come independently. This doesn't fit well into the qemu device semantics, where the map/unmap are done at the (whole)device/object level granularity. The patch doesnt actually bind/unbind on hcalls but let it happen at the device_add/del phase itself instead. The guest kernel makes bind/unbind requests for the virtual NVDIMM device at the region level granularity. Without interleaving, each virtual NVDIMM device is presented as a separate guest physical address range. So, there is no way a partial bind/unbind request can come for the vNVDIMM in a hcall for a subset of SCM blocks of a virtual NVDIMM. Hence it is safe to do bind/unbind everything during the device_add/del. Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com> Message-Id: <158131059899.2897.11515211602702956854.stgit@lep8c.aus.stglabs.ibm.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2020-02-03migration: Include migration support for machine check handlingAravinda Prasad
This patch includes migration support for machine check handling. Especially this patch blocks VM migration requests until the machine check error handling is complete as these errors are specific to the source hardware and is irrelevant on the target hardware. Signed-off-by: Aravinda Prasad <arawinda.p@gmail.com> [Do not set FWNMI cap in post_load, now its done in .apply hook] Signed-off-by: Ganesh Goudar <ganeshgr@linux.ibm.com> Message-Id: <20200130184423.20519-7-ganeshgr@linux.ibm.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2020-02-03ppc: spapr: Handle "ibm,nmi-register" and "ibm,nmi-interlock" RTAS callsAravinda Prasad
This patch adds support in QEMU to handle "ibm,nmi-register" and "ibm,nmi-interlock" RTAS calls. The machine check notification address is saved when the OS issues "ibm,nmi-register" RTAS call. This patch also handles the case when multiple processors experience machine check at or about the same time by handling "ibm,nmi-interlock" call. In such cases, as per PAPR, subsequent processors serialize waiting for the first processor to issue the "ibm,nmi-interlock" call. The second processor that also received a machine check error waits till the first processor is done reading the error log. The first processor issues "ibm,nmi-interlock" call when the error log is consumed. Signed-off-by: Aravinda Prasad <arawinda.p@gmail.com> [Register fwnmi RTAS calls in core_rtas_register_types() where other RTAS calls are registered] Signed-off-by: Ganesh Goudar <ganeshgr@linux.ibm.com> Message-Id: <20200130184423.20519-6-ganeshgr@linux.ibm.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2020-02-03target/ppc: Build rtas error log upon an MCEAravinda Prasad
Upon a machine check exception (MCE) in a guest address space, KVM causes a guest exit to enable QEMU to build and pass the error to the guest in the PAPR defined rtas error log format. This patch builds the rtas error log, copies it to the rtas_addr and then invokes the guest registered machine check handler. The handler in the guest takes suitable action(s) depending on the type and criticality of the error. For example, if an error is unrecoverable memory corruption in an application inside the guest, then the guest kernel sends a SIGBUS to the application. For recoverable errors, the guest performs recovery actions and logs the error. Signed-off-by: Aravinda Prasad <arawinda.p@gmail.com> [Assume SLOF has allocated enough room for rtas error log] Signed-off-by: Ganesh Goudar <ganeshgr@linux.ibm.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Message-Id: <20200130184423.20519-5-ganeshgr@linux.ibm.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2020-02-03target/ppc: Handle NMI guest exitAravinda Prasad
Memory error such as bit flips that cannot be corrected by hardware are passed on to the kernel for handling. If the memory address in error belongs to guest then the guest kernel is responsible for taking suitable action. Patch [1] enhances KVM to exit guest with exit reason set to KVM_EXIT_NMI in such cases. This patch handles KVM_EXIT_NMI exit. [1] https://www.spinics.net/lists/kvm-ppc/msg12637.html (e20bbd3d and related commits) Signed-off-by: Aravinda Prasad <arawinda.p@gmail.com> Signed-off-by: Ganesh Goudar <ganeshgr@linux.ibm.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Greg Kurz <groug@kaod.org> Message-Id: <20200130184423.20519-4-ganeshgr@linux.ibm.com> [dwg: #ifdefs to fix compile for 32-bit target] Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2020-02-03ppc: spapr: Introduce FWNMI capabilityAravinda Prasad
Introduce fwnmi an spapr capability and add a helper function which tries to enable it, which would be used by following patch of the series. This patch by itself does not change the existing behavior. Signed-off-by: Aravinda Prasad <arawinda.p@gmail.com> [eliminate cap_ppc_fwnmi, add fwnmi cap to migration state and reprhase the commit message] Signed-off-by: Ganesh Goudar <ganeshgr@linux.ibm.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Message-Id: <20200130184423.20519-3-ganeshgr@linux.ibm.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2019-12-17spapr: Fold h_cas_compose_response() into h_client_architecture_support()David Gibson
spapr_h_cas_compose_response() handles the last piece of the PAPR feature negotiation process invoked via the ibm,client-architecture-support OF call. Its only caller is h_client_architecture_support() which handles most of the rest of that process. I believe it was placed in a separate file originally to handle some fiddly dependencies between functions, but mostly it's just confusing to have the CAS process split into two pieces like this. Now that compose response is simplified (by just generating the whole device tree anew), it's cleaner to just fold it into h_client_architecture_support(). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Cedric Le Goater <clg@fr.ibm.com> Reviewed-by: Greg Kurz <groug@kaod.org>
2019-10-24spapr: Move SpaprIrq::nr_xirqs to SpaprMachineClassDavid Gibson
For the benefit of peripheral device allocation, the number of available irqs really wants to be the same on a given machine type version, regardless of what irq backends we are using. That's the case now, but only because we make sure the different SpaprIrq instances have the same value except for the special legacy one. Since this really only depends on machine type version, move the value to SpaprMachineClass instead of SpaprIrq. This also puts the code to set it to the lower value on old machine types right next to setting legacy_irq_allocation, which needs to go hand in hand. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Greg Kurz <groug@kaod.org> Reviewed-by: Cédric Le Goater <clg@kaod.org>
2019-10-24spapr: Formalize notion of active interrupt controllerDavid Gibson
spapr now has the mechanism of constructing both XICS and XIVE instances of the SpaprInterruptController interface. However, only one of the interrupt controllers will actually be active at any given time, depending on feature negotiation with the guest. This is handled in the current code via spapr_irq_current() which checks the OV5 vector from feature negotiation to determine the current backend. Determining the active controller at the point we need it like this can be pretty confusing, because it makes it very non obvious at what points the active controller can change. This can make it difficult to reason about the code and where a change of active controller could appear in sequence with other events. Make this mechanism more explicit by adding an 'active_intc' pointer and an explicit spapr_irq_update_active_intc() function to update it from the CAS state. We also add hooks on the intc backend which will get called when it is activated or deactivated. For now we just introduce the switch and hooks, later patches will actually start using them. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Greg Kurz <groug@kaod.org> Reviewed-by: Cédric Le Goater <clg@kaod.org>
2019-10-24spapr: Set VSMT to smp_threads by defaultGreg Kurz
Support for setting VSMT is available in KVM since linux-4.13. Most distros that support KVM on POWER already have it. It thus seem reasonable enough to have the default machine to set VSMT to smp_threads. This brings contiguous VCPU ids and thus brings their upper bound down to the machine's max_cpus. This is especially useful for XIVE KVM devices, which may thus allocate only one VP descriptor per VCPU. Signed-off-by: Greg Kurz <groug@kaod.org> Message-Id: <157010411885.246126.12610015369068227139.stgit@bahia.lan> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2019-10-04spapr: Stop providing RTAS blobAlexey Kardashevskiy
SLOF implements one itself so let's remove it from QEMU. It is one less image and simpler setup as the RTAS blob never stays in its initial place anyway as the guest OS always decides where to put it. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: Greg Kurz <groug@kaod.org> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2019-10-04spapr: Simplify handling of pre ISA 3.0 guest workaround handlingDavid Gibson
Certain old guest versions don't understand the radix MMU introduced with POWER ISA 3.0, but incorrectly select it if presented with the option at CAS time. We workaround this in qemu by explicitly excluding the radix (and other ISA 3.0 linked) options if the guest doesn't explicitly note support for ISA 3.0. This is handled by the 'cas_legacy_guest_workaround' flag, which is pretty vague. Rename it to 'cas_pre_isa3_guest' to be clearer about what it's for. In addition, we unnecessarily call spapr_populate_pa_features() with different options when initially constructing the device tree and when adjusting it at CAS time. At the initial construct time cas_pre_isa3_guest is already false, so we can still use the flag, rather than explicitly overriding it to be false at the callsite. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Cédric Le Goater <clg@kaod.org> Reviewed-by: Greg Kurz <groug@kaod.org> Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
2019-08-29spapr_pci: Advertise BAR reallocation capabilityAlexey Kardashevskiy
The pseries guests do not normally allocate PCI resources and rely on the system firmware doing so. Furthermore at least at some point in the past the pseries guests won't even allowed to change BARs, probably it is still the case for phyp. So since the initial commit we have [1] which prevents resource reallocation. This is not a problem until we want specific BAR alignments, for example, PAGE_SIZE==64k to make sure we can still map MMIO BARs directly. For the boot time devices we handle this in SLOF [2] but since QEMU's RTAS does not allocate BARs, the guest does this instead and does not align BARs even if Linux is given pci=resource_alignment=16@pci:0:0 as PCI_PROBE_ONLY makes Linux ignore alignment requests. ARM folks added a dial to control PCI_PROBE_ONLY via the device tree [3]. This makes use of the dial to advertise to the guest that we can handle BAR reassignments. This limits the change to the latest pseries machine to avoid old guests explosion. We do not remove the flag from [1] as pseries guests are still supported under phyp so having that removed may cause problems. [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/powerpc/platforms/pseries/setup.c?h=v5.1#n773 [2] https://git.qemu.org/?p=SLOF.git;a=blob;f=board-qemu/slof/pci-phb.fs;h=06729bcf77a0d4e900c527adcd9befe2a269f65d;hb=HEAD#l338 [3] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=f81c11af Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Message-Id: <20190719043734.108462-1-aik@ozlabs.ru> Reviewed-by: Greg Kurz <groug@kaod.org> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2019-08-21spapr: Implement ibm,suspend-meNicholas Piggin
This has been useful to modify and test the Linux pseries suspend code but it requires modification to the guest to call it (due to being gated by other unimplemented features). It is not otherwise used by Linux yet, but work is slowly progressing there. This allows a (lightly modified) guest kernel to suspend with `echo mem > /sys/power/state` and be resumed with system_wakeup monitor command. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Message-Id: <20190722061752.22114-2-npiggin@gmail.com> Acked-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2019-08-21spapr: initial implementation for H_TPM_COMM/spapr-tpm-proxyMichael Roth
This implements the H_TPM_COMM hypercall, which is used by an Ultravisor to pass TPM commands directly to the host's TPM device, or a TPM Resource Manager associated with the device. This also introduces a new virtual device, spapr-tpm-proxy, which is used to configure the host TPM path to be used to service requests sent by H_TPM_COMM hcalls, for example: -device spapr-tpm-proxy,id=tpmp0,host-path=/dev/tpmrm0 By default, no spapr-tpm-proxy will be created, and hcalls will return H_FUNCTION. The full specification for this hypercall can be found in docs/specs/ppc-spapr-uv-hcalls.txt Since SVM-related hcalls like H_TPM_COMM use a reserved range of 0xEF00-0xEF80, we introduce a separate hcall table here to handle them. Signed-off-by: Michael Roth <mdroth@linux.vnet.ibm.com Message-Id: <20190717205842.17827-3-mdroth@linux.vnet.ibm.com> [dwg: Corrected #include for upstream change] Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2019-08-21spapr: Implement dispatch tracking for tcgNicholas Piggin
Implement cpu_exec_enter/exit on ppc which calls into new methods of the same name in PPCVirtualHypervisorClass. These are used by spapr to implement the splpar VPA dispatch counter initially. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Message-Id: <20190718034214.14948-2-npiggin@gmail.com> [dwg: Removed unnecessary CONFIG_USER_ONLY checks as suggested by gkurz] Reviewed-by: Greg Kurz <groug@kaod.org> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2019-07-02xics/spapr: Register RTAS/hypercalls once at machine initGreg Kurz
QEMU may crash when running a spapr machine in 'dual' interrupt controller mode on some older (but not that old, eg. ubuntu 18.04.2) KVMs with partial XIVE support: qemu-system-ppc64: hw/ppc/spapr_rtas.c:411: spapr_rtas_register: Assertion `!name || !rtas_table[token].name' failed. XICS is controlled by the guest thanks to a set of RTAS calls. Depending on whether KVM XICS is used or not, the RTAS calls are handled by KVM or QEMU. In both cases, QEMU needs to expose the RTAS calls to the guest through the "rtas" node of the device tree. The spapr_rtas_register() helper takes care of all of that: it adds the RTAS call token to the "rtas" node and registers a QEMU callback to be invoked when the guest issues the RTAS call. In the KVM XICS case, QEMU registers a dummy callback that just prints an error since it isn't supposed to be invoked, ever. Historically, the XICS controller was setup during machine init and released during final teardown. This changed when the 'dual' interrupt controller mode was added to the spapr machine: in this case we need to tear the XICS down and set it up again during machine reset. The crash happens because we indeed have an incompatibility with older KVMs that forces QEMU to fallback on emulated XICS, which tries to re-registers the same RTAS calls. This could be fixed by adding proper rollback that would unregister RTAS calls on error. But since the emulated RTAS calls in QEMU can now detect when they are mistakenly called while KVM XICS is in use, it seems simpler to register them once and for all at machine init. This fixes the crash and allows to remove some now useless lines of code. Signed-off-by: Greg Kurz <groug@kaod.org> Message-Id: <156044429963.125694.13710679451927268758.stgit@bahia.lab.toulouse-stg.fr.ibm.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2019-05-29spapr: Don't migrate the hpt_maxpagesize cap to older machine typesGreg Kurz
Commit 0b8c89be7f7b added the hpt_maxpagesize capability to the migration stream. This is okay for new machine types but it breaks backward migration to older QEMUs, which don't expect the extra subsection. Add a compatibility boolean flag to the sPAPR machine class and use it to skip migration of the capability for machine types 4.0 and older. This fixes migration to an older QEMU. Note that the destination will emit a warning: qemu-system-ppc64: warning: cap-hpt-max-page-size lower level (16) in incoming stream than on destination (24) This is expected and harmless though. It is okay to migrate from a lower HPT maximum page size (64k) to a greater one (16M). Fixes: 0b8c89be7f7b "spapr: Add forgotten capability to migration stream" Based-on: <20190522074016.10521-3-clg@kaod.org> Signed-off-by: Greg Kurz <groug@kaod.org> Message-Id: <155853262675.1158324.17301777846476373459.stgit@bahia.lan> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2019-05-29spapr: Add forgotten capability to migration streamDavid Gibson
spapr machine capabilities are supposed to be sent in the migration stream so that we can sanity check the source and destination have compatible configuration. Unfortunately, when we added the hpt-max-page-size capability, we forgot to add it to the migration state. This means that we can generate spurious warnings when both ends are configured for large pages, or potentially fail to warn if the source is configured for huge pages, but the destination is not. Fixes: 2309832afda "spapr: Maximum (HPT) pagesize property" Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Cédric Le Goater <clg@kaod.org>
2019-04-26ppc/hash64: Rework R and C bit updatesBenjamin Herrenschmidt
With MT-TCG, we are now running translation in a racy way, thus we need to mimic hardware when it comes to updating the R and C bits, by doing byte stores. The current "store_hpte" abstraction is ill suited for this, we replace it with two separate callbacks for setting R and C. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Cédric Le Goater <clg@kaod.org> Message-Id: <20190411080004.8690-4-clg@kaod.org> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2019-04-26spapr/rtas: modify spapr_rtas_register() to remove RTAS handlersCédric Le Goater
Removing RTAS handlers will become necessary when the new pseries machine supporting multiple interrupt mode is introduced. Signed-off-by: Cédric Le Goater <clg@kaod.org> Message-Id: <20190321144914.19934-9-clg@kaod.org> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2019-04-26spapr: Support NVIDIA V100 GPU with NVLink2Alexey Kardashevskiy
NVIDIA V100 GPUs have on-board RAM which is mapped into the host memory space and accessible as normal RAM via an NVLink bus. The VFIO-PCI driver implements special regions for such GPUs and emulates an NVLink bridge. NVLink2-enabled POWER9 CPUs also provide address translation services which includes an ATS shootdown (ATSD) register exported via the NVLink bridge device. This adds a quirk to VFIO to map the GPU memory and create an MR; the new MR is stored in a PCI device as a QOM link. The sPAPR PCI uses this to get the MR and map it to the system address space. Another quirk does the same for ATSD. This adds additional steps to sPAPR PHB setup: 1. Search for specific GPUs and NPUs, collect findings in sPAPRPHBState::nvgpus, manage system address space mappings; 2. Add device-specific properties such as "ibm,npu", "ibm,gpu", "memory-block", "link-speed" to advertise the NVLink2 function to the guest; 3. Add "mmio-atsd" to vPHB to advertise the ATSD capability; 4. Add new memory blocks (with extra "linux,memory-usable" to prevent the guest OS from accessing the new memory until it is onlined) and npuphb# nodes representing an NPU unit for every vPHB as the GPU driver uses it for link discovery. This allocates space for GPU RAM and ATSD like we do for MMIOs by adding 2 new parameters to the phb_placement() hook. Older machine types set these to zero. This puts new memory nodes in a separate NUMA node to as the GPU RAM needs to be configured equally distant from any other node in the system. Unlike the host setup which assigns numa ids from 255 downwards, this adds new NUMA nodes after the user configures nodes or from 1 if none were configured. This adds requirement similar to EEH - one IOMMU group per vPHB. The reason for this is that ATSD registers belong to a physical NPU so they cannot invalidate translations on GPUs attached to another NPU. It is guaranteed by the host platform as it does not mix NVLink bridges or GPUs from different NPU in the same IOMMU group. If more than one IOMMU group is detected on a vPHB, this disables ATSD support for that vPHB and prints a warning. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> [aw: for vfio portions] Acked-by: Alex Williamson <alex.williamson@redhat.com> Message-Id: <20190312082103.130561-1-aik@ozlabs.ru> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2019-03-29spapr: Simplify handling of host-serial and host-model valuesDavid Gibson
27461d69a0f "ppc: add host-serial and host-model machine attributes (CVE-2019-8934)" introduced 'host-serial' and 'host-model' machine properties for spapr to explicitly control the values advertised to the guest in device tree properties with the same names. The previous behaviour on KVM was to unconditionally populate the device tree with the real host serial number and model, which leaks possibly sensitive information about the host to the guest. To maintain compatibility for old machine types, we allowed those props to be set to "passthrough" to take the value from the host as before. Or they could be set to "none" to explicitly omit the device tree items. Special casing specific values on what's otherwise a user supplied string is very ugly. So, this patch simplifies things by implementing the backwards compatibility in a different way: we have a machine class flag set for the older machines, and we only load the host values into the device tree if A) they're not set by the user and B) we have that flag set. This does mean that the "passthrough" functionality is no longer available with the current machine type. That's ok though: if a user or management layer really wants the information passed through they can read it themselves (OpenStack Nova already does something similar for x86). It also means the user can't explicitly ask for the values to be omitted on the old machine types. I think that's an acceptable trade-off: if you care enough about not leaking the host information you can either move to the new machine type, or use a dummy value for the properties. For the new machine type, this also removes an odd inconsistency between running on a POWER and non-POWER (or non-Linux) hosts: if the host information couldn't be read from where we expect (in the host's device tree as exposed by Linux), we'd fallback to omitting the guest device tree items. While we're there, improve some poorly worded comments, and the help text for the properties. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Daniel P. Berrangé <berrange@redhat.com> Reviewed-by: Greg Kurz <groug@kaod.org> Tested-by: Greg Kurz <groug@kaod.org>
2019-03-12spapr: Use CamelCase properlyDavid Gibson
The qemu coding standard is to use CamelCase for type and structure names, and the pseries code follows that... sort of. There are quite a lot of places where we bend the rules in order to preserve the capitalization of internal acronyms like "PHB", "TCE", "DIMM" and most commonly "sPAPR". That was a bad idea - it frequently leads to names ending up with hard to read clusters of capital letters, and means they don't catch the eye as type identifiers, which is kind of the point of the CamelCase convention in the first place. In short, keeping type identifiers look like CamelCase is more important than preserving standard capitalization of internal "words". So, this patch renames a heap of spapr internal type names to a more standard CamelCase. In addition to case changes, we also make some other identifier renames: VIOsPAPR* -> SpaprVio* The reverse word ordering was only ever used to mitigate the capital cluster, so revert to the natural ordering. VIOsPAPRVTYDevice -> SpaprVioVty VIOsPAPRVLANDevice -> SpaprVioVlan Brevity, since the "Device" didn't add useful information sPAPRDRConnector -> SpaprDrc sPAPRDRConnectorClass -> SpaprDrcClass Brevity, and makes it clearer this is the same thing as a "DRC" mentioned in many other places in the code This is 100% a mechanical search-and-replace patch. It will, however, conflict with essentially any and all outstanding patches touching the spapr code. Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2019-03-12spapr_iommu: Do not replay mappings from just created DMA windowAlexey Kardashevskiy
On sPAPR vfio_listener_region_add() is called in 2 situations: 1. a new listener is registered from vfio_connect_container(); 2. a new IOMMU Memory Region is added from rtas_ibm_create_pe_dma_window(). In both cases vfio_listener_region_add() calls memory_region_iommu_replay() to notify newly registered IOMMU notifiers about existing mappings which is totally desirable for case 1. However for case 2 it is nothing but noop as the window has just been created and has no valid mappings so replaying those does not do anything. It is barely noticeable with usual guests but if the window happens to be really big, such no-op replay might take minutes and trigger RCU stall warnings in the guest. For example, a upcoming GPU RAM memory region mapped at 64TiB (right after SPAPR_PCI_LIMIT) causes a 64bit DMA window to be at least 128TiB which is (128<<40)/0x10000=2.147.483.648 TCEs to replay. This mitigates the problem by adding an "skipping_replay" flag to sPAPRTCETable and defining sPAPR own IOMMU MR replay() hook which does exactly the same thing as the generic one except it returns early if @skipping_replay==true. Another way of fixing this would be delaying replay till the very first H_PUT_TCE but this does not work if in-kernel H_PUT_TCE handler is enabled (a likely case). When "ibm,create-pe-dma-window" is complete, the guest will map only required regions of the huge DMA window. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Message-Id: <20190307050518.64968-2-aik@ozlabs.ru> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2019-03-12spapr: Force SPAPR_MEMORY_BLOCK_SIZE to be a hwaddr (64-bit)David Gibson
SPAPR_MEMORY_BLOCK_SIZE is logically a difference in memory addresses, and hence of type hwaddr which is 64-bit. Previously it wasn't marked as such which means that it could be treated as 32-bit. That will work in some circumstances but if multiplied by another 32-bit value it could lead to a 32-bit overflow and an incorrect result. One specific instance of this in spapr_lmb_dt_populate() was spotted by Coverity (CID 1399145). Reported-by: Peter Maydell <peter.maydell@linaro.org> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2019-03-12target/ppc/spapr: Add SPAPR_CAP_CCF_ASSISTSuraj Jitindar Singh
Introduce a new spapr_cap SPAPR_CAP_CCF_ASSIST to be used to indicate the requirement for a hw-assisted version of the count cache flush workaround. The count cache flush workaround is a software workaround which can be used to flush the count cache on context switch. Some revisions of hardware may have a hardware accelerated flush, in which case the software flush can be shortened. This cap is used to set the availability of such hardware acceleration for the count cache flush routine. The availability of such hardware acceleration is indicated by the H_CPU_CHAR_BCCTR_FLUSH_ASSIST flag being set in the characteristics returned from the KVM_PPC_GET_CPU_CHAR ioctl. Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com> Message-Id: <20190301031912.28809-2-sjitindarsingh@gmail.com> [dwg: Small style fixes] Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2019-03-12target/ppc/spapr: Add workaround option to SPAPR_CAP_IBSSuraj Jitindar Singh
The spapr_cap SPAPR_CAP_IBS is used to indicate the level of capability for mitigations for indirect branch speculation. Currently the available values are broken (default), fixed-ibs (fixed by serialising indirect branches) and fixed-ccd (fixed by diabling the count cache). Introduce a new value for this capability denoted workaround, meaning that software can work around the issue by flushing the count cache on context switch. This option is available if the hypervisor sets the H_CPU_BEHAV_FLUSH_COUNT_CACHE flag in the cpu behaviours returned from the KVM_PPC_GET_CPU_CHAR ioctl. Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com> Message-Id: <20190301031912.28809-1-sjitindarsingh@gmail.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2019-03-12target/ppc/spapr: Add SPAPR_CAP_LARGE_DECREMENTERSuraj Jitindar Singh
Add spapr_cap SPAPR_CAP_LARGE_DECREMENTER to be used to control the availability of the large decrementer for a guest. Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com> Message-Id: <20190301024317.22137-1-sjitindarsingh@gmail.com> [dwg: Trivial style fix] Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2019-02-26spapr: add hotplug hooks for PHB hotplugGreg Kurz
Hotplugging PHBs is a machine-level operation, but PHBs reside on the main system bus, so we register spapr machine as the handler for the main system bus. Provide the usual pre-plug, plug and unplug-request handlers. Move the checking of the PHB index to the pre-plug handler. It is okay to do that and assert in the realize function because the pre-plug handler is always called, even for the oldest machine types we support. Signed-off-by: Michael Roth <mdroth@linux.vnet.ibm.com> (Fixed interrupt controller phandle in "interrupt-map" and TCE table size in "ibm,dma-window" FDT fragment, Greg Kurz) Signed-off-by: Greg Kurz <groug@kaod.org> Message-Id: <155059672926.1466090.13612804072190051439.stgit@bahia.lab.toulouse-stg.fr.ibm.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2019-02-26spapr: create DR connectors for PHBsMichael Roth
Signed-off-by: Michael Roth <mdroth@linux.vnet.ibm.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Greg Kurz <groug@kaod.org> Message-Id: <155059670389.1466090.10015601248906623076.stgit@bahia.lab.toulouse-stg.fr.ibm.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2019-02-26spapr: Generate FDT fragment for CPUs at configure connector timeGreg Kurz
Signed-off-by: Greg Kurz <groug@kaod.org> Message-Id: <155059666839.1466090.3833376527523126752.stgit@bahia.lab.toulouse-stg.fr.ibm.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2019-02-26spapr: Generate FDT fragment for LMBs at configure connector timeGreg Kurz
Signed-off-by: Greg Kurz <groug@kaod.org> Message-Id: <155059666331.1466090.6766540766297333313.stgit@bahia.lab.toulouse-stg.fr.ibm.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2019-02-26target/ppc/spapr: Set LPCR:HR when using Radix modeBenjamin Herrenschmidt
The HW relies on LPCR:HR along with the PATE to determine whether to use Radix or Hash mode. In fact it uses LPCR:HR more commonly than the PATE. For us, it's also more efficient to do so, especially since unlike the HW we do not maintain a cache of the current PATE and HV PATE in a generic place. Prepare the grounds for that by ensuring that LPCR:HR is set properly on SPAPR machines. Another option would have been to use a callback to get the PATE but this gets messy when implementing bare metal support, it's much simpler (and faster) to use LPCR. Since existing migration streams may not have it, fix it up in spapr_post_load() as well based on the pseudo-PATE entry that we keep. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Cédric Le Goater <clg@kaod.org> Message-Id: <20190215170029.15641-2-clg@kaod.org> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2019-02-26ppc: add host-serial and host-model machine attributes (CVE-2019-8934)Prasad J Pandit
On ppc hosts, hypervisor shares following system attributes - /proc/device-tree/system-id - /proc/device-tree/model with a guest. This could lead to information leakage and misuse.[*] Add machine attributes to control such system information exposure to a guest. [*] https://wiki.openstack.org/wiki/OSSN/OSSN-0028 Reported-by: Daniel P. Berrangé <berrange@redhat.com> Fix-suggested-by: Daniel P. Berrangé <berrange@redhat.com> Signed-off-by: Prasad J Pandit <pjp@fedoraproject.org> Message-Id: <20190218181349.23885-1-ppandit@redhat.com> Reviewed-by: Daniel P. Berrangé <berrange@redhat.com> Reviewed-by: Greg Kurz <groug@kaod.org> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2019-02-18spapr/irq: Use the base ICP class for KVMGreg Kurz
The base ICP class knows how to interact with KVM. Adapt sPAPR to use it instead of the ICP KVM class. Signed-off-by: Greg Kurz <groug@kaod.org> Message-Id: <155023080638.1011724.792095453419098948.stgit@bahia.lan> Reviewed-by: Cédric Le Goater <clg@kaod.org> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2019-02-17spapr: Rename xics to intc in interrupt controller agnostic codeGreg Kurz
All this code is used with both the XICS and XIVE interrupt controllers. Signed-off-by: Greg Kurz <groug@kaod.org> Reviewed-by: Cédric Le Goater <clg@kaod.org> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2019-01-22ppc: Fix duplicated typedefs to be able to compile with Clang in gnu99 modeThomas Huth
When compiling the ppc code with clang and -std=gnu99, there are a couple of warnings/errors like this one: CC ppc64-softmmu/hw/intc/xics.o In file included from hw/intc/xics.c:35: include/hw/ppc/xics.h:43:25: error: redefinition of typedef 'ICPState' is a C11 feature [-Werror,-Wtypedef-redefinition] typedef struct ICPState ICPState; ^ target/ppc/cpu.h:1181:25: note: previous definition is here typedef struct ICPState ICPState; ^ Work around the problems by including the proper headers in spapr.h and by using struct forward declarations in cpu.h. Reviewed-by: Greg Kurz <groug@kaod.org> Signed-off-by: Thomas Huth <thuth@redhat.com>
2019-01-09spapr: move the qemu_irq array under the machineCédric Le Goater
The qemu_irq array is now allocated at the machine level using a sPAPR IRQ set_irq handler depending on the chosen interrupt mode. The use of this handler is slightly inefficient today but it will become necessary when the 'dual' interrupt mode is introduced. Signed-off-by: Cédric Le Goater <clg@kaod.org> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2019-01-09ppc/spapr: Receive and store device tree blob from SLOFAlexey Kardashevskiy
SLOF receives a device tree and updates it with various properties before switching to the guest kernel and QEMU is not aware of any changes made by SLOF. Since there is no real RTAS (QEMU implements it), it makes sense to pass the SLOF final device tree to QEMU to let it implement RTAS related tasks better, such as PCI host bus adapter hotplug. Specifially, now QEMU can find out the actual XICS phandle (for PHB hotplug) and the RTAS linux,rtas-entry/base properties (for firmware assisted NMI - FWNMI). This stores the initial DT blob in the sPAPR machine and replaces it in the KVMPPC_H_UPDATE_DT (new private hypercall) handler. This adds an @update_dt_enabled machine property to allow backward migration. SLOF already has a hypercall since https://github.com/aik/SLOF/commit/e6fc84652c9c0073f9183 This makes use of the new fdt_check_full() helper. In order to allow the configure script to pick the correct DTC version, this adjusts the DTC presense test. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: Greg Kurz <groug@kaod.org> Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Greg Kurz <groug@kaod.org> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2019-01-09spapr: Add H-Call H_HOME_NODE_ASSOCIATIVITYLaurent Vivier
H_HOME_NODE_ASSOCIATIVITY H-Call returns the associativity domain designation associated with the identifier input parameter This fixes a crash when we try to hotplug a CPU in memory-less and CPU-less numa node. In this case, the kernel tries to online the node, but without the information provided by this h-call, the node id, it cannot and the CPU is started while the node is not onlined. It also removes the warning message from the kernel: VPHN is not supported. Disabling polling.. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Reviewed-by: Greg Kurz <groug@kaod.org> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2018-12-21spapr: introduce an 'ic-mode' machine optionCédric Le Goater
This option is used to select the interrupt controller mode (XICS or XIVE) with which the machine will operate. XICS being the default mode for now. When running a machine with the XIVE interrupt mode backend, the guest OS is required to have support for the XIVE exploitation mode. In the case of legacy OS, the mode selected by CAS should be XICS and the OS should fail to boot. However, QEMU could possibly detect it, terminate the boot process and reset to stop in the SLOF firmware. This is not yet handled. Signed-off-by: Cédric Le Goater <clg@kaod.org> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2018-12-21spapr: add an extra OV5 field to the sPAPR IRQ backendCédric Le Goater
The interrupt modes supported by the hypervisor are advertised to the guest with new bits definitions of the option vector 5 of property "ibm,arch-vec-5-platform-support. The byte 23 bits 0-1 of the OV5 are defined as follow : 0b00 PAPR 2.7 and earlier (Legacy systems) 0b01 XIVE Exploitation mode only 0b10 Either available If the client/guest selects the XIVE interrupt mode, it informs the hypervisor by returning the value 0b01 in byte 23 bits 0-1. A 0b00 value indicates the use of the XICS interrupt mode (Legacy systems). The sPAPR IRQ backend is extended with these definitions and the values are directly used to populate the "ibm,arch-vec-5-platform-support" property. The interrupt mode is advertised under TCG and under KVM. Although a KVM XIVE device is not yet available, the machine can still operate with kernel_irqchip=off. However, we apply a restriction on the CPU which is required to be a POWER9 when a XIVE interrupt controller is in use. Signed-off-by: Cédric Le Goater <clg@kaod.org> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>