aboutsummaryrefslogtreecommitdiff
path: root/include/hw/ppc/spapr.h
AgeCommit message (Collapse)Author
2022-12-21hw/ppc/spapr: Reduce "vof.h" inclusionPhilippe Mathieu-Daudé
Currently objects including "hw/ppc/spapr.h" are forced to be target specific due to the inclusion of "vof.h" in "spapr.h". "spapr.h" only uses a Vof pointer, so doesn't require the structure declaration. The only place where Vof structure is accessed is in spapr.c, so include "vof.h" there, and forward declare the structure in "spapr.h". Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org> Reviewed-by: Daniel Henrique Barboza <danielhb413@gmail.com> Message-Id: <20221213123550.39302-4-philmd@linaro.org> Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
2022-09-20hw/ppc/spapr: Fix code style problems reported by checkpatchBernhard Beschow
Reviewed-by: Daniel Henrique Barboza <danielhb413@gmail.com> Signed-off-by: Bernhard Beschow <shentey@gmail.com> Message-Id: <20220919231720.163121-5-shentey@gmail.com> Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
2022-07-06ppc/spapr: Implement H_WATCHDOGAlexey Kardashevskiy
The new PAPR 2.12 defines a watchdog facility managed via the new H_WATCHDOG hypercall. This adds H_WATCHDOG support which a proposed driver for pseries uses: https://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=303120 This was tested by running QEMU with a debug kernel and command line: -append \ "pseries-wdt.timeout=60 pseries-wdt.nowayout=1 pseries-wdt.action=2" and running "echo V > /dev/watchdog0" inside the VM. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: Cédric Le Goater <clg@kaod.org> Reviewed-by: Daniel Henrique Barboza <danielhb413@gmail.com> Message-Id: <20220622051008.1067464-1-aik@ozlabs.ru> Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
2022-07-06spapr/ddw: Reset DMA when the last non-default window is removedAlexey Kardashevskiy
PAPR+/LoPAPR says: === The platform must restore the default DMA window for the PE on a call to the ibm,remove-pe-dma-window RTAS call when all of the following are true: a. The call removes the last DMA window remaining for the PE. b. The DMA window being removed is not the default window === This resets DMA as PAPR mandates. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: Daniel Henrique Barboza <danielhb413@gmail.com> Message-Id: <20220622052955.1069903-1-aik@ozlabs.ru> Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
2022-05-26pseries: allow setting stdout-path even on machines with a VGAPaolo Bonzini
-machine graphics=off is the usual way to tell the firmware or the OS that the user wants a serial console. The pseries machine however does not support this, and never adds the stdout-path node to the device tree if a VGA device is provided. This is in addition to the other magic behavior of VGA devices, which is to add a keyboard and mouse to the default USB bus. Split spapr->has_graphics in two variables so that the two behaviors can be separated: the USB devices remains the same, but the stdout-path is added even with "-device VGA -machine graphics=off". Reviewed-by: Daniel Henrique Barboza <danielhb413@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20220507054826.124936-1-pbonzini@redhat.com> Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
2022-04-20ppc/spapr/ddw: Add 2M pagesizeAlexey Kardashevskiy
Recently the LoPAPR spec got a new 2MB pagesize to support in Dynamic DMA Windows API (DDW), this adds the new flag. Linux supports it since https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=38727311871 Reviewed-by: Daniel Henrique Barboza <danielhb413@gmail.com> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Message-Id: <20220321071945.918669-1-aik@ozlabs.ru> Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
2022-02-18spapr: implement nested-hv capability for the virtual hypervisorNicholas Piggin
This implements the Nested KVM HV hcall API for spapr under TCG. The L2 is switched in when the H_ENTER_NESTED hcall is made, and the L1 is switched back in returned from the hcall when a HV exception is sent to the vhyp. Register state is copied in and out according to the nested KVM HV hcall API specification. The hdecr timer is started when the L2 is switched in, and it provides the HDEC / 0x980 return to L1. The MMU re-uses the bare metal radix 2-level page table walker by using the get_pate method to point the MMU to the nested partition table entry. MMU faults due to partition scope errors raise HV exceptions and accordingly are routed back to the L1. The MMU does not tag translations for the L1 (direct) vs L2 (nested) guests, so the TLB is flushed on any L1<->L2 transition (hcall entry and exit). Reviewed-by: Fabiano Rosas <farosas@linux.ibm.com> Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Reviewed-by: Cédric Le Goater <clg@kaod.org> [ clg: checkpatch fixes ] Message-Id: <20220216102545.1808018-10-npiggin@gmail.com> Signed-off-by: Cédric Le Goater <clg@kaod.org>
2022-02-18spapr: nvdimm: Implement H_SCM_FLUSH hcallShivaprasad G Bhat
The patch adds support for the SCM flush hcall for the nvdimm devices. To be available for exploitation by guest through the next patch. The hcall is applicable only for new SPAPR specific device class which is also introduced in this patch. The hcall expects the semantics such that the flush to return with H_LONG_BUSY_ORDER_10_MSEC when the operation is expected to take longer time along with a continue_token. The hcall to be called again by providing the continue_token to get the status. So, all fresh requests are put into a 'pending' list and flush worker is submitted to the thread pool. The thread pool completion callbacks move the requests to 'completed' list, which are cleaned up after collecting the return status for the guest in subsequent hcall from the guest. The semantics makes it necessary to preserve the continue_tokens and their return status across migrations. So, the completed flush states are forwarded to the destination and the pending ones are restarted at the destination in post_load. The necessary nvdimm flush specific vmstate structures are also introduced in this patch which are to be saved in the new SPAPR specific nvdimm device to be introduced in the following patch. Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com> Reviewed-by: Daniel Henrique Barboza <danielhb413@gmail.com> Message-Id: <164396254862.109112.16675611182159105748.stgit@ltczzess4.aus.stglabs.ibm.com> Signed-off-by: Cédric Le Goater <clg@kaod.org>
2021-09-30spapr_numa.c: FORM2 NUMA affinity supportDaniel Henrique Barboza
The main feature of FORM2 affinity support is the separation of NUMA distances from ibm,associativity information. This allows for a more flexible and straightforward NUMA distance assignment without relying on complex associations between several levels of NUMA via ibm,associativity matches. Another feature is its extensibility. This base support contains the facilities for NUMA distance assignment, but in the future more facilities will be added for latency, performance, bandwidth and so on. This patch implements the base FORM2 affinity support as follows: - the use of FORM2 associativity is indicated by using bit 2 of byte 5 of ibm,architecture-vec-5. A FORM2 aware guest can choose to use FORM1 or FORM2 affinity. Setting both forms will default to FORM2. We're not advertising FORM2 for pseries-6.1 and older machine versions to prevent guest visible changes in those; - ibm,associativity-reference-points has a new semantic. Instead of being used to calculate distances via NUMA levels, it's now used to indicate the primary domain index in the ibm,associativity domain of each resource. In our case it's set to {0x4}, matching the position where we already place logical_domain_id; - two new RTAS DT artifacts are introduced: ibm,numa-lookup-index-table and ibm,numa-distance-table. The index table is used to list all the NUMA logical domains of the platform, in ascending order, and allows for spartial NUMA configurations (although QEMU ATM doesn't support that). ibm,numa-distance-table is an array that contains all the distances from the first NUMA node to all other nodes, then the second NUMA node distances to all other nodes and so on; - get_max_dist_ref_points(), get_numa_assoc_size() and get_associativity() now checks for OV5_FORM2_AFFINITY and returns FORM2 values if the guest selected FORM2 affinity during CAS. Reviewed-by: Greg Kurz <groug@kaod.org> Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com> Message-Id: <20210920174947.556324-7-danielhb413@gmail.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2021-09-30spapr_numa.c: rename numa_assoc_array to FORM1_assoc_arrayDaniel Henrique Barboza
Introducing a new NUMA affinity, FORM2, requires a new mechanism to switch between affinity modes after CAS. Also, we want FORM2 data structures and functions to be completely separated from the existing FORM1 code, allowing us to avoid adding new code that inherits the existing complexity of FORM1. The idea of switching values used by the write_dt() functions in spapr_numa.c was already introduced in the previous patch, and the same approach will be used when dealing with the FORM1 and FORM2 arrays. We can accomplish that by that by renaming the existing numa_assoc_array to FORM1_assoc_array, which now is used exclusively to handle FORM1 affinity data. A new helper get_associativity() is then introduced to be used by the write_dt() functions to retrieve the current ibm,associativity array of a given node, after considering affinity selection that might have been done during CAS. All code that was using numa_assoc_array now needs to retrieve the array by calling this function. This will allow for an easier plug of FORM2 data later on. Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com> Message-Id: <20210920174947.556324-5-danielhb413@gmail.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2021-09-30spapr_numa.c: parametrize FORM1 macrosDaniel Henrique Barboza
The next preliminary step to introduce NUMA FORM2 affinity is to make the existing code independent of FORM1 macros and values, i.e. MAX_DISTANCE_REF_POINTS, NUMA_ASSOC_SIZE and VCPU_ASSOC_SIZE. This patch accomplishes that by doing the following: - move the NUMA related macros from spapr.h to spapr_numa.c where they are used. spapr.h gets instead a 'NUMA_NODES_MAX_NUM' macro that is used to refer to the maximum number of NUMA nodes, including GPU nodes, that the machine can support; - MAX_DISTANCE_REF_POINTS and NUMA_ASSOC_SIZE are renamed to FORM1_DIST_REF_POINTS and FORM1_NUMA_ASSOC_SIZE. These FORM1 specific macros are used in FORM1 init functions; - code that uses MAX_DISTANCE_REF_POINTS now retrieves the max_dist_ref_points value using get_max_dist_ref_points(). NUMA_ASSOC_SIZE is replaced by get_numa_assoc_size() and VCPU_ASSOC_SIZE is replaced by get_vcpu_assoc_size(). These functions are used by the generic device tree functions and h_home_node_associativity() and will allow them to switch between FORM1 and FORM2 without changing their core logic. Reviewed-by: Greg Kurz <groug@kaod.org> Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com> Message-Id: <20210920174947.556324-4-danielhb413@gmail.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2021-07-09target/ppc: Support for H_RPT_INVALIDATE hcallBharata B Rao
If KVM_CAP_RPT_INVALIDATE KVM capability is enabled, then - indicate the availability of H_RPT_INVALIDATE hcall to the guest via ibm,hypertas-functions property. - Enable the hcall Both the above are done only if the new sPAPR machine capability cap-rpt-invalidate is set. Signed-off-by: Bharata B Rao <bharata@linux.ibm.com> Message-Id: <20210706112440.1449562-3-bharata@linux.ibm.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2021-07-09spapr: Fix implementation of Open Firmware client interfaceAlexey Kardashevskiy
This addresses the comments from v22. The functional changes are (the VOF ones need retesting with Pegasos2): (VOF) setprop will start failing if the machine class callback did not handle it; (VOF) unit addresses are lowered in path_offset(); (SPAPR) /chosen/bootargs is initialized from kernel_cmdline if the client did not change it. Fixes: 5c991e5d4378 ("spapr: Implement Open Firmware client interface") Cc: BALATON Zoltan <balaton@eik.bme.hu> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Message-Id: <20210708065625.548396-1-aik@ozlabs.ru> Tested-by: BALATON Zoltan <balaton@eik.bme.hu> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2021-07-09target/ppc/spapr: Update H_GET_CPU_CHARACTERISTICS L1D cache flush bitsNicholas Piggin
There are several new L1D cache flush bits added to the hcall which reflect hardware security features for speculative cache access issues. These behaviours are now being specified as negative in order to simplify patched kernel compatibility with older firmware (a new problem found in existing systems would automatically be vulnerable). [dwg: Technically this changes behaviour for existing machine types. After discussion with Nick, we've determined this is safe, because the worst that will happen if a guest gets the wrong information due to a migration is that it will perform some unnecessary workarounds, but will remain correct and secure (well, as secure as it was going to be anyway). In addition the change only affects cap-cfpc=safe which is not enabled by default, and in fact is not possible to set on any current hardware (though it's expected it will be possible on POWER10)] Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Message-Id: <20210615044107.1481608-1-npiggin@gmail.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2021-07-09spapr: Implement Open Firmware client interfaceAlexey Kardashevskiy
The PAPR platform describes an OS environment that's presented by a combination of a hypervisor and firmware. The features it specifies require collaboration between the firmware and the hypervisor. Since the beginning, the runtime component of the firmware (RTAS) has been implemented as a 20 byte shim which simply forwards it to a hypercall implemented in qemu. The boot time firmware component is SLOF - but a build that's specific to qemu, and has always needed to be updated in sync with it. Even though we've managed to limit the amount of runtime communication we need between qemu and SLOF, there's some, and it has become increasingly awkward to handle as we've implemented new features. This implements a boot time OF client interface (CI) which is enabled by a new "x-vof" pseries machine option (stands for "Virtual Open Firmware). When enabled, QEMU implements the custom H_OF_CLIENT hcall which implements Open Firmware Client Interface (OF CI). This allows using a smaller stateless firmware which does not have to manage the device tree. The new "vof.bin" firmware image is included with source code under pc-bios/. It also includes RTAS blob. This implements a handful of CI methods just to get -kernel/-initrd working. In particular, this implements the device tree fetching and simple memory allocator - "claim" (an OF CI memory allocator) and updates "/memory@0/available" to report the client about available memory. This implements changing some device tree properties which we know how to deal with, the rest is ignored. To allow changes, this skips fdt_pack() when x-vof=on as not packing the blob leaves some room for appending. In absence of SLOF, this assigns phandles to device tree nodes to make device tree traversing work. When x-vof=on, this adds "/chosen" every time QEMU (re)builds a tree. This adds basic instances support which are managed by a hash map ihandle -> [phandle]. Before the guest started, the used memory is: 0..e60 - the initial firmware 8000..10000 - stack 400000.. - kernel 3ea0000.. - initramdisk This OF CI does not implement "interpret". Unlike SLOF, this does not format uninitialized nvram. Instead, this includes a disk image with pre-formatted nvram. With this basic support, this can only boot into kernel directly. However this is just enough for the petitboot kernel and initradmdisk to boot from any possible source. Note this requires reasonably recent guest kernel with: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=df5be5be8735 The immediate benefit is much faster booting time which especially crucial with fully emulated early CPU bring up environments. Also this may come handy when/if GRUB-in-the-userspace sees light of the day. This separates VOF and sPAPR in a hope that VOF bits may be reused by other POWERPC boards which do not support pSeries. This assumes potential support for booting from QEMU backends such as blockdev or netdev without devices/drivers used. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Message-Id: <20210625055155.2252896-1-aik@ozlabs.ru> Reviewed-by: BALATON Zoltan <balaton@eik.bme.hu> [dwg: Adjusted some includes which broke compile in some more obscure compilation setups] Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2021-07-09spapr: tune rtas-sizeAlexey Kardashevskiy
QEMU reserves space for RTAS via /rtas/rtas-size which tells the client how much space the RTAS requires to work which includes the RTAS binary blob implementing RTAS runtime. Because pseries supports FWNMI which requires plenty of space, QEMU reserves more than 2KB which is enough for the RTAS blob as it is just 20 bytes (under QEMU). Since FWNMI reset delivery was added, RTAS_SIZE macro is not used anymore. This replaces RTAS_SIZE with RTAS_MIN_SIZE and uses it in the /rtas/rtas-size calculation to account for the RTAS blob. Fixes: 0e236d347790 ("ppc/spapr: Implement FWNMI System Reset delivery") Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Message-Id: <20210622070336.1463250-1-aik@ozlabs.ru> Reviewed-by: Greg Kurz <groug@kaod.org> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2021-06-03spapr: Don't hijack current_machine->boot_orderGreg Kurz
QEMU 6.0 moved all the -boot variables to the machine. Especially, the removal of the boot_order static changed the handling of '-boot once' from: if (boot_once) { qemu_boot_set(boot_once, &error_fatal); qemu_register_reset(restore_boot_order, g_strdup(boot_order)); } to if (current_machine->boot_once) { qemu_boot_set(current_machine->boot_once, &error_fatal); qemu_register_reset(restore_boot_order, g_strdup(current_machine->boot_order)); } This means that we now register as subsequent boot order a copy of current_machine->boot_once that was just set with the previous call to qemu_boot_set(), i.e. we never transition away from the once boot order. It is certainly fragile^Wwrong for the spapr code to hijack a field of the base machine type object like that. The boot order rework simply turned this software boundary violation into an actual bug. Have the spapr code to handle that with its own field in SpaprMachineState. Also kfree() the initial boot device string when "once" was used. Fixes: 4b7acd2ac821 ("vl: clean up -boot variables") Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1960119 Cc: pbonzini@redhat.com Signed-off-by: Greg Kurz <groug@kaod.org> Message-Id: <20210521160735.1901914-1-groug@kaod.org> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2021-05-19hw/ppc: moved hcalls that depend on softmmuLucas Mateus Castro (alqotel)
The hypercalls h_enter, h_remove, h_bulk_remove, h_protect, and h_read, have been moved to spapr_softmmu.c with the functions they depend on. The functions is_ram_address and push_sregs_to_kvm_pr are not static anymore as functions on both spapr_hcall.c and spapr_softmmu.c depend on them. The hypercalls h_resize_hpt_prepare and h_resize_hpt_commit have been divided, the KVM part stayed in spapr_hcall.c while the softmmu part was moved to spapr_softmmu.c Signed-off-by: Lucas Mateus Castro (alqotel) <lucas.araujo@eldorado.org.br> Message-Id: <20210506163941.106984-2-lucas.araujo@eldorado.org.br> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2021-05-19hw/ppc/spapr.c: Extract MMU mode error reporting into a functionFabiano Rosas
A following patch will make use of it. Signed-off-by: Fabiano Rosas <farosas@linux.ibm.com> Message-Id: <20210505001130.3999968-2-farosas@linux.ibm.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2021-05-04spapr.h: increase FDT_MAX_SIZEDaniel Henrique Barboza
Certain SMP topologies stress, e.g. 1 thread/core, 2048 cores and 1 socket, stress the current maximum size of the pSeries FDT: Calling ibm,client-architecture-support...qemu-system-ppc64: error creating device tree: (fdt_setprop(fdt, offset, "ibm,processor-segment-sizes", segs, sizeof(segs))): FDT_ERR_NOSPACE 2048 is the default NR_CPUS value for the pSeries kernel. It's expected that users will want QEMU to be able to handle this kind of configuration. Bumping FDT_MAX_SIZE to 2MB is enough for these setups to be created. Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com> Message-Id: <20210408204049.221802-3-danielhb413@gmail.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2021-05-04ppc: Rename current DAWR macros and variablesRavi Bangoria
Power10 is introducing second DAWR. Use real register names (with suffix 0) from ISA for current macros and variables used by Qemu. One exception to this is KVM_REG_PPC_DAWR[X]. This is from kernel uapi header and thus not changed in kernel as well as Qemu. Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com> Reviewed-by: Greg Kurz <groug@kaod.org> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Message-Id: <20210412114433.129702-3-ravi.bangoria@linux.ibm.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2021-05-04ppc/spapr: Add support for implement support for H_SCM_HEALTHVaibhav Jain
Add support for H_SCM_HEALTH hcall described at [1] for spapr nvdimms. This enables guest to detect the 'unarmed' status of a specific spapr nvdimm identified by its DRC and if its unarmed, mark the region backed by the nvdimm as read-only. The patch adds h_scm_health() to handle the H_SCM_HEALTH hcall which returns two 64-bit bitmaps (health bitmap, health bitmap mask) derived from 'struct nvdimm->unarmed' member. Linux kernel side changes to enable handling of 'unarmed' nvdimms for ppc64 are proposed at [2]. References: [1] "Hypercall Op-codes (hcalls)" https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/powerpc/papr_hcalls.rst#n220 [2] "powerpc/papr_scm: Mark nvdimm as unarmed if needed during probe" https://lore.kernel.org/linux-nvdimm/20210329113103.476760-1-vaibhav@linux.ibm.com/ Signed-off-by: Vaibhav Jain <vaibhav@linux.ibm.com> Message-Id: <20210402102128.213943-1-vaibhav@linux.ibm.com> Reviewed-by: Greg Kurz <groug@kaod.org> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2021-03-31spapr: Fix typo in the patb_entry commentAlexey Kardashevskiy
There is no H_REGISTER_PROCESS_TABLE, it is H_REGISTER_PROC_TBL handler for which is still called h_register_process_table() though. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Message-Id: <20210225032335.64245-1-aik@ozlabs.ru> Reviewed-by: Daniel Henrique Barboza <danielhb413@gmail.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2021-03-10spapr.c: send QAPI event when memory hotunplug failsDaniel Henrique Barboza
Recent changes allowed the pSeries machine to rollback the hotunplug process for the DIMM when the guest kernel signals, via a reconfiguration of the DR connector, that it's not going to release the LMBs. Let's also warn QAPI listerners about it. One place to do it would be right after the unplug state is cleaned up, spapr_clear_pending_dimm_unplug_state(). This would mean that the function is now doing more than cleaning up the pending dimm state though. This patch does the following changes in spapr.c: - send a QAPI event to inform that we experienced a failure in the hotunplug of the DIMM; - rename spapr_clear_pending_dimm_unplug_state() to spapr_memory_unplug_rollback(). This is a better fit for what the function is now doing, and it makes callers care more about what the function goal is and less about spapr.c internals such as clearing the pending dimm unplug state. Reviewed-by: Greg Kurz <groug@kaod.org> Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com> Message-Id: <20210302141019.153729-3-danielhb413@gmail.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2021-03-10spapr_drc.c: use DRC reconfiguration to cleanup DIMM unplug stateDaniel Henrique Barboza
Handling errors in memory hotunplug in the pSeries machine is more complex than any other device type, because there are all the complications that other devices has, and more. For instance, determining a timeout for a DIMM hotunplug must consider if it's a Hash-MMU or a Radix-MMU guest, because Hash guests takes longer to hotunplug DIMMs. The size of the DIMM is also a factor, given that longer DIMMs naturally takes longer to be hotunplugged from the kernel. And there's also the guest memory usage to be considered: if there's a process that is consuming memory that would be lost by the DIMM unplug, the kernel will postpone the unplug process until the process finishes, and then initiate the regular hotunplug process. The first two considerations are manageable, but the last one is a deal breaker. There is no sane way for the pSeries machine to determine the memory load in the guest when attempting a DIMM hotunplug - and even if there was a way, the guest can start using all the RAM in the middle of the unplug process and invalidate our previous assumptions - and in result we can't even begin to calculate a timeout for the operation. This means that we can't implement a viable timeout mechanism for memory unplug in pSeries. Going back to why we would consider an unplug timeout, the reason is that we can't know if the kernel is giving up the unplug. Turns out that, sometimes, we can. Consider a failed memory hotunplug attempt where the kernel will error out with the following message: 'pseries-hotplug-mem: Memory indexed-count-remove failed, adding any removed LMBs' This happens when there is a LMB that the kernel gave up in removing, and the LMBs previously marked for removal are now being added back. This happens in the pseries kernel in [1], dlpar_memory_remove_by_ic() into dlpar_add_lmb(), and after that update_lmb_associativity_index(). In this function, the kernel is configuring the LMB DRC connector again. Note that this is a valid usage in LOPAR, as stated in section "ibm,configure-connector RTAS Call": 'A subsequent sequence of calls to ibm,configure-connector with the same entry from the “ibm,drc-indexes” or “ibm,drc-info” property will restart the configuration of devices which were not completely configured.' We can use this kernel behavior in our favor. If a DRC connector reconfiguration for a LMB that we marked as unplug pending happens, this indicates that the kernel changed its mind about the unplug and is reasserting that it will keep using all the LMBs of the DIMM. In this case, it's safe to assume that the whole DIMM device unplug was cancelled. This patch hops into rtas_ibm_configure_connector() and, in the scenario described above, clear the unplug state for the DIMM device. This will not solve all the problems we still have with memory unplug, but it will cover this case where the kernel reconfigures LMBs after a failed unplug. We are a bit more resilient, without using an unreliable timeout, and we didn't make the remaining error cases any worse. [1] arch/powerpc/platforms/pseries/hotplug-memory.c Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com> Message-Id: <20210222194531.62717-6-danielhb413@gmail.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2021-02-10spapr: move spapr_machine_using_legacy_numa() to spapr_numa.cDaniel Henrique Barboza
This function is used only in spapr_numa.c. Tested-by: Cédric Le Goater <clg@kaod.org> Reviewed-by: Greg Kurz <groug@kaod.org> Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com> Message-Id: <20210128174213.1349181-2-danielhb413@gmail.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2021-01-19spapr_hcall.c: make do_client_architecture_support staticDaniel Henrique Barboza
The function is called only inside spapr_hcall.c. Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com> Message-Id: <20210114180628.1675603-3-danielhb413@gmail.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2021-01-19spapr.h: fix trailing whitespace in phb_placementDaniel Henrique Barboza
This whitespace was messing with lots of diffs if you happen to use an editor that eliminates trailing whitespaces on file save. Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com> Message-Id: <20210114180628.1675603-2-danielhb413@gmail.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2021-01-19spapr: Improve handling of memory unplug with old guestsGreg Kurz
Since commit 1e8b5b1aa16b ("spapr: Allow memory unplug to always succeed") trying to unplug memory from a guest that doesn't support it (eg. rhel6) no longer generates an error like it used to. Instead, it leaves the memory around : only a subsequent reboot or manual use of drmgr within the guest can complete the hot-unplug sequence. A flag was added to SpaprMachineClass so that this new behavior only applies to the default machine type. We can do better. CAS processes all pending hot-unplug requests. This means that we don't really care about what the guest supports if the hot-unplug request happens before CAS. All guests that we care for, even old ones, set enough bits in OV5 that lead to a non-empty bitmap in spapr->ov5_cas. Use that as a heuristic to decide if CAS has already occured or not. Always accept unplug requests that happen before CAS since CAS will process them. Restore the previous behavior of rejecting them after CAS when we know that the guest doesn't support memory hot-unplug. This behavior is suitable for all machine types : this allows to drop the pre_6_0_memory_unplug flag. Fixes: 1e8b5b1aa16b ("spapr: Allow memory unplug to always succeed") Signed-off-by: Greg Kurz <groug@kaod.org> Message-Id: <161012708715.801107.11418801796987916516.stgit@bahia.lan> Reviewed-by: Daniel Henrique Barboza <danielhb413@gmail.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2021-01-06spapr: Fix buffer overflow in spapr_numa_associativity_init()Greg Kurz
Running a guest with 128 NUMA nodes crashes QEMU: ../../util/error.c:59: error_setv: Assertion `*errp == NULL' failed. The crash happens when setting the FWNMI migration blocker: 2861 if (spapr_get_cap(spapr, SPAPR_CAP_FWNMI) == SPAPR_CAP_ON) { 2862 /* Create the error string for live migration blocker */ 2863 error_setg(&spapr->fwnmi_migration_blocker, 2864 "A machine check is being handled during migration. The handler" 2865 "may run and log hardware error on the destination"); 2866 } Inspection reveals that papr->fwnmi_migration_blocker isn't NULL: (gdb) p spapr->fwnmi_migration_blocker $1 = (Error *) 0x8000000004000000 Since this is the only place where papr->fwnmi_migration_blocker is set, this means someone wrote there in our back. Further analysis points to spapr_numa_associativity_init(), especially the part that initializes the associative arrays for NVLink GPUs: max_nodes_with_gpus = nb_numa_nodes + NVGPU_MAX_NUM; ie. max_nodes_with_gpus = 128 + 6, but the array isn't sized to accommodate the 6 extra nodes: struct SpaprMachineState { . . . uint32_t numa_assoc_array[MAX_NODES][NUMA_ASSOC_SIZE]; Error *fwnmi_migration_blocker; }; and the following loops happily overwrite spapr->fwnmi_migration_blocker, and probably more: for (i = nb_numa_nodes; i < max_nodes_with_gpus; i++) { spapr->numa_assoc_array[i][0] = cpu_to_be32(MAX_DISTANCE_REF_POINTS); for (j = 1; j < MAX_DISTANCE_REF_POINTS; j++) { uint32_t gpu_assoc = smc->pre_5_1_assoc_refpoints ? SPAPR_GPU_NUMA_ID : cpu_to_be32(i); spapr->numa_assoc_array[i][j] = gpu_assoc; } spapr->numa_assoc_array[i][MAX_DISTANCE_REF_POINTS] = cpu_to_be32(i); } Fix the size of the array. This requires "hw/ppc/spapr.h" to see NVGPU_MAX_NUM. Including "hw/pci-host/spapr.h" introduces a circular dependency that breaks the build, so this moves the definition of NVGPU_MAX_NUM to "hw/ppc/spapr.h" instead. Reported-by: Min Deng <mdeng@redhat.com> BugLink: https://bugzilla.redhat.com/show_bug.cgi?id=1908693 Fixes: dd7e1d7ae431 ("spapr_numa: move NVLink2 associativity handling to spapr_numa.c") Cc: danielhb413@gmail.com Signed-off-by: Greg Kurz <groug@kaod.org> Message-Id: <160829960428.734871.12634150161215429514.stgit@bahia.lan> Reviewed-by: Daniel Henrique Barboza <danielhb413@gmail.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2021-01-06spapr: Allow memory unplug to always succeedGreg Kurz
It is currently impossible to hot-unplug a memory device between machine reset and CAS. (qemu) device_del dimm1 Error: Memory hot unplug not supported for this guest This limitation was introduced in order to provide an explicit error path for older guests that didn't support hot-plug event sources (and thus memory hot-unplug). The linux kernel has been supporting these since 4.11. All recent enough guests are thus capable of handling the removal of a memory device at all time, including during early boot. Lift the limitation for the latest machine type. This means that trying to unplug memory from a guest that doesn't support it will likely just do nothing and the memory will only get removed at next reboot. Such older guests can still get the existing behavior by using an older machine type. Signed-off-by: Greg Kurz <groug@kaod.org> Message-Id: <160794035064.23292.17560963281911312439.stgit@bahia.lan> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2020-12-14spapr: Pass sPAPR machine state down to spapr_pci_switch_vga()Greg Kurz
This allows to drop a user of qdev_get_machine(). Signed-off-by: Greg Kurz <groug@kaod.org> Message-Id: <20201209170052.1431440-4-groug@kaod.org> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2020-12-14spapr: Make PHB placement functions and spapr_pre_plug_phb() return statusGreg Kurz
Read documentation in "qapi/error.h" and changelog of commit e3fe3988d785 ("error: Document Error API usage rules") for rationale. Signed-off-by: Greg Kurz <groug@kaod.org> Message-Id: <20201120234208.683521-7-groug@kaod.org> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2020-10-28spapr: Improve spapr_reallocate_hpt() error reportingGreg Kurz
spapr_reallocate_hpt() has three users, two of which pass &error_fatal and the third one, htab_load(), passes &local_err, uses it to detect failures and simply propagates -EINVAL up to vmstate_load(), which will cause QEMU to exit. It is thus confusing that spapr_reallocate_hpt() doesn't return right away when an error is detected in some cases. Also, the comment suggesting that the caller is welcome to try to carry on seems like a remnant in this respect. This can be improved: - change spapr_reallocate_hpt() to always report a negative errno on failure, either as reported by KVM or -ENOSPC if the HPT is smaller than what was asked, - use that to detect failures in htab_load() which is preferred over checking &local_err, - propagate this negative errno to vmstate_load() because it is more accurate than propagating -EINVAL for all possible errors. [dwg: Fix compile error due to omitted prelim patch] Signed-off-by: Greg Kurz <groug@kaod.org> Message-Id: <160371605460.305923.5890143959901241157.stgit@bahia.lan> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2020-10-09spapr: add spapr_machine_using_legacy_numa() helperDaniel Henrique Barboza
The changes to come to NUMA support are all guest visible. In theory we could just create a new 5_1 class option flag to avoid the changes to cascade to 5.1 and under. The reality is that these changes are only relevant if the machine has more than one NUMA node. There is no need to change guest behavior that has been around for years needlesly. This new helper will be used by the next patches to determine whether we should retain the (soon to be) legacy NUMA behavior in the pSeries machine. The new behavior will only be exposed if: - machine is pseries-5.2 and newer; - more than one NUMA node is declared in NUMA state. Reviewed-by: Greg Kurz <groug@kaod.org> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com> Message-Id: <20201007172849.302240-2-danielhb413@gmail.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2020-10-09spapr: Add a return value to spapr_check_pagesize()Greg Kurz
As recommended in "qapi/error.h", return true on success and false on failure. This allows to reduce error propagation overhead in the callers. Signed-off-by: Greg Kurz <groug@kaod.org> Message-Id: <20200914123505.612812-14-groug@kaod.org> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2020-10-09spapr: Add a return value to spapr_set_vcpu_id()Greg Kurz
As recommended in "qapi/error.h", return true on success and false on failure. This allows to reduce error propagation overhead in the callers. Signed-off-by: Greg Kurz <groug@kaod.org> Message-Id: <20200914123505.612812-11-groug@kaod.org> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2020-09-18Use OBJECT_DECLARE_SIMPLE_TYPE when possibleEduardo Habkost
This converts existing DECLARE_INSTANCE_CHECKER usage to OBJECT_DECLARE_SIMPLE_TYPE when possible. $ ./scripts/codeconverter/converter.py -i \ --pattern=AddObjectDeclareSimpleType $(git grep -l '' -- '*.[ch]') Signed-off-by: Eduardo Habkost <ehabkost@redhat.com> Acked-by: Paul Durrant <paul@xen.org> Message-Id: <20200916182519.415636-6-ehabkost@redhat.com> Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
2020-09-18Use OBJECT_DECLARE_TYPE when possibleEduardo Habkost
This converts existing DECLARE_OBJ_CHECKERS usage to OBJECT_DECLARE_TYPE when possible. $ ./scripts/codeconverter/converter.py -i \ --pattern=AddObjectDeclareType $(git grep -l '' -- '*.[ch]') Signed-off-by: Eduardo Habkost <ehabkost@redhat.com> Reviewed-by: Cédric Le Goater <clg@kaod.org> Acked-by: Paul Durrant <paul@xen.org> Message-Id: <20200916182519.415636-5-ehabkost@redhat.com> Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
2020-09-09Use DECLARE_*CHECKER* macrosEduardo Habkost
Generated using: $ ./scripts/codeconverter/converter.py -i \ --pattern=TypeCheckMacro $(git grep -l '' -- '*.[ch]') Reviewed-by: Daniel P. Berrangé <berrange@redhat.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Message-Id: <20200831210740.126168-12-ehabkost@redhat.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Message-Id: <20200831210740.126168-13-ehabkost@redhat.com> Message-Id: <20200831210740.126168-14-ehabkost@redhat.com> Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
2020-09-09Move QOM typedefs and add missing includesEduardo Habkost
Some typedefs and macros are defined after the type check macros. This makes it difficult to automatically replace their definitions with OBJECT_DECLARE_TYPE. Patch generated using: $ ./scripts/codeconverter/converter.py -i \ --pattern=QOMStructTypedefSplit $(git grep -l '' -- '*.[ch]') which will split "typdef struct { ... } TypedefName" declarations. Followed by: $ ./scripts/codeconverter/converter.py -i --pattern=MoveSymbols \ $(git grep -l '' -- '*.[ch]') which will: - move the typedefs and #defines above the type check macros - add missing #include "qom/object.h" lines if necessary Reviewed-by: Daniel P. Berrangé <berrange@redhat.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Message-Id: <20200831210740.126168-9-ehabkost@redhat.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Message-Id: <20200831210740.126168-10-ehabkost@redhat.com> Message-Id: <20200831210740.126168-11-ehabkost@redhat.com> Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
2020-09-08spapr_numa: create a vcpu associativity helperDaniel Henrique Barboza
The work to be done in h_home_node_associativity() intersects with what is already done in spapr_numa_fixup_cpu_dt(). This patch creates a new helper, spapr_numa_get_vcpu_assoc(), to be used for both spapr_numa_fixup_cpu_dt() and h_home_node_associativity(). While we're at it, use memcpy() instead of loop assignment to created the returned array. Reviewed-by: Greg Kurz <groug@kaod.org> Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com> Message-Id: <20200904172422.617460-3-danielhb413@gmail.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2020-09-08spapr: introduce SpaprMachineState::numa_assoc_arrayDaniel Henrique Barboza
The next step to centralize all NUMA/associativity handling in the spapr machine is to create a 'one stop place' for all things ibm,associativity. This patch introduces numa_assoc_array, a 2 dimensional array that will store all ibm,associativity arrays of all NUMA nodes. This array is initialized in a new spapr_numa_associativity_init() function, called in spapr_machine_init(). It is being initialized with the same values used in other ibm,associativity properties around spapr files (i.e. all zeros, last value is node_id). The idea is to remove all hardcoded definitions and FDT writes of ibm,associativity arrays, doing instead a call to the new helper spapr_numa_write_associativity_dt() helper, that will be able to write the DT with the correct values. We'll start small, handling the trivial cases first. The remaining instances of ibm,associativity will be handled next. Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com> Message-Id: <20200903220639.563090-2-danielhb413@gmail.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2020-08-27spapr: Move typedef SpaprMachineState to spapr.hEduardo Habkost
Move the typedef from spapr_irq.h to spapr.h, and use "struct SpaprMachineState" in the spapr_*.h headers (to avoid circular header dependencies). This will make future conversion to OBJECT_DECLARE* easier. Reviewed-by: Daniel P. Berrangé <berrange@redhat.com> Signed-off-by: Eduardo Habkost <ehabkost@redhat.com> Tested-By: Roman Bolshakov <r.bolshakov@yadro.com> Message-Id: <20200825192110.3528606-28-ehabkost@redhat.com> Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
2020-07-20spapr: Add a new level of NUMA for GPUsReza Arbab
NUMA nodes corresponding to GPU memory currently have the same affinity/distance as normal memory nodes. Add a third NUMA associativity reference point enabling us to give GPU nodes more distance. This is guest visible information, which shouldn't change under a running guest across migration between different qemu versions, so make the change effective only in new (pseries > 5.0) machine types. Before, `numactl -H` output in a guest with 4 GPUs (nodes 2-5): node distances: node 0 1 2 3 4 5 0: 10 40 40 40 40 40 1: 40 10 40 40 40 40 2: 40 40 10 40 40 40 3: 40 40 40 10 40 40 4: 40 40 40 40 10 40 5: 40 40 40 40 40 10 After: node distances: node 0 1 2 3 4 5 0: 10 40 80 80 80 80 1: 40 10 80 80 80 80 2: 80 80 10 80 80 80 3: 80 80 80 10 80 80 4: 80 80 80 80 10 80 5: 80 80 80 80 80 10 These are the same distances as on the host, mirroring the change made to host firmware in skiboot commit f845a648b8cb ("numa/associativity: Add a new level of NUMA for GPU's"). Signed-off-by: Reza Arbab <arbab@linux.ibm.com> Message-Id: <20200716225655.24289-1-arbab@linux.ibm.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2020-05-27ppc/spapr: Add hotremovable flag on DIMM LMBs on drmem_v2Leonardo Bras
On reboot, all memory that was previously added using object_add and device_add is placed in this DIMM area. The new SPAPR_LMB_FLAGS_HOTREMOVABLE flag helps Linux to put this memory in the correct memory zone, so no unmovable allocations are made there, allowing the object to be easily hot-removed by device_del and object_del. This new flag was accepted in Power Architecture documentation. Signed-off-by: Leonardo Bras <leobras.c@gmail.com> Reviewed-by: Bharata B Rao <bharata@linux.ibm.com> Message-Id: <20200511200201.58537-1-leobras.c@gmail.com> [dwg: Fixed syntax error spotted by Cédric Le Goater] Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2020-05-15Drop more @errp parameters after previous commitMarkus Armbruster
Several functions can't fail anymore: ich9_pm_add_properties(), device_add_bootindex_property(), ppc_compat_add_property(), spapr_caps_add_properties(), PropertyInfo.create(). Drop their @errp parameter. Signed-off-by: Markus Armbruster <armbru@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20200505152926.18877-16-armbru@redhat.com>
2020-05-07spapr: Drop CAS reboot flagGreg Kurz
The CAS reboot flag is false by default and all the locations that could set it to true have been dropped. This means that all code blocks depending on the flag being set is dead code and the other code blocks should be executed always. Just do that and drop the now uneeded CAS reboot flag. Fix a comment on the way to make checkpatch happy. Signed-off-by: Greg Kurz <groug@kaod.org> Message-Id: <158514994893.478799.11772512888322840990.stgit@bahia.lan> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2020-05-07spapr/cas: Separate CAS handling from rebuilding the FDTAlexey Kardashevskiy
At the moment "ibm,client-architecture-support" ("CAS") is implemented in SLOF and QEMU assists via the custom H_CAS hypercall which copies an updated flatten device tree (FDT) blob to the SLOF memory which it then uses to update its internal tree. When we enable the OpenFirmware client interface in QEMU, we won't need to copy the FDT to the guest as the client is expected to fetch the device tree using the client interface. This moves FDT rebuild out to a separate helper which is going to be called from the "ibm,client-architecture-support" handler and leaves writing FDT to the guest in the H_CAS handler. This should not cause any behavioral change. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Message-Id: <20200310050733.29805-3-aik@ozlabs.ru> Signed-off-by: Greg Kurz <groug@kaod.org> Message-Id: <158514994229.478799.2178881312094922324.stgit@bahia.lan> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2020-03-17ppc/spapr: Add FWNMI System Reset stateNicholas Piggin
The FWNMI option must deliver system reset interrupts to their registered address, and there are a few constraints on the handler addresses specified in PAPR. Add the system reset address state and checks. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Message-Id: <20200316142613.121089-4-npiggin@gmail.com> Reviewed-by: Greg Kurz <groug@kaod.org> Reviwed-by: Mahesh Salgaonkar <mahesh@linux.ibm.com> Reviewed-by: Cédric Le Goater <clg@kaod.org> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>