aboutsummaryrefslogtreecommitdiff
path: root/hw/ppc/spapr_pci.c
AgeCommit message (Collapse)Author
2017-09-08spapr_pci: parent the MSI memory region to the PHBGreg Kurz
This memory region should be owned by the PHB. This ensures the PHB cannot be finalized as long as the the region is guest visible, or used by a CPU or a device. Signed-off-by: Greg Kurz <groug@kaod.org> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2017-09-08spapr_pci: use memory_region_add_subregion() with DMA windowsGreg Kurz
Passing a null priority to memory_region_add_subregion_overlap() is strictly equivalent to calling memory_region_add_subregion(). Signed-off-by: Greg Kurz <groug@kaod.org> Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2017-07-25spapr_pci: Fix obsolete comment about MSIX encoding in addr/dataAlexey Kardashevskiy
f1c2dc7c866a "spapr-pci: rework MSI/MSIX" (07/2013) changed MSIX encoding but forgot to change the comment so this changes it. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2017-07-17spapr: Cleanups relating to DRC awaiting_release fieldDavid Gibson
'awaiting_release' indicates that the host has requested an unplug of the device attached to the DRC, but the guest has not (yet) put the device into a state where it is safe to complete removal. 1. Rename it to 'unplug_requested' which to me at least is clearer 2. Remove the ->release_pending() method used to check this from outside spapr_drc.c. The method only plausibly has one implementation, so use a plain function (spapr_drc_unplug_requested()) instead. 3. Remove it from the migration stream. Attempting to migrate mid-unplug is broken not just for spapr - in general management has no good way to determine if the device should be present on the destination or not. So, until that's fixed, there's no point adding extra things to the stream. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Greg Kurz <groug@kaod.org> Tested-by: Daniel Barboza <danielhb@linux.vnet.ibm.com>
2017-07-17spapr: Refactor spapr_drc_detach()David Gibson
This function has two unused parameters - remove them. It also sets awaiting_release on all paths, except one. On that path setting it is harmless, since it will be immediately cleared by spapr_drc_release(). So factor it out of the if statements. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Greg Kurz <groug@kaod.org> Tested-by: Daniel Barboza <danielhb@linux.vnet.ibm.com>
2017-07-17spapr: Treat devices added before inbound migration as coldpluggedLaurent Vivier
When migrating a guest which has already had devices hotplugged, libvirt typically starts the destination qemu with -incoming defer, adds those hotplugged devices with qmp, then initiates the incoming migration. This causes problems for the management of spapr DRC state. Because the device is treated as hotplugged, it goes into a DRC state for a device immediately after it's plugged, but before the guest has acknowledged its presence. However, chances are the guest on the source machine *has* acknowledged the device's presence and configured it. If the source has fully configured the device, then DRC state won't be sent in the migration stream: for maximum migration compatibility with earlier versions we don't migrate DRCs in coldplug-equivalent state. That means that the DRC effectively changes state over the migrate, causing problems later on. In addition, logging hotplug events for these devices isn't what we want because a) those events should already have been issued on the source host and b) the event queue should get wiped out by the incoming state anyway. In short, what we really want is to treat devices added before an incoming migration as if they were coldplugged. To do this, we first add a spapr_drc_hotplugged() helper which determines if the device is hotplugged in the sense relevant for DRC state management. We only send hotplug events when this is true. Second, when we add a device which isn't hotplugged in this sense, we force a reset of the DRC state - this ensures the DRC is in a coldplug-equivalent state (there isn't usually a system reset between these device adds and the incoming migration). This is based on an earlier patch by Laurent Vivier, cleaned up and extended. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Greg Kurz <groug@kaod.org> Tested-by: Daniel Barboza <danielhb@linux.vnet.ibm.com>
2017-07-11spapr: Only report host/guest IOMMU page size mismatches on KVMDavid Gibson
We print a warning if the spapr IOMMU isn't configured to support a page size matching the host page size backing RAM. When that's the case we need more complex logic to translate VFIO mappings, which is slower. But, it's not so slow that it would be at all noticeable against the general slowness of TCG. So, only warn when using KVM. This removes some noisy and unhelpful warnings from make check on hosts with page sizes which typically differ from those on POWER (e.g. Sparc). Reported-by: Peter Maydell <peter.maydell@linaro.org> Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Thomas Huth <thuth@redhat.com>
2017-07-11spapr: Use unplug_request for PCI hot unplugDavid Gibson
AIUI, ->unplug_request in the HotplugHandler is used for "soft" unplug, where acknowledgement from the guest is required before completing the unplug, whereas ->unplug is used for "hard" unplug where qemu unilaterally removes the device, and the guest just has to cope with its sudden absence. For spapr we (correctly) use ->unplug_request for CPU and memory hot unplug but we use ->unplug for PCI. While I think it might be possible to support "hard" PCI unplug within the PAPR model, that's not how it actually works now. Although it's called from ->unplug, the PCI unplug path will usually just mark the device for removal, with completion of the unplug delayed until userspace responds to the unplug notification. If the guest doesn't respond as expected, that could delay the unplug completion arbitrarily long. To reflect that, change the PCI unplug path to be called from ->unplug_request. We also rename spapr_phb_hot_plug_child() and spapr_phb_hot_unplug_child() to spapr_pci_plug() and spapr_pci_unplug_request() to more obviously reflect the callbacks they're implementing. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Laurent Vivier <lvivier@redhat.com> Reviewed-by: Greg Kurz <groug@kaod.org>
2017-07-11spapr: Remove unnecessary differences between hotplug and coldplug pathsDavid Gibson
spapr_drc_attach() has a 'coldplug' parameter which sets the DRC into configured state initially, instead of the usual ISOLATED/UNUSABLE state. It turns out this is unnecessary: although coldplugged devices do need to be in CONFIGURED state once the guest starts, that will already be accomplished by the reset code which will move DRCs for already plugged devices into a coldplug equivalent state. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Laurent Vivier <lvivier@redhat.com> Reviewed-by: Greg Kurz <groug@kaod.org>
2017-07-11spapr: fix migration to pseries machine < 2.8Laurent Vivier
since commit 5c4537bd ("spapr: Fix 2.7<->2.8 migration of PCI host bridge"), some migration fields are forged from the new ones in spapr_pci_pre_save(). It works well, except when the number of MSI devices is 0, because in this case the function exits immediately. This fix moves the migration code before the exit code. The problem can be reproduced with these commands: source qemu-2.9: qemu-system-ppc64 -monitor stdio -M pseries-2.6 -nodefaults -S destination qemu-2.6: qemu-system-ppc64 -monitor stdio -M pseries-2.6 -nodefaults \ -incoming tcp:0:4444 on the source: migrate tcp:localhost:4444 Destination fails with the following error: qemu-system-ppc64: error while loading state for instance 0x0 of device 'spapr_pci' qemu-system-ppc64: load of migration failed: Invalid argument Signed-off-by: Laurent Vivier <lvivier@redhat.com> Reviewed-by: Greg Kurz <groug@kaod.org> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2017-06-28vmstate: error hint for failed equal checksHalil Pasic
In some cases a failing VMSTATE_*_EQUAL does not mean we detected a bug, but it's actually the best we can do. Especially in these cases a verbose error message is required. Let's introduce infrastructure for specifying a error hint to be used if equal check fails. Let's do this by adding a parameter to the _EQUAL macros called _err_hint. Also change all current users to pass NULL as last parameter so nothing changes for them. Signed-off-by: Halil Pasic <pasic@linux.vnet.ibm.com> Message-Id: <20170623144823.42936-1-pasic@linux.vnet.ibm.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com>
2017-06-08spapr: Fold spapr_phb_{add,remove}_pci_device() into their only callersDavid Gibson
Both functions are fairly short, and so are their callers. There's no particular logical distinction between them, so fold them together. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Michael Roth <mdroth@linux.vnet.ibm.com> Acked-by: Michael Roth <mdroth@linux.vnet.ibm.com>
2017-06-08spapr: Change DRC attach & detach methods to functionsDavid Gibson
DRC objects have attach & detach methods, but there's only one implementation. Although there are some differences in its behaviour for different DRC types, the overall structure is the same, so while we might want different method implementations for some parts, we're unlikely to want them for the top-level functions. So, replace them with direct function calls. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Michael Roth <mdroth@linux.vnet.ibm.com> Acked-by: Michael Roth <mdroth@linux.vnet.ibm.com>
2017-06-08spapr: Clean up DR entity sense handlingDavid Gibson
DRC classes have an entity_sense method to determine (in a specific PAPR sense) the presence or absence of a device plugged into a DRC. However, we only have one implementation of the method, which explicitly tests for different DRC types. This changes it to instead have different method implementations for the two cases: "logical" and "physical" DRCs. While we're at it, the entity sense method always returns RTAS_OUT_SUCCESS, and the interesting value is returned via pass-by-reference. Simplify this to directly return the value we care about Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Michael Roth <mdroth@linux.vnet.ibm.com> Acked-by: Michael Roth <mdroth@linux.vnet.ibm.com>
2017-06-06spapr: Clean up spapr_dr_connector_by_*()David Gibson
* Change names to something less ludicrously verbose * Now that we have QOM subclasses for the different DRC types, use a QOM typename instead of a PAPR type value parameter The latter allows removal of the get_type_shift() helper. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Michael Roth <mdroth@linux.vnet.ibm.com> Acked-by: Michael Roth <mdroth@linux.vnet.ibm.com>
2017-06-06spapr: Introduce DRC subclassesDavid Gibson
Currently we only have a single QOM type for all DRCs, but lots of places where we switch behaviour based on the DRC's PAPR defined type. This is a poor use of our existing type system. So, instead create QOM subclasses for each PAPR defined DRC type. We also introduce intermediate subclasses for physical and logical DRCs, a division which will be useful later on. Instead of being stored in the DRC object itself, the PAPR type is now stored in the class structure. There are still many places where we switch directly on the PAPR type value, but this at least provides the basis to start to remove those. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Michael Roth <mdroth@linux.vnet.ibm.com> Acked-by: Michael Roth <mdroth@linux.vnet.ibm.com>
2017-06-06spapr: Make DRC get_index and get_type methods into plain functionsDavid Gibson
These two methods only have one implementation, and the spec they're implementing means any other implementation is unlikely, verging on impossible. So replace them with simple functions. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Laurent Vivier <lvivier@redhat.com> Tested-by: Daniel Barboza <danielhb@linux.vnet.ibm.com>
2017-05-25hw/ppc: removing drc->detach_cb and drc->detach_cb_opaqueDaniel Henrique Barboza
The pointer drc->detach_cb is being used as a way of informing the detach() function inside spapr_drc.c which cb to execute. This information can also be retrieved simply by checking drc->type and choosing the right callback based on it. In this context, detach_cb is redundant information that must be managed. After the previous spapr_lmb_release change, no detach_cb_opaques are being used by any of the three callbacks functions. This is yet another information that is now unused and, on top of that, can't be migrated either. This patch makes the following changes: - removal of detach_cb_opaque. the 'opaque' argument was removed from the callbacks and from the detach() function of sPAPRConnectorClass. The attribute detach_cb_opaque of sPAPRConnector was removed. - removal of detach_cb from the detach() call. The function pointer detach_cb of sPAPRConnector was removed. detach() now uses a switch(drc->type) to execute the apropriate callback. To achieve this, spapr_core_release, spapr_lmb_release and spapr_phb_remove_pci_device_cb callbacks were made public to be visible inside detach(). Signed-off-by: Daniel Henrique Barboza <danielhb@linux.vnet.ibm.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2017-05-17sysbus: Set user_creatable=false by default on TYPE_SYS_BUS_DEVICEEduardo Habkost
commit 33cd52b5d7b9adfd009e95f07e6c64dd88ae2a31 unset cannot_instantiate_with_device_add_yet in TYPE_SYSBUS, making all sysbus devices appear on "-device help" and lack the "no-user" flag in "info qdm". To fix this, we can set user_creatable=false by default on TYPE_SYS_BUS_DEVICE, but this requires setting user_creatable=true explicitly on the sysbus devices that actually work with -device. Fortunately today we have just a few has_dynamic_sysbus=1 machines: virt, pc-q35-*, ppce500, and spapr. virt, ppce500, and spapr have extra checks to ensure just a few device types can be instantiated: * virt supports only TYPE_VFIO_CALXEDA_XGMAC, TYPE_VFIO_AMD_XGBE. * ppce500 supports only TYPE_ETSEC_COMMON. * spapr supports only TYPE_SPAPR_PCI_HOST_BRIDGE. This patch sets user_creatable=true explicitly on those 4 device classes. Now, the more complex cases: pc-q35-*: q35 has no sysbus device whitelist yet (which is a separate bug). We are in the process of fixing it and building a sysbus whitelist on q35, but in the meantime we can fix the "-device help" and "info qdm" bugs mentioned above. Also, despite not being strictly necessary for fixing the q35 bug, reducing the list of user_creatable=true devices will help us be more confident when building the q35 whitelist. xen: We also have a hack at xen_set_dynamic_sysbus(), that sets has_dynamic_sysbus=true at runtime when using the Xen accelerator. This hack is only used to allow xen-backend devices to be dynamically plugged/unplugged. This means today we can use -device with the following 22 device types, that are the ones compiled into the qemu-system-x86_64 and qemu-system-i386 binaries: * allwinner-ahci * amd-iommu * cfi.pflash01 * esp * fw_cfg_io * fw_cfg_mem * generic-sdhci * hpet * intel-iommu * ioapic * isabus-bridge * kvmclock * kvm-ioapic * kvmvapic * SUNW,fdtwo * sysbus-ahci * sysbus-fdc * sysbus-ohci * unimplemented-device * virtio-mmio * xen-backend * xen-sysdev This patch adds user_creatable=true explicitly to those devices, temporarily, just to keep 100% compatibility with existing behavior of q35. Subsequent patches will remove user_creatable=true from the devices that are really not meant to user-creatable on any machine, and remove the FIXME comment from the ones that are really supposed to be user-creatable. This is being done in separate patches because we still don't have an obvious list of devices that will be whitelisted by q35, and I would like to get each device reviewed individually. Cc: Alexander Graf <agraf@suse.de> Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Alistair Francis <alistair.francis@xilinx.com> Cc: Beniamino Galvani <b.galvani@gmail.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Cornelia Huck <cornelia.huck@de.ibm.com> Cc: David Gibson <david@gibson.dropbear.id.au> Cc: "Edgar E. Iglesias" <edgar.iglesias@gmail.com> Cc: Eduardo Habkost <ehabkost@redhat.com> Cc: Frank Blaschka <frank.blaschka@de.ibm.com> Cc: Gabriel L. Somlo <somlo@cmu.edu> Cc: Gerd Hoffmann <kraxel@redhat.com> Cc: Igor Mammedov <imammedo@redhat.com> Cc: Jason Wang <jasowang@redhat.com> Cc: John Snow <jsnow@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Kevin Wolf <kwolf@redhat.com> Cc: Laszlo Ersek <lersek@redhat.com> Cc: Marcel Apfelbaum <marcel@redhat.com> Cc: Markus Armbruster <armbru@redhat.com> Cc: Max Reitz <mreitz@redhat.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Peter Maydell <peter.maydell@linaro.org> Cc: Pierre Morel <pmorel@linux.vnet.ibm.com> Cc: Prasad J Pandit <pjp@fedoraproject.org> Cc: qemu-arm@nongnu.org Cc: qemu-block@nongnu.org Cc: qemu-ppc@nongnu.org Cc: Richard Henderson <rth@twiddle.net> Cc: Rob Herring <robh@kernel.org> Cc: Shannon Zhao <zhaoshenglong@huawei.com> Cc: sstabellini@kernel.org Cc: Thomas Huth <thuth@redhat.com> Cc: Yi Min Zhao <zyimin@linux.vnet.ibm.com> Acked-by: John Snow <jsnow@redhat.com> Acked-by: Juergen Gross <jgross@suse.com> Acked-by: Marcel Apfelbaum <marcel@redhat.com> Signed-off-by: Eduardo Habkost <ehabkost@redhat.com> Message-Id: <20170503203604.31462-3-ehabkost@redhat.com> Reviewed-by: Markus Armbruster <armbru@redhat.com> [ehabkost: Small changes at sysbus_device_class_init() comments] Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
2017-04-26spapr_pci: Removed unused includeAlexey Kardashevskiy
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2017-04-26spapr_pci: Warn when RAM page size is not enabled in IOMMU page maskAlexey Kardashevskiy
If a page size used by QEMU is not enabled in the PHB IOMMU page mask, in-kernel acceleration of TCE handling won't be enabled and performance might be slower than expected. This prints a warning if system page size is not enabled. This should print a warning if huge pages are enabled but sphb.pgsz still uses the default value of 4K|64K. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2017-03-14pseries: Don't expose PCIe extended config space on older machine typesDavid Gibson
bb9986452 "spapr_pci: Advertise access to PCIe extended config space" allowed guests to access the extended config space of PCI Express devices via the PAPR interfaces, even though the paravirtualized bus mostly acts like plain PCI. However, that patch enabled access unconditionally, including for existing machine types, which is an unwise change in behaviour. This patch limits the change to pseries-2.9 (and later) machine types. Suggested-by: Andrea Bolognani <abologna@redhat.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2017-03-03spapr_pci: Advertise access to PCIe extended config spaceDavid Gibson
The (paravirtual) PCI host bridge on the 'pseries' machine in most regards acts like a regular PCI bus, rather than a PCIe bus. Despite this, though, it does allow access to the PCIe extended config space. We already implemented the RTAS methods to allow this access.. but forgot to put the markers into the device tree so that guest's know it is there. This adds them in. With this, a pseries guest is able to view extended config space on (for example an e1000e device. This should be enough to allow guests to use at least some PCIe devices. Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2017-03-01ppc/xics: use the QOM interface to get irqsCédric Le Goater
Signed-off-by: Cédric Le Goater <clg@kaod.org> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2017-03-01ppc/xics: store the ICS object under the sPAPR machineCédric Le Goater
A list of ICS objects was introduced under the XICS object for the PowerNV machine but, for the sPAPR machine, it brings extra complexity as there is only a single ICS. To simplify the code, let's add the ICS pointer under the sPAPR machine and try to reduce the use of this list where possible. Also, change the xics_spapr_*() routines to use an ICS object instead of an XICSState and change their name to reflect that these are specific to the sPAPR ICS object. Signed-off-by: Cédric Le Goater <clg@kaod.org> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2017-03-01spapr/pci: populate PCI DT in reverse orderGreg Kurz
Since commit 1d2d974244c6 "spapr_pci: enumerate and add PCI device tree", QEMU populates the PCI device tree in the opposite order compared to SLOF. Before 1d2d974244c6: Populating /pci@800000020000000 00 0000 (D) : 1af4 1000 virtio [ net ] 00 0800 (D) : 1af4 1001 virtio [ block ] 00 1000 (D) : 1af4 1009 virtio [ network ] Populating /pci@800000020000000/unknown-legacy-device@2 7e5294b8 : /pci@800000020000000 7e52b998 : |-- ethernet@0 7e52c0c8 : |-- scsi@1 7e52c7e8 : +-- unknown-legacy-device@2 ok Since 1d2d974244c6: Populating /pci@800000020000000 00 1000 (D) : 1af4 1009 virtio [ network ] Populating /pci@800000020000000/unknown-legacy-device@2 00 0800 (D) : 1af4 1001 virtio [ block ] 00 0000 (D) : 1af4 1000 virtio [ net ] 7e5e8118 : /pci@800000020000000 7e5ea6a0 : |-- unknown-legacy-device@2 7e5eadb8 : |-- scsi@1 7e5eb4d8 : +-- ethernet@0 ok This behaviour change is not actually a bug since no assumptions should be made on DT ordering. But it has no real justification either, other than being the consequence of the way fdt_add_subnode() inserts new elements to the front of the FDT rather than adding them to the tail. This patch reverts to the historical SLOF ordering by walking PCI devices in reverse order. This reconciles pseries with x86 machine types behavior. It is expected to make things easier when porting existing applications to power. Signed-off-by: Greg Kurz <gkurz@linux.vnet.ibm.com> Tested-by: Thomas Huth <thuth@redhat.com> Reviewed-by: Nikunj A Dadhania <nikunj@linux.vnet.ibm.com> (slight update to the changelog) Signed-off-by: Greg Kurz <groug@kaod.org> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2017-03-01spapr: generate DT node namesLaurent Vivier
When DT node names for PCI devices are generated by SLOF, they are generated according to the type of the device (for instance, ethernet for virtio-net-pci device). Node name for hotplugged devices is generated by QEMU. This patch adds the mechanic to QEMU to create the node name according to the device type too. The data structure has been roughly copied from OpenBIOS/OpenHackware, node names from SLOF. Example: Hotplugging some PCI cards with QEMU monitor: device_add virtio-tablet-pci device_add virtio-serial-pci device_add virtio-mouse-pci device_add virtio-scsi-pci device_add virtio-gpu-pci device_add ne2k_pci device_add nec-usb-xhci device_add intel-hda What we can see in linux device tree: for dir in /proc/device-tree/pci@800000020000000/*@*/; do echo $dir cat $dir/name echo done WITHOUT this patch: /proc/device-tree/pci@800000020000000/pci@0/ pci /proc/device-tree/pci@800000020000000/pci@1/ pci /proc/device-tree/pci@800000020000000/pci@2/ pci /proc/device-tree/pci@800000020000000/pci@3/ pci /proc/device-tree/pci@800000020000000/pci@4/ pci /proc/device-tree/pci@800000020000000/pci@5/ pci /proc/device-tree/pci@800000020000000/pci@6/ pci /proc/device-tree/pci@800000020000000/pci@7/ pci WITH this patch: /proc/device-tree/pci@800000020000000/communication-controller@1/ communication-controller /proc/device-tree/pci@800000020000000/display@4/ display /proc/device-tree/pci@800000020000000/ethernet@5/ ethernet /proc/device-tree/pci@800000020000000/input-controller@0/ input-controller /proc/device-tree/pci@800000020000000/mouse@2/ mouse /proc/device-tree/pci@800000020000000/multimedia-device@7/ multimedia-device /proc/device-tree/pci@800000020000000/scsi@3/ scsi /proc/device-tree/pci@800000020000000/usb-xhci@6/ usb-xhci Signed-off-by: Laurent Vivier <lvivier@redhat.com> Reviewed-by: Thomas Huth <thuth@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2016-11-23spapr: Fix 2.7<->2.8 migration of PCI host bridgeDavid Gibson
daa2369 "spapr_pci: Add a 64-bit MMIO window" subtly broke migration from qemu-2.7 to the current version. It split the device's MMIO window into two pieces for 32-bit and 64-bit MMIO. The patch included backwards compatibility code to convert the old property into the new format. However, the property value was also transferred in the migration stream and compared with a (probably unwise) VMSTATE_EQUAL. So, the "raw" value from 2.7 is compared to the new style converted value from (pre-)2.8 giving a mismatch and migration failure. Along with the actual field that caused the breakage, there are several other ill-advised VMSTATE_EQUAL()s. To fix forwards migration, we read the values in the stream into scratch variables and ignore them, instead of comparing for equality. To fix backwards migration, we populate those scratch variables in pre_save() with adjusted values to match the old behaviour. To permit the eventual possibility of removing this cruft from the stream, we only include these compatibility fields if a new 'pre-2.8-migration' property is set. We clear it on the pseries-2.8 machine type, which obviously can't be migrated backwards, but set it on earlier machine type versions. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com> Reviewed-by: Thomas Huth <thuth@redhat.com> Reviewed-by: Greg Kurz <groug@kaod.org> Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
2016-11-23Revert "spapr: Fix migration of PCI host bridges from qemu-2.7"David Gibson
This reverts commit 9b54ca0ba781012eeea4237b7c4832ba2ea81d89. The commit above corrected a migration breakage between qemu-2.7 and qemu-2.8. However it did so by advancing the migration version for the PCI host bridge, which obviously breaks migration backwards to earlier qemu versions. Although it's not totally essential, we'd like to maintain the possibility for backwards migration, so revert the change in preparation for a better fix. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Thomas Huth <thuth@redhat.com> Reviewed-by: Greg Kurz <groug@kaod.org> Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
2016-11-15spapr: Fix migration of PCI host bridges from qemu-2.7David Gibson
daa2369 "spapr_pci: Add a 64-bit MMIO window" subtly broke migration from qemu-2.7 to the current version. It split the device's MMIO window into two pieces for 32-bit and 64-bit MMIO. The patch included backwards compatibility code to convert the old property into the new format. However, the property value was also transferred in the migration stream and compared with a (probably unwise) VMSTATE_EQUAL. So, the "raw" value from 2.7 is compared to the new style converted value from (pre-)2.8 giving a mismatch and migration failure. Although it would be technically possible to fix this in a way allowing backwards migration, that would leave an ugly legacy around indefinitely. This patch takes the simpler approach of bumping the migration version, dropping the unwise VMSTATE_EQUAL (and some equally unwise ones around it) and ignoring them on an incoming migration. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
2016-10-28spapr_pci: advertise explicit numa IDs even when there's 1 nodeMichael Roth
With the addition of "numa_node" properties for PHBs we began advertising NUMA affinity in cases where nb_numa_nodes > 1. Since the default on the guest side is to make no assumptions about PHB NUMA affinity (defaulting to -1), there is still a valid use-case for explicitly defining a PHB's NUMA affinity even when there's just one node. In particular, some workloads make faulty assumptions about /sys/bus/pci/<devid>/numa_node being >= 0, warranting the use of this property as a workaround even if there's just 1 PHB or NUMA node. Enable this use-case by always advertising the PHB's NUMA affinity if "numa_node" has been explicitly set. We could achieve this by relaxing the check to simply be nb_numa_nodes > 0, but even safer would be to check numa_info[nodeid].present explicitly, and to fail at start time for cases where it does not exist. This has an additional affect of no longer advertising PHB NUMA affinity unconditionally if nb_numa_nodes > 1 and "numa_node" property is unset/-1, but since the default value on the guest side for each PHB is also -1, the behavior should be the same for that situation. We could still retain the old behavior if desired, but the decision seems arbitrary, so we take the simpler route. Cc: Alexey Kardashevskiy <aik@ozlabs.ru> Cc: Shivaprasad G. Bhat <shivapbh@in.ibm.com> Signed-off-by: Michael Roth <mdroth@linux.vnet.ibm.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2016-10-16spapr: Improved placement of PCI host bridges in guest memory mapDavid Gibson
Currently, the MMIO space for accessing PCI on pseries guests begins at 1 TiB in guest address space. Each PCI host bridge (PHB) has a 64 GiB chunk of address space in which it places its outbound PIO and 32-bit and 64-bit MMIO windows. This scheme as several problems: - It limits guest RAM to 1 TiB (though we have a limited fix for this now) - It limits the total MMIO window to 64 GiB. This is not always enough for some of the large nVidia GPGPU cards - Putting all the windows into a single 64 GiB area means that naturally aligning things within there will waste more address space. In addition there was a miscalculation in some of the defaults, which meant that the MMIO windows for each PHB actually slightly overran the 64 GiB region for that PHB. We got away without nasty consequences because the overrun fit within an unused area at the beginning of the next PHB's region, but it's not pretty. This patch implements a new scheme which addresses those problems, and is also closer to what bare metal hardware and pHyp guests generally use. Because some guest versions (including most current distro kernels) can't access PCI MMIO above 64 TiB, we put all the PCI windows between 32 TiB and 64 TiB. This is broken into 1 TiB chunks. The first 1 TiB contains the PIO (64 kiB) and 32-bit MMIO (2 GiB) windows for all of the PHBs. Each subsequent TiB chunk contains a naturally aligned 64-bit MMIO window for one PHB each. This reduces the number of allowed PHBs (without full manual configuration of all the windows) from 256 to 31, but this should still be plenty in practice. We also change some of the default window sizes for manually configured PHBs to saner values. Finally we adjust some tests and libqos so that it correctly uses the new default locations. Ideally it would parse the device tree given to the guest, but that's a more complex problem for another time. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Laurent Vivier <lvivier@redhat.com>
2016-10-16spapr_pci: Add a 64-bit MMIO windowDavid Gibson
On real hardware, and under pHyp, the PCI host bridges on Power machines typically advertise two outbound MMIO windows from the guest's physical memory space to PCI memory space: - A 32-bit window which maps onto 2GiB..4GiB in the PCI address space - A 64-bit window which maps onto a large region somewhere high in PCI address space (traditionally this used an identity mapping from guest physical address to PCI address, but that's not always the case) The qemu implementation in spapr-pci-host-bridge, however, only supports a single outbound MMIO window, however. At least some Linux versions expect the two windows however, so we arranged this window to map onto the PCI memory space from 2 GiB..~64 GiB, then advertised it as two contiguous windows, the "32-bit" window from 2G..4G and the "64-bit" window from 4G..~64G. This approach means, however, that the 64G window is not naturally aligned. In turn this limits the size of the largest BAR we can map (which does have to be naturally aligned) to roughly half of the total window. With some large nVidia GPGPU cards which have huge memory BARs, this is starting to be a problem. This patch adds true support for separate 32-bit and 64-bit outbound MMIO windows to the spapr-pci-host-bridge implementation, each of which can be independently configured. The 32-bit window always maps to 2G.. in PCI space, but the PCI address of the 64-bit window can be configured (it defaults to the same as the guest physical address). So as not to break possible existing configurations, as long as a 64-bit window is not specified, a large single window can be specified. This will appear the same way to the guest as the old approach, although it's now implemented by two contiguous memory regions rather than a single one. For now, this only adds the possibility of 64-bit windows. The default configuration still uses the legacy mode. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Laurent Vivier <lvivier@redhat.com>
2016-10-16spapr_pci: Delegate placement of PCI host bridges to machine typeDavid Gibson
The 'spapr-pci-host-bridge' represents the virtual PCI host bridge (PHB) for a PAPR guest. Unlike on x86, it's routine on Power (both bare metal and PAPR guests) to have numerous independent PHBs, each controlling a separate PCI domain. There are two ways of configuring the spapr-pci-host-bridge device: first it can be done fully manually, specifying the locations and sizes of all the IO windows. This gives the most control, but is very awkward with 6 mandatory parameters. Alternatively just an "index" can be specified which essentially selects from an array of predefined PHB locations. The PHB at index 0 is automatically created as the default PHB. The current set of default locations causes some problems for guests with large RAM (> 1 TiB) or PCI devices with very large BARs (e.g. big nVidia GPGPU cards via VFIO). Obviously, for migration we can only change the locations on a new machine type, however. This is awkward, because the placement is currently decided within the spapr-pci-host-bridge code, so it breaks abstraction to look inside the machine type version. So, this patch delegates the "default mode" PHB placement from the spapr-pci-host-bridge device back to the machine type via a public method in sPAPRMachineClass. It's still a bit ugly, but it's about the best we can do. For now, this just changes where the calculation is done. It doesn't change the actual location of the host bridges, or any other behaviour. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Laurent Vivier <lvivier@redhat.com>
2016-10-14ppc/xics: Make the ICSState a listBenjamin Herrenschmidt
Instead of an array of fixed sized blocks, use a list, as we will need to have sources with variable number of interrupts. SPAPR only uses a single entry. Native will create more. If performance becomes an issue we can add some hashed lookup but for now this will do fine. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> [ move the initialization of list to xics_common_initfn, restore xirr_owner after migration and move restoring to icp_post_load] Signed-off-by: Nikunj A Dadhania <nikunj@linux.vnet.ibm.com> [ clg: removed the icp_post_load() changes from nikunj patchset v3: http://patchwork.ozlabs.org/patch/646008/ ] Signed-off-by: Cédric Le Goater <clg@kaod.org> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2016-09-23spapr_pci: Add numa node idAlexey Kardashevskiy
This adds a numa id property to a PHB to allow linking passed PCI device to CPU/memory. It is up to the management stack to do CPU/memory pinning to the node with the actual PCI device. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> [dwg: Renamed property from "node" to "numa_node" to match the similar one in the pxb device] Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2016-07-05spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW)Alexey Kardashevskiy
This adds support for Dynamic DMA Windows (DDW) option defined by the SPAPR specification which allows to have additional DMA window(s) The "ddw" property is enabled by default on a PHB but for compatibility the pseries-2.6 machine and older disable it. This also creates a single DMA window for the older machines to maintain backward migration. This implements DDW for PHB with emulated and VFIO devices. The host kernel support is required. The advertised IOMMU page sizes are 4K and 64K; 16M pages are supported but not advertised by default, in order to enable them, the user has to specify "pgsz" property for PHB and enable huge pages for RAM. The existing linux guests try creating one additional huge DMA window with 64K or 16MB pages and map the entire guest RAM to. If succeeded, the guest switches to dma_direct_ops and never calls TCE hypercalls (H_PUT_TCE,...) again. This enables VFIO devices to use the entire RAM and not waste time on map/unmap later. This adds a "dma64_win_addr" property which is a bus address for the 64bit window and by default set to 0x800.0000.0000.0000 as this is what the modern POWER8 hardware uses and this allows having emulated and VFIO devices on the same bus. This adds 4 RTAS handlers: * ibm,query-pe-dma-window * ibm,create-pe-dma-window * ibm,remove-pe-dma-window * ibm,reset-pe-dma-window These are registered from type_init() callback. These RTAS handlers are implemented in a separate file to avoid polluting spapr_iommu.c with PCI. This changes sPAPRPHBState::dma_liobn to an array to allow 2 LIOBNs and updates all references to dma_liobn. However this does not add 64bit LIOBN to the migration stream as in fact even 32bit LIOBN is rather pointless there (as it is a PHB property and the management software can/should pass LIOBNs via CLI) but we keep it for the backward migration support. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2016-07-05spapr_iommu: Realloc guest visible TCE table when starting/stopping listeningAlexey Kardashevskiy
The sPAPR TCE tables manage 2 copies when VFIO is using an IOMMU - a guest view of the table and a hardware TCE table. If there is no VFIO presense in the address space, then just the guest view is used, if this is the case, it is allocated in the KVM. However since there is no support yet for VFIO in KVM TCE hypercalls, when we start using VFIO, we need to move the guest view from KVM to the userspace; and we need to do this for every IOMMU on a bus with VFIO devices. This implements the callbacks for the sPAPR IOMMU - notify_started() reallocated the guest view to the user space, notify_stopped() does the opposite. This removes explicit spapr_tce_set_need_vfio() call from PCI hotplug path as the new callbacks do this better - they notify IOMMU at the exact moment when the configuration is changed, and this also includes the case of PCI hot unplug. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Acked-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2016-07-01ppc/xics: Replace "icp" with "xics" in most placesBenjamin Herrenschmidt
The "ICP" is a different object than the "XICS". For historical reasons, we have a number of places where we name a variable "icp" while it contains a XICSState pointer. There *is* an ICPState structure too so this makes the code really confusing. This is a mechanical replacement of all those instances to use the name "xics" instead. There should be no functional change. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> [spapr_cpu_init has been moved to spapr_cpu_core.c, change there] Signed-off-by: Nikunj A Dadhania <nikunj@linux.vnet.ibm.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2016-07-01ppc/xics: Rename existing xics to xics_spaprBenjamin Herrenschmidt
The common class doesn't change, the KVM one is sPAPR specific. Rename variables and functions to xics_spapr. Retain the type name as "xics" to preserve migration for existing sPAPR guests. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Nikunj A Dadhania <nikunj@linux.vnet.ibm.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2016-06-07spapr_pci: Drop cannot_instantiate_with_device_add_yet=falseMarkus Armbruster
It's become redundant since it was added in commit 09aa9a5 "spapr-pci: enable adding PHB via -device". Cc: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Markus Armbruster <armbru@redhat.com> Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2016-06-07spapr_pci: Add and export DMA resetting helperAlexey Kardashevskiy
This will be later used by the "ibm,reset-pe-dma-window" RTAS handler which resets the DMA configuration to the defaults. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2016-06-07spapr_pci: Reset DMA config on PHB resetAlexey Kardashevskiy
LoPAPR dictates that during system reset all DMA windows must be removed and the default DMA32 window must be created so does the patch. At the moment there is just one window supported so no change in behaviour is expected. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2016-06-07spapr_iommu: Add root memory regionAlexey Kardashevskiy
We are going to have multiple DMA windows at different offsets on a PCI bus. For the sake of migration, we will have as many TCE table objects pre-created as many windows supported. So we need a way to map windows dynamically onto a PCI bus when migration of a table is completed but at this stage a TCE table object does not have access to a PHB to ask it to map a DMA window backed by just migrated TCE table. This adds a "root" memory region (UINT64_MAX long) to the TCE object. This new region is mapped on a PCI bus with enabled overlapping as there will be one root MR per TCE table, each of them mapped at 0. The actual IOMMU memory region is a subregion of the root region and a TCE table enables/disables this subregion and maps it at the specific offset inside the root MR which is 1:1 mapping of a PCI address space. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Thomas Huth <thuth@redhat.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2016-06-07spapr_iommu: Introduce "enabled" state for TCE tableAlexey Kardashevskiy
Currently TCE tables are created once at start and their sizes never change. We are going to change that by introducing a Dynamic DMA windows support where DMA configuration may change during the guest execution. This changes spapr_tce_new_table() to create an empty zero-size IOMMU memory region (IOMMU MR). Only LIOBN is assigned by the time of creation. It still will be called once at the owner object (VIO or PHB) creation. This introduces an "enabled" state for TCE table objects, some helper functions are added: - spapr_tce_table_enable() receives TCE table parameters, stores in sPAPRTCETable and allocates a guest view of the TCE table (in the user space or KVM) and sets the correct size on the IOMMU MR; - spapr_tce_table_disable() disposes the table and resets the IOMMU MR size; it is made public as the following DDW code will be using it. This changes the PHB reset handler to do the default DMA initialization instead of spapr_phb_realize(). This does not make differenct now but later with more than just one DMA window, we will have to remove them all and create the default one on a system reset. No visible change in behaviour is expected except the actual table will be reallocated every reset. We might optimize this later. The other way to implement this would be dynamically create/remove the TCE table QOM objects but this would make migration impossible as the migration code expects all QOM objects to exist at the receiver so we have to have TCE table objects created when migration begins. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2016-05-27spapr_pci: Use correct DMA LIOBN when composing the device treeAlexey Kardashevskiy
The user could have picked LIOBN via the CLI but the device tree rendering code would still use the value derived from the PHB index (which is the default fallback if LIOBN is not set in the CLI). This replaces SPAPR_PCI_LIOBN() with the actual DMA LIOBN value. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2016-05-27spapr: ensure device trees are always associated with DRCJianjun Duan
There are possible racing situations involving hotplug events and guest migration. For cases where a hotplug event is migrated, or the guest is in the process of fetching device tree at the time of migration, we need to ensure the device tree is created and associated with the corresponding DRC for devices that were hotplugged on the source, but 'coldplugged' on the target. Signed-off-by: Jianjun Duan <duanj@linux.vnet.ibm.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2016-05-19dma: do not depend on kvm_enabled()Paolo Bonzini
Memory barriers are needed also by Xen and, when the ioeventfd bugs are fixed, by TCG as well. sysemu/kvm.h is not anymore needed in sysemu/dma.h, move it to the actual users. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-04-23hw/ppc/spapr: Fix crash when specifying bad parameters to spapr-pci-host-bridgeThomas Huth
QEMU currently crashes when using bad parameters for the spapr-pci-host-bridge device: $ qemu-system-ppc64 -device spapr-pci-host-bridge,buid=0x123,liobn=0x321,mem_win_addr=0x1,io_win_addr=0x10 Segmentation fault The problem is that spapr_tce_find_by_liobn() might return NULL, but the code in spapr_populate_pci_dt() does not check for this condition and then tries to dereference this NULL pointer. Apart from that, the return value of spapr_populate_pci_dt() also has to be checked for all PCI buses, not only for the last one, to make sure we catch all errors. Signed-off-by: Thomas Huth <thuth@redhat.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2016-03-22hw: explicitly include qemu-common.h and cpu.hPaolo Bonzini
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>