aboutsummaryrefslogtreecommitdiff
path: root/hw/vfio/pci.c
AgeCommit message (Collapse)Author
2019-06-13vfio/common: Introduce vfio_set_irq_signaling helperEric Auger
The code used to assign an interrupt index/subindex to an eventfd is duplicated many times. Let's introduce an helper that allows to set/unset the signaling for an ACTION_TRIGGER, ACTION_MASK or ACTION_UNMASK action. In the error message, we now use errno in case of any VFIO_DEVICE_SET_IRQS ioctl failure. Signed-off-by: Eric Auger <eric.auger@redhat.com> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Reviewed-by: Li Qiang <liq3ea@gmail.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2019-06-13vfio/pci: Allow MSI-X relocation to fixup bogus PBAAlex Williamson
The MSI-X relocation code can sometimes be used to work around bogus MSI-X capabilities, but this test for whether the PBA is outside of the specified BAR causes the device to error before we can apply a relocation. Let it proceed if we intend to relocate MSI-X anyway. Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2019-06-13vfio/pci: Hide Resizable BAR capabilityAlex Williamson
The resizable BAR capability is currently exposed read-only from the kernel and we don't yet implement a protocol for virtualizing it to the VM. Exposing it to the guest read-only introduces poor behavior as the guest has no reason to test that a control register write is accepted by the hardware. This can lead to cases where the guest OS assumes the BAR has been resized, but it hasn't. This has been observed when assigning AMD Vega GPUs. Note, this does not preclude future enablement of resizable BARs, but it's currently incorrect to expose this capability as read-only, so better to not expose it at all. Reported-by: James Courtier-Dutton <james.dutton@gmail.com> Tested-by: James Courtier-Dutton <james.dutton@gmail.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2019-06-12Include qemu/module.h where needed, drop it from qemu-common.hMarkus Armbruster
Signed-off-by: Markus Armbruster <armbru@redhat.com> Message-Id: <20190523143508.25387-4-armbru@redhat.com> [Rebased with conflicts resolved automatically, except for hw/usb/dev-hub.c hw/misc/exynos4210_rng.c hw/misc/bcm2835_rng.c hw/misc/aspeed_scu.c hw/display/virtio-vga.c hw/arm/stm32f205_soc.c; ui/cocoa.m fixed up]
2019-06-06hw/vfio/pci: Use the QOM DEVICE() macro to access DeviceState.qdevPhilippe Mathieu-Daudé
Rather than looking inside the definition of a DeviceState with "s->qdev", use the QOM prefered style: "DEVICE(s)". This patch was generated using the following Coccinelle script: // Use DEVICE() macros to access DeviceState.qdev @use_device_macro_to_access_qdev@ expression obj; identifier dev; @@ -&obj->dev.qdev +DEVICE(obj) Suggested-by: Peter Maydell <peter.maydell@linaro.org> Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com> Acked-by: Alex Williamson <alex.williamson@redhat.com> Message-Id: <20190528164020.32250-10-philmd@redhat.com> Signed-off-by: Laurent Vivier <laurent@vivier.eu>
2019-05-22pci: msix: move 'MSIX_CAP_LENGTH' to header fileLi Qiang
'MSIX_CAP_LENGTH' is defined in two .c file. Move it to hw/pci/msix.h file to reduce duplicated code. CC: qemu-trivial@nongnu.org Signed-off-by: Li Qiang <liq3ea@163.com> Message-Id: <20190521151543.92274-5-liq3ea@163.com> Acked-by: Alex Williamson <alex.williamson@redhat.com> Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com> Reviewed-by: Eric Auger <eric.auger@redhat.com> Signed-off-by: Laurent Vivier <laurent@vivier.eu>
2019-05-22vfio: pci: make "vfio-pci-nohotplug" as MACROLi Qiang
The QOMConventions recommends we should use TYPE_FOO for a TypeInfo's name. Though "vfio-pci-nohotplug" is not used in other parts, for consistency we should make this change. CC: qemu-trivial@nongnu.org Signed-off-by: Li Qiang <liq3ea@163.com> Message-Id: <20190521151543.92274-2-liq3ea@163.com> Acked-by: Alex Williamson <alex.williamson@redhat.com> Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com> Reviewed-by: Eric Auger <eric.auger@redhat.com> Signed-off-by: Laurent Vivier <laurent@vivier.eu>
2019-04-26spapr: Support NVIDIA V100 GPU with NVLink2Alexey Kardashevskiy
NVIDIA V100 GPUs have on-board RAM which is mapped into the host memory space and accessible as normal RAM via an NVLink bus. The VFIO-PCI driver implements special regions for such GPUs and emulates an NVLink bridge. NVLink2-enabled POWER9 CPUs also provide address translation services which includes an ATS shootdown (ATSD) register exported via the NVLink bridge device. This adds a quirk to VFIO to map the GPU memory and create an MR; the new MR is stored in a PCI device as a QOM link. The sPAPR PCI uses this to get the MR and map it to the system address space. Another quirk does the same for ATSD. This adds additional steps to sPAPR PHB setup: 1. Search for specific GPUs and NPUs, collect findings in sPAPRPHBState::nvgpus, manage system address space mappings; 2. Add device-specific properties such as "ibm,npu", "ibm,gpu", "memory-block", "link-speed" to advertise the NVLink2 function to the guest; 3. Add "mmio-atsd" to vPHB to advertise the ATSD capability; 4. Add new memory blocks (with extra "linux,memory-usable" to prevent the guest OS from accessing the new memory until it is onlined) and npuphb# nodes representing an NPU unit for every vPHB as the GPU driver uses it for link discovery. This allocates space for GPU RAM and ATSD like we do for MMIOs by adding 2 new parameters to the phb_placement() hook. Older machine types set these to zero. This puts new memory nodes in a separate NUMA node to as the GPU RAM needs to be configured equally distant from any other node in the system. Unlike the host setup which assigns numa ids from 255 downwards, this adds new NUMA nodes after the user configures nodes or from 1 if none were configured. This adds requirement similar to EEH - one IOMMU group per vPHB. The reason for this is that ATSD registers belong to a physical NPU so they cannot invalidate translations on GPUs attached to another NPU. It is guaranteed by the host platform as it does not mix NVLink bridges or GPUs from different NPU in the same IOMMU group. If more than one IOMMU group is detected on a vPHB, this disables ATSD support for that vPHB and prints a warning. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> [aw: for vfio portions] Acked-by: Alex Williamson <alex.williamson@redhat.com> Message-Id: <20190312082103.130561-1-aik@ozlabs.ru> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2019-04-18vfio: Report warnings with warn_report(), not error_printf()Markus Armbruster
Cc: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Markus Armbruster <armbru@redhat.com> Message-Id: <20190417190641.26814-8-armbru@redhat.com> Acked-by: Alex Williamson <alex.williamson@redhat.com>
2019-03-11vfio/display: add xres + yres propertiesGerd Hoffmann
This allows configure the display resolution which the vgpu should use. The information will be passed to the guest using EDID, so the mdev driver must support the vfio edid region for this to work. Signed-off-by: Gerd Hoffmann <kraxel@redhat.com> Reviewed-by: Liam Merwick <liam.merwick@oracle.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2019-01-24trace: forbid use of %m in trace event format stringsDaniel P. Berrangé
The '%m' format instructs glibc's printf()/syslog() implementation to insert the contents of strerror(errno). Since this is a glibc extension it should generally be avoided in QEMU due to need for portability to a variety of platforms. Even though vfio is Linux-only code that could otherwise use "%m", it must still be avoided in trace-events files because several of the backends do not use the format string and so this error information is invisible to them. The errno string value should be given as an explicit trace argument instead, making it accessible to all backends. This also allows it to work correctly with future patches that use the format string with systemtap's simple printf code. Reviewed-by: Eric Blake <eblake@redhat.com> Signed-off-by: Daniel P. Berrangé <berrange@redhat.com> Message-id: 20190123120016.4538-4-berrange@redhat.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2018-12-19vfio/pci: Remove PCIe Link Status emulationAlex Williamson
Now that the downstream port will virtually negotiate itself to the link status of the downstream device, we can remove this emulation. It's not clear that it was every terribly useful anyway. Tested-by: Geoffrey McRae <geoff@hostfission.com> Reviewed-by: Eric Auger <eric.auger@redhat.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2018-12-19pcie: Create enums for link speed and widthAlex Williamson
In preparation for reporting higher virtual link speeds and widths, create enums and macros to help us manage them. Cc: Marcel Apfelbaum <marcel.apfelbaum@gmail.com> Tested-by: Geoffrey McRae <geoff@hostfission.com> Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com> Reviewed-by: Eric Auger <eric.auger@redhat.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2018-10-19vfio: Clean up error reporting after previous commitMarkus Armbruster
The previous commit changed vfio's warning messages from vfio warning: DEV-NAME: Could not frobnicate to warning: vfio DEV-NAME: Could not frobnicate To match this change, change error messages from vfio error: DEV-NAME: On fire to vfio DEV-NAME: On fire Note the loss of "error". If we think marking error messages that way is a good idea, we should mark *all* error messages, i.e. make error_report() print it. Cc: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Markus Armbruster <armbru@redhat.com> Acked-by: Alex Williamson <alex.williamson@redhat.com> Message-Id: <20181017082702.5581-7-armbru@redhat.com>
2018-10-19vfio: Use warn_report() & friends to report warningsMarkus Armbruster
The vfio code reports warnings like error_report(WARN_PREFIX "Could not frobnicate", DEV-NAME); where WARN_PREFIX is defined so the message comes out as vfio warning: DEV-NAME: Could not frobnicate This usage predates the introduction of warn_report() & friends in commit 97f40301f1d. It's time to convert to that interface. Since these functions already prefix the message with "warning: ", replace WARN_PREFIX by VFIO_MSG_PREFIX, so the messages come out like warning: vfio DEV-NAME: Could not frobnicate The next commit will replace ERR_PREFIX. Cc: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Markus Armbruster <armbru@redhat.com> Acked-by: Alex Williamson <alex.williamson@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Message-Id: <20181017082702.5581-6-armbru@redhat.com>
2018-10-19error: Fix use of error_prepend() with &error_fatal, &error_abortMarkus Armbruster
From include/qapi/error.h: * Pass an existing error to the caller with the message modified: * error_propagate(errp, err); * error_prepend(errp, "Could not frobnicate '%s': ", name); Fei Li pointed out that doing error_propagate() first doesn't work well when @errp is &error_fatal or &error_abort: the error_prepend() is never reached. Since I doubt fixing the documentation will stop people from getting it wrong, introduce error_propagate_prepend(), in the hope that it lures people away from using its constituents in the wrong order. Update the instructions in error.h accordingly. Convert existing error_prepend() next to error_propagate to error_propagate_prepend(). If any of these get reached with &error_fatal or &error_abort, the error messages improve. I didn't check whether that's the case anywhere. Cc: Fei Li <fli@suse.com> Signed-off-by: Markus Armbruster <armbru@redhat.com> Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Message-Id: <20181017082702.5581-2-armbru@redhat.com>
2018-10-15vfio-pci: make vfio-pci device more QOM conventionalLi Qiang
Define a TYPE_VFIO_PCI and drop DO_UPCAST. Signed-off-by: Li Qiang <liq3ea@gmail.com> Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2018-10-15hw/vfio/display: add ramfb supportGerd Hoffmann
So we have a boot display when using a vgpu as primary display. ramfb depends on a fw_cfg file. fw_cfg files can not be added and removed at runtime, therefore a ramfb-enabled vfio device can't be hotplugged. Add a nohotplug variant of the vfio-pci device (as child class). Add the ramfb property to the nohotplug variant only. So to enable the vgpu display with boot support use this: -device vfio-pci-nohotplug,display=on,ramfb=on,sysfsdev=... Signed-off-by: Gerd Hoffmann <kraxel@redhat.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2018-08-23vfio/pci: Handle subsystem realpath() returning NULLAlex Williamson
Fix error reported by Coverity where realpath can return NULL, resulting in a segfault in strcmp(). This should never happen given that we're working through regularly structured sysfs paths, but trivial enough to easily avoid. Fixes: 238e91728503 ("vfio/ccw/pci: Allow devices to opt-in for ballooning") Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2018-08-17vfio/ccw/pci: Allow devices to opt-in for ballooningAlex Williamson
If a vfio assigned device makes use of a physical IOMMU, then memory ballooning is necessarily inhibited due to the page pinning, lack of page level granularity at the IOMMU, and sufficient notifiers to both remove the page on balloon inflation and add it back on deflation. However, not all devices are backed by a physical IOMMU. In the case of mediated devices, if a vendor driver is well synchronized with the guest driver, such that only pages actively used by the guest driver are pinned by the host mdev vendor driver, then there should be no overlap between pages available for the balloon driver and pages actively in use by the device. Under these conditions, ballooning should be safe. vfio-ccw devices are always mediated devices and always operate under the constraints above. Therefore we can consider all vfio-ccw devices as balloon compatible. The situation is far from straightforward with vfio-pci. These devices can be physical devices with physical IOMMU backing or mediated devices where it is unknown whether a physical IOMMU is in use or whether the vendor driver is well synchronized to the working set of the guest driver. The safest approach is therefore to assume all vfio-pci devices are incompatible with ballooning, but allow user opt-in should they have further insight into mediated devices. Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2018-07-11vfio/pci: do not set the PCIDevice 'has_rom' attributeCédric Le Goater
PCI devices needing a ROM allocate an optional MemoryRegion with pci_add_option_rom(). pci_del_option_rom() does the cleanup when the device is destroyed. The only action taken by this routine is to call vmstate_unregister_ram() which clears the id string of the optional ROM RAMBlock and now, also flags the RAMBlock as non-migratable. This was recently added by commit b895de502717 ("migration: discard non-migratable RAMBlocks"), . VFIO devices do their own loading of the PCI option ROM in vfio_pci_size_rom(). The memory region is switched to an I/O region and the PCI attribute 'has_rom' is set but the RAMBlock of the ROM region is not allocated. When the associated PCI device is deleted, pci_del_option_rom() calls vmstate_unregister_ram() which tries to flag a NULL RAMBlock, leading to a SEGV. It seems that 'has_rom' was set to have memory_region_destroy() called, but since commit 469b046ead06 ("memory: remove memory_region_destroy") this is not necessary anymore as the MemoryRegion is freed automagically. Remove the PCIDevice 'has_rom' attribute setting in vfio. Fixes: b895de502717 ("migration: discard non-migratable RAMBlocks") Signed-off-by: Cédric Le Goater <clg@kaod.org> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2018-07-02hw/vfio: Use the IEC binary prefix definitionsPhilippe Mathieu-Daudé
It eases code review, unit is explicit. Patch generated using: $ git grep -E '(1024|2048|4096|8192|(<<|>>).?(10|20|30))' hw/ include/hw/ and modified manually. Signed-off-by: Philippe Mathieu-Daudé <f4bug@amsat.org> Message-Id: <20180625124238.25339-38-f4bug@amsat.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-06-05vfio/pci: Default display option to "off"Alex Williamson
Commit a9994687cb9b ("vfio/display: core & wireup") added display support to vfio-pci with the default being "auto", which breaks existing VMs when the vGPU requires GL support but had no previous requirement for a GL compatible configuration. "Off" is the safer default as we impose no new requirements to VM configurations. Fixes: a9994687cb9b ("vfio/display: core & wireup") Cc: qemu-stable@nongnu.org Cc: Gerd Hoffmann <kraxel@redhat.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2018-06-05vfio/quirks: Enable ioeventfd quirks to be handled by vfio directlyAlex Williamson
With vfio ioeventfd support, we can program vfio-pci to perform a specified BAR write when an eventfd is triggered. This allows the KVM ioeventfd to be wired directly to vfio-pci, entirely avoiding userspace handling for these events. On the same micro-benchmark where the ioeventfd got us to almost 90% of performance versus disabling the GeForce quirks, this gets us to within 95%. Reviewed-by: Peter Xu <peterx@redhat.com> Reviewed-by: Eric Auger <eric.auger@redhat.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2018-06-05vfio/quirks: ioeventfd quirk accelerationAlex Williamson
The NVIDIA BAR0 quirks virtualize the PCI config space mirrors found in device MMIO space. Normally PCI config space is considered a slow path and further optimization is unnecessary, however NVIDIA uses a register here to enable the MSI interrupt to re-trigger. Exiting to QEMU for this MSI-ACK handling can therefore rate limit our interrupt handling. Fortunately the MSI-ACK write is easily detected since the quirk MemoryRegion otherwise has very few accesses, so simply looking for consecutive writes with the same data is sufficient, in this case 10 consecutive writes with the same data and size is arbitrarily chosen. We configure the KVM ioeventfd with data match, so there's no risk of triggering for the wrong data or size, but we do risk that pathological driver behavior might consume all of QEMU's file descriptors, so we cap ourselves to 10 ioeventfds for this purpose. In support of the above, generic ioeventfd infrastructure is added for vfio quirks. This automatically initializes an ioeventfd list per quirk, disables and frees ioeventfds on exit, and allows ioeventfds marked as dynamic to be dropped on device reset. The rationale for this latter feature is that useful ioeventfds may depend on specific driver behavior and since we necessarily place a cap on our use of ioeventfds, a machine reset is a reasonable point at which to assume a new driver and re-profile. Reviewed-by: Peter Xu <peterx@redhat.com> Reviewed-by: Eric Auger <eric.auger@redhat.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2018-06-05vfio/quirks: Add quirk reset callbackAlex Williamson
Quirks can be self modifying, provide a hook to allow them to cleanup on device reset if desired. Reviewed-by: Eric Auger <eric.auger@redhat.com> Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2018-04-27ui: introduce vfio_display_resetTina Zhang
During guest OS reboot, guest framebuffer is invalid. It will cause bugs, if the invalid guest framebuffer is still used by host. This patch is to introduce vfio_display_reset which is invoked during vfio display reset. This vfio_display_reset function is used to release the invalid display resource, disable scanout mode and replace the invalid surface with QemuConsole's DisplaySurafce. This patch can fix the GPU hang issue caused by gd_egl_draw during guest OS reboot. Changes v3->v4: - Move dma-buf based display check into the vfio_display_reset(). (Gerd) Changes v2->v3: - Limit vfio_display_reset to dma-buf based vfio display. (Gerd) Changes v1->v2: - Use dpy_gfx_update_full() update screen after reset. (Gerd) - Remove dpy_gfx_switch_surface(). (Gerd) Signed-off-by: Tina Zhang <tina.zhang@intel.com> Message-id: 1524820266-27079-3-git-send-email-tina.zhang@intel.com Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
2018-03-13ppc/spapr, vfio: Turn off MSIX emulation for VFIO devicesAlexey Kardashevskiy
This adds a possibility for the platform to tell VFIO not to emulate MSIX so MMIO memory regions do not get split into chunks in flatview and the entire page can be registered as a KVM memory slot and make direct MMIO access possible for the guest. This enables the entire MSIX BAR mapping to the guest for the pseries platform in order to achieve the maximum MMIO preformance for certain devices. Tested on: LSI Logic / Symbios Logic SAS3008 PCI-Express Fusion-MPT SAS-3 (rev 02) Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2018-03-13vfio-pci: Allow mmap of MSIX BARAlexey Kardashevskiy
At the moment we unconditionally avoid mapping MSIX data of a BAR and emulate MSIX table in QEMU. However it is 1) not always necessary as a platform may provide a paravirt interface for MSIX configuration; 2) can affect the speed of MMIO access by emulating them in QEMU when frequently accessed registers share same system page with MSIX data, this is particularly a problem for systems with the page size bigger than 4KB. A new capability - VFIO_REGION_INFO_CAP_MSIX_MAPPABLE - has been added to the kernel [1] which tells the userspace that mapping of the MSIX data is possible now. This makes use of it so from now on QEMU tries mapping the entire BAR as a whole and emulate MSIX on top of that. [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a32295c612c57990d17fb0f41e7134394b2f35f6 Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2018-03-13vfio/display: core & wireupGerd Hoffmann
Infrastructure for display support. Must be enabled using 'display' property. Signed-off-by: Gerd Hoffmann <kraxel@redhat.com> Reviewed By: Kirti Wankhede <kwankhede@nvidia.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2018-03-06use g_path_get_basename instead of basenameJulia Suvorova
basename(3) and dirname(3) modify their argument and may return pointers to statically allocated memory which may be overwritten by subsequent calls. g_path_get_basename and g_path_get_dirname have no such issues, and therefore more preferable. Signed-off-by: Julia Suvorova <jusual@mail.ru> Message-Id: <1519888086-4207-1-git-send-email-jusual@mail.ru> Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-02-13Merge remote-tracking branch 'remotes/mst/tags/for_upstream' into stagingPeter Maydell
virtio,vhost,pci,pc: features, fixes and cleanups - new stats in virtio balloon - virtio eventfd rework for boot speedup - vhost memory rework for boot speedup - fixes and cleanups all over the place Signed-off-by: Michael S. Tsirkin <mst@redhat.com> # gpg: Signature made Tue 13 Feb 2018 16:29:55 GMT # gpg: using RSA key 281F0DB8D28D5469 # gpg: Good signature from "Michael S. Tsirkin <mst@kernel.org>" # gpg: aka "Michael S. Tsirkin <mst@redhat.com>" # Primary key fingerprint: 0270 606B 6F3C DF3D 0B17 0970 C350 3912 AFBE 8E67 # Subkey fingerprint: 5D09 FD08 71C8 F85B 94CA 8A0D 281F 0DB8 D28D 5469 * remotes/mst/tags/for_upstream: (22 commits) virtio-balloon: include statistics of disk/file caches acpi-test: update FADT lpc: drop pcie host dependency tests: acpi: fix FADT not being compared to reference table hw/pci-bridge: fix pcie root port's IO hints capability libvhost-user: Support across-memory-boundary access libvhost-user: Fix resource leak virtio-balloon: unref the memory region before continuing pci: removed the is_express field since a uniform interface was inserted virtio-blk: enable multiple vectors when using multiple I/O queues pci/bus: let it has higher migration priority pci-bridge/i82801b11: clear bridge registers on platform reset vhost: Move log_dirty check vhost: Merge and delete unused callbacks vhost: Clean out old vhost_set_memory and friends vhost: Regenerate region list from changed sections list vhost: Merge sections added to temporary list vhost: Simplify ring verification checks vhost: Build temporary section list and deref after commit virtio: improve virtio devices initialization time ... Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
2018-02-09Move include qemu/option.h from qemu-common.h to actual usersMarkus Armbruster
qemu-common.h includes qemu/option.h, but most places that include the former don't actually need the latter. Drop the include, and add it to the places that actually need it. While there, drop superfluous includes of both headers, and separate #include from file comment with a blank line. This cleanup makes the number of objects depending on qemu/option.h drop from 4545 (out of 4743) to 284 in my "build everything" tree. Reviewed-by: Eric Blake <eblake@redhat.com> Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org> Signed-off-by: Markus Armbruster <armbru@redhat.com> Message-Id: <20180201111846.21846-20-armbru@redhat.com> [Semantic conflict with commit bdd6a90a9e in block/nvme.c resolved]
2018-02-08pci: removed the is_express field since a uniform interface was insertedYoni Bettan
according to Eduardo Habkost's commit fd3b02c889 all PCIEs now implement INTERFACE_PCIE_DEVICE so we don't need is_express field anymore. Devices that implements only INTERFACE_PCIE_DEVICE (is_express == 1) or devices that implements only INTERFACE_CONVENTIONAL_PCI_DEVICE (is_express == 0) where not affected by the change. The only devices that were affected are those that are hybrid and also had (is_express == 1) - therefor only: - hw/vfio/pci.c - hw/usb/hcd-xhci.c - hw/xen/xen_pt.c For those 3 I made sure that QEMU_PCI_CAP_EXPRESS is on in instance_init() Reviewed-by: Marcel Apfelbaum <marcel@redhat.com> Reviewed-by: Eduardo Habkost <ehabkost@redhat.com> Signed-off-by: Yoni Bettan <ybettan@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2018-02-06vfio/pci: Add option to disable GeForce quirksAlex Williamson
These quirks are necessary for GeForce, but not for Quadro/GRID/Tesla assignment. Leaving them enabled is fully functional and provides the most compatibility, but due to the unique NVIDIA MSI ACK behavior[1], it also introduces latency in re-triggering the MSI interrupt. This overhead is typically negligible, but has been shown to adversely affect some (very) high interrupt rate applications. This adds the vfio-pci device option "x-no-geforce-quirks=" which can be set to "on" to disable this additional overhead. A follow-on optimization for GeForce might be to make use of an ioeventfd to allow KVM to trigger an irqfd in the kernel vfio-pci driver, avoiding the bounce through userspace to handle this device write. [1] Background: the NVIDIA driver has been observed to issue a write to the MMIO mirror of PCI config space in BAR0 in order to allow the MSI interrupt for the device to retrigger. Older reports indicated a write of 0xff to the (read-only) MSI capability ID register, while more recently a write of 0x0 is observed at config space offset 0x704, non-architected, extended config space of the device (BAR0 offset 0x88704). Virtualization of this range is only required for GeForce. Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2018-02-06vfio/pci: Allow relocating MSI-X MMIOAlex Williamson
Recently proposed vfio-pci kernel changes (v4.16) remove the restriction preventing userspace from mmap'ing PCI BARs in areas overlapping the MSI-X vector table. This change is primarily intended to benefit host platforms which make use of system page sizes larger than the PCI spec recommendation for alignment of MSI-X data structures (ie. not x86_64). In the case of POWER systems, the SPAPR spec requires the VM to program MSI-X using hypercalls, rendering the MSI-X vector table unused in the VM view of the device. However, ARM64 platforms also support 64KB pages and rely on QEMU emulation of MSI-X. Regardless of the kernel driver allowing mmaps overlapping the MSI-X vector table, emulation of the MSI-X vector table also prevents direct mapping of device MMIO spaces overlapping this page. Thanks to the fact that PCI devices have a standard self discovery mechanism, we can try to resolve this by relocating the MSI-X data structures, either by creating a new PCI BAR or extending an existing BAR and updating the MSI-X capability for the new location. There's even a very slim chance that this could benefit devices which do not adhere to the PCI spec alignment guidelines on x86_64 systems. This new x-msix-relocation option accepts the following choices: off: Disable MSI-X relocation, use native device config (default) auto: Use a known good combination for the platform/device (none yet) bar0..bar5: Specify the target BAR for MSI-X data structures If compatible, the target BAR will either be created or extended and the new portion will be used for MSI-X emulation. The first obvious user question with this option is how to determine whether a given platform and device might benefit from this option. In most cases, the answer is that it won't, especially on x86_64. Devices often dedicate an entire BAR to MSI-X and therefore no performance sensitive registers overlap the MSI-X area. Take for example: # lspci -vvvs 0a:00.0 0a:00.0 Ethernet controller: Intel Corporation I350 Gigabit Network Connection ... Region 0: Memory at db680000 (32-bit, non-prefetchable) [size=512K] Region 3: Memory at db7f8000 (32-bit, non-prefetchable) [size=16K] ... Capabilities: [70] MSI-X: Enable+ Count=10 Masked- Vector table: BAR=3 offset=00000000 PBA: BAR=3 offset=00002000 This device uses the 16K bar3 for MSI-X with the vector table at offset zero and the pending bits arrary at offset 8K, fully honoring the PCI spec alignment guidance. The data sheet specifically refers to this as an MSI-X BAR. This device would not see a benefit from MSI-X relocation regardless of the platform, regardless of the page size. However, here's another example: # lspci -vvvs 02:00.0 02:00.0 Serial Attached SCSI controller: xxxxxxxx ... Region 0: I/O ports at c000 [size=256] Region 1: Memory at ef640000 (64-bit, non-prefetchable) [size=64K] Region 3: Memory at ef600000 (64-bit, non-prefetchable) [size=256K] ... Capabilities: [c0] MSI-X: Enable+ Count=16 Masked- Vector table: BAR=1 offset=0000e000 PBA: BAR=1 offset=0000f000 Here the MSI-X data structures are placed on separate 4K pages at the end of a 64KB BAR. If our host page size is 4K, we're likely fine, but at 64KB page size, MSI-X emulation at that location prevents the entire BAR from being directly mapped into the VM address space. Overlapping performance sensitive registers then starts to be a very likely scenario on such a platform. At this point, the user could enable tracing on vfio_region_read and vfio_region_write to determine more conclusively if device accesses are being trapped through QEMU. Upon finding a device and platform in need of MSI-X relocation, the next problem is how to choose target PCI BAR to host the MSI-X data structures. A few key rules to keep in mind for this selection include: * There are only 6 BAR slots, bar0..bar5 * 64-bit BARs occupy two BAR slots, 'lspci -vvv' lists the first slot * PCI BARs are always a power of 2 in size, extending == doubling * The maximum size of a 32-bit BAR is 2GB * MSI-X data structures must reside in an MMIO BAR Using these rules, we can evaluate each BAR of the second example device above as follows: bar0: I/O port BAR, incompatible with MSI-X tables bar1: BAR could be extended, incurring another 64KB of MMIO bar2: Unavailable, bar1 is 64-bit, this register is used by bar1 bar3: BAR could be extended, incurring another 256KB of MMIO bar4: Unavailable, bar3 is 64bit, this register is used by bar3 bar5: Available, empty BAR, minimum additional MMIO A secondary optimization we might wish to make in relocating MSI-X is to minimize the additional MMIO required for the device, therefore we might test the available choices in order of preference as bar5, bar1, and finally bar3. The original proposal for this feature included an 'auto' option which would choose bar5 in this case, but various drivers have been found that make assumptions about the properties of the "first" BAR or the size of BARs such that there appears to be no foolproof automatic selection available, requiring known good combinations to be sourced from users. This patch is pre-enabled for an 'auto' selection making use of a validated lookup table, but no entries are yet identified. Tested-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: Eric Auger <eric.auger@redhat.com> Tested-by: Eric Auger <eric.auger@redhat.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2018-02-06vfio/pci: Emulate BARsAlex Williamson
The kernel provides similar emulation of PCI BAR register access to QEMU, so up until now we've used that for things like BAR sizing and storing the BAR address. However, if we intend to resize BARs or add BARs that don't exist on the physical device, we need to switch to the pure QEMU emulation of the BAR. Tested-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: Eric Auger <eric.auger@redhat.com> Tested-by: Eric Auger <eric.auger@redhat.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2018-02-06vfio/pci: Add base BAR MemoryRegionAlex Williamson
Add one more layer to our stack of MemoryRegions, this base region allows us to register BARs independently of the vfio region or to extend the size of BARs which do map to a region. This will be useful when we want hypervisor defined BARs or sections of BARs, for purposes such as relocating MSI-X emulation. We therefore call msix_init() based on this new base MemoryRegion, while the quirks, which only modify regions still operate on those sub-MemoryRegions. Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2017-12-05pci: Eliminate redundant PCIDevice::bus pointerDavid Gibson
The bus pointer in PCIDevice is basically redundant with QOM information. It's always initialized to the qdev_get_parent_bus(), the only difference is the type. Therefore this patch eliminates the field, instead creating a pci_get_bus() helper to do the type mangling to derive it conveniently from the QOM Device object underneath. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Eduardo Habkost <ehabkost@redhat.com> Reviewed-by: Marcel Apfelbaum <marcel@redhat.com> Reviewed-by: Peter Xu <peterx@redhat.com>
2017-10-15pci: Add interface names to hybrid PCI devicesEduardo Habkost
The following devices support both PCI Express and Conventional PCI, by including special code to handle the QEMU_PCI_CAP_EXPRESS flag and/or conditional pcie_endpoint_cap_init() calls: * vfio-pci (is_express=1, but legacy PCI handled by vfio_populate_device()) * vmxnet3 (is_express=0, but PCIe handled by vmxnet3_realize()) * pvscsi (is_express=0, but PCIe handled by pvscsi_realize()) * virtio-pci (is_express=0, but PCIe handled by virtio_pci_dc_realize(), and additional legacy PCI code at virtio_pci_realize()) * base-xhci (is_express=1, but pcie_endpoint_cap_init() call is conditional on pci_bus_is_express(dev->bus) * Note that xhci does not clear QEMU_PCI_CAP_EXPRESS like the other hybrid devices Cc: Dmitry Fleytman <dmitry@daynix.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Gerd Hoffmann <kraxel@redhat.com> Cc: Alex Williamson <alex.williamson@redhat.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Signed-off-by: Eduardo Habkost <ehabkost@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Marcel Apfelbaum <marcel@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2017-10-03vfio/pci: Add NVIDIA GPUDirect Cliques supportAlex Williamson
NVIDIA has defined a specification for creating GPUDirect "cliques", where devices with the same clique ID support direct peer-to-peer DMA. When running on bare-metal, tools like NVIDIA's p2pBandwidthLatencyTest (part of cuda-samples) determine which GPUs can support peer-to-peer based on chipset and topology. When running in a VM, these tools have no visibility to the physical hardware support or topology. This option allows the user to specify hints via a vendor defined capability. For instance: <qemu:commandline> <qemu:arg value='-set'/> <qemu:arg value='device.hostdev0.x-nv-gpudirect-clique=0'/> <qemu:arg value='-set'/> <qemu:arg value='device.hostdev1.x-nv-gpudirect-clique=1'/> <qemu:arg value='-set'/> <qemu:arg value='device.hostdev2.x-nv-gpudirect-clique=1'/> </qemu:commandline> This enables two cliques. The first is a singleton clique with ID 0, for the first hostdev defined in the XML (note that since cliques define peer-to-peer sets, singleton clique offer no benefit). The subsequent two hostdevs are both added to clique ID 1, indicating peer-to-peer is possible between these devices. QEMU only provides validation that the clique ID is valid and applied to an NVIDIA graphics device, any validation that the resulting cliques are functional and valid is the user's responsibility. The NVIDIA specification allows a 4-bit clique ID, thus valid values are 0-15. Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2017-10-03vfio/pci: Add virtual capabilities quirk infrastructureAlex Williamson
If the hypervisor needs to add purely virtual capabilties, give us a hook through quirks to do that. Note that we determine the maximum size for a capability based on the physical device, if we insert a virtual capability, that can change. Therefore if maximum size is smaller after added virt capabilities, use that. Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2017-10-03vfio/pci: Do not unwind on errorAlex Williamson
If vfio_add_std_cap() errors then going to out prepends irrelevant errors for capabilities we haven't attempted to add as we unwind our recursive stack. Just return error. Fixes: 7ef165b9a8d9 ("vfio/pci: Pass an error object to vfio_add_capabilities") Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2017-07-26vfio/pci: fix use of freed memoryPhilippe Mathieu-Daudé
hw/vfio/pci.c:308:29: warning: Use of memory after it is freed qemu_set_fd_handler(*pfd, NULL, NULL, vdev); ^~~~ Reported-by: Clang Static Analyzer Signed-off-by: Philippe Mathieu-Daudé <f4bug@amsat.org> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2017-07-10vfio/pci: Fixup v0 PCIe capabilitiesAlex Williamson
Intel 82599 VFs report a PCIe capability version of 0, which is invalid. The earliest version of the PCIe spec used version 1. This causes Windows to fail startup on the device and it will be disabled with error code 10. Our choices are either to drop the PCIe cap on such devices, which has the side effect of likely preventing the guest from discovering any extended capabilities, or performing a fixup to update the capability to the earliest valid version. This implements the latter. Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2017-07-10vfio: Test realized when using VFIOGroup.device_list iteratorAlex Williamson
VFIOGroup.device_list is effectively our reference tracking mechanism such that we can teardown a group when all of the device references are removed. However, we also use this list from our machine reset handler for processing resets that affect multiple devices. Generally device removals are fully processed (exitfn + finalize) when this reset handler is invoked, however if the removal is triggered via another reset handler (piix4_reset->acpi_pcihp_reset) then the device exitfn may run, but not finalize. In this case we hit asserts when we start trying to access PCI helpers since much of the PCI state of the device is released. To resolve this, add a pointer to the Object DeviceState in our common base-device and skip non-realized devices as we iterate. Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2017-07-03pci: Replace pci_add_capability2() with pci_add_capability()Mao Zhongyi
After the patch 'Make errp the last parameter of pci_add_capability()', pci_add_capability() and pci_add_capability2() now do exactly the same. So drop the wrapper pci_add_capability() of pci_add_capability2(), then replace the pci_add_capability2() with pci_add_capability() everywhere. Cc: pbonzini@redhat.com Cc: rth@twiddle.net Cc: ehabkost@redhat.com Cc: mst@redhat.com Cc: dmitry@daynix.com Cc: jasowang@redhat.com Cc: marcel@redhat.com Cc: alex.williamson@redhat.com Cc: armbru@redhat.com Suggested-by: Eduardo Habkost <ehabkost@redhat.com> Signed-off-by: Mao Zhongyi <maozy.fnst@cn.fujitsu.com> Reviewed-by: Marcel Apfelbaum <marcel@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2017-07-03pci: Make errp the last parameter of pci_add_capability()Mao Zhongyi
Add Error argument for pci_add_capability() to leverage the errp to pass info on errors. This way is helpful for its callers to make a better error handling when moving to 'realize'. Cc: pbonzini@redhat.com Cc: rth@twiddle.net Cc: ehabkost@redhat.com Cc: mst@redhat.com Cc: jasowang@redhat.com Cc: marcel@redhat.com Cc: alex.williamson@redhat.com Cc: armbru@redhat.com Signed-off-by: Mao Zhongyi <maozy.fnst@cn.fujitsu.com> Reviewed-by: Marcel Apfelbaum <marcel@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2017-07-03pci: Fix the wrong assertion.Mao Zhongyi
pci_add_capability returns a strictly positive value on success, correct asserts. Cc: dmitry@daynix.com Cc: jasowang@redhat.com Cc: kraxel@redhat.com Cc: alex.williamson@redhat.com Cc: armbru@redhat.com Cc: marcel@redhat.com Signed-off-by: Mao Zhongyi <maozy.fnst@cn.fujitsu.com> Reviewed-by: Marcel Apfelbaum <marcel@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2017-05-03vfio/pci: Fix incorrect error messageDong Jia Shi
When the "No host device provided" error occurs, the hint message that starts with "Use -vfio-pci," makes no sense, since "-vfio-pci" is not a valid command line parameter. Correct this by replacing "-vfio-pci" with "-device vfio-pci". Signed-off-by: Dong Jia Shi <bjsdjshi@linux.vnet.ibm.com> Reviewed-by: Eric Auger <eric.auger@redhat.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>