aboutsummaryrefslogtreecommitdiff
path: root/hw/vfio
AgeCommit message (Collapse)Author
2022-05-24hw/vfio/pci-quirks: Resolve redundant property gettersBernhard Beschow
The QOM API already provides getters for uint64 and uint32 values, so reuse them. Signed-off-by: Bernhard Beschow <shentey@gmail.com> Reviewed-by: Alistair Francis <alistair.francis@wdc.com> Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org> Message-Id: <20220301225220.239065-2-shentey@gmail.com> Signed-off-by: Alistair Francis <alistair.francis@wdc.com>
2022-05-13linux-headers: Update to v5.18-rc6Alex Williamson
Update to c5eb0a61238d ("Linux 5.18-rc6"). Mechanical search and replace of vfio defines with white space massaging. Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2022-05-06vfio/common: Rename VFIOGuestIOMMU::iommu into ::iommu_mrYi Liu
Rename VFIOGuestIOMMU iommu field into iommu_mr. Then it becomes clearer it is an IOMMU memory region. no functional change intended Signed-off-by: Yi Liu <yi.l.liu@intel.com> Link: https://lore.kernel.org/r/20220502094223.36384-4-yi.l.liu@intel.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2022-05-06vfio/pci: Use vbasedev local variable in vfio_realize()Eric Auger
Using a VFIODevice handle local variable to improve the code readability. no functional change intended Signed-off-by: Eric Auger <eric.auger@redhat.com> Signed-off-by: Yi Liu <yi.l.liu@intel.com> Link: https://lore.kernel.org/r/20220502094223.36384-3-yi.l.liu@intel.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2022-05-06hw/vfio/pci: fix vfio_pci_hot_reset_result trace pointEric Auger
"%m" format specifier is not interpreted by the trace infrastructure and thus "%m" is output instead of the actual errno string. Fix it by outputting strerror(errno). Signed-off-by: Eric Auger <eric.auger@redhat.com> Signed-off-by: Yi Liu <yi.l.liu@intel.com> Link: https://lore.kernel.org/r/20220502094223.36384-2-yi.l.liu@intel.com [aw: replace commit log as provided by Eric] Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2022-05-06vfio/common: remove spurious tpm-crb-cmd misalignment warningEric Auger
The CRB command buffer currently is a RAM MemoryRegion and given its base address alignment, it causes an error report on vfio_listener_region_add(). This region could have been a RAM device region, easing the detection of such safe situation but this option was not well received. So let's add a helper function that uses the memory region owner type to detect the situation is safe wrt the assignment. Other device types can be checked here if such kind of problem occurs again. Signed-off-by: Eric Auger <eric.auger@redhat.com> Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org> Acked-by: Stefan Berger <stefanb@linux.ibm.com> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Link: https://lore.kernel.org/r/20220506132510.1847942-3-eric.auger@redhat.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2022-05-06vfio/common: Fix a small boundary issue of a traceXiang Chen
It uses [offset, offset + size - 1] to indicate that the length of range is size in most places in vfio trace code (such as trace_vfio_region_region_mmap()) execpt trace_vfio_region_sparse_mmap_entry(). So change it for trace_vfio_region_sparse_mmap_entry(), but if size is zero, the trace will be weird with an underflow, so move the trace and trace it only if size is not zero. Signed-off-by: Xiang Chen <chenxiang66@hisilicon.com> Link: https://lore.kernel.org/r/1650100104-130737-1-git-send-email-chenxiang66@hisilicon.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2022-05-06vfio: defer to commit kvm irq routing when enable msi/msixLongpeng(Mike)
In migration resume phase, all unmasked msix vectors need to be setup when loading the VF state. However, the setup operation would take longer if the VM has more VFs and each VF has more unmasked vectors. The hot spot is kvm_irqchip_commit_routes, it'll scan and update all irqfds that are already assigned each invocation, so more vectors means need more time to process them. vfio_pci_load_config vfio_msix_enable msix_set_vector_notifiers for (vector = 0; vector < dev->msix_entries_nr; vector++) { vfio_msix_vector_do_use vfio_add_kvm_msi_virq kvm_irqchip_commit_routes <-- expensive } We can reduce the cost by only committing once outside the loop. The routes are cached in kvm_state, we commit them first and then bind irqfd for each vector. The test VM has 128 vcpus and 8 VF (each one has 65 vectors), we measure the cost of the vfio_msix_enable for each VF, and we can see 90+% costs can be reduce. VF Count of irqfds[*] Original With this patch 1st 65 8 2 2nd 130 15 2 3rd 195 22 2 4th 260 24 3 5th 325 36 2 6th 390 44 3 7th 455 51 3 8th 520 58 4 Total 258ms 21ms [*] Count of irqfds How many irqfds that already assigned and need to process in this round. The optimization can be applied to msi type too. Signed-off-by: Longpeng(Mike) <longpeng2@huawei.com> Link: https://lore.kernel.org/r/20220326060226.1892-6-longpeng2@huawei.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2022-05-06Revert "vfio: Avoid disabling and enabling vectors repeatedly in VFIO migration"Longpeng(Mike)
Commit ecebe53fe993 ("vfio: Avoid disabling and enabling vectors repeatedly in VFIO migration") avoids inefficiently disabling and enabling vectors repeatedly and lets the unmasked vectors be enabled one by one. But we want to batch multiple routes and defer the commit, and only commit once outside the loop of setting vector notifiers, so we cannot enable the vectors one by one in the loop now. Revert that commit and we will take another way in the next patch, it can not only avoid disabling/enabling vectors repeatedly, but also satisfy our requirement of defer to commit. Signed-off-by: Longpeng(Mike) <longpeng2@huawei.com> Link: https://lore.kernel.org/r/20220326060226.1892-5-longpeng2@huawei.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2022-05-06vfio: simplify the failure path in vfio_msi_enableLongpeng(Mike)
Use vfio_msi_disable_common to simplify the error handling in vfio_msi_enable. Signed-off-by: Longpeng(Mike) <longpeng2@huawei.com> Link: https://lore.kernel.org/r/20220326060226.1892-4-longpeng2@huawei.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2022-05-06vfio: move re-enabling INTX out of the common helperLongpeng(Mike)
Move re-enabling INTX out, and the callers should decide to re-enable it or not. Signed-off-by: Longpeng(Mike) <longpeng2@huawei.com> Link: https://lore.kernel.org/r/20220326060226.1892-3-longpeng2@huawei.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2022-05-06vfio: simplify the conditional statements in vfio_msi_enableLongpeng(Mike)
It's unnecessary to test against the specific return value of VFIO_DEVICE_SET_IRQS, since any positive return is an error indicating the number of vectors we should retry with. Signed-off-by: Longpeng(Mike) <longpeng2@huawei.com> Link: https://lore.kernel.org/r/20220326060226.1892-2-longpeng2@huawei.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2022-04-06Replace qemu_real_host_page variables with inlined functionsMarc-André Lureau
Replace the global variables with inlined helper functions. getpagesize() is very likely annotated with a "const" function attribute (at least with glibc), and thus optimization should apply even better. This avoids the need for a constructor initialization too. Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com> Message-Id: <20220323155743.1585078-12-marcandre.lureau@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-03-21Use g_new() & friends where that makes obvious senseMarkus Armbruster
g_new(T, n) is neater than g_malloc(sizeof(T) * n). It's also safer, for two reasons. One, it catches multiplication overflowing size_t. Two, it returns T * rather than void *, which lets the compiler catch more type errors. This commit only touches allocations with size arguments of the form sizeof(T). Patch created mechanically with: $ spatch --in-place --sp-file scripts/coccinelle/use-g_new-etc.cocci \ --macro-file scripts/cocci-macro-file.h FILES... Signed-off-by: Markus Armbruster <armbru@redhat.com> Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org> Reviewed-by: Cédric Le Goater <clg@kaod.org> Reviewed-by: Alex Bennée <alex.bennee@linaro.org> Acked-by: Dr. David Alan Gilbert <dgilbert@redhat.com> Message-Id: <20220315144156.1595462-4-armbru@redhat.com> Reviewed-by: Pavel Dovgalyuk <Pavel.Dovgalyuk@ispras.ru>
2022-03-15kvm/msi: do explicit commit when adding msi routesLongpeng(Mike)
We invoke the kvm_irqchip_commit_routes() for each addition to MSI route table, which is not efficient if we are adding lots of routes in some cases. This patch lets callers invoke the kvm_irqchip_commit_routes(), so the callers can decide how to optimize. [1] https://lists.gnu.org/archive/html/qemu-devel/2021-11/msg00967.html Signed-off-by: Longpeng <longpeng2@huawei.com> Message-Id: <20220222141116.2091-3-longpeng2@huawei.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-21Mark remaining global TypeInfo instances as constBernhard Beschow
More than 1k of TypeInfo instances are already marked as const. Mark the remaining ones, too. This commit was created with: git grep -z -l 'static TypeInfo' -- '*.c' | \ xargs -0 sed -i 's/static TypeInfo/static const TypeInfo/' Signed-off-by: Bernhard Beschow <shentey@gmail.com> Reviewed-by: Alistair Francis <alistair.francis@wdc.com> Reviewed-by: Thomas Huth <thuth@redhat.com> Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org> Reviewed-by: Cédric Le Goater <clg@kaod.org> Reviewed-by: Igor Mammedov <imammedo@redhat.com> Acked-by: Corey Minyard <cminyard@mvista.com> Message-id: 20220117145805.173070-2-shentey@gmail.com Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
2021-11-17vfio: Fix memory leak of hostwinPeng Liang
hostwin is allocated and added to hostwin_list in vfio_host_win_add, but it is only deleted from hostwin_list in vfio_host_win_del, which causes a memory leak. Also, freeing all elements in hostwin_list is missing in vfio_disconnect_container. Fix: 2e4109de8e58 ("vfio/spapr: Create DMA window dynamically (SPAPR IOMMU v2)") CC: qemu-stable@nongnu.org Signed-off-by: Peng Liang <liangpeng10@huawei.com> Link: https://lore.kernel.org/r/20211117014739.1839263-1-liangpeng10@huawei.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2021-11-01vfio/common: Add a trace point when a MMIO RAM section cannot be mappedKunkun Jiang
The MSI-X structures of some devices and other non-MSI-X structures may be in the same BAR. They may share one host page, especially in the case of large page granularity, such as 64K. For example, MSIX-Table size of 82599 NIC is 0x30 and the offset in Bar 3(size 64KB) is 0x0. vfio_listener_region_add() will be called to map the remaining range (0x30-0xffff). If host page size is 64KB, it will return early at 'int128_ge((int128_make64(iova), llend))' without any message. Let's add a trace point to inform users like commit 5c08600547c0 ("vfio: Use a trace point when a RAM section cannot be DMA mapped") did. Signed-off-by: Kunkun Jiang <jiangkunkun@huawei.com> Link: https://lore.kernel.org/r/20211027090406.761-3-jiangkunkun@huawei.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2021-11-01vfio/pci: Add support for mmapping sub-page MMIO BARs after live migrationKunkun Jiang
We can expand MemoryRegions of sub-page MMIO BARs in vfio_pci_write_config() to improve IO performance for some devices. However, the MemoryRegions of destination VM are not expanded any more after live migration. Because their addresses have been updated in vmstate_load_state() (vfio_pci_load_config) and vfio_sub_page_bar_update_mapping() will not be called. This may result in poor performance after live migration. So iterate BARs in vfio_pci_load_config() and try to update sub-page BARs. Reported-by: Nianyao Tang <tangnianyao@huawei.com> Reported-by: Qixin Gan <ganqixin@huawei.com> Signed-off-by: Kunkun Jiang <jiangkunkun@huawei.com> Link: https://lore.kernel.org/r/20211027090406.761-2-jiangkunkun@huawei.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2021-10-15qdev: Base object creation on QDict rather than QemuOptsKevin Wolf
QDicts are both what QMP natively uses and what the keyval parser produces. Going through QemuOpts isn't useful for either one, so switch the main device creation function to QDicts. By sharing more code with the -object/object-add code path, we can even reduce the code size a bit. This commit doesn't remove the detour through QemuOpts from any code path yet, but it allows the following commits to do so. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Message-Id: <20211008133442.141332-15-kwolf@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Tested-by: Peter Krempa <pkrempa@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2021-09-30memory: Name all the memory listenersPeter Xu
Provide a name field for all the memory listeners. It can be used to identify which memory listener is which. Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Message-Id: <20210817013553.30584-2-peterx@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-30memory: Add RAM_PROTECTED flag to skip IOMMU mappingsSean Christopherson
Add a new RAMBlock flag to denote "protected" memory, i.e. memory that looks and acts like RAM but is inaccessible via normal mechanisms, including DMA. Use the flag to skip protected memory regions when mapping RAM for DMA in VFIO. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Yang Zhong <yang.zhong@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-16hw/vfio: Fix typo in commentsCai Huoqing
Fix typo in comments: *programatically ==> programmatically *disconecting ==> disconnecting *mulitple ==> multiple *timout ==> timeout *regsiter ==> register *forumula ==> formula Signed-off-by: Cai Huoqing <caihuoqing@baidu.com> Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com> Message-Id: <20210730012613.2198-1-caihuoqing@baidu.com> Signed-off-by: Laurent Vivier <laurent@vivier.eu>
2021-09-06vfio-ccw: forward halt/clear errorsCornelia Huck
hsch and csch basically have two parts: execute the command, and perform the halt/clear function. For fully emulated subchannels, it is pretty clear how it will work: check the subchannel state, and actually 'perform the halt/clear function' and set cc 0 if everything looks good. For passthrough subchannels, some of the checking is done within QEMU, but some has to be done within the kernel. QEMU's subchannel state may be such that we can perform the async function, but the kernel may still get a cc != 0 when it is actually executing the instruction. In that case, we need to set the condition actually encountered by the kernel; if we set cc 0 on error, we would actually need to inject an interrupt as well. Signed-off-by: Cornelia Huck <cohuck@redhat.com> Reviewed-by: Matthew Rosato <mjrosato@linux.ibm.com> Tested-by: Jared Rossi <jrossi@linux.ibm.com> Message-Id: <20210705163952.736020-2-cohuck@redhat.com> Signed-off-by: Thomas Huth <thuth@redhat.com>
2021-08-26vfio: Avoid error_propagate() after migrate_add_blocker()Markus Armbruster
When migrate_add_blocker(blocker, &err) is followed by error_propagate(errp, err), we can often just as well do migrate_add_blocker(..., errp). This is the case in vfio_migration_probe(). Prior art: commit 386f6c07d2 "error: Avoid error_propagate() after migrate_add_blocker()". Cc: Kirti Wankhede <kwankhede@nvidia.com> Cc: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Markus Armbruster <armbru@redhat.com> Message-Id: <20210720125408.387910-8-armbru@redhat.com> Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com> Reviewed by: Kirti Wankhede <kwankhede@nvidia.com> Acked-by: Michael S. Tsirkin <mst@redhat.com>
2021-07-14vfio/pci: Add pba_offset PCI quirk for BAIDU KUNLUN AI processorCai Huoqing
Fix pba_offset initialization value for BAIDU KUNLUN Virtual Function device. The KUNLUN hardware returns an incorrect value for the VF PBA offset, and add a quirk to instead return a hardcoded value of 0xb400. Signed-off-by: Cai Huoqing <caihuoqing@baidu.com> Link: https://lore.kernel.org/r/20210713093743.942-1-caihuoqing@baidu.com [aw: comment & whitespace tuning] Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2021-07-14vfio/pci: Change to use vfio_pci_is()Cai Huoqing
Make use of vfio_pci_is() helper function. Signed-off-by: Cai Huoqing <caihuoqing@baidu.com> Link: https://lore.kernel.org/r/20210713014831.742-1-caihuoqing@baidu.com [aw: commit log wording] Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2021-07-14vfio: Fix CID 1458134 in vfio_register_ram_discard_listener()David Hildenbrand
CID 1458134: Integer handling issues (BAD_SHIFT) In expression "1 << ctz64(container->pgsizes)", left shifting by more than 31 bits has undefined behavior. The shift amount, "ctz64(container->pgsizes)", is 64. Commit 5e3b981c330c ("vfio: Support for RamDiscardManager in the !vIOMMU case") added an assertion that our granularity is at least as big as the page size. Although unlikely, we could have a page size that does not fit into 32 bit. In that case, we'd try shifting by more than 31 bit. Let's use 1ULL instead and make sure we're not shifting by more than 63 bit by asserting that any bit in container->pgsizes is set. Fixes: CID 1458134 Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Eduardo Habkost <ehabkost@redhat.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Dr. David Alan Gilbert <dgilbert@redhat.com> Cc: Igor Mammedov <imammedo@redhat.com> Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Cc: Peter Xu <peterx@redhat.com> Cc: Auger Eric <eric.auger@redhat.com> Cc: Wei Yang <richard.weiyang@linux.alibaba.com> Cc: teawater <teawaterz@linux.alibaba.com> Cc: Marek Kedzierski <mkedzier@redhat.com> Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Igor Mammedov <imammedo@redhat.com> Reviewed-by: Pankaj Gupta <pankaj.gupta@ionos.com> Link: https://lore.kernel.org/r/20210712083135.15755-1-david@redhat.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2021-07-12Merge remote-tracking branch 'remotes/cohuck-gitlab/tags/s390x-20210708' ↵Peter Maydell
into staging s390x updates: - add gen16 cpumodels - refactor/cleanup some code - bugfixes # gpg: Signature made Thu 08 Jul 2021 12:26:21 BST # gpg: using EDDSA key 69A3B536F5CBFC65208026C1DE88BB5641DE66C1 # gpg: issuer "cohuck@redhat.com" # gpg: Good signature from "Cornelia Huck <conny@cornelia-huck.de>" [unknown] # gpg: aka "Cornelia Huck <huckc@linux.vnet.ibm.com>" [full] # gpg: aka "Cornelia Huck <cornelia.huck@de.ibm.com>" [full] # gpg: aka "Cornelia Huck <cohuck@kernel.org>" [unknown] # gpg: aka "Cornelia Huck <cohuck@redhat.com>" [unknown] # Primary key fingerprint: C3D0 D66D C362 4FF6 A8C0 18CE DECF 6B93 C6F0 2FAF # Subkey fingerprint: 69A3 B536 F5CB FC65 2080 26C1 DE88 BB56 41DE 66C1 * remotes/cohuck-gitlab/tags/s390x-20210708: target/s390x: split sysemu part of cpu models target/s390x: move kvm files into kvm/ target/s390x: remove kvm-stub.c target/s390x: use kvm_enabled() to wrap call to kvm_s390_get_hpage_1m target/s390x: make helper.c sysemu-only target/s390x: split cpu-dump from helper.c target/s390x: move sysemu-only code out to cpu-sysemu.c target/s390x: start moving TCG-only code to tcg/ target/s390x: rename internal.h to s390x-internal.h target/s390x: remove tcg-stub.c hw/s390x: only build tod-tcg from the CONFIG_TCG build hw/s390x: tod: make explicit checks for accelerators when initializing hw/s390x: rename tod-qemu.c to tod-tcg.c target/s390x: meson: add target_user_arch s390x/tcg: Fix m5 vs. m4 field for VECTOR MULTIPLY SUM LOGICAL target/s390x: Fix CC set by CONVERT TO FIXED/LOGICAL s390x/cpumodel: add 3931 and 3932 Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
2021-07-08vfio: Disable only uncoordinated discards for VFIO_TYPE1 iommusDavid Hildenbrand
We support coordinated discarding of RAM using the RamDiscardManager for the VFIO_TYPE1 iommus. Let's unlock support for coordinated discards, keeping uncoordinated discards (e.g., via virtio-balloon) disabled if possible. This unlocks virtio-mem + vfio on x86-64. Note that vfio used via "nvme://" by the block layer has to be implemented/unlocked separately. For now, virtio-mem only supports x86-64; we don't restrict RamDiscardManager to x86-64, though: arm64 and s390x are supposed to work as well, and we'll test once unlocking virtio-mem support. The spapr IOMMUs will need special care, to be tackled later, e.g.., once supporting virtio-mem. Note: The block size of a virtio-mem device has to be set to sane sizes, depending on the maximum hotplug size - to not run out of vfio mappings. The default virtio-mem block size is usually in the range of a couple of MBs. The maximum number of mapping is 64k, shared with other users. Assume you want to hotplug 256GB using virtio-mem - the block size would have to be set to at least 8 MiB (resulting in 32768 separate mappings). Acked-by: Alex Williamson <alex.williamson@redhat.com> Reviewed-by: Alex Williamson <alex.williamson@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Dr. David Alan Gilbert <dgilbert@redhat.com> Cc: Igor Mammedov <imammedo@redhat.com> Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Cc: Peter Xu <peterx@redhat.com> Cc: Auger Eric <eric.auger@redhat.com> Cc: Wei Yang <richard.weiyang@linux.alibaba.com> Cc: teawater <teawaterz@linux.alibaba.com> Cc: Marek Kedzierski <mkedzier@redhat.com> Signed-off-by: David Hildenbrand <david@redhat.com> Message-Id: <20210413095531.25603-14-david@redhat.com> Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
2021-07-08vfio: Support for RamDiscardManager in the vIOMMU caseDavid Hildenbrand
vIOMMU support works already with RamDiscardManager as long as guests only map populated memory. Both, populated and discarded memory is mapped into &address_space_memory, where vfio_get_xlat_addr() will find that memory, to create the vfio mapping. Sane guests will never map discarded memory (e.g., unplugged memory blocks in virtio-mem) into an IOMMU - or keep it mapped into an IOMMU while memory is getting discarded. However, there are two cases where a malicious guests could trigger pinning of more memory than intended. One case is easy to handle: the guest trying to map discarded memory into an IOMMU. The other case is harder to handle: the guest keeping memory mapped in the IOMMU while it is getting discarded. We would have to walk over all mappings when discarding memory and identify if any mapping would be a violation. Let's keep it simple for now and print a warning, indicating that setting RLIMIT_MEMLOCK can mitigate such attacks. We have to take care of incoming migration: at the point the IOMMUs get restored and start creating mappings in vfio, RamDiscardManager implementations might not be back up and running yet: let's add runstate priorities to enforce the order when restoring. Acked-by: Alex Williamson <alex.williamson@redhat.com> Reviewed-by: Alex Williamson <alex.williamson@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Dr. David Alan Gilbert <dgilbert@redhat.com> Cc: Igor Mammedov <imammedo@redhat.com> Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Cc: Peter Xu <peterx@redhat.com> Cc: Auger Eric <eric.auger@redhat.com> Cc: Wei Yang <richard.weiyang@linux.alibaba.com> Cc: teawater <teawaterz@linux.alibaba.com> Cc: Marek Kedzierski <mkedzier@redhat.com> Signed-off-by: David Hildenbrand <david@redhat.com> Message-Id: <20210413095531.25603-10-david@redhat.com> Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
2021-07-08vfio: Sanity check maximum number of DMA mappings with RamDiscardManagerDavid Hildenbrand
Although RamDiscardManager can handle running into the maximum number of DMA mappings by propagating errors when creating a DMA mapping, we want to sanity check and warn the user early that there is a theoretical setup issue and that virtio-mem might not be able to provide as much memory towards a VM as desired. As suggested by Alex, let's use the number of KVM memory slots to guess how many other mappings we might see over time. Acked-by: Alex Williamson <alex.williamson@redhat.com> Reviewed-by: Alex Williamson <alex.williamson@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Dr. David Alan Gilbert <dgilbert@redhat.com> Cc: Igor Mammedov <imammedo@redhat.com> Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Cc: Peter Xu <peterx@redhat.com> Cc: Auger Eric <eric.auger@redhat.com> Cc: Wei Yang <richard.weiyang@linux.alibaba.com> Cc: teawater <teawaterz@linux.alibaba.com> Cc: Marek Kedzierski <mkedzier@redhat.com> Signed-off-by: David Hildenbrand <david@redhat.com> Message-Id: <20210413095531.25603-9-david@redhat.com> Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
2021-07-08vfio: Query and store the maximum number of possible DMA mappingsDavid Hildenbrand
Let's query the maximum number of possible DMA mappings by querying the available mappings when creating the container (before any mappings are created). We'll use this informaton soon to perform some sanity checks and warn the user. Reviewed-by: Alex Williamson <alex.williamson@redhat.com> Acked-by: Alex Williamson <alex.williamson@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Dr. David Alan Gilbert <dgilbert@redhat.com> Cc: Igor Mammedov <imammedo@redhat.com> Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Cc: Peter Xu <peterx@redhat.com> Cc: Auger Eric <eric.auger@redhat.com> Cc: Wei Yang <richard.weiyang@linux.alibaba.com> Cc: teawater <teawaterz@linux.alibaba.com> Cc: Marek Kedzierski <mkedzier@redhat.com> Signed-off-by: David Hildenbrand <david@redhat.com> Message-Id: <20210413095531.25603-8-david@redhat.com> Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
2021-07-08vfio: Support for RamDiscardManager in the !vIOMMU caseDavid Hildenbrand
Implement support for RamDiscardManager, to prepare for virtio-mem support. Instead of mapping the whole memory section, we only map "populated" parts and update the mapping when notified about discarding/population of memory via the RamDiscardListener. Similarly, when syncing the dirty bitmaps, sync only the actually mapped (populated) parts by replaying via the notifier. Using virtio-mem with vfio is still blocked via ram_block_discard_disable()/ram_block_discard_require() after this patch. Reviewed-by: Alex Williamson <alex.williamson@redhat.com> Acked-by: Alex Williamson <alex.williamson@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Dr. David Alan Gilbert <dgilbert@redhat.com> Cc: Igor Mammedov <imammedo@redhat.com> Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Cc: Peter Xu <peterx@redhat.com> Cc: Auger Eric <eric.auger@redhat.com> Cc: Wei Yang <richard.weiyang@linux.alibaba.com> Cc: teawater <teawaterz@linux.alibaba.com> Cc: Marek Kedzierski <mkedzier@redhat.com> Signed-off-by: David Hildenbrand <david@redhat.com> Message-Id: <20210413095531.25603-7-david@redhat.com> Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
2021-07-07target/s390x: move kvm files into kvm/Cho, Yu-Chen
move kvm files into kvm/ After the reshuffling, update MAINTAINERS accordingly. Make use of the new directory: target/s390x/kvm/ Signed-off-by: Claudio Fontana <cfontana@suse.de> Signed-off-by: Cho, Yu-Chen <acho@suse.com> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Reviewed-by: Thomas Huth <thuth@redhat.com> Message-Id: <20210707105324.23400-14-acho@suse.com> Signed-off-by: Cornelia Huck <cohuck@redhat.com>
2021-06-21s390x/css: Add passthrough IRBEric Farman
Wire in the subchannel callback for building the IRB ESW and ECW space for passthrough devices, and copy the hardware's ESW into the IRB we are building. If the hardware presented concurrent sense, then copy that sense data into the IRB's ECW space. Signed-off-by: Eric Farman <farman@linux.ibm.com> Message-Id: <20210617232537.1337506-5-farman@linux.ibm.com> Signed-off-by: Cornelia Huck <cohuck@redhat.com>
2021-06-18vfio/migration: Correct device state from vmstate change for savevm caseKirti Wankhede
Set _SAVING flag for device state from vmstate change handler when it gets called from savevm. Currently State transition savevm/suspend is seen as: _RUNNING -> _STOP -> Stop-and-copy -> _STOP State transition savevm/suspend should be: _RUNNING -> Stop-and-copy -> _STOP State transition from _RUNNING to _STOP occurs from vfio_vmstate_change() where when vmstate changes from running to !running, _RUNNING flag is reset but at the same time when vfio_vmstate_change() is called for RUN_STATE_SAVE_VM, _SAVING bit should be set. Reported by: Yishai Hadas <yishaih@nvidia.com> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com> Message-Id: <1623177441-27496-1-git-send-email-kwankhede@nvidia.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2021-06-18vfio: Fix unregister SaveVMHandler in vfio_migration_finalizeKunkun Jiang
In the vfio_migration_init(), the SaveVMHandler is registered for VFIO device. But it lacks the operation of 'unregister'. It will lead to 'Segmentation fault (core dumped)' in qemu_savevm_state_setup(), if performing live migration after a VFIO device is hot deleted. Fixes: 7c2f5f75f94 (vfio: Register SaveVMHandlers for VFIO device) Reported-by: Qixin Gan <ganqixin@huawei.com> Signed-off-by: Kunkun Jiang <jiangkunkun@huawei.com> Message-Id: <20210527123101.289-1-jiangkunkun@huawei.com> Reviewed by: Kirti Wankhede <kwankhede@nvidia.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2021-06-02docs: fix references to docs/devel/tracing.rstStefano Garzarella
Commit e50caf4a5c ("tracing: convert documentation to rST") converted docs/devel/tracing.txt to docs/devel/tracing.rst. We still have several references to the old file, so let's fix them with the following command: sed -i s/tracing.txt/tracing.rst/ $(git grep -l docs/devel/tracing.txt) Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com> Message-Id: <20210517151702.109066-2-sgarzare@redhat.com> Signed-off-by: Thomas Huth <thuth@redhat.com>
2021-05-20vfio-ccw: Attempt to clean up all IRQs on errorEric Farman
The vfio_ccw_unrealize() routine makes an unconditional attempt to unregister every IRQ notifier, though they may not have been registered in the first place (when running on an older kernel, for example). Let's mirror this behavior in the error cleanups in vfio_ccw_realize() so that if/when new IRQs are added, it is less confusing to recognize the necessary procedures. The worst case scenario would be some extra messages about an undefined IRQ, but since this is an error exit that won't be the only thing to worry about. And regarding those messages, let's change it to a warning instead of an error, to better reflect their severity. The existing code in both paths handles everything anyway. Signed-off-by: Eric Farman <farman@linux.ibm.com> Acked-by: Matthew Rosato <mjrosato@linux.ibm.com> Message-Id: <20210428143652.1571487-1-farman@linux.ibm.com> Signed-off-by: Cornelia Huck <cohuck@redhat.com>
2021-05-20vfio-ccw: Permit missing IRQsEric Farman
Commit 690e29b91102 ("vfio-ccw: Refactor ccw irq handler") changed one of the checks for the IRQ notifier registration from saying "the host needs to recognize the only IRQ that exists" to saying "the host needs to recognize ANY IRQ that exists." And this worked fine, because the subsequent change to support the CRW IRQ notifier doesn't get into this code when running on an older kernel, thanks to a guard by a capability region. The later addition of the REQ(uest) IRQ by commit b2f96f9e4f5f ("vfio-ccw: Connect the device request notifier") broke this assumption because there is no matching capability region. Thus, running new QEMU on an older kernel fails with: vfio: unexpected number of irqs 2 Let's adapt the message here so that there's a better clue of what IRQ is missing. Furthermore, let's make the REQ(uest) IRQ not fail when attempting to register it, to permit running vfio-ccw on a newer QEMU with an older kernel. Fixes: b2f96f9e4f5f ("vfio-ccw: Connect the device request notifier") Signed-off-by: Eric Farman <farman@linux.ibm.com> Message-Id: <20210421152053.2379873-1-farman@linux.ibm.com> Signed-off-by: Cornelia Huck <cohuck@redhat.com>
2021-05-02Do not include cpu.h if it's not really necessaryThomas Huth
Stop including cpu.h in files that don't need it. Signed-off-by: Thomas Huth <thuth@redhat.com> Message-Id: <20210416171314.2074665-4-thuth@redhat.com> Signed-off-by: Laurent Vivier <laurent@vivier.eu>
2021-05-02Do not include sysemu/sysemu.h if it's not really necessaryThomas Huth
Stop including sysemu/sysemu.h in files that don't need it. Signed-off-by: Thomas Huth <thuth@redhat.com> Message-Id: <20210416171314.2074665-2-thuth@redhat.com> Signed-off-by: Laurent Vivier <laurent@vivier.eu>
2021-05-02hw: Do not include hw/sysbus.h if it is not necessaryThomas Huth
Many files include hw/sysbus.h without needing it. Remove the superfluous include statements. Signed-off-by: Thomas Huth <thuth@redhat.com> Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org> Message-Id: <20210327082804.2259480-1-thuth@redhat.com> Signed-off-by: Laurent Vivier <laurent@vivier.eu>
2021-05-02hw: Remove superfluous includes of hw/hw.hThomas Huth
The include/hw/hw.h header only has a prototype for hw_error(), so it does not make sense to include this in files that do not use this function. Signed-off-by: Thomas Huth <thuth@redhat.com> Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org> Message-Id: <20210326151848.2217216-1-thuth@redhat.com> Signed-off-by: Laurent Vivier <laurent@vivier.eu>
2021-03-16vfio/migrate: Move switch of dirty tracking into vfio_memory_listenerKeqian Zhu
For now the switch of vfio dirty page tracking is integrated into @vfio_save_handler. The reason is that some PCI vendor driver may start to track dirty base on _SAVING state of device, so if dirty tracking is started before setting device state, vfio will report full-dirty to QEMU. However, the dirty bmap of all ramblocks are fully set when setup ram saving, so it's not matter whether the device is in _SAVING state when start vfio dirty tracking. Moreover, this logic causes some problems [1]. The object of dirty tracking is guest memory, but the object of @vfio_save_handler is device state, which produces unnecessary coupling and conflicts: 1. Coupling: Their saving granule is different (perVM vs perDevice). vfio will enable dirty_page_tracking for each devices, actually once is enough. 2. Conflicts: The ram_save_setup() traverses all memory_listeners to execute their log_start() and log_sync() hooks to get the first round dirty bitmap, which is used by the bulk stage of ram saving. However, as vfio dirty tracking is not yet started, it can't get dirty bitmap from vfio. Then we give up the chance to handle vfio dirty page at bulk stage. Move the switch of vfio dirty_page_tracking into vfio_memory_listener can solve above problems. Besides, Do not require devices in SAVING state for vfio_sync_dirty_bitmap(). [1] https://www.spinics.net/lists/kvm/msg229967.html Reported-by: Zenghui Yu <yuzenghui@huawei.com> Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com> Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210309031913.11508-1-zhukeqian1@huawei.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2021-03-16vfio: Support host translation granule sizeKunkun Jiang
The cpu_physical_memory_set_dirty_lebitmap() can quickly deal with the dirty pages of memory by bitmap-traveling, regardless of whether the bitmap is aligned correctly or not. cpu_physical_memory_set_dirty_lebitmap() supports pages in bitmap of host page size. So it'd better to set bitmap_pgsize to host page size to support more translation granule sizes. [aw: The Fixes commit below introduced code to restrict migration support to configurations where the target page size intersects the host dirty page support. For example, a 4K guest on a 4K host. Due to the above flexibility in bitmap handling, this restriction unnecessarily prevents mixed target/host pages size that could otherwise be supported. Use host page size for dirty bitmap.] Fixes: 87ea529c502 ("vfio: Get migration capability flags for container") Signed-off-by: Kunkun Jiang <jiangkunkun@huawei.com> Message-Id: <20210304133446.1521-1-jiangkunkun@huawei.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2021-03-16vfio: Avoid disabling and enabling vectors repeatedly in VFIO migrationShenming Lu
In VFIO migration resume phase and some guest startups, there are already unmasked vectors in the vector table when calling vfio_msix_enable(). So in order to avoid inefficiently disabling and enabling vectors repeatedly, let's allocate all needed vectors first and then enable these unmasked vectors one by one without disabling. Signed-off-by: Shenming Lu <lushenming@huawei.com> Message-Id: <20210310030233.1133-4-lushenming@huawei.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2021-03-16vfio: Set the priority of the VFIO VM state change handler explicitlyShenming Lu
In the VFIO VM state change handler when stopping the VM, the _RUNNING bit in device_state is cleared which makes the VFIO device stop, including no longer generating interrupts. Then we can save the pending states of all interrupts in the GIC VM state change handler (on ARM). So we have to set the priority of the VFIO VM state change handler explicitly (like virtio devices) to ensure it is called before the GIC's in saving. Signed-off-by: Shenming Lu <lushenming@huawei.com> Reviewed-by: Kirti Wankhede <kwankhede@nvidia.com> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Message-Id: <20210310030233.1133-3-lushenming@huawei.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2021-03-16vfio: Move the saving of the config space to the right place in VFIO migrationShenming Lu
On ARM64 the VFIO SET_IRQS ioctl is dependent on the VM interrupt setup, if the restoring of the VFIO PCI device config space is before the VGIC, an error might occur in the kernel. So we move the saving of the config space to the non-iterable process, thus it will be called after the VGIC according to their priorities. As for the possible dependence of the device specific migration data on it's config space, we can let the vendor driver to include any config info it needs in its own data stream. Signed-off-by: Shenming Lu <lushenming@huawei.com> Reviewed-by: Kirti Wankhede <kwankhede@nvidia.com> Message-Id: <20210310030233.1133-2-lushenming@huawei.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>