aboutsummaryrefslogtreecommitdiff
path: root/hw/virtio
AgeCommit message (Collapse)Author
2023-11-07Merge tag 'for_upstream' of https://git.kernel.org/pub/scm/virt/kvm/mst/qemu ↵Stefan Hajnoczi
into staging virtio,pc,pci: features, fixes virtio sound card support vhost-user: back-end state migration cxl: line length reduction enabling fabric management vhost-vdpa: shadow virtqueue hash calculation Support shadow virtqueue RSS Support tests: CPU topology related smbios test cases Fixes, cleanups all over the place Signed-off-by: Michael S. Tsirkin <mst@redhat.com> # -----BEGIN PGP SIGNATURE----- # # iQFDBAABCAAtFiEEXQn9CHHI+FuUyooNKB8NuNKNVGkFAmVKDDoPHG1zdEByZWRo # YXQuY29tAAoJECgfDbjSjVRpF08H/0Zts8uvkHbgiOEJw4JMHU6/VaCipfIYsp01 # GSfwYOyEsXJ7GIxKWaCiMnWXEm7tebNCPKf3DoUtcAojQj3vuF9XbWBKw/bfRn83 # nGO/iiwbYViSKxkwqUI+Up5YiN9o0M8gBFrY0kScPezbnYmo5u2bcADdEEq6gH68 # D0Ea8i+WmszL891ypvgCDBL2ObDk3qX3vA5Q6J2I+HKX2ofJM59BwaKwS5ghw+IG # BmbKXUZJNjUQfN9dQ7vJuiuqdknJ2xUzwW2Vn612ffarbOZB1DZ6ruWlrHty5TjX # 0w4IXEJPBgZYbX9oc6zvTQnbLDBJbDU89mnme0TcmNMKWmQKTtc= # =vEv+ # -----END PGP SIGNATURE----- # gpg: Signature made Tue 07 Nov 2023 18:06:50 HKT # gpg: using RSA key 5D09FD0871C8F85B94CA8A0D281F0DB8D28D5469 # gpg: issuer "mst@redhat.com" # gpg: Good signature from "Michael S. Tsirkin <mst@kernel.org>" [full] # gpg: aka "Michael S. Tsirkin <mst@redhat.com>" [full] # Primary key fingerprint: 0270 606B 6F3C DF3D 0B17 0970 C350 3912 AFBE 8E67 # Subkey fingerprint: 5D09 FD08 71C8 F85B 94CA 8A0D 281F 0DB8 D28D 5469 * tag 'for_upstream' of https://git.kernel.org/pub/scm/virt/kvm/mst/qemu: (63 commits) acpi/tests/avocado/bits: enable console logging from bits VM acpi/tests/avocado/bits: enforce 32-bit SMBIOS entry point hw/cxl: Add tunneled command support to mailbox for switch cci. hw/cxl: Add dummy security state get hw/cxl/type3: Cleanup multiple CXL_TYPE3() calls in read/write functions hw/cxl/mbox: Add Get Background Operation Status Command hw/cxl: Add support for device sanitation hw/cxl/mbox: Wire up interrupts for background completion hw/cxl/mbox: Add support for background operations hw/cxl: Implement Physical Ports status retrieval hw/pci-bridge/cxl_downstream: Set default link width and link speed hw/cxl/mbox: Add Physical Switch Identify command. hw/cxl/mbox: Add Information and Status / Identify command hw/cxl: Add a switch mailbox CCI function hw/pci-bridge/cxl_upstream: Move defintion of device to header. hw/cxl/mbox: Generalize the CCI command processing hw/cxl/mbox: Pull the CCI definition out of the CXLDeviceState hw/cxl/mbox: Split mailbox command payload into separate input and output hw/cxl/mbox: Pull the payload out of struct cxl_cmd and make instances constant hw/cxl: Fix a QEMU_BUILD_BUG_ON() in switch statement scope issue. ... Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2023-11-07vhost-user-fs: Implement internal migrationHanna Czenczek
A virtio-fs device's VM state consists of: - the virtio device (vring) state (VMSTATE_VIRTIO_DEVICE) - the back-end's (virtiofsd's) internal state We get/set the latter via the new vhost operations to transfer migratory state. It is its own dedicated subsection, so that for external migration, it can be disabled. Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Hanna Czenczek <hreitz@redhat.com> Message-Id: <20231016134243.68248-8-hreitz@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2023-11-07vhost: Add high-level state save/load functionsHanna Czenczek
vhost_save_backend_state() and vhost_load_backend_state() can be used by vhost front-ends to easily save and load the back-end's state to/from the migration stream. Because we do not know the full state size ahead of time, vhost_save_backend_state() simply reads the data in 1 MB chunks, and writes each chunk consecutively into the migration stream, prefixed by its length. EOF is indicated by a 0-length chunk. Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Hanna Czenczek <hreitz@redhat.com> Message-Id: <20231016134243.68248-7-hreitz@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2023-11-07vhost-user: Interface for migration state transferHanna Czenczek
Add the interface for transferring the back-end's state during migration as defined previously in vhost-user.rst. Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Hanna Czenczek <hreitz@redhat.com> Message-Id: <20231016134243.68248-6-hreitz@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2023-11-07Merge tag 'pull-vfio-20231106' of https://github.com/legoater/qemu into stagingStefan Hajnoczi
vfio queue: * Support for non 64b IOVA space * Introduction of a PCIIOMMUOps callback structure to ease future extensions * Fix for a buffer overrun when writing the VF token * PPC cleanups preparing ground for IOMMUFD support # -----BEGIN PGP SIGNATURE----- # # iQIzBAABCAAdFiEEoPZlSPBIlev+awtgUaNDx8/77KEFAmVI+bIACgkQUaNDx8/7 # 7KHW4g/9FmgX0k2Elm1BAul3slJtuBT8/iHKfK19rhXICxhxS5xBWJA8FmosTWAT # 91YqQJhOHARxLd9VROfv8Fq8sAo+Ys8bP3PTXh5satjY5gR9YtmMSVqvsAVLn7lv # a/0xp7wPJt2UeKzvRNUqFXNr7yHPwxFxbJbmmAJbNte8p+TfE2qvojbJnu7BjJbg # sTtS/vFWNJwtuNYTkMRoiZaUKEoEZ8LnslOqKUjgeO59g4i3Dq8e2JCmHANPFWUK # cWmr7AqcXgXEnLSDWTtfN53bjcSCYkFVb4WV4Wv1/7hUF5jQ4UR0l3B64xWe0M3/ # Prak3bWOM/o7JwLBsgaWPngXA9V0WFBTXVF4x5qTwhuR1sSV8MxUvTKxI+qqiEzA # FjU89oSZ+zXId/hEUuTL6vn1Th8/6mwD0L9ORchNOQUKzCjBzI4MVPB09nM3AdPC # LGThlufsZktdoU2KjMHpc+gMIXQYsxkgvm07K5iZTZ5eJ4tV5KB0aPvTZppGUxe1 # YY9og9F3hxjDHQtEuSY2rzBQI7nrUpd1ZI5ut/3ZgDWkqD6aGRtMme4n4GsGsYb2 # Ht9+d2RL9S8uPUh+7rV8K/N3+vXgXRaEYTuAScKtflEbA7YnZA5nUdMng8x0kMTQ # Y73XCd4UGWDfSSZsgaIHGkM/MRIHgmlrfcwPkWqWW9vF+92O6Hw= # =/Du0 # -----END PGP SIGNATURE----- # gpg: Signature made Mon 06 Nov 2023 22:35:30 HKT # gpg: using RSA key A0F66548F04895EBFE6B0B6051A343C7CFFBECA1 # gpg: Good signature from "Cédric Le Goater <clg@redhat.com>" [unknown] # gpg: aka "Cédric Le Goater <clg@kaod.org>" [unknown] # gpg: WARNING: This key is not certified with a trusted signature! # gpg: There is no indication that the signature belongs to the owner. # Primary key fingerprint: A0F6 6548 F048 95EB FE6B 0B60 51A3 43C7 CFFB ECA1 * tag 'pull-vfio-20231106' of https://github.com/legoater/qemu: (22 commits) vfio/common: Move vfio_host_win_add/del into spapr.c vfio/spapr: Make vfio_spapr_create/remove_window static vfio/container: Move spapr specific init/deinit into spapr.c vfio/container: Move vfio_container_add/del_section_window into spapr.c vfio/container: Move IBM EEH related functions into spapr_pci_vfio.c util/uuid: Define UUID_STR_LEN from UUID_NONE string util/uuid: Remove UUID_FMT_LEN vfio/pci: Fix buffer overrun when writing the VF token util/uuid: Add UUID_STR_LEN definition hw/pci: modify pci_setup_iommu() to set PCIIOMMUOps test: Add some tests for range and resv-mem helpers virtio-iommu: Consolidate host reserved regions and property set ones virtio-iommu: Implement set_iova_ranges() callback virtio-iommu: Record whether a probe request has been issued range: Introduce range_inverse_array() virtio-iommu: Introduce per IOMMUDevice reserved regions util/reserved-region: Add new ReservedRegion helpers range: Make range_compare() public virtio-iommu: Rename reserved_regions into prop_resv_regions vfio: Collect container iova range info ... Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2023-11-06Revert "hw/virtio/virtio-pmem: Replace impossible check by assertion"Maciej S. Szmigiero
This reverts commit 5960f254dbb46f0c7a9f5f44bf4d27c19c34cb97 since the previous commit made this situation possible again. Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
2023-11-03hw/pci: modify pci_setup_iommu() to set PCIIOMMUOpsYi Liu
This patch modifies pci_setup_iommu() to set PCIIOMMUOps instead of setting PCIIOMMUFunc. PCIIOMMUFunc is used to get an address space for a PCI device in vendor specific way. The PCIIOMMUOps still offers this functionality. But using PCIIOMMUOps leaves space to add more iommu related vendor specific operations. Cc: Kevin Tian <kevin.tian@intel.com> Cc: Jacob Pan <jacob.jun.pan@linux.intel.com> Cc: Peter Xu <peterx@redhat.com> Cc: Eric Auger <eric.auger@redhat.com> Cc: Yi Sun <yi.y.sun@linux.intel.com> Cc: David Gibson <david@gibson.dropbear.id.au> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Eric Auger <eric.auger@redhat.com> Cc: Peter Maydell <peter.maydell@linaro.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Andrey Smirnov <andrew.smirnov@gmail.com> Cc: Helge Deller <deller@gmx.de> Cc: Hervé Poussineau <hpoussin@reactos.org> Cc: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk> Cc: BALATON Zoltan <balaton@eik.bme.hu> Cc: Elena Ufimtseva <elena.ufimtseva@oracle.com> Cc: Jagannathan Raman <jag.raman@oracle.com> Cc: Matthew Rosato <mjrosato@linux.ibm.com> Cc: Eric Farman <farman@linux.ibm.com> Cc: Halil Pasic <pasic@linux.ibm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Thomas Huth <thuth@redhat.com> Cc: Helge Deller <deller@gmx.de> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: Yi Liu <yi.l.liu@intel.com> Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Reviewed-by: Eric Auger <eric.auger@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> [ clg: - refreshed on latest QEMU - included hw/remote/iommu.c - documentation update - asserts in pci_setup_iommu() - removed checks on iommu_bus->iommu_ops->get_address_space - included Elroy PCI host (PA-RISC) ] Signed-off-by: Cédric Le Goater <clg@redhat.com>
2023-11-03virtio-iommu: Consolidate host reserved regions and property set onesEric Auger
Up to now we were exposing to the RESV_MEM probe requests the reserved memory regions set though the reserved-regions array property. Combine those with the host reserved memory regions if any. Those latter are tagged as RESERVED. We don't have more information about them besides then cannot be mapped. Reserved regions set by property have higher priority. Signed-off-by: Eric Auger <eric.auger@redhat.com> Reviewed-by: "Michael S. Tsirkin" <mst@redhat.com> Tested-by: Yanghang Liu <yanghliu@redhat.com> Signed-off-by: Cédric Le Goater <clg@redhat.com>
2023-11-03virtio-iommu: Implement set_iova_ranges() callbackEric Auger
The implementation populates the array of per IOMMUDevice host reserved ranges. It is forbidden to have conflicting sets of host IOVA ranges to be applied onto the same IOMMU MR (implied by different host devices). In case the callback is called after the probe request has been issues by the driver, a warning is issued. Signed-off-by: Eric Auger <eric.auger@redhat.com> Reviewed-by: "Michael S. Tsirkin" <mst@redhat.com> Tested-by: Yanghang Liu <yanghliu@redhat.com> Signed-off-by: Cédric Le Goater <clg@redhat.com>
2023-11-03virtio-iommu: Record whether a probe request has been issuedEric Auger
Add an IOMMUDevice 'probe_done' flag to record that the driver already issued a probe request on that device. This will be useful to double check host reserved regions aren't notified after the probe and hence are not taken into account by the driver. Signed-off-by: Eric Auger <eric.auger@redhat.com> Suggested-by: Jean-Philippe Brucker <jean-philippe@linaro.org> Reviewed-by: "Michael S. Tsirkin" <mst@redhat.com> Tested-by: Yanghang Liu <yanghliu@redhat.com> Signed-off-by: Cédric Le Goater <clg@redhat.com>
2023-11-03virtio-iommu: Introduce per IOMMUDevice reserved regionsEric Auger
For the time being the per device reserved regions are just a duplicate of IOMMU wide reserved regions. Subsequent patches will combine those with host reserved regions, if any. Signed-off-by: Eric Auger <eric.auger@redhat.com> Tested-by: Yanghang Liu <yanghliu@redhat.com> Reviewed-by: "Michael S. Tsirkin" <mst@redhat.com> Signed-off-by: Cédric Le Goater <clg@redhat.com>
2023-11-03virtio-iommu: Rename reserved_regions into prop_resv_regionsEric Auger
Rename VirtIOIOMMU (nb_)reserved_regions fields with the "prop_" prefix to highlight those fields are set through a property, at machine level. They are IOMMU wide. A subsequent patch will introduce per IOMMUDevice reserved regions that will include both those IOMMU wide property reserved regions plus, sometimes, host reserved regions, if the device is backed by a host device protected by a physical IOMMU. Also change nb_ prefix by nr_. Signed-off-by: Eric Auger <eric.auger@redhat.com> Reviewed-by: Cédric Le Goater <clg@redhat.com> Reviewed-by: "Michael S. Tsirkin" <mst@redhat.com> Signed-off-by: Cédric Le Goater <clg@redhat.com>
2023-11-03memory: Let ReservedRegion use RangeEric Auger
A reserved region is a range tagged with a type. Let's directly use the Range type in the prospect to reuse some of the library helpers shipped with the Range type. Signed-off-by: Eric Auger <eric.auger@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Reviewed-by: Cédric Le Goater <clg@redhat.com> Reviewed-by: "Michael S. Tsirkin" <mst@redhat.com> Signed-off-by: Cédric Le Goater <clg@redhat.com>
2023-11-01cpr: relax vhost migration blockersSteve Sistare
vhost blocks migration if logging is not supported to track dirty memory, and vhost-user blocks it if the log cannot be saved to a shm fd. vhost-vdpa blocks migration if both hosts do not support all the device's features using a shadow VQ, for tracking requests and dirty memory. vhost-scsi blocks migration if storage cannot be shared across hosts, or if state cannot be migrated. None of these conditions apply if the old and new qemu processes do not run concurrently, and if new qemu starts on the same host as old, which is the case for cpr. Narrow the scope of these blockers so they only apply to normal mode. They will not block cpr modes when they are added in subsequent patches. No functional change until a new mode is added. Signed-off-by: Steve Sistare <steven.sistare@oracle.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com> Message-ID: <1698263069-406971-5-git-send-email-steven.sistare@oracle.com>
2023-11-01migration: Use vmstate_register_any()Juan Quintela
This are the easiest cases, where we were already using VMSTATE_INSTANCE_ID_ANY. Reviewed-by: Stefan Berger <stefanb@linux.ibm.com> Signed-off-by: Juan Quintela <quintela@redhat.com> Message-ID: <20231020090731.28701-3-quintela@redhat.com>
2023-10-31virtio: use defer_call() in virtio_irqfd_notify()Stefan Hajnoczi
virtio-blk and virtio-scsi invoke virtio_irqfd_notify() to send Used Buffer Notifications from an IOThread. This involves an eventfd write(2) syscall. Calling this repeatedly when completing multiple I/O requests in a row is wasteful. Use the defer_call() API to batch together virtio_irqfd_notify() calls made during thread pool (aio=threads), Linux AIO (aio=native), and io_uring (aio=io_uring) completion processing. Behavior is unchanged for emulated devices that do not use defer_call_begin()/defer_call_end() since defer_call() immediately invokes the callback when called outside a defer_call_begin()/defer_call_end() region. fio rw=randread bs=4k iodepth=64 numjobs=8 IOPS increases by ~9% with a single IOThread and 8 vCPUs. iodepth=1 decreases by ~1% but this could be noise. Detailed performance data and configuration specifics are available here: https://gitlab.com/stefanha/virt-playbooks/-/tree/blk_io_plug-irqfd This duplicates the BH that virtio-blk uses for batching. The next commit will remove it. Reviewed-by: Eric Blake <eblake@redhat.com> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Message-ID: <20230913200045.1024233-4-stefanha@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2023-10-25kvm: require KVM_CAP_IOEVENTFD and KVM_CAP_IOEVENTFD_ANY_LENGTHPaolo Bonzini
KVM_CAP_IOEVENTFD_ANY_LENGTH was added in Linux 4.4, released in 2016. Assume that it is present. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-10-25kvm: assume that many ioeventfds can be createdPaolo Bonzini
NR_IOBUS_DEVS was increased to 200 in Linux 2.6.34. By Linux 3.5 it had increased to 1000 and later ioeventfds were changed to not count against the limit. But the earlier limit of 200 would already be enough for kvm_check_many_ioeventfds() to be true, so remove the check. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-10-23Merge tag 'for_upstream' of https://git.kernel.org/pub/scm/virt/kvm/mst/qemu ↵Stefan Hajnoczi
into staging virtio,pc,pci: features, cleanups infrastructure for vhost-vdpa shadow work piix south bridge rework reconnect for vhost-user-scsi dummy ACPI QTG DSM for cxl tests, cleanups, fixes all over the place Signed-off-by: Michael S. Tsirkin <mst@redhat.com> # -----BEGIN PGP SIGNATURE----- # # iQFDBAABCAAtFiEEXQn9CHHI+FuUyooNKB8NuNKNVGkFAmU06PMPHG1zdEByZWRo # YXQuY29tAAoJECgfDbjSjVRpNIsH/0DlKti86VZLJ6PbNqsnKxoK2gg05TbEhPZU # pQ+RPDaCHpFBsLC5qsoMJwvaEQFe0e49ZFemw7bXRzBxgmbbNnZ9ArCIPqT+rvQd # 7UBmyC+kacVyybZatq69aK2BHKFtiIRlT78d9Izgtjmp8V7oyKoz14Esh8wkE+FT # ypHUa70Addi6alNm6BVkm7bxZxi0Wrmf3THqF8ViYvufzHKl7JR5e17fKWEG0BqV # 9W7AeHMnzJ7jkTvBGUw7g5EbzFn7hPLTbO4G/VW97k0puS4WRX5aIMkVhUazsRIa # zDOuXCCskUWuRapiCwY0E4g7cCaT8/JR6JjjBaTgkjJgvo5Y8Eg= # =ILek # -----END PGP SIGNATURE----- # gpg: Signature made Sun 22 Oct 2023 02:18:43 PDT # gpg: using RSA key 5D09FD0871C8F85B94CA8A0D281F0DB8D28D5469 # gpg: issuer "mst@redhat.com" # gpg: Good signature from "Michael S. Tsirkin <mst@kernel.org>" [full] # gpg: aka "Michael S. Tsirkin <mst@redhat.com>" [full] # Primary key fingerprint: 0270 606B 6F3C DF3D 0B17 0970 C350 3912 AFBE 8E67 # Subkey fingerprint: 5D09 FD08 71C8 F85B 94CA 8A0D 281F 0DB8 D28D 5469 * tag 'for_upstream' of https://git.kernel.org/pub/scm/virt/kvm/mst/qemu: (62 commits) intel-iommu: Report interrupt remapping faults, fix return value MAINTAINERS: Add include/hw/intc/i8259.h to the PC chip section vhost-user: Fix protocol feature bit conflict tests/acpi: Update DSDT.cxl with QTG DSM hw/cxl: Add QTG _DSM support for ACPI0017 device tests/acpi: Allow update of DSDT.cxl hw/i386/cxl: ensure maxram is greater than ram size for calculating cxl range vhost-user: fix lost reconnect vhost-user-scsi: start vhost when guest kicks vhost-user-scsi: support reconnect to backend vhost: move and rename the conn retry times vhost-user-common: send get_inflight_fd once hw/i386/pc_piix: Make PIIX4 south bridge usable in PC machine hw/isa/piix: Implement multi-process QEMU support also for PIIX4 hw/isa/piix: Resolve duplicate code regarding PCI interrupt wiring hw/isa/piix: Reuse PIIX3's PCI interrupt triggering in PIIX4 hw/isa/piix: Rename functions to be shared for PCI interrupt triggering hw/isa/piix: Reuse PIIX3 base class' realize method in PIIX4 hw/isa/piix: Share PIIX3's base class with PIIX4 hw/isa/piix: Harmonize names of reset control memory regions ... Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2023-10-22vhost-user: fix lost reconnectLi Feng
When the vhost-user is reconnecting to the backend, and if the vhost-user fails at the get_features in vhost_dev_init(), then the reconnect will fail and it will not be retriggered forever. The reason is: When the vhost-user fails at get_features, the vhost_dev_cleanup will be called immediately. vhost_dev_cleanup calls 'memset(hdev, 0, sizeof(struct vhost_dev))'. The reconnect path is: vhost_user_blk_event vhost_user_async_close(.. vhost_user_blk_disconnect ..) qemu_chr_fe_set_handlers <----- clear the notifier callback schedule vhost_user_async_close_bh The vhost->vdev is null, so the vhost_user_blk_disconnect will not be called, then the event fd callback will not be reinstalled. All vhost-user devices have this issue, including vhost-user-blk/scsi. With this patch, if the vdev->vdev is null, the fd callback will still be reinstalled. Fixes: 71e076a07d ("hw/virtio: generalise CHR_EVENT_CLOSED handling") Signed-off-by: Li Feng <fengli@smartx.com> Reviewed-by: Raphael Norwitz <raphael.norwitz@nutanix.com> Message-Id: <20231009044735.941655-6-fengli@smartx.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2023-10-22vhost: move and rename the conn retry timesLi Feng
Multiple devices need this macro, move it to a common header. Signed-off-by: Li Feng <fengli@smartx.com> Reviewed-by: Raphael Norwitz <raphael.norwitz@nutanix.com> Message-Id: <20231009044735.941655-3-fengli@smartx.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2023-10-22virtio: call ->vhost_reset_device() during resetStefan Hajnoczi
vhost-user-scsi has a VirtioDeviceClass->reset() function that calls ->vhost_reset_device(). The other vhost devices don't notify the vhost device upon reset. Stateful vhost devices may need to handle device reset in order to free resources or prevent stale device state from interfering after reset. Call ->vhost_device_reset() from virtio_reset() so that that vhost devices are notified of device reset. This patch affects behavior as follows: - vhost-kernel: No change in behavior since ->vhost_reset_device() is not implemented. - vhost-user: back-ends that negotiate VHOST_USER_PROTOCOL_F_RESET_DEVICE now receive a VHOST_USER_DEVICE_RESET message upon device reset. Otherwise there is no change in behavior. DPDK, SPDK, libvhost-user, and the vhost-user-backend crate do not negotiate VHOST_USER_PROTOCOL_F_RESET_DEVICE automatically. - vhost-vdpa: an extra SET_STATUS 0 call is made during device reset. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Message-Id: <20231004014532.1228637-4-stefanha@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Raphael Norwitz <raphael.norwitz@nutanix.com> Reviewed-by: Hanna Czenczek <hreitz@redhat.com>
2023-10-22vhost-backend: remove vhost_kernel_reset_device()Stefan Hajnoczi
vhost_kernel_reset_device() invokes RESET_OWNER, which disassociates the owner process from the device. The device is left non-operational since SET_OWNER is only called once during startup in vhost_dev_init(). vhost_kernel_reset_device() is never called so this latent bug never appears. Get rid of vhost_kernel_reset_device() for now. If someone needs it in the future they'll need to implement it correctly. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Message-Id: <20231004014532.1228637-3-stefanha@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Raphael Norwitz <raphael.norwitz@nutanix.com> Reviewed-by: Hanna Czenczek <hreitz@redhat.com>
2023-10-22vhost-user: do not send RESET_OWNER on device resetStefan Hajnoczi
The VHOST_USER_RESET_OWNER message is deprecated in the spec: This is no longer used. Used to be sent to request disabling all rings, but some back-ends interpreted it to also discard connection state (this interpretation would lead to bugs). It is recommended that back-ends either ignore this message, or use it to disable all rings. The only caller of vhost_user_reset_device() is vhost_user_scsi_reset(). It checks that F_RESET_DEVICE was negotiated before calling it: static void vhost_user_scsi_reset(VirtIODevice *vdev) { VHostSCSICommon *vsc = VHOST_SCSI_COMMON(vdev); struct vhost_dev *dev = &vsc->dev; /* * Historically, reset was not implemented so only reset devices * that are expecting it. */ if (!virtio_has_feature(dev->protocol_features, VHOST_USER_PROTOCOL_F_RESET_DEVICE)) { return; } if (dev->vhost_ops->vhost_reset_device) { dev->vhost_ops->vhost_reset_device(dev); } } Therefore VHOST_USER_RESET_OWNER is actually never sent by vhost_user_reset_device(). Remove the dead code. This effectively moves the vhost-user protocol specific code from vhost-user-scsi.c into vhost-user.c where it belongs. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Message-Id: <20231004014532.1228637-2-stefanha@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Raphael Norwitz <raphael.norwitz@nutanix.com>
2023-10-22vhost-user: call VHOST_USER_SET_VRING_ENABLE synchronouslyLaszlo Ersek
(1) The virtio-1.2 specification <http://docs.oasis-open.org/virtio/virtio/v1.2/virtio-v1.2.html> writes: > 3 General Initialization And Device Operation > 3.1 Device Initialization > 3.1.1 Driver Requirements: Device Initialization > > [...] > > 7. Perform device-specific setup, including discovery of virtqueues for > the device, optional per-bus setup, reading and possibly writing the > device’s virtio configuration space, and population of virtqueues. > > 8. Set the DRIVER_OK status bit. At this point the device is “live”. and > 4 Virtio Transport Options > 4.1 Virtio Over PCI Bus > 4.1.4 Virtio Structure PCI Capabilities > 4.1.4.3 Common configuration structure layout > 4.1.4.3.2 Driver Requirements: Common configuration structure layout > > [...] > > The driver MUST configure the other virtqueue fields before enabling the > virtqueue with queue_enable. > > [...] (The same statements are present in virtio-1.0 identically, at <http://docs.oasis-open.org/virtio/virtio/v1.0/virtio-v1.0.html>.) These together mean that the following sub-sequence of steps is valid for a virtio-1.0 guest driver: (1.1) set "queue_enable" for the needed queues as the final part of device initialization step (7), (1.2) set DRIVER_OK in step (8), (1.3) immediately start sending virtio requests to the device. (2) When vhost-user is enabled, and the VHOST_USER_F_PROTOCOL_FEATURES special virtio feature is negotiated, then virtio rings start in disabled state, according to <https://qemu-project.gitlab.io/qemu/interop/vhost-user.html#ring-states>. In this case, explicit VHOST_USER_SET_VRING_ENABLE messages are needed for enabling vrings. Therefore setting "queue_enable" from the guest (1.1) -- which is technically "buffered" on the QEMU side until the guest sets DRIVER_OK (1.2) -- is a *control plane* operation, which -- after (1.2) -- travels from the guest through QEMU to the vhost-user backend, using a unix domain socket. Whereas sending a virtio request (1.3) is a *data plane* operation, which evades QEMU -- it travels from guest to the vhost-user backend via eventfd. This means that operations ((1.1) + (1.2)) and (1.3) travel through different channels, and their relative order can be reversed, as perceived by the vhost-user backend. That's exactly what happens when OVMF's virtiofs driver (VirtioFsDxe) runs against the Rust-language virtiofsd version 1.7.2. (Which uses version 0.10.1 of the vhost-user-backend crate, and version 0.8.1 of the vhost crate.) Namely, when VirtioFsDxe binds a virtiofs device, it goes through the device initialization steps (i.e., control plane operations), and immediately sends a FUSE_INIT request too (i.e., performs a data plane operation). In the Rust-language virtiofsd, this creates a race between two components that run *concurrently*, i.e., in different threads or processes: - Control plane, handling vhost-user protocol messages: The "VhostUserSlaveReqHandlerMut::set_vring_enable" method [crates/vhost-user-backend/src/handler.rs] handles VHOST_USER_SET_VRING_ENABLE messages, and updates each vring's "enabled" flag according to the message processed. - Data plane, handling virtio / FUSE requests: The "VringEpollHandler::handle_event" method [crates/vhost-user-backend/src/event_loop.rs] handles the incoming virtio / FUSE request, consuming the virtio kick at the same time. If the vring's "enabled" flag is set, the virtio / FUSE request is processed genuinely. If the vring's "enabled" flag is clear, then the virtio / FUSE request is discarded. Note that OVMF enables the queue *first*, and sends FUSE_INIT *second*. However, if the data plane processor in virtiofsd wins the race, then it sees the FUSE_INIT *before* the control plane processor took notice of VHOST_USER_SET_VRING_ENABLE and green-lit the queue for the data plane processor. Therefore the latter drops FUSE_INIT on the floor, and goes back to waiting for further virtio / FUSE requests with epoll_wait. Meanwhile OVMF is stuck waiting for the FUSET_INIT response -- a deadlock. The deadlock is not deterministic. OVMF hangs infrequently during first boot. However, OVMF hangs almost certainly during reboots from the UEFI shell. The race can be "reliably masked" by inserting a very small delay -- a single debug message -- at the top of "VringEpollHandler::handle_event", i.e., just before the data plane processor checks the "enabled" field of the vring. That delay suffices for the control plane processor to act upon VHOST_USER_SET_VRING_ENABLE. We can deterministically prevent the race in QEMU, by blocking OVMF inside step (1.2) -- i.e., in the write to the device status register that "unleashes" queue enablement -- until VHOST_USER_SET_VRING_ENABLE actually *completes*. That way OVMF's VCPU cannot advance to the FUSE_INIT submission before virtiofsd's control plane processor takes notice of the queue being enabled. Wait for VHOST_USER_SET_VRING_ENABLE completion by: - setting the NEED_REPLY flag on VHOST_USER_SET_VRING_ENABLE, and waiting for the reply, if the VHOST_USER_PROTOCOL_F_REPLY_ACK vhost-user feature has been negotiated, or - performing a separate VHOST_USER_GET_FEATURES *exchange*, which requires a backend response regardless of VHOST_USER_PROTOCOL_F_REPLY_ACK. Cc: "Michael S. Tsirkin" <mst@redhat.com> (supporter:vhost) Cc: Eugenio Perez Martin <eperezma@redhat.com> Cc: German Maglione <gmaglione@redhat.com> Cc: Liu Jiang <gerry@linux.alibaba.com> Cc: Sergio Lopez Pascual <slp@redhat.com> Cc: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Laszlo Ersek <lersek@redhat.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Tested-by: Albert Esteve <aesteve@redhat.com> [lersek@redhat.com: work Eugenio's explanation into the commit message, about QEMU containing step (1.1) until step (1.2)] Reviewed-by: Eugenio Pérez <eperezma@redhat.com> Message-Id: <20231002203221.17241-8-lersek@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2023-10-22vhost-user: allow "vhost_set_vring" to wait for a replyLaszlo Ersek
The "vhost_set_vring" function already centralizes the common parts of "vhost_user_set_vring_num", "vhost_user_set_vring_base" and "vhost_user_set_vring_enable". We'll want to allow some of those callers to wait for a reply. Therefore, rebase "vhost_set_vring" from just "vhost_user_write" to "vhost_user_write_sync", exposing the "wait_for_reply" parameter. This is purely refactoring -- there is no observable change. That's because: - all three callers pass in "false" for "wait_for_reply", which disables all logic in "vhost_user_write_sync" except the call to "vhost_user_write"; - the fds=NULL and fd_num=0 arguments of the original "vhost_user_write" call inside "vhost_set_vring" are hard-coded within "vhost_user_write_sync". Cc: "Michael S. Tsirkin" <mst@redhat.com> (supporter:vhost) Cc: Eugenio Perez Martin <eperezma@redhat.com> Cc: German Maglione <gmaglione@redhat.com> Cc: Liu Jiang <gerry@linux.alibaba.com> Cc: Sergio Lopez Pascual <slp@redhat.com> Cc: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Laszlo Ersek <lersek@redhat.com> Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Tested-by: Albert Esteve <aesteve@redhat.com> Reviewed-by: Eugenio Pérez <eperezma@redhat.com> Message-Id: <20231002203221.17241-7-lersek@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2023-10-22vhost-user: hoist "write_sync", "get_features", "get_u64"Laszlo Ersek
In order to avoid a forward-declaration for "vhost_user_write_sync" in a subsequent patch, hoist "vhost_user_write_sync" -> "vhost_user_get_features" -> "vhost_user_get_u64" just above "vhost_set_vring". This is purely code movement -- no observable change. Cc: "Michael S. Tsirkin" <mst@redhat.com> (supporter:vhost) Cc: Eugenio Perez Martin <eperezma@redhat.com> Cc: German Maglione <gmaglione@redhat.com> Cc: Liu Jiang <gerry@linux.alibaba.com> Cc: Sergio Lopez Pascual <slp@redhat.com> Cc: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Laszlo Ersek <lersek@redhat.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Tested-by: Albert Esteve <aesteve@redhat.com> Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Reviewed-by: Eugenio Pérez <eperezma@redhat.com> Message-Id: <20231002203221.17241-6-lersek@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2023-10-22vhost-user: flatten "enforce_reply" into "vhost_user_write_sync"Laszlo Ersek
At this point, only "vhost_user_write_sync" calls "enforce_reply"; embed the latter into the former. This is purely refactoring -- no observable change. Cc: "Michael S. Tsirkin" <mst@redhat.com> (supporter:vhost) Cc: Eugenio Perez Martin <eperezma@redhat.com> Cc: German Maglione <gmaglione@redhat.com> Cc: Liu Jiang <gerry@linux.alibaba.com> Cc: Sergio Lopez Pascual <slp@redhat.com> Cc: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Laszlo Ersek <lersek@redhat.com> Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Tested-by: Albert Esteve <aesteve@redhat.com> Reviewed-by: Eugenio Pérez <eperezma@redhat.com> Message-Id: <20231002203221.17241-5-lersek@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2023-10-22vhost-user: factor out "vhost_user_write_sync"Laszlo Ersek
The tails of the "vhost_user_set_vring_addr" and "vhost_user_set_u64" functions are now byte-for-byte identical. Factor the common tail out to a new function called "vhost_user_write_sync". This is purely refactoring -- no observable change. Cc: "Michael S. Tsirkin" <mst@redhat.com> (supporter:vhost) Cc: Eugenio Perez Martin <eperezma@redhat.com> Cc: German Maglione <gmaglione@redhat.com> Cc: Liu Jiang <gerry@linux.alibaba.com> Cc: Sergio Lopez Pascual <slp@redhat.com> Cc: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Laszlo Ersek <lersek@redhat.com> Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Tested-by: Albert Esteve <aesteve@redhat.com> Reviewed-by: Eugenio Pérez <eperezma@redhat.com> Message-Id: <20231002203221.17241-4-lersek@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2023-10-22vhost-user: tighten "reply_supported" scope in "set_vring_addr"Laszlo Ersek
In the vhost_user_set_vring_addr() function, we calculate "reply_supported" unconditionally, even though we'll only need it if "wait_for_reply" is also true. Restrict the scope of "reply_supported" to the minimum. This is purely refactoring -- no observable change. Cc: "Michael S. Tsirkin" <mst@redhat.com> (supporter:vhost) Cc: Eugenio Perez Martin <eperezma@redhat.com> Cc: German Maglione <gmaglione@redhat.com> Cc: Liu Jiang <gerry@linux.alibaba.com> Cc: Sergio Lopez Pascual <slp@redhat.com> Cc: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Laszlo Ersek <lersek@redhat.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Tested-by: Albert Esteve <aesteve@redhat.com> Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Reviewed-by: Eugenio Pérez <eperezma@redhat.com> Message-Id: <20231002203221.17241-3-lersek@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2023-10-22vhost-user: strip superfluous whitespaceLaszlo Ersek
Cc: "Michael S. Tsirkin" <mst@redhat.com> (supporter:vhost) Cc: Eugenio Perez Martin <eperezma@redhat.com> Cc: German Maglione <gmaglione@redhat.com> Cc: Liu Jiang <gerry@linux.alibaba.com> Cc: Sergio Lopez Pascual <slp@redhat.com> Cc: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Laszlo Ersek <lersek@redhat.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Tested-by: Albert Esteve <aesteve@redhat.com> Reviewed-by: Eugenio Pérez <eperezma@redhat.com> Message-Id: <20231002203221.17241-2-lersek@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2023-10-20Merge tag 'migration-20231020-pull-request' of ↵Stefan Hajnoczi
https://gitlab.com/juan.quintela/qemu into staging Migration Pull request (20231020) In this pull request: - disable analyze-migration on s390x (thomas) - Fix parse_ramblock() (peter) - start merging live update (steve) - migration-test support for using several binaries (fabiano) - multifd cleanups (fabiano) CI: https://gitlab.com/juan.quintela/qemu/-/pipelines/1042492801 Please apply. # -----BEGIN PGP SIGNATURE----- # # iQIzBAABCAAdFiEEGJn/jt6/WMzuA0uC9IfvGFhy1yMFAmUyJMsACgkQ9IfvGFhy # 1yP0AQ/9ELr6VJ0crqzfGm2dy2emnZMaQhDtzR4Kk4ciZF6U+GiATdGN9hK499mP # 6WzRIjtSzwD8YZvhLfegxIVTGcEttaM93uXFPznWrk7gwny6QTvuA4qtcRYejTSl # wE4GQQOsSrukVCUlqcZtY/t2aphVWQzlx8RRJE3XGaodT1gNLMjd+xp34NbbOoR3 # 32ixpSPUCOGvCd7hb+HG7pEzk+905Pn2URvbdiP71uqhgJZdjMAv8ehSGD3kufdg # FMrZyIEq7Eguk2bO1+7ZiVuIafXXRloIVqi1ENmjIyNDa/Rlv2CA85u0CfgeP6qY # Ttj+MZaz8PIhf97IJEILFn+NDXYgsGqEFl//uNbLuTeCpmr9NPhBzLw8CvCefPrR # rwBs3J+QbDHWX9EYjk6QZ9QfYJy/DXkl0KfdNtQy9Wf+0o1mHDn5/y3s782T24aJ # lGo0ph4VJLBNOx58rpgmoO5prRIjqzF5w4j8pCSeGUC4Bcub5af4TufYrwaf+cps # iIbNFx79dLXBlfkKIn7i9RLpz7641Fs/iTQ/MZh1eyvX++UDXAPWnbd4GDYOEewA # U3WKsTs/ipIbY8nqaO4j1VMzADPUfetBXznBw60xsZcfjynFJsPV6/F/0OpUupdv # qPEY4LZ2uwP4K7AlzrUzUn2f3BKrspL0ObX0qTn0WJ8WX5Jp/YA= # =m+uB # -----END PGP SIGNATURE----- # gpg: Signature made Thu 19 Oct 2023 23:57:15 PDT # gpg: using RSA key 1899FF8EDEBF58CCEE034B82F487EF185872D723 # gpg: Good signature from "Juan Quintela <quintela@redhat.com>" [full] # gpg: aka "Juan Quintela <quintela@trasno.org>" [full] # Primary key fingerprint: 1899 FF8E DEBF 58CC EE03 4B82 F487 EF18 5872 D723 * tag 'migration-20231020-pull-request' of https://gitlab.com/juan.quintela/qemu: tests/qtest: Don't print messages from query instances tests/qtest/migration: Allow user to specify a machine type tests/qtest/migration: Support more than one QEMU binary tests/qtest/migration: Set q35 as the default machine for x86_86 tests/qtest/migration: Specify the geometry of the bootsector tests/qtest/migration: Define a machine for all architectures tests/qtest/migration: Introduce find_common_machine_version tests/qtest: Introduce qtest_resolve_machine_alias tests/qtest: Introduce qtest_has_machine_with_env tests/qtest: Allow qtest_get_machines to use an alternate QEMU binary tests/qtest: Introduce qtest_init_with_env tests/qtest: Allow qtest_qemu_binary to use a custom environment variable migration/multifd: Stop checking p->quit in multifd_send_thread migration: simplify notifiers migration: Fix parse_ramblock() on overwritten retvals migration: simplify blockers tests/qtest/migration-test: Disable the analyze-migration.py test on s390x Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2023-10-20migration: simplify blockersSteve Sistare
Modify migrate_add_blocker and migrate_del_blocker to take an Error ** reason. This allows migration to own the Error object, so that if an error occurs in migrate_add_blocker, migration code can free the Error and clear the client handle, simplifying client code. It also simplifies the migrate_del_blocker call site. In addition, this is a pre-requisite for a proposed future patch that would add a mode argument to migration requests to support live update, and maintain a list of blockers for each mode. A blocker may apply to a single mode or to multiple modes, and passing Error** will allow one Error object to be registered for multiple modes. No functional change. Signed-off-by: Steve Sistare <steven.sistare@oracle.com> Tested-by: Michael Galaxy <mgalaxy@akamai.com> Reviewed-by: Michael Galaxy <mgalaxy@akamai.com> Reviewed-by: Peter Xu <peterx@redhat.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com> Message-ID: <1697634216-84215-1-git-send-email-steven.sistare@oracle.com>
2023-10-19hw/virtio/virtio-pmem: Replace impossible check by assertionPhilippe Mathieu-Daudé
The get_memory_region() handler is used when (un)plugging the device, which can only occur *after* it is realized. virtio_pmem_realize() ensure the instance can not be realized without 'memdev'. Remove the superfluous check, replacing it by an assertion. Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Manos Pitsidianakis <manos.pitsidianakis@linaro.org> Message-Id: <20231017140150.44995-2-philmd@linaro.org>
2023-10-18vhost: Expose vhost_svq_available_slots()Hawkins Jiawei
Next patches in this series will delay the polling and checking of buffers until either the SVQ is full or control commands shadow buffers are full, no longer perform an immediate poll and check of the device's used buffers for each CVQ state load command. To achieve this, this patch exposes vhost_svq_available_slots(), allowing QEMU to know whether the SVQ is full. Signed-off-by: Hawkins Jiawei <yin31149@gmail.com> Acked-by: Eugenio Pérez <eperezma@redhat.com> Message-Id: <25938079f0bd8185fd664c64e205e629f7a966be.1697165821.git.yin31149@gmail.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2023-10-17Merge tag 'gpu-pull-request' of https://gitlab.com/marcandre.lureau/qemu ↵Stefan Hajnoczi
into staging virtio-gpu rutabaga support # -----BEGIN PGP SIGNATURE----- # # iQJQBAABCAA6FiEEh6m9kz+HxgbSdvYt2ujhCXWWnOUFAmUtP5YcHG1hcmNhbmRy # ZS5sdXJlYXVAcmVkaGF0LmNvbQAKCRDa6OEJdZac5X9CD/4s1n/GZyDr9bh04V03 # otAqtq2CSyuUOviqBrqxYgraCosUD1AuX8WkDy5cCPtnKC4FxRjgVlm9s7K/yxOW # xZ78e4oVgB1F3voOq6LgtKK6BRG/BPqNzq9kuGcayCHQbSxg7zZVwa702Y18r2ZD # pjOhbZCrJTSfASL7C3e/rm7798Wk/hzSrClGR56fbRAVgQ6Lww2L97/g0nHyDsWK # DrCBrdqFtKjpLeUHmcqqS4AwdpG2SyCgqE7RehH/wOhvGTxh/JQvHbLGWK2mDC3j # Qvs8mClC5bUlyNQuUz7lZtXYpzCW6VGMWlz8bIu+ncgSt6RK1TRbdEfDJPGoS4w9 # ZCGgcTxTG/6BEO76J/VpydfTWDo1FwQCQ0Vv7EussGoRTLrFC3ZRFgDWpqCw85yi # AjPtc0C49FHBZhK0l1CoJGV4gGTDtD9jTYN0ffsd+aQesOjcsgivAWBaCOOQWUc8 # KOv9sr4kLLxcnuCnP7p/PuVRQD4eg0TmpdS8bXfnCzLSH8fCm+n76LuJEpGxEBey # 3KPJPj/1BNBgVgew+znSLD/EYM6YhdK2gF5SNrYsdR6UcFdrPED/xmdhzFBeVym/ # xbBWqicDw4HLn5YrJ4tzqXje5XUz5pmJoT5zrRMXTHiu4pjBkEXO/lOdAoFwSy8M # WNOtmSyB69uCrbyLw6xE2/YX8Q== # =5a/Z # -----END PGP SIGNATURE----- # gpg: Signature made Mon 16 Oct 2023 09:50:14 EDT # gpg: using RSA key 87A9BD933F87C606D276F62DDAE8E10975969CE5 # gpg: issuer "marcandre.lureau@redhat.com" # gpg: Good signature from "Marc-André Lureau <marcandre.lureau@redhat.com>" [full] # gpg: aka "Marc-André Lureau <marcandre.lureau@gmail.com>" [full] # Primary key fingerprint: 87A9 BD93 3F87 C606 D276 F62D DAE8 E109 7596 9CE5 * tag 'gpu-pull-request' of https://gitlab.com/marcandre.lureau/qemu: docs/system: add basic virtio-gpu documentation gfxstream + rutabaga: enable rutabaga gfxstream + rutabaga: meson support gfxstream + rutabaga: add initial support for gfxstream gfxstream + rutabaga prep: added need defintions, fields, and options virtio-gpu: blob prep virtio-gpu: hostmem virtio-gpu: CONTEXT_INIT feature virtio: Add shared memory capability Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2023-10-16virtio: Add shared memory capabilityDr. David Alan Gilbert
Define a new capability type 'VIRTIO_PCI_CAP_SHARED_MEMORY_CFG' to allow defining shared memory regions with sizes and offsets of 2^32 and more. Multiple instances of the capability are allowed and distinguished by a device-specific 'id'. Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com> Signed-off-by: Antonio Caggiano <antonio.caggiano@collabora.com> Signed-off-by: Gurchetan Singh <gurchetansingh@chromium.org> Tested-by: Alyssa Ross <hi@alyssa.is> Tested-by: Huang Rui <ray.huang@amd.com> Tested-by: Akihiko Odaki <akihiko.odaki@daynix.com> Acked-by: Huang Rui <ray.huang@amd.com> Reviewed-by: Gurchetan Singh <gurchetansingh@chromium.org> Reviewed-by: Akihiko Odaki <akihiko.odaki@daynix.com>
2023-10-12virtio-mem: Mark memslot alias memory regions unmergeableDavid Hildenbrand
Let's mark the memslot alias memory regions as unmergable, such that flatview and vhost won't merge adjacent memory region aliases and we can atomically map/unmap individual aliases without affecting adjacent alias memory regions. This handles vhost and vfio in multiple-memslot mode correctly (which do not support atomic memslot updates) and avoids the temporary removal of large memslots, which can be an expensive operation. For example, vfio might have to unpin + repin a lot of memory, which is undesired. Message-ID: <20230926185738.277351-19-david@redhat.com> Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David Hildenbrand <david@redhat.com>
2023-10-12memory,vhost: Allow for marking memory device memory regions unmergeableDavid Hildenbrand
Let's allow for marking memory regions unmergeable, to teach flatview code and vhost to not merge adjacent aliases to the same memory region into a larger memory section; instead, we want separate aliases to stay separate such that we can atomically map/unmap aliases without affecting other aliases. This is desired for virtio-mem mapping device memory located on a RAM memory region via multiple aliases into a memory region container, resulting in separate memslots that can get (un)mapped atomically. As an example with virtio-mem, the layout would look something like this: [...] 0000000240000000-00000020bfffffff (prio 0, i/o): device-memory 0000000240000000-000000043fffffff (prio 0, i/o): virtio-mem 0000000240000000-000000027fffffff (prio 0, ram): alias memslot-0 @mem2 0000000000000000-000000003fffffff 0000000280000000-00000002bfffffff (prio 0, ram): alias memslot-1 @mem2 0000000040000000-000000007fffffff 00000002c0000000-00000002ffffffff (prio 0, ram): alias memslot-2 @mem2 0000000080000000-00000000bfffffff [...] Without unmergable memory regions, all three memslots would get merged into a single memory section. For example, when mapping another alias (e.g., virtio-mem-memslot-3) or when unmapping any of the mapped aliases, memory listeners will first get notified about the removal of the big memory section to then get notified about re-adding of the new (differently merged) memory section(s). In an ideal world, memory listeners would be able to deal with that atomically, like KVM nowadays does. However, (a) supporting this for other memory listeners (vhost-user, vfio) is fairly hard: temporary removal can result in all kinds of issues on concurrent access to guest memory; and (b) this handling is undesired, because temporarily removing+readding can consume quite some time on bigger memslots and is not efficient (e.g., vfio unpinning and repinning pages ...). Let's allow for marking a memory region unmergeable, such that we can atomically (un)map aliases to the same memory region, similar to (un)mapping individual DIMMs. Similarly, teach vhost code to not redo what flatview core stopped doing: don't merge such sections. Merging in vhost code is really only relevant for handling random holes in boot memory where; without this merging, the vhost-user backend wouldn't be able to mmap() some boot memory backed on hugetlb. We'll use this for virtio-mem next. Message-ID: <20230926185738.277351-18-david@redhat.com> Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David Hildenbrand <david@redhat.com>
2023-10-12virtio-mem: Expose device memory dynamically via multiple memslots if enabledDavid Hildenbrand
Having large virtio-mem devices that only expose little memory to a VM is currently a problem: we map the whole sparse memory region into the guest using a single memslot, resulting in one gigantic memslot in KVM. KVM allocates metadata for the whole memslot, which can result in quite some memory waste. Assuming we have a 1 TiB virtio-mem device and only expose little (e.g., 1 GiB) memory, we would create a single 1 TiB memslot and KVM has to allocate metadata for that 1 TiB memslot: on x86, this implies allocating a significant amount of memory for metadata: (1) RMAP: 8 bytes per 4 KiB, 8 bytes per 2 MiB, 8 bytes per 1 GiB -> For 1 TiB: 2147483648 + 4194304 + 8192 = ~ 2 GiB (0.2 %) With the TDP MMU (cat /sys/module/kvm/parameters/tdp_mmu) this gets allocated lazily when required for nested VMs (2) gfn_track: 2 bytes per 4 KiB -> For 1 TiB: 536870912 = ~512 MiB (0.05 %) (3) lpage_info: 4 bytes per 2 MiB, 4 bytes per 1 GiB -> For 1 TiB: 2097152 + 4096 = ~2 MiB (0.0002 %) (4) 2x dirty bitmaps for tracking: 2x 1 bit per 4 KiB page -> For 1 TiB: 536870912 = 64 MiB (0.006 %) So we primarily care about (1) and (2). The bad thing is, that the memory consumption *doubles* once SMM is enabled, because we create the memslot once for !SMM and once for SMM. Having a 1 TiB memslot without the TDP MMU consumes around: * With SMM: 5 GiB * Without SMM: 2.5 GiB Having a 1 TiB memslot with the TDP MMU consumes around: * With SMM: 1 GiB * Without SMM: 512 MiB ... and that's really something we want to optimize, to be able to just start a VM with small boot memory (e.g., 4 GiB) and a virtio-mem device that can grow very large (e.g., 1 TiB). Consequently, using multiple memslots and only mapping the memslots we really need can significantly reduce memory waste and speed up memslot-related operations. Let's expose the sparse RAM memory region using multiple memslots, mapping only the memslots we currently need into our device memory region container. The feature can be enabled using "dynamic-memslots=on" and requires "unplugged-inaccessible=on", which is nowadays the default. Once enabled, we'll auto-detect the number of memslots to use based on the memslot limit provided by the core. We'll use at most 1 memslot per gigabyte. Note that our global limit of memslots accross all memory devices is currently set to 256: even with multiple large virtio-mem devices, we'd still have a sane limit on the number of memslots used. The default is to not dynamically map memslot for now ("dynamic-memslots=off"). The optimization must be enabled manually, because some vhost setups (e.g., hotplug of vhost-user devices) might be problematic until we support more memslots especially in vhost-user backends. Note that "dynamic-memslots=on" is just a hint that multiple memslots *may* be used for internal optimizations, not that multiple memslots *must* be used. The actual number of memslots that are used is an internal detail: for example, once memslot metadata is no longer an issue, we could simply stop optimizing for that. Migration source and destination can differ on the setting of "dynamic-memslots". Message-ID: <20230926185738.277351-17-david@redhat.com> Reviewed-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David Hildenbrand <david@redhat.com>
2023-10-12virtio-mem: Update state to match bitmap as soon as it's been migratedDavid Hildenbrand
It's cleaner and future-proof to just have other state that depends on the bitmap state to be updated as soon as possible when restoring the bitmap. So factor out informing RamDiscardListener into a functon and call it in case of early migration right after we restored the bitmap. Message-ID: <20230926185738.277351-16-david@redhat.com> Reviewed-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David Hildenbrand <david@redhat.com>
2023-10-12virtio-mem: Pass non-const VirtIOMEM via virtio_mem_range_cbDavid Hildenbrand
Let's prepare for a user that has to modify the VirtIOMEM device state. Message-ID: <20230926185738.277351-15-david@redhat.com> Reviewed-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David Hildenbrand <david@redhat.com>
2023-10-12memory-device,vhost: Support automatic decision on the number of memslotsDavid Hildenbrand
We want to support memory devices that can automatically decide how many memslots they will use. In the worst case, they have to use a single memslot. The target use cases are virtio-mem and the hyper-v balloon. Let's calculate a reasonable limit such a memory device may use, and instruct the device to make a decision based on that limit. Use a simple heuristic that considers: * A memslot soft-limit for all memory devices of 256; also, to not consume too many memslots -- which could harm performance. * Actually still free and unreserved memslots * The percentage of the remaining device memory region that memory device will occupy. Further, while we properly check before plugging a memory device whether there still is are free memslots, we have other memslot consumers (such as boot memory, PCI BARs) that don't perform any checks and might dynamically consume memslots without any prior reservation. So we might succeed in plugging a memory device, but once we dynamically map a PCI BAR we would be in trouble. Doing accounting / reservation / checks for all such users is problematic (e.g., sometimes we might temporarily split boot memory into two memslots, triggered by the BIOS). We use the historic magic memslot number of 509 as orientation to when supporting 256 memory devices -> memslots (leaving 253 for boot memory and other devices) has been proven to work reliable. We'll fallback to suggesting a single memslot if we don't have at least 509 total memslots. Plugging vhost devices with less than 509 memslots available while we have memory devices plugged that consume multiple memslots due to automatic decisions can be problematic. Most configurations might just fail due to "limit < used + reserved", however, it can also happen that these memory devices would suddenly consume memslots that would actually be required by other memslot consumers (boot, PCI BARs) later. Note that this has always been sketchy with vhost devices that support only a small number of memslots; but we don't want to make it any worse.So let's keep it simple and simply reject plugging such vhost devices in such a configuration. Eventually, all vhost devices that want to be fully compatible with such memory devices should support a decent number of memslots (>= 509). Message-ID: <20230926185738.277351-13-david@redhat.com> Reviewed-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David Hildenbrand <david@redhat.com>
2023-10-12vhost: Add vhost_get_max_memslots()David Hildenbrand
Let's add vhost_get_max_memslots(). Message-ID: <20230926185738.277351-12-david@redhat.com> Reviewed-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David Hildenbrand <david@redhat.com>
2023-10-12memory-device,vhost: Support memory devices that dynamically consume memslotsDavid Hildenbrand
We want to support memory devices that have a dynamically managed memory region container as device memory region. This device memory region maps multiple RAM memory subregions (e.g., aliases to the same RAM memory region), whereby these subregions can be (un)mapped on demand. Each RAM subregion will consume a memslot in KVM and vhost, resulting in such a new device consuming memslots dynamically, and initially usually 0. We already track the number of used vs. required memslots for all memslots. From that, we can derive the number of reserved memslots that must not be used otherwise. The target use case is virtio-mem and the hyper-v balloon, which will dynamically map aliases to RAM memory region into their device memory region container. Properly document what's supported and what's not and extend the vhost memslot check accordingly. Message-ID: <20230926185738.277351-10-david@redhat.com> Reviewed-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David Hildenbrand <david@redhat.com>
2023-10-12vhost: Return number of free memslotsDavid Hildenbrand
Let's return the number of free slots instead of only checking if there is a free slot. Required to support memory devices that consume multiple memslots. This is a preparation for memory devices that consume multiple memslots. Message-ID: <20230926185738.277351-6-david@redhat.com> Reviewed-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David Hildenbrand <david@redhat.com>
2023-10-12vhost: Remove vhost_backend_can_merge() callbackDavid Hildenbrand
Checking whether the memory regions are equal is sufficient: if they are equal, then most certainly the contained fd is equal. The whole vhost-user memslot handling is suboptimal and overly complicated. We shouldn't have to lookup a RAM memory regions we got notified about in vhost_user_get_mr_data() using a host pointer. But that requires a bigger rework -- especially an alternative vhost_set_mem_table() backend call that simply consumes MemoryRegionSections. For now, let's just drop vhost_backend_can_merge(). Message-ID: <20230926185738.277351-3-david@redhat.com> Acked-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Igor Mammedov <imammedo@redhat.com> Acked-by: Igor Mammedov <imammedo@redhat.com> Reviewed-by: Peter Xu <peterx@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David Hildenbrand <david@redhat.com>
2023-10-12vhost: Rework memslot filtering and fix "used_memslot" trackingDavid Hildenbrand
Having multiple vhost devices, some filtering out fd-less memslots and some not, can mess up the "used_memslot" accounting. Consequently our "free memslot" checks become unreliable and we might run out of free memslots at runtime later. An example sequence which can trigger a potential issue that involves different vhost backends (vhost-kernel and vhost-user) and hotplugged memory devices can be found at [1]. Let's make the filtering mechanism less generic and distinguish between backends that support private memslots (without a fd) and ones that only support shared memslots (with a fd). Track the used_memslots for both cases separately and use the corresponding value when required. Note: Most probably we should filter out MAP_PRIVATE fd-based RAM regions (for example, via memory-backend-memfd,...,shared=off or as default with memory-backend-file) as well. When not using MAP_SHARED, it might not work as expected. Add a TODO for now. [1] https://lkml.kernel.org/r/fad9136f-08d3-3fd9-71a1-502069c000cf@redhat.com Message-ID: <20230926185738.277351-2-david@redhat.com> Fixes: 988a27754bbb ("vhost: allow backends to filter memory sections") Cc: Tiwei Bie <tiwei.bie@intel.com> Acked-by: Igor Mammedov <imammedo@redhat.com> Reviewed-by: Peter Xu <peterx@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David Hildenbrand <david@redhat.com>
2023-10-06hw/virtio/vhost: Silence compiler warnings in vhost code when using -WshadowThomas Huth
Rename a variable in vhost_dev_sync_region() and remove a superfluous declaration in vhost_commit() to make this code compilable with "-Wshadow". Signed-off-by: Thomas Huth <thuth@redhat.com> Message-ID: <20231004114809.105672-1-thuth@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-By: Michael Tokarev <mjt@tls.msk.ru> Signed-off-by: Markus Armbruster <armbru@redhat.com>
2023-10-06hw/virtio/virtio-pci: Avoid compiler warning with -WshadowThomas Huth
"len" is used as parameter of the functions virtio_write_config() and virtio_read_config(), and additionally as a local variable, so this causes a compiler warning when compiling with "-Wshadow" and can be confusing for the reader. Rename the local variables to "caplen" to avoid this problem. Signed-off-by: Thomas Huth <thuth@redhat.com> Message-ID: <20231004095302.99037-1-thuth@redhat.com> Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Signed-off-by: Markus Armbruster <armbru@redhat.com>