aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2024-03-12hw/i386/pc: Avoid one use of the current_machine globalBernhard Beschow
The RTC can be accessed through the X86 machine instance, so rather than passing the RTC it's possible to pass the machine state instead. This avoids pc_boot_set() from having to access the current_machine global. Signed-off-by: Bernhard Beschow <shentey@gmail.com> Message-Id: <20240303185332.1408-3-shentey@gmail.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
2024-03-12hw/i386/pc: Remove "rtc_state" link againBernhard Beschow
Commit 99e1c1137b6f "hw/i386/pc: Populate RTC attribute directly" made linking the "rtc_state" property unnecessary and removed it. Commit 84e945aad2d0 "vl, pc: turn -no-fd-bootchk into a machine property" accidently reintroduced the link. Remove it again since it is not needed. Fixes: 84e945aad2d0 "vl, pc: turn -no-fd-bootchk into a machine property" Cc: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Bernhard Beschow <shentey@gmail.com> Message-Id: <20240303185332.1408-2-shentey@gmail.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
2024-03-12Revert "hw/i386/pc: Confine system flash handling to pc_sysfw"Bernhard Beschow
Specifying the property `-M pflash0` results in a regression: qemu-system-x86_64: Property 'pc-q35-9.0-machine.pflash0' not found Revert the change for now until a solution is found. This reverts commit 6f6ad2b24582593d8feb00434ce2396840666227. Reported-by: Volker Rümelin <vr_qemu@t-online.de> Signed-off-by: Bernhard Beschow <shentey@gmail.com> Message-Id: <20240226215909.30884-3-shentey@gmail.com> Tested-by: Alex Williamson <alex.williamson@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-03-12Revert "hw/i386/pc_sysfw: Inline pc_system_flash_create() and remove it"Bernhard Beschow
Commit 6f6ad2b24582 "hw/i386/pc: Confine system flash handling to pc_sysfw" causes a regression when specifying the property `-M pflash0` in the PCI PC machines: qemu-system-x86_64: Property 'pc-q35-9.0-machine.pflash0' not found In order to revert the commit, the commit below must be reverted first. This reverts commit cb05cc16029bb0a61ac5279ab7b3b90dcf2aa69f. Signed-off-by: Bernhard Beschow <shentey@gmail.com> Message-Id: <20240226215909.30884-2-shentey@gmail.com> Tested-by: Alex Williamson <alex.williamson@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-03-12pc: q35: Bump max_cpus to 4096 vcpusAni Sinha
Since commit f10a570b093e6 ("KVM: x86: Add CONFIG_KVM_MAX_NR_VCPUS to allow up to 4096 vCPUs") Linux kernel can support upto a maximum number of 4096 vcpus when MAXSMP is enabled in the kernel. At present, QEMU has been tested to correctly boot a linux guest with 4096 vcpus using the current edk2 upstream master branch that has the fixes corresponding to the following two PRs: https://github.com/tianocore/edk2/pull/5410 https://github.com/tianocore/edk2/pull/5418 The changes merged into edk2 with the above PRs will be in the upcoming 2024-05 release. With current seabios firmware, it boots fine with 4096 vcpus already. So bump up the value max_cpus to 4096 for q35 machines versions 9 and newer. Q35 machines versions 8.2 and older continue to support 1024 maximum vcpus as before for compatibility reasons. If KVM is not able to support the specified number of vcpus, QEMU would return the following error messages: $ ./qemu-system-x86_64 -cpu host -accel kvm -machine q35 -smp 1728 qemu-system-x86_64: -accel kvm: warning: Number of SMP cpus requested (1728) exceeds the recommended cpus supported by KVM (12) qemu-system-x86_64: -accel kvm: warning: Number of hotpluggable cpus requested (1728) exceeds the recommended cpus supported by KVM (12) Number of SMP cpus requested (1728) exceeds the maximum cpus supported by KVM (1024) Cc: Daniel P. Berrangé <berrange@redhat.com> Cc: Igor Mammedov <imammedo@redhat.com> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Julia Suvorova <jusual@redhat.com> Cc: kraxel@redhat.com Reviewed-by: Daniel P. Berrangé <berrange@redhat.com> Reviewed-by: Igor Mammedov <imammedo@redhat.com> Reviewed-by: Gerd Hoffmann <kraxel@redhat.com> Signed-off-by: Ani Sinha <anisinha@redhat.com> Message-Id: <20240228143351.3967-1-anisinha@redhat.com> Reviewed-by: Zhao Liu <zhao1.liu@intel.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-03-12hw/pci: Always call pcie_sriov_pf_reset()Akihiko Odaki
Call pcie_sriov_pf_reset() from pci_do_device_reset() just as we do for msi_reset() and msix_reset() to prevent duplicating code for each SR-IOV PF. Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com> Message-Id: <20240228-reuse-v8-5-282660281e60@daynix.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Sriram Yagnaraman <sriram.yagnaraman@ericsson.com>
2024-03-12pcie_sriov: Do not reset NumVFs after disabling VFsAkihiko Odaki
The spec does not NumVFs is reset after disabling VFs except when resetting the PF. Clearing it is guest visible and out of spec, even though Linux doesn't rely on this value being preserved, so we never noticed. Fixes: 7c0fa8dff811 ("pcie: Add support for Single Root I/O Virtualization (SR/IOV)") Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com> Message-Id: <20240228-reuse-v8-4-282660281e60@daynix.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-03-12pcie_sriov: Reset SR-IOV extended capabilityAkihiko Odaki
pcie_sriov_pf_disable_vfs() is called when resetting the PF, but it only disables VFs and does not reset SR-IOV extended capability, leaking the state and making the VF Enable register inconsistent with the actual state. Replace pcie_sriov_pf_disable_vfs() with pcie_sriov_pf_reset(), which does not only disable VFs but also resets the capability. Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com> Message-Id: <20240228-reuse-v8-3-282660281e60@daynix.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Sriram Yagnaraman <sriram.yagnaraman@ericsson.com>
2024-03-12pcie_sriov: Validate NumVFsAkihiko Odaki
The guest may write NumVFs greater than TotalVFs and that can lead to buffer overflow in VF implementations. Cc: qemu-stable@nongnu.org Fixes: CVE-2024-26327 Fixes: 7c0fa8dff811 ("pcie: Add support for Single Root I/O Virtualization (SR/IOV)") Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com> Message-Id: <20240228-reuse-v8-2-282660281e60@daynix.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Sriram Yagnaraman <sriram.yagnaraman@ericsson.com>
2024-03-12hw/nvme: Use pcie_sriov_num_vfs()Akihiko Odaki
nvme_sriov_pre_write_ctrl() used to directly inspect SR-IOV configurations to know the number of VFs being disabled due to SR-IOV configuration writes, but the logic was flawed and resulted in out-of-bound memory access. It assumed PCI_SRIOV_NUM_VF always has the number of currently enabled VFs, but it actually doesn't in the following cases: - PCI_SRIOV_NUM_VF has been set but PCI_SRIOV_CTRL_VFE has never been. - PCI_SRIOV_NUM_VF was written after PCI_SRIOV_CTRL_VFE was set. - VFs were only partially enabled because of realization failure. It is a responsibility of pcie_sriov to interpret SR-IOV configurations and pcie_sriov does it correctly, so use pcie_sriov_num_vfs(), which it provides, to get the number of enabled VFs before and after SR-IOV configuration writes. Cc: qemu-stable@nongnu.org Fixes: CVE-2024-26328 Fixes: 11871f53ef8e ("hw/nvme: Add support for the Virtualization Management command") Suggested-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com> Message-Id: <20240228-reuse-v8-1-282660281e60@daynix.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-03-12Implement SMBIOS type 9 v2.6Felix Wu
Signed-off-by: Felix Wu <flwu@google.com> Signed-off-by: Nabih Estefan <nabihestefan@google.com> Message-Id: <20240221170027.1027325-3-nabihestefan@google.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-03-12Implement base of SMBIOS type 9 descriptor.Felix Wu
Version 2.1+. Signed-off-by: Felix Wu <flwu@google.com> Signed-off-by: Nabih Estefan <nabihestefan@google.com> Message-Id: <20240221170027.1027325-2-nabihestefan@google.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-03-12hw/intc: Check @errp to handle the error of IOAPICCommonClass.realize()Zhao Liu
IOAPICCommonClass implements its own private realize(), and this private realize() allows error. Since IOAPICCommonClass.realize() returns void, to check the error, dereference @errp with ERRP_GUARD(). Signed-off-by: Zhao Liu <zhao1.liu@intel.com> Message-Id: <20240223085653.1255438-8-zhao1.liu@linux.intel.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
2024-03-12hw/vfio/iommufd: Fix missing ERRP_GUARD() in iommufd_cdev_getfd()Zhao Liu
As the comment in qapi/error, dereferencing @errp requires ERRP_GUARD(): * = Why, when and how to use ERRP_GUARD() = * * Without ERRP_GUARD(), use of the @errp parameter is restricted: * - It must not be dereferenced, because it may be null. ... * ERRP_GUARD() lifts these restrictions. * * To use ERRP_GUARD(), add it right at the beginning of the function. * @errp can then be used without worrying about the argument being * NULL or &error_fatal. * * Using it when it's not needed is safe, but please avoid cluttering * the source with useless code. But in iommufd_cdev_getfd(), @errp is dereferenced without ERRP_GUARD(): if (*errp) { error_prepend(errp, VFIO_MSG_PREFIX, path); } Currently, since vfio_attach_device() - the caller of iommufd_cdev_getfd() - is always called in DeviceClass.realize() context and doesn't get the NULL @errp parameter, iommufd_cdev_getfd() hasn't triggered the bug that dereferencing the NULL @errp. To follow the requirement of @errp, add missing ERRP_GUARD() in iommufd_cdev_getfd(). Suggested-by: Markus Armbruster <armbru@redhat.com> Signed-off-by: Zhao Liu <zhao1.liu@intel.com> Reviewed-by: Markus Armbruster <armbru@redhat.com> Message-Id: <20240223085653.1255438-7-zhao1.liu@linux.intel.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-03-12hw/pci-bridge/cxl_upstream: Fix missing ERRP_GUARD() in cxl_usp_realize()Zhao Liu
As the comment in qapi/error, dereferencing @errp requires ERRP_GUARD(): * = Why, when and how to use ERRP_GUARD() = * * Without ERRP_GUARD(), use of the @errp parameter is restricted: * - It must not be dereferenced, because it may be null. ... * ERRP_GUARD() lifts these restrictions. * * To use ERRP_GUARD(), add it right at the beginning of the function. * @errp can then be used without worrying about the argument being * NULL or &error_fatal. * * Using it when it's not needed is safe, but please avoid cluttering * the source with useless code. But in cxl_usp_realize(), @errp is dereferenced without ERRP_GUARD(): cxl_doe_cdat_init(cxl_cstate, errp); if (*errp) { goto err_cap; } Here we check *errp, because cxl_doe_cdat_init() returns void. And since cxl_usp_realize() - as a PCIDeviceClass.realize() method - doesn't get the NULL @errp parameter, it hasn't triggered the bug that dereferencing the NULL @errp. To follow the requirement of @errp, add missing ERRP_GUARD() in cxl_usp_realize(). Suggested-by: Markus Armbruster <armbru@redhat.com> Signed-off-by: Zhao Liu <zhao1.liu@intel.com> Reviewed-by: Markus Armbruster <armbru@redhat.com> Message-Id: <20240223085653.1255438-6-zhao1.liu@linux.intel.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Thomas Huth <thuth@redhat.com>
2024-03-12hw/misc/xlnx-versal-trng: Check returned bool in trng_prop_fault_event_set()Zhao Liu
As the comment in qapi/error, dereferencing @errp requires ERRP_GUARD(): * = Why, when and how to use ERRP_GUARD() = * * Without ERRP_GUARD(), use of the @errp parameter is restricted: * - It must not be dereferenced, because it may be null. ... * ERRP_GUARD() lifts these restrictions. * * To use ERRP_GUARD(), add it right at the beginning of the function. * @errp can then be used without worrying about the argument being * NULL or &error_fatal. * * Using it when it's not needed is safe, but please avoid cluttering * the source with useless code. But in trng_prop_fault_event_set, @errp is dereferenced without ERRP_GUARD(): visit_type_uint32(v, name, events, errp); if (*errp) { return; } Currently, since trng_prop_fault_event_set() doesn't get the NULL @errp parameter as a "set" method of object property, it hasn't triggered the bug that dereferencing the NULL @errp. And since visit_type_uint32() returns bool, check the returned bool directly instead of dereferencing @errp, then we needn't the add missing ERRP_GUARD(). Suggested-by: Markus Armbruster <armbru@redhat.com> Signed-off-by: Zhao Liu <zhao1.liu@intel.com> Message-Id: <20240223085653.1255438-5-zhao1.liu@linux.intel.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
2024-03-12hw/mem/cxl_type3: Fix missing ERRP_GUARD() in ct3_realize()Zhao Liu
As the comment in qapi/error, dereferencing @errp requires ERRP_GUARD(): * = Why, when and how to use ERRP_GUARD() = * * Without ERRP_GUARD(), use of the @errp parameter is restricted: * - It must not be dereferenced, because it may be null. ... * ERRP_GUARD() lifts these restrictions. * * To use ERRP_GUARD(), add it right at the beginning of the function. * @errp can then be used without worrying about the argument being * NULL or &error_fatal. * * Using it when it's not needed is safe, but please avoid cluttering * the source with useless code. But in ct3_realize(), @errp is dereferenced without ERRP_GUARD(): cxl_doe_cdat_init(cxl_cstate, errp); if (*errp) { goto err_free_special_ops; } Here we check *errp, because cxl_doe_cdat_init() returns void. And ct3_realize() - as a PCIDeviceClass.realize() method - doesn't get the NULL @errp parameter, it hasn't triggered the bug that dereferencing the NULL @errp. To follow the requirement of @errp, add missing ERRP_GUARD() in ct3_realize(). Suggested-by: Markus Armbruster <armbru@redhat.com> Signed-off-by: Zhao Liu <zhao1.liu@intel.com> Reviewed-by: Markus Armbruster <armbru@redhat.com> Message-Id: <20240223085653.1255438-4-zhao1.liu@linux.intel.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
2024-03-12hw/display/macfb: Fix missing ERRP_GUARD() in macfb_nubus_realize()Zhao Liu
As the comment in qapi/error, dereferencing @errp requires ERRP_GUARD(): * = Why, when and how to use ERRP_GUARD() = * * Without ERRP_GUARD(), use of the @errp parameter is restricted: * - It must not be dereferenced, because it may be null. ... * ERRP_GUARD() lifts these restrictions. * * To use ERRP_GUARD(), add it right at the beginning of the function. * @errp can then be used without worrying about the argument being * NULL or &error_fatal. * * Using it when it's not needed is safe, but please avoid cluttering * the source with useless code. But in macfb_nubus_realize(), @errp is dereferenced without ERRP_GUARD(): ndc->parent_realize(dev, errp); if (*errp) { return; } Here we check *errp, because the ndc->parent_realize(), as a DeviceClass.realize() callback, returns void. And since macfb_nubus_realize(), also as a DeviceClass.realize(), doesn't get the NULL @errp parameter, it hasn't triggered the bug that dereferencing the NULL @errp. To follow the requirement of @errp, add missing ERRP_GUARD() in macfb_nubus_realize(). Suggested-by: Markus Armbruster <armbru@redhat.com> Signed-off-by: Zhao Liu <zhao1.liu@intel.com> Reviewed-by: Markus Armbruster <armbru@redhat.com> Message-Id: <20240223085653.1255438-3-zhao1.liu@linux.intel.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-03-12hw/cxl/cxl-host: Fix missing ERRP_GUARD() in cxl_fixed_memory_window_config()Zhao Liu
As the comment in qapi/error, dereferencing @errp requires ERRP_GUARD(): * = Why, when and how to use ERRP_GUARD() = * * Without ERRP_GUARD(), use of the @errp parameter is restricted: * - It must not be dereferenced, because it may be null. ... * ERRP_GUARD() lifts these restrictions. * * To use ERRP_GUARD(), add it right at the beginning of the function. * @errp can then be used without worrying about the argument being * NULL or &error_fatal. * * Using it when it's not needed is safe, but please avoid cluttering * the source with useless code. But in cxl_fixed_memory_window_config(), @errp is dereferenced in 2 places without ERRP_GUARD(): fw->enc_int_ways = cxl_interleave_ways_enc(fw->num_targets, errp); if (*errp) { return; } and fw->enc_int_gran = cxl_interleave_granularity_enc(object->interleave_granularity, errp); if (*errp) { return; } For the above 2 places, we check "*errp", because neither function returns a suitable error code. And since machine_set_cfmw() - the caller of cxl_fixed_memory_window_config() - doesn't get the NULL @errp parameter as the "set" method of object property, cxl_fixed_memory_window_config() hasn't triggered the bug that dereferencing the NULL @errp. To follow the requirement of @errp, add missing ERRP_GUARD() in cxl_fixed_memory_window_config(). Suggested-by: Markus Armbruster <armbru@redhat.com> Signed-off-by: Zhao Liu <zhao1.liu@intel.com> Reviewed-by: Markus Armbruster <armbru@redhat.com> Message-Id: <20240223085653.1255438-2-zhao1.liu@linux.intel.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
2024-03-12hw/virtio: Add support for VDPA network simulation devicesHao Chen
This patch adds support for VDPA network simulation devices. The device is developed based on virtio-net and tap backend, and supports hardware live migration function. For more details, please refer to "docs/system/devices/vdpa-net.rst" Signed-off-by: Hao Chen <chenh@yusur.tech> Message-Id: <20240221073802.2888022-1-chenh@yusur.tech> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-03-12hw/virtio: check owner for removing objectsAlbert Esteve
Shared objects lack spoofing protection. For VHOST_USER_BACKEND_SHARED_OBJECT_REMOVE messages received by the vhost-user interface, any backend was allowed to remove entries from the shared table just by knowing the UUID. Only the owner of the entry shall be allowed to removed their resources from the table. To fix that, add a check for all *SHARED_OBJECT_REMOVE messages received. A vhost device can only remove TYPE_VHOST_DEV entries that are owned by them, otherwise skip the removal, and inform the device that the entry has not been removed in the answer. Signed-off-by: Albert Esteve <aesteve@redhat.com> Acked-by: Stefan Hajnoczi <stefanha@redhat.com> Message-Id: <20240219143423.272012-2-aesteve@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-03-12hw/audio/virtio-sound: return correct command response sizeVolker Rümelin
The payload size returned by command VIRTIO_SND_R_PCM_INFO is wrong. The code in process_cmd() assumes that all commands return only a virtio_snd_hdr payload, but some commands like VIRTIO_SND_R_PCM_INFO may return an additional payload. Add a zero initialized payload_size variable to struct virtio_snd_ctrl_command to allow for additional payloads. Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com> Signed-off-by: Volker Rümelin <vr_qemu@t-online.de> Message-Id: <20240218083351.8524-1-vr_qemu@t-online.de> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-03-12hw/pci-bridge/pxb-cxl: Drop RAS capability from host bridge.Jonathan Cameron
This CXL component isn't allowed to have a RAS capability. Whilst this should be harmless as software is not expected to look here, good to clean it up. Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Message-Id: <20240215155206.2736-1-Jonathan.Cameron@huawei.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-03-12vdpa: trace skipped memory sectionsEugenio Pérez
Sometimes, certain parts are not being skipped in vhost_vdpa_listener_region_del, but they are skipped in vhost_vdpa_listener_region_add, or vice versa. The vhost-vdpa code expects all parts to maintain their properties, so we're adding a trace to help with debugging when any part is skipped. Signed-off-by: Eugenio Pérez <eperezma@redhat.com> Message-Id: <20240215103616.330518-3-eperezma@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-03-12vdpa: stash memory region properties in varsEugenio Pérez
Next changes uses this variables, so avoid call repeatedly to memory region functions. No functional change intended. Signed-off-by: Eugenio Pérez <eperezma@redhat.com> Message-Id: <20240215103616.330518-2-eperezma@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-03-12pcie: Support PCIe Gen5/Gen6 link speedsLukas Stockner
This patch extends the PCIe link speed option so that slots can be configured as supporting 32GT/s (Gen5) or 64GT/s (Gen5) speeds. This is as simple as setting the appropriate bit in LnkCap2 and the appropriate value in LnkCap and LnkCtl2. Signed-off-by: Lukas Stockner <lstockner@genesiscloud.com> Message-Id: <20240215012326.3272366-1-lstockner@genesiscloud.com> Reviewed-by: Manos Pitsidianakis <manos.pitsidianakis@linaro.org> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-03-12libvhost-user: Mark mmap'ed region memory as MADV_DONTDUMPDavid Hildenbrand
We already use MADV_NORESERVE to deal with sparse memory regions. Let's also set madvise(MADV_DONTDUMP), otherwise a crash of the process can result in us allocating all memory in the mmap'ed region for dumping purposes. This change implies that the mmap'ed rings won't be included in a coredump. If ever required for debugging purposes, we could mark only the mapped rings MADV_DODUMP. Ignore errors during madvise() for now. Reviewed-by: Raphael Norwitz <raphael@enfabrica.net> Acked-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: David Hildenbrand <david@redhat.com> Message-Id: <20240214151701.29906-15-david@redhat.com> Tested-by: Mario Casquero <mcasquer@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-03-12libvhost-user: Dynamically remap rings after (temporarily?) removing memory ↵David Hildenbrand
regions Currently, we try to remap all rings whenever we add a single new memory region. That doesn't quite make sense, because we already map rings when setting the ring address, and panic if that goes wrong. Likely, that handling was simply copied from set_mem_table code, where we actually have to remap all rings. Remapping all rings might require us to walk quite a lot of memory regions to perform the address translations. Ideally, we'd simply remove that remapping. However, let's be a bit careful. There might be some weird corner cases where we might temporarily remove a single memory region (e.g., resize it), that would have worked for now. Further, a ring might be located on hotplugged memory, and as the VM reboots, we might unplug that memory, to hotplug memory before resetting the ring addresses. So let's unmap affected rings as we remove a memory region, and try dynamically mapping the ring again when required. Acked-by: Raphael Norwitz <raphael@enfabrica.net> Acked-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: David Hildenbrand <david@redhat.com> Message-Id: <20240214151701.29906-14-david@redhat.com> Tested-by: Mario Casquero <mcasquer@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-03-12libvhost-user: Factor out vq usability checkDavid Hildenbrand
Let's factor it out to prepare for further changes. Reviewed-by: Raphael Norwitz <raphael@enfabrica.net> Acked-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: David Hildenbrand <david@redhat.com> Message-Id: <20240214151701.29906-13-david@redhat.com> Tested-by: Mario Casquero <mcasquer@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-03-12libvhost-user: Use most of mmap_offset as fd_offsetDavid Hildenbrand
In the past, QEMU would create memory regions that could partially cover hugetlb pages, making mmap() fail if we would use the mmap_offset as an fd_offset. For that reason, we never used the mmap_offset as an offset into the fd and instead always mapped the fd from the very start. However, that can easily result in us mmap'ing a lot of unnecessary parts of an fd, possibly repeatedly. QEMU nowadays does not create memory regions that partially cover huge pages -- it never really worked with postcopy. QEMU handles merging of regions that partially cover huge pages (due to holes in boot memory) since 2018 in c1ece84e7c93 ("vhost: Huge page align and merge"). Let's be a bit careful and not unconditionally convert the mmap_offset into an fd_offset. Instead, let's simply detect the hugetlb size and pass as much as we can as fd_offset, making sure that we call mmap() with a properly aligned offset. With QEMU and a virtio-mem device that is fully plugged (50GiB using 50 memslots) the qemu-storage daemon process consumes in the VA space 1281GiB before this change and 58GiB after this change. ================ Vhost user message ================ Request: VHOST_USER_ADD_MEM_REG (37) Flags: 0x9 Size: 40 Fds: 59 Adding region 4 guest_phys_addr: 0x0000000200000000 memory_size: 0x0000000040000000 userspace_addr: 0x00007fb73bffe000 old mmap_offset: 0x0000000080000000 fd_offset: 0x0000000080000000 new mmap_offset: 0x0000000000000000 mmap_addr: 0x00007f02f1bdc000 Successfully added new region ================ Vhost user message ================ Request: VHOST_USER_ADD_MEM_REG (37) Flags: 0x9 Size: 40 Fds: 59 Adding region 5 guest_phys_addr: 0x0000000240000000 memory_size: 0x0000000040000000 userspace_addr: 0x00007fb77bffe000 old mmap_offset: 0x00000000c0000000 fd_offset: 0x00000000c0000000 new mmap_offset: 0x0000000000000000 mmap_addr: 0x00007f0284000000 Successfully added new region Reviewed-by: Raphael Norwitz <raphael@enfabrica.net> Acked-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: David Hildenbrand <david@redhat.com> Message-Id: <20240214151701.29906-12-david@redhat.com> Tested-by: Mario Casquero <mcasquer@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-03-12libvhost-user: Speedup gpa_to_mem_region() and vu_gpa_to_va()David Hildenbrand
Let's speed up GPA to memory region / virtual address lookup. Store the memory regions ordered by guest physical addresses, and use binary search for address translation, as well as when adding/removing memory regions. Most importantly, this will speed up GPA->VA address translation when we have many memslots. Reviewed-by: Raphael Norwitz <raphael@enfabrica.net> Acked-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: David Hildenbrand <david@redhat.com> Message-Id: <20240214151701.29906-11-david@redhat.com> Tested-by: Mario Casquero <mcasquer@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-03-12libvhost-user: Factor out search for memory region by GPA and simplifyDavid Hildenbrand
Memory regions cannot overlap, and if we ever hit that case something would be really flawed. For example, when vhost code in QEMU decides to increase the size of memory regions to cover full huge pages, it makes sure to never create overlaps, and if there would be overlaps, it would bail out. QEMU commits 48d7c9757749 ("vhost: Merge sections added to temporary list"), c1ece84e7c93 ("vhost: Huge page align and merge") and e7b94a84b6cb ("vhost: Allow adjoining regions") added and clarified that handling and how overlaps are impossible. Consequently, each GPA can belong to at most one memory region, and everything else doesn't make sense. Let's factor out our search to prepare for further changes. Reviewed-by: Raphael Norwitz <raphael@enfabrica.net> Acked-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: David Hildenbrand <david@redhat.com> Message-Id: <20240214151701.29906-10-david@redhat.com> Tested-by: Mario Casquero <mcasquer@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-03-12libvhost-user: Don't search for duplicates when removing memory regionsDavid Hildenbrand
We cannot have duplicate memory regions, something would be deeply flawed elsewhere. Let's just stop the search once we found an entry. We'll add more sanity checks when adding memory regions later. Reviewed-by: Raphael Norwitz <raphael@enfabrica.net> Acked-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: David Hildenbrand <david@redhat.com> Message-Id: <20240214151701.29906-9-david@redhat.com> Tested-by: Mario Casquero <mcasquer@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-03-12libvhost-user: Don't zero out memory for memory regionsDavid Hildenbrand
dev->nregions always covers only valid entries. Stop zeroing out other array elements that are unused. Reviewed-by: Raphael Norwitz <raphael@enfabrica.net> Acked-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: David Hildenbrand <david@redhat.com> Message-Id: <20240214151701.29906-8-david@redhat.com> Tested-by: Mario Casquero <mcasquer@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-03-12libvhost-user: No need to check for NULL when unmappingDavid Hildenbrand
We never add a memory region if mmap() failed. Therefore, no need to check for NULL. Reviewed-by: Raphael Norwitz <raphael@enfabrica.net> Acked-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: David Hildenbrand <david@redhat.com> Message-Id: <20240214151701.29906-7-david@redhat.com> Tested-by: Mario Casquero <mcasquer@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-03-12libvhost-user: Factor out adding a memory regionDavid Hildenbrand
Let's factor it out, reducing quite some code duplication and perparing for further changes. If we fail to mmap a region and panic, we now simply don't add that (broken) region. Note that we now increment dev->nregions as we are successfully adding memory regions, and don't increment dev->nregions if anything went wrong. Reviewed-by: Raphael Norwitz <raphael@enfabrica.net> Acked-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: David Hildenbrand <david@redhat.com> Message-Id: <20240214151701.29906-6-david@redhat.com> Tested-by: Mario Casquero <mcasquer@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-03-12libvhost-user: Merge vu_set_mem_table_exec_postcopy() into ↵David Hildenbrand
vu_set_mem_table_exec() Let's reduce some code duplication and prepare for further changes. Reviewed-by: Raphael Norwitz <raphael@enfabrica.net> Acked-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: David Hildenbrand <david@redhat.com> Message-Id: <20240214151701.29906-5-david@redhat.com> Tested-by: Mario Casquero <mcasquer@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-03-12libvhost-user: Factor out removing all mem regionsDavid Hildenbrand
Let's factor it out. Note that the check for MAP_FAILED was wrong as we never set mmap_addr if mmap() failed. We'll remove the NULL check separately. Reviewed-by: Raphael Norwitz <raphael@enfabrica.net> Acked-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: David Hildenbrand <david@redhat.com> Message-Id: <20240214151701.29906-4-david@redhat.com> Tested-by: Mario Casquero <mcasquer@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-03-12libvhost-user: Bump up VHOST_USER_MAX_RAM_SLOTS to 509David Hildenbrand
Let's support up to 509 mem slots, just like vhost in the kernel usually does and the rust vhost-user implementation recently [1] started doing. This is required to properly support memory hotplug, either using multiple DIMMs (ACPI supports up to 256) or using virtio-mem. The 509 used to be the KVM limit, it supported 512, but 3 were used for internal purposes. Currently, KVM supports more than 512, but it usually doesn't make use of more than ~260 (i.e., 256 DIMMs + boot memory), except when other memory devices like PCI devices with BARs are used. So, 509 seems to work well for vhost in the kernel. Details can be found in the QEMU change that made virtio-mem consume up to 256 mem slots across all virtio-mem devices. [2] 509 mem slots implies 509 VMAs/mappings in the worst case (even though, in practice with virtio-mem we won't be seeing more than ~260 in most setups). With max_map_count under Linux defaulting to 64k, 509 mem slots still correspond to less than 1% of the maximum number of mappings. There are plenty left for the application to consume. [1] https://github.com/rust-vmm/vhost/pull/224 [2] https://lore.kernel.org/all/20230926185738.277351-1-david@redhat.com/ Reviewed-by: Raphael Norwitz <raphael@enfabrica.net> Acked-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: David Hildenbrand <david@redhat.com> Message-Id: <20240214151701.29906-3-david@redhat.com> Tested-by: Mario Casquero <mcasquer@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-03-12libvhost-user: Dynamically allocate memory for memory slotsDavid Hildenbrand
Let's prepare for increasing VHOST_USER_MAX_RAM_SLOTS by dynamically allocating dev->regions. We don't have any ABI guarantees (not dynamically linked), so we can simply change the layout of VuDev. Let's zero out the memory, just as we used to do. Reviewed-by: Raphael Norwitz <raphael@enfabrica.net> Acked-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: David Hildenbrand <david@redhat.com> Message-Id: <20240214151701.29906-2-david@redhat.com> Tested-by: Mario Casquero <mcasquer@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-03-12vdpa: fix network breakage after cancelling migrationSi-Wei Liu
Fix an issue where cancellation of ongoing migration ends up with no network connectivity. When canceling migration, SVQ will be switched back to the passthrough mode, but the right call fd is not programed to the device and the svq's own call fd is still used. At the point of this transitioning period, the shadow_vqs_enabled hadn't been set back to false yet, causing the installation of call fd inadvertently bypassed. Message-Id: <1707910082-10243-13-git-send-email-si-wei.liu@oracle.com> Fixes: a8ac88585da1 ("vhost: Add Shadow VirtQueue call forwarding capabilities") Cc: Eugenio Pérez <eperezma@redhat.com> Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Si-Wei Liu <si-wei.liu@oracle.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-03-12vdpa: indicate transitional state for SVQ switchingSi-Wei Liu
svq_switching indicates the transitional state whether or not SVQ mode switching is in progress, and towards which direction. Add the neccessary state around where the switching would take place. Message-Id: <1707910082-10243-12-git-send-email-si-wei.liu@oracle.com> Signed-off-by: Si-Wei Liu <si-wei.liu@oracle.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-03-12vdpa: define SVQ transitioning state for mode switchingSi-Wei Liu
Will be used in following patches. DISABLING(-1) means SVQ is being switched off to passthrough mode. ENABLING(1) means passthrough VQs are being switched to SVQ. DONE(0) means SVQ switching is completed. Message-Id: <1707910082-10243-11-git-send-email-si-wei.liu@oracle.com> Signed-off-by: Si-Wei Liu <si-wei.liu@oracle.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-03-12vdpa: add trace event for vhost_vdpa_net_load_mqSi-Wei Liu
For better debuggability and observability. Message-Id: <1707910082-10243-10-git-send-email-si-wei.liu@oracle.com> Reviewed-by: Eugenio Pérez <eperezma@redhat.com> Signed-off-by: Si-Wei Liu <si-wei.liu@oracle.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-03-12vdpa: add trace events for vhost_vdpa_net_load_cmdSi-Wei Liu
For better debuggability and observability. Message-Id: <1707910082-10243-9-git-send-email-si-wei.liu@oracle.com> Reviewed-by: Eugenio Pérez <eperezma@redhat.com> Signed-off-by: Si-Wei Liu <si-wei.liu@oracle.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-03-12vdpa: add vhost_vdpa_set_dev_vring_base trace for svq modeSi-Wei Liu
For better debuggability and observability. Message-Id: <1707910082-10243-8-git-send-email-si-wei.liu@oracle.com> Reviewed-by: Eugenio Pérez <eperezma@redhat.com> Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Si-Wei Liu <si-wei.liu@oracle.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-03-12vdpa: add vhost_vdpa_get_vring_base trace for svq modeSi-Wei Liu
For better debuggability and observability. Message-Id: <1707910082-10243-7-git-send-email-si-wei.liu@oracle.com> Reviewed-by: Eugenio Pérez <eperezma@redhat.com> Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Si-Wei Liu <si-wei.liu@oracle.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-03-12vdpa: add vhost_vdpa_set_address_space_id traceSi-Wei Liu
For better debuggability and observability. Message-Id: <1707910082-10243-6-git-send-email-si-wei.liu@oracle.com> Reviewed-by: Eugenio Pérez <eperezma@redhat.com> Signed-off-by: Si-Wei Liu <si-wei.liu@oracle.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-03-12vdpa: factor out vhost_vdpa_net_get_nc_vdpaSi-Wei Liu
Introduce new API. No functional change on existing API. Message-Id: <1707910082-10243-5-git-send-email-si-wei.liu@oracle.com> Reviewed-by: Eugenio Pérez <eperezma@redhat.com> Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Si-Wei Liu <si-wei.liu@oracle.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-03-12vdpa: factor out vhost_vdpa_last_devSi-Wei Liu
Generalize duplicated condition check for the last vq of vdpa device to a common function. Message-Id: <1707910082-10243-4-git-send-email-si-wei.liu@oracle.com> Reviewed-by: Eugenio Pérez <eperezma@redhat.com> Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Si-Wei Liu <si-wei.liu@oracle.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>