aboutsummaryrefslogtreecommitdiff
path: root/hw/block/virtio-blk.c
AgeCommit message (Collapse)Author
2024-06-05virtio-blk: remove SCSI passthrough functionalityPaolo Bonzini
The legacy SCSI passthrough functionality has never been enabled for VIRTIO 1.0 and was deprecated more than four years ago. Get rid of it---almost, because QEMU is advertising it unconditionally for legacy virtio-blk devices. Just parse the header and return a nonzero status. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-04-04block/virtio-blk: Fix memory leak from virtio_blk_zone_reportZheyu Ma
This modification ensures that in scenarios where the buffer size is insufficient for a zone report, the function will now properly set an error status and proceed to a cleanup label, instead of merely returning. The following ASAN log reveals it: ==1767400==ERROR: LeakSanitizer: detected memory leaks Direct leak of 312 byte(s) in 1 object(s) allocated from: #0 0x64ac7b3280cd in malloc llvm/compiler-rt/lib/asan/asan_malloc_linux.cpp:129:3 #1 0x735b02fb9738 in g_malloc (/lib/x86_64-linux-gnu/libglib-2.0.so.0+0x5e738) #2 0x64ac7d23be96 in virtqueue_split_pop hw/virtio/virtio.c:1612:12 #3 0x64ac7d23728a in virtqueue_pop hw/virtio/virtio.c:1783:16 #4 0x64ac7cfcaacd in virtio_blk_get_request hw/block/virtio-blk.c:228:27 #5 0x64ac7cfca7c7 in virtio_blk_handle_vq hw/block/virtio-blk.c:1123:23 #6 0x64ac7cfecb95 in virtio_blk_handle_output hw/block/virtio-blk.c:1157:5 Signed-off-by: Zheyu Ma <zheyuma97@gmail.com> Message-id: 20240404120040.1951466-1-zheyuma97@gmail.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2024-03-12block/virtio-blk: Fix missing ERRP_GUARD() for error_prepend()Zhao Liu
As the comment in qapi/error, passing @errp to error_prepend() requires ERRP_GUARD(): * = Why, when and how to use ERRP_GUARD() = * * Without ERRP_GUARD(), use of the @errp parameter is restricted: ... * - It should not be passed to error_prepend(), error_vprepend() or * error_append_hint(), because that doesn't work with &error_fatal. * ERRP_GUARD() lifts these restrictions. * * To use ERRP_GUARD(), add it right at the beginning of the function. * @errp can then be used without worrying about the argument being * NULL or &error_fatal. ERRP_GUARD() could avoid the case when @errp is &error_fatal, the user can't see this additional information, because exit() happens in error_setg earlier than information is added [1]. The virtio_blk_vq_aio_context_init() passes @errp to error_prepend(). Though its @errp points its caller's local @err variable, to follow the requirement of @errp, add missing ERRP_GUARD() at the beginning of virtio_blk_vq_aio_context_init(). [1]: Issue description in the commit message of commit ae7c80a7bd73 ("error: New macro ERRP_GUARD()"). Cc: Stefan Hajnoczi <stefanha@redhat.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Kevin Wolf <kwolf@redhat.com> Cc: Hanna Reitz <hreitz@redhat.com> Cc: qemu-block@nongnu.org Signed-off-by: Zhao Liu <zhao1.liu@intel.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Acked-by: "Michael S. Tsirkin" <mst@redhat.com> Message-ID: <20240311033822.3142585-14-zhao1.liu@linux.intel.com> Signed-off-by: Thomas Huth <thuth@redhat.com>
2024-02-08virtio-blk: avoid using ioeventfd state in irqfd conditionalStefan Hajnoczi
Requests that complete in an IOThread use irqfd to notify the guest while requests that complete in the main loop thread use the traditional qdev irq code path. The reason for this conditional is that the irq code path requires the BQL: if (s->ioeventfd_started && !s->ioeventfd_disabled) { virtio_notify_irqfd(vdev, req->vq); } else { virtio_notify(vdev, req->vq); } There is a corner case where the conditional invokes the irq code path instead of the irqfd code path: static void virtio_blk_stop_ioeventfd(VirtIODevice *vdev) { ... /* * Set ->ioeventfd_started to false before draining so that host notifiers * are not detached/attached anymore. */ s->ioeventfd_started = false; /* Wait for virtio_blk_dma_restart_bh() and in flight I/O to complete */ blk_drain(s->conf.conf.blk); During blk_drain() the conditional produces the wrong result because ioeventfd_started is false. Use qemu_in_iothread() instead of checking the ioeventfd state. Cc: qemu-stable@nongnu.org Buglink: https://issues.redhat.com/browse/RHEL-15394 Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Message-ID: <20240122172625.415386-1-stefanha@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2024-02-07virtio-blk: Use ioeventfd_attach in start_ioeventfdHanna Czenczek
Commit d3f6f294aeadd5f88caf0155e4360808c95b3146 ("virtio-blk: always set ioeventfd during startup") has made virtio_blk_start_ioeventfd() always kick the virtqueue (set the ioeventfd), regardless of whether the BB is drained. That is no longer necessary, because attaching the host notifier will now set the ioeventfd, too; this happens either immediately right here in virtio_blk_start_ioeventfd(), or later when the drain ends, in virtio_blk_ioeventfd_attach(). With event_notifier_set() removed, the code becomes the same as the one in virtio_blk_ioeventfd_attach(), so we can reuse that function. Signed-off-by: Hanna Czenczek <hreitz@redhat.com> Message-ID: <20240202153158.788922-4-hreitz@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2024-02-07virtio-blk: do not use C99 mixed declarationsStefan Hajnoczi
QEMU's coding style generally forbids C99 mixed declarations. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Message-ID: <20240206140410.65650-1-stefanha@redhat.com> Reviewed-by: Hanna Czenczek <hreitz@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2024-02-07virtio-blk: add vq_rq[] bounds check in virtio_blk_dma_restart_cb()Stefan Hajnoczi
Hanna Czenczek <hreitz@redhat.com> noted that the array index in virtio_blk_dma_restart_cb() is not bounds-checked: g_autofree VirtIOBlockReq **vq_rq = g_new0(VirtIOBlockReq *, num_queues); ... while (rq) { VirtIOBlockReq *next = rq->next; uint16_t idx = virtio_get_queue_index(rq->vq); rq->next = vq_rq[idx]; ^^^^^^^^^^ The code is correct because both rq->vq and vq_rq[] depend on num_queues, but this is indirect and not 100% obvious. Add an assertion. Suggested-by: Hanna Czenczek <hreitz@redhat.com> Reviewed-by: Manos Pitsidianakis <manos.pitsidianakis@linaro.org> Reviewed-by: Hanna Czenczek <hreitz@redhat.com> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Message-ID: <20240206190610.107963-4-stefanha@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2024-02-07virtio-blk: clarify that there is at least 1 virtqueueStefan Hajnoczi
It is not possible to instantiate a virtio-blk device with 0 virtqueues. The following check is located in ->realize(): if (!conf->num_queues) { error_setg(errp, "num-queues property must be larger than 0"); return; } Later on we access s->vq_aio_context[0] under the assumption that there is as least one virtqueue. Hanna Czenczek <hreitz@redhat.com> noted that it would help to show that the array index is already valid. Add an assertion to document that s->vq_aio_context[0] is always safe...and catch future code changes that break this assumption. Suggested-by: Hanna Czenczek <hreitz@redhat.com> Reviewed-by: Manos Pitsidianakis <manos.pitsidianakis@linaro.org> Reviewed-by: Hanna Czenczek <hreitz@redhat.com> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Message-ID: <20240206190610.107963-3-stefanha@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2024-02-07virtio-blk: enforce iothread-vq-mapping validationStefan Hajnoczi
Hanna Czenczek <hreitz@redhat.com> noticed that the safety of `vq_aio_context[vq->value] = ctx;` with user-defined vq->value inputs is not obvious. The code is structured in validate() + apply() steps so input validation is there, but it happens way earlier and there is nothing that guarantees apply() can only be called with validated inputs. This patch moves the validate() call inside the apply() function so validation is guaranteed. I also added the bounds checking assertion that Hanna suggested. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Manos Pitsidianakis <manos.pitsidianakis@linaro.org> Reviewed-by: Hanna Czenczek <hreitz@redhat.com> Message-ID: <20240206190610.107963-2-stefanha@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2024-01-26virtio-blk: always set ioeventfd during startupStefan Hajnoczi
When starting ioeventfd it is common practice to set the event notifier so that the ioeventfd handler is triggered to run immediately. There may be no requests waiting to be processed, but the idea is that if a request snuck in then we guarantee that it will be detected. One scenario where self-triggering the ioeventfd is necessary is when virtio_blk_handle_output() is called from a vCPU thread before the VIRTIO Device Status transitions to DRIVER_OK. In that case we need to self-trigger the ioeventfd so that the kick handled by the vCPU thread causes the vq AioContext thread to take over handling the request(s). Fixes: b6948ab01df0 ("virtio-blk: add iothread-vq-mapping parameter") Reported-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Message-ID: <20240119135748.270944-7-stefanha@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2024-01-26virtio-blk: tolerate failure to set BlockBackend AioContextStefan Hajnoczi
We no longer rely on setting the AioContext since the block layer IO_CODE APIs can be called from any thread. Now it's just a hint to help block jobs and other operations co-locate themselves in a thread with the guest I/O requests. Keep going if setting the AioContext fails. Suggested-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Message-ID: <20240119135748.270944-6-stefanha@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2024-01-26virtio-blk: restart s->rq reqs in vq AioContextsStefan Hajnoczi
A virtio-blk device with the iothread-vq-mapping parameter has per-virtqueue AioContexts. It is not thread-safe to process s->rq requests in the BlockBackend AioContext since that may be different from the virtqueue's AioContext to which this request belongs. The code currently races and could crash. Adapt virtio_blk_dma_restart_cb() to first split s->rq into per-vq lists and then schedule a BH each vq's AioContext as necessary. This way requests are safely processed in their vq's AioContext. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Message-ID: <20240119135748.270944-5-stefanha@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2024-01-26virtio-blk: rename dataplane to ioeventfdStefan Hajnoczi
The dataplane code is really about using ioeventfd. It's used both for IOThreads (what we think of as dataplane) and for the core virtio-pci code's ioeventfd feature (which is enabled by default and used when no IOThread has been specified). Rename the code to reflect this. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Message-ID: <20240119135748.270944-4-stefanha@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2024-01-26virtio-blk: rename dataplane create/destroy functionsStefan Hajnoczi
virtio_blk_data_plane_create() and virtio_blk_data_plane_destroy() are actually about s->vq_aio_context[] rather than managing dataplane-specific state. As a prerequisite to using s->vq_aio_context[] in all code paths (even when dataplane is not used), rename these functions to reflect that they just manage s->vq_aio_context and call them regardless of whether or not dataplane is in use. Note that virtio-blk supports running with -device virtio-blk-pci,ioevent=off where the vCPU thread enters the device emulation code. In this mode ioeventfd is not used for virtqueue processing. However, we still want to initialize s->vq_aio_context[] to qemu_aio_context in that case since I/O completion callbacks will be invoked in the main loop thread. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Message-ID: <20240119135748.270944-3-stefanha@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2024-01-26virtio-blk: move dataplane code into virtio-blk.cStefan Hajnoczi
The dataplane code used to be significantly different from the non-dataplane code and therefore had a separate source file. Over time the difference has gotten smaller because the I/O code paths were unified. Nowadays the distinction between the VirtIOBlock and VirtIOBlockDataPlane structs is more of an inconvenience that hinders code simplification. Move hw/block/dataplane/virtio-blk.c into hw/block/virtio-blk.c, merging VirtIOBlockDataPlane's fields into VirtIOBlock. hw/block/virtio-blk.c used VirtIOBlock->dataplane to check if virtio_blk_data_plane_create() was successful. This is not necessary because ->dataplane_started and ->dataplane_disabled can be used instead. This patch makes those changes in order to drop VirtIOBlock->dataplane. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Message-ID: <20240119135748.270944-2-stefanha@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2024-01-08Rename "QEMU global mutex" to "BQL" in comments and docsStefan Hajnoczi
The term "QEMU global mutex" is identical to the more widely used Big QEMU Lock ("BQL"). Update the code comments and documentation to use "BQL" instead of "QEMU global mutex". Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Acked-by: Markus Armbruster <armbru@redhat.com> Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Reviewed-by: Paul Durrant <paul@xen.org> Reviewed-by: Akihiko Odaki <akihiko.odaki@daynix.com> Reviewed-by: Cédric Le Goater <clg@kaod.org> Reviewed-by: Harsh Prateek Bora <harshpb@linux.ibm.com> Message-id: 20240102153529.486531-6-stefanha@redhat.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2023-12-29hw/block: Constify VMStateRichard Henderson
Signed-off-by: Richard Henderson <richard.henderson@linaro.org> Message-Id: <20231221031652.119827-25-richard.henderson@linaro.org>
2023-12-21virtio-blk: add iothread-vq-mapping parameterStefan Hajnoczi
Add the iothread-vq-mapping parameter to assign virtqueues to IOThreads. Store the vq:AioContext mapping in the new struct VirtIOBlockDataPlane->vq_aio_context[] field and refactor the code to use the per-vq AioContext instead of the BlockDriverState's AioContext. Reimplement --device virtio-blk-pci,iothread= and non-IOThread mode by assigning all virtqueues to the IOThread and main loop's AioContext in vq_aio_context[], respectively. The comment in struct VirtIOBlockDataPlane about EventNotifiers is stale. Remove it. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Message-ID: <20231220134755.814917-5-stefanha@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2023-12-21block: remove AioContext lockingStefan Hajnoczi
This is the big patch that removes aio_context_acquire()/aio_context_release() from the block layer and affected block layer users. There isn't a clean way to split this patch and the reviewers are likely the same group of people, so I decided to do it in one patch. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Paul Durrant <paul@xen.org> Message-ID: <20231205182011.1976568-7-stefanha@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2023-12-21virtio-blk: don't lock AioContext in the submission code pathStefan Hajnoczi
There is no need to acquire the AioContext lock around blk_aio_*() or blk_get_geometry() anymore. I/O plugging (defer_call()) also does not require the AioContext lock anymore. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Message-ID: <20230914140101.1065008-5-stefanha@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2023-12-21virtio-blk: don't lock AioContext in the completion code pathStefan Hajnoczi
Nothing in the completion code path relies on the AioContext lock anymore. Virtqueues are only accessed from one thread at any moment and the s->rq global state is protected by its own lock now. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Message-ID: <20230914140101.1065008-4-stefanha@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2023-12-21virtio-blk: add lock to protect s->rqStefan Hajnoczi
s->rq is accessed from IO_CODE and GLOBAL_STATE_CODE. Introduce a lock to protect s->rq and eliminate reliance on the AioContext lock. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Message-ID: <20230914140101.1065008-3-stefanha@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2023-10-31util/defer-call: move defer_call() to util/Stefan Hajnoczi
The networking subsystem may wish to use defer_call(), so move the code to util/ where it can be reused. As a reminder of what defer_call() does: This API defers a function call within a defer_call_begin()/defer_call_end() section, allowing multiple calls to batch up. This is a performance optimization that is used in the block layer to submit several I/O requests at once instead of individually: defer_call_begin(); <-- start of section ... defer_call(my_func, my_obj); <-- deferred my_func(my_obj) call defer_call(my_func, my_obj); <-- another defer_call(my_func, my_obj); <-- another ... defer_call_end(); <-- end of section, my_func(my_obj) is called once Suggested-by: Ilya Maximets <i.maximets@ovn.org> Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Message-ID: <20230913200045.1024233-3-stefanha@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2023-10-31block: rename blk_io_plug_call() API to defer_call()Stefan Hajnoczi
Prepare to move the blk_io_plug_call() API out of the block layer so that other subsystems call use this deferred call mechanism. Rename it to defer_call() but leave the code in block/plug.c. The next commit will move the code out of the block layer. Suggested-by: Ilya Maximets <i.maximets@ovn.org> Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Reviewed-by: Paul Durrant <paul@xen.org> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Message-ID: <20230913200045.1024233-2-stefanha@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2023-06-01block: add blk_io_plug_call() APIStefan Hajnoczi
Introduce a new API for thread-local blk_io_plug() that does not traverse the block graph. The goal is to make blk_io_plug() multi-queue friendly. Instead of having block drivers track whether or not we're in a plugged section, provide an API that allows them to defer a function call until we're unplugged: blk_io_plug_call(fn, opaque). If blk_io_plug_call() is called multiple times with the same fn/opaque pair, then fn() is only called once at the end of the function - resulting in batching. This patch introduces the API and changes blk_io_plug()/blk_io_unplug(). blk_io_plug()/blk_io_unplug() no longer require a BlockBackend argument because the plug state is now thread-local. Later patches convert block drivers to blk_io_plug_call() and then we can finally remove .bdrv_co_io_plug() once all block drivers have been converted. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Acked-by: Kevin Wolf <kwolf@redhat.com> Message-id: 20230530180959.1108766-2-stefanha@redhat.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2023-05-30virtio-blk: implement BlockDevOps->drained_begin()Stefan Hajnoczi
Detach ioeventfds during drained sections to stop I/O submission from the guest. virtio-blk is no longer reliant on aio_disable_external() after this patch. This will allow us to remove the aio_disable_external() API once all other code that relies on it is converted. Take extra care to avoid attaching/detaching ioeventfds if the data plane is started/stopped during a drained section. This should be rare, but maybe the mirror block job can trigger it. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Message-Id: <20230516190238.8401-18-stefanha@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2023-05-15virtio-blk: add some trace events for zoned emulationSam Li
Signed-off-by: Sam Li <faithilikerun@gmail.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Message-id: 20230508051916.178322-4-faithilikerun@gmail.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2023-05-15block: add accounting for zone append operationSam Li
Taking account of the new zone append write operation for zoned devices, BLOCK_ACCT_ZONE_APPEND enum is introduced as other I/O request type (read, write, flush). Signed-off-by: Sam Li <faithilikerun@gmail.com> Message-id: 20230508051916.178322-3-faithilikerun@gmail.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2023-05-15virtio-blk: add zoned storage emulation for zoned devicesSam Li
This patch extends virtio-blk emulation to handle zoned device commands by calling the new block layer APIs to perform zoned device I/O on behalf of the guest. It supports Report Zone, four zone oparations (open, close, finish, reset), and Append Zone. The VIRTIO_BLK_F_ZONED feature bit will only be set if the host does support zoned block devices. Regular block devices(conventional zones) will not be set. The guest os can use blktests, fio to test those commands on zoned devices. Furthermore, using zonefs to test zone append write is also supported. Signed-off-by: Sam Li <faithilikerun@gmail.com> Message-id: 20230508051916.178322-2-faithilikerun@gmail.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2023-02-09virtio-blk: add missing AioContext lockEmanuele Giuseppe Esposito
virtio_blk_update_config() calls blk_get_geometry and blk_getlength, and both functions eventually end up calling bdrv_poll_co when not running in a coroutine: - blk_getlength is a co_wrapper_mixed function - blk_get_geometry calls bdrv_get_geometry -> bdrv_nb_sectors, a co_wrapper_mixed function too Since we are not running in a coroutine, we need to take s->blk AioContext lock, otherwise bdrv_poll_co will inevitably call AIO_WAIT_WHILE and therefore try to un unlock() an AioContext lock that was never acquired. RHBZ: https://bugzilla.redhat.com/show_bug.cgi?id=2167838 Steps to reproduce the issue: simply boot a VM with -object '{"qom-type":"iothread","id":"iothread1"}' \ -blockdev '{"driver":"file","filename":"$QCOW2","aio":"native","node-name":"libvirt-1-storage","cache":{"direct":true,"no-flush":false},"auto-read-only":true,"discard":"unmap"}' \ -blockdev '{"node-name":"libvirt-1-format","read-only":false,"cache":{"direct":true,"no-flush":false},"driver":"qcow2","file":"libvirt-1-storage"}' \ -device virtio-blk-pci,iothread=iothread1,drive=libvirt-1-format,id=virtio-disk0,bootindex=1,write-cache=on and observe that it will fail not manage to boot with "qemu_mutex_unlock_impl: Operation not permitted" Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Tested-by: Lukáš Doktor <ldoktor@redhat.com> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Message-Id: <20230208111148.1040083-1-eesposit@redhat.com>
2023-01-23virtio-blk: simplify virtio_blk_dma_restart_cb()Stefan Hajnoczi
virtio_blk_dma_restart_cb() is tricky because the BH must deal with virtio_blk_data_plane_start()/virtio_blk_data_plane_stop() being called. There are two issues with the code: 1. virtio_blk_realize() should use qdev_add_vm_change_state_handler() instead of qemu_add_vm_change_state_handler(). This ensures the ordering with virtio_init()'s vm change state handler that calls virtio_blk_data_plane_start()/virtio_blk_data_plane_stop() is well-defined. Then blk's AioContext is guaranteed to be up-to-date in virtio_blk_dma_restart_cb() and it's no longer necessary to have a special case for virtio_blk_data_plane_start(). 2. Only blk_drain() waits for virtio_blk_dma_restart_cb()'s blk_inc_in_flight() to be decremented. The bdrv_drain() family of functions do not wait for BlockBackend's in_flight counter to reach zero. virtio_blk_data_plane_stop() relies on blk_set_aio_context()'s implicit drain, but that's a bdrv_drain() and not a blk_drain(). Note that virtio_blk_reset() already correctly relies on blk_drain(). If virtio_blk_data_plane_stop() switches to blk_drain() then we can properly wait for pending virtio_blk_dma_restart_bh() calls. Once these issues are taken care of the code becomes simpler. This change is in preparation for multiple IOThreads in virtio-blk where we need to clean up the multi-threading behavior. I ran the reproducer from commit 49b44549ace7 ("virtio-blk: On restart, process queued requests in the proper context") to check that there is no regression. Cc: Sergio Lopez <slp@redhat.com> Cc: Kevin Wolf <kwolf@redhat.com> Cc: Emanuele Giuseppe Esposito <eesposit@redhat.com> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Emanuele Giuseppe Esposito <eesposit@redhat.com> Message-id: 20221102182337.252202-1-stefanha@redhat.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2022-10-26virtio-blk: use BDRV_REQ_REGISTERED_BUF optimization hintStefan Hajnoczi
Register guest RAM using BlockRAMRegistrar and set the BDRV_REQ_REGISTERED_BUF flag so block drivers can optimize memory accesses in I/O requests. This is for vdpa-blk, vhost-user-blk, and other I/O interfaces that rely on DMA mapping/unmapping. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Message-id: 20221013185908.1297568-14-stefanha@redhat.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2022-10-07virtio-blk: move config size params to virtio-blk-commonDaniil Tatianin
This way we can reuse it for other virtio-blk devices, e.g vhost-user-blk, which currently does not control its config space size dynamically. Signed-off-by: Daniil Tatianin <d-tatianin@yandex-team.ru> Reviewed-by: Raphael Norwitz <raphael.norwitz@nutanix.com> Message-Id: <20220906073111.353245-3-d-tatianin@yandex-team.ru> Reviewed-by: Raphael Norwitz <raphael.norwitz@nutanix.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2022-10-07virtio: introduce VirtIOConfigSizeParams & virtio_get_config_sizeDaniil Tatianin
This is the first step towards moving all device config size calculation logic into the virtio core code. In particular, this adds a struct that contains all the necessary information for common virtio code to be able to calculate the final config size for a device. This is expected to be used with the new virtio_get_config_size helper, which calculates the final length based on the provided host features. This builds on top of already existing code like VirtIOFeature and virtio_feature_get_config_size(), but adds additional fields, as well as sanity checking so that device-specifc code doesn't have to duplicate it. An example usage would be: static const VirtIOFeature dev_features[] = { {.flags = 1ULL << FEATURE_1_BIT, .end = endof(struct virtio_dev_config, feature_1)}, {.flags = 1ULL << FEATURE_2_BIT, .end = endof(struct virtio_dev_config, feature_2)}, {} }; static const VirtIOConfigSizeParams dev_cfg_size_params = { .min_size = DEV_BASE_CONFIG_SIZE, .max_size = sizeof(struct virtio_dev_config), .feature_sizes = dev_features }; // code inside my_dev_device_realize() size_t config_size = virtio_get_config_size(&dev_cfg_size_params, host_features); virtio_init(vdev, VIRTIO_ID_MYDEV, config_size); Currently every device is expected to write its own boilerplate from the example above in device_realize(), however, the next step of this transition is moving VirtIOConfigSizeParams into VirtioDeviceClass, so that it can be done automatically by the virtio initialization code. All of the users of virtio_feature_get_config_size have been converted to use virtio_get_config_size so it's no longer needed and is removed with this commit. Signed-off-by: Daniil Tatianin <d-tatianin@yandex-team.ru> Message-Id: <20220906073111.353245-2-d-tatianin@yandex-team.ru> Reviewed-by: Raphael Norwitz <raphael.norwitz@nutanix.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2022-06-24block: get rid of blk->guest_block_sizeStefan Hajnoczi
Commit 1b7fd729559c ("block: rename buffer_alignment to guest_block_size") noted: At this point, the field is set by the device emulation, but completely ignored by the block layer. The last time the value of buffer_alignment/guest_block_size was actually used was before commit 339064d50639 ("block: Don't use guest sector size for qemu_blockalign()"). This value has not been used since 2013. Get rid of it. Cc: Xie Yongji <xieyongji@bytedance.com> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Message-Id: <20220518130945.2657905-1-stefanha@redhat.com> Reviewed-by: Paul Durrant <paul@xen.org> Reviewed-by: Eric Blake <eblake@redhat.com> Reviewed-by: Alberto Faria <afaria@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-05-16virtio: drop name parameter for virtio_init()Jonah Palmer
This patch drops the name parameter for the virtio_init function. The pair between the numeric device ID and the string device ID (name) of a virtio device already exists, but not in a way that lets us map between them. This patch lets us do this and removes the need for the name parameter in the virtio_init function. Signed-off-by: Jonah Palmer <jonah.palmer@oracle.com> Message-Id: <1648819405-25696-2-git-send-email-jonah.palmer@oracle.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2022-05-12coroutine: Rename qemu_coroutine_inc/dec_pool_size()Kevin Wolf
It's true that these functions currently affect the batch size in which coroutines are reused (i.e. moved from the global release pool to the allocation pool of a specific thread), but this is a bug and will be fixed in a separate patch. In fact, the comment in the header file already just promises that it influences the pool size, so reflect this in the name of the functions. As a nice side effect, the shorter function name makes some line wrapping unnecessary. Cc: qemu-stable@nongnu.org Signed-off-by: Kevin Wolf <kwolf@redhat.com> Message-Id: <20220510151020.105528-2-kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-02-14util: adjust coroutine pool size to virtio block queueHiroki Narukawa
Coroutine pool size was 64 from long ago, and the basis was organized in the commit message in 4d68e86b. At that time, virtio-blk queue-size and num-queue were not configuable, and equivalent values were 128 and 1. Coroutine pool size 64 was fine then. Later queue-size and num-queue got configuable, and default values were increased. Coroutine pool with size 64 exhausts frequently with random disk IO in new size, and slows down. This commit adjusts coroutine pool size adaptively with new values. This commit adds 64 by default, but now coroutine is not only for block devices, and is not too much burdon comparing with new default. pool size of 128 * vCPUs. Signed-off-by: Hiroki Narukawa <hnarukaw@yahoo-corp.jp> Message-id: 20220214115302.13294-2-hnarukaw@yahoo-corp.jp Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2022-01-12virtio-blk: drop unused virtio_blk_handle_vq() return valueStefan Hajnoczi
The return value of virtio_blk_handle_vq() is no longer used. Get rid of it. This is a step towards unifying the dataplane and non-dataplane virtqueue handler functions. Prepare virtio_blk_handle_output() to be used by both dataplane and non-dataplane by making the condition for starting ioeventfd more specific. This way it won't trigger when dataplane has already been started. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Message-id: 20211207132336.36627-4-stefanha@redhat.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2021-05-14virtio-blk: Constify VirtIOFeature feature_sizes[]Philippe Mathieu-Daudé
Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com> Message-Id: <20210511104157.2880306-3-philmd@redhat.com>
2021-03-15virtio-blk: Respect discard granularityAkihiko Odaki
Report the configured granularity for discard operation to the guest. If this is not set use the block size. Since until now we have ignored the configured discard granularity and always reported the block size, let's add 'report-discard-granularity' property and disable it for older machine types to avoid migration issues. Signed-off-by: Akihiko Odaki <akihiko.odaki@gmail.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Message-Id: <20210225001239.47046-1-akihiko.odaki@gmail.com>
2021-03-09sysemu: Let VMChangeStateHandler take boolean 'running' argumentPhilippe Mathieu-Daudé
The 'running' argument from VMChangeStateHandler does not require other value than 0 / 1. Make it a plain boolean. Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com> Reviewed-by: Alex Bennée <alex.bennee@linaro.org> Acked-by: David Gibson <david@gibson.dropbear.id.au> Message-Id: <20210111152020.1422021-3-philmd@redhat.com> Signed-off-by: Laurent Vivier <laurent@vivier.eu>
2021-01-27block: Separate blk_is_writable() and blk_supports_write_perm()Kevin Wolf
Currently, blk_is_read_only() tells whether a given BlockBackend can only be used in read-only mode because its root node is read-only. Some callers actually try to answer a slightly different question: Is the BlockBackend configured to be writable, by taking write permissions on the root node? This can differ, for example, for CD-ROM devices which don't take write permissions, but may be backed by a writable image file. scsi-cd allows write requests to the drive if blk_is_read_only() returns false. However, the write request will immediately run into an assertion failure because the write permission is missing. This patch introduces separate functions for both questions. blk_supports_write_perm() answers the question whether the block node/image file can support writable devices, whereas blk_is_writable() tells whether the BlockBackend is currently configured to be writable. All calls of blk_is_read_only() are converted to one of the two new functions. Fixes: https://bugs.launchpad.net/bugs/1906693 Cc: qemu-stable@nongnu.org Signed-off-by: Kevin Wolf <kwolf@redhat.com> Message-Id: <20210118123448.307825-2-kwolf@redhat.com> Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com> Reviewed-by: Max Reitz <mreitz@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2020-09-23virtio-blk: undo destructive iov_discard_*() operationsStefan Hajnoczi
Fuzzing discovered that virtqueue_unmap_sg() is being called on modified req->in/out_sg iovecs. This means dma_memory_map() and dma_memory_unmap() calls do not have matching memory addresses. Fuzzing discovered that non-RAM addresses trigger a bug: void address_space_unmap(AddressSpace *as, void *buffer, hwaddr len, bool is_write, hwaddr access_len) { if (buffer != bounce.buffer) { ^^^^^^^^^^^^^^^^^^^^^^^ A modified iov->iov_base is no longer recognized as a bounce buffer and the wrong branch is taken. There are more potential bugs: dirty memory is not tracked correctly and MemoryRegion refcounts can be leaked. Use the new iov_discard_undo() API to restore elem->in/out_sg before virtqueue_push() is called. Fixes: 827805a2492c1bbf1c0712ed18ee069b4ebf3dd6 ("virtio-blk: Convert VirtIOBlockReq.out to structrue") Reported-by: Alexander Bulekov <alxndr@bu.edu> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Li Qiang <liq3ea@gmail.com> Buglink: https://bugs.launchpad.net/qemu/+bug/1890360 Message-Id: <20200917094455.822379-3-stefanha@redhat.com>
2020-08-27virtio-blk-pci: default num_queues to -smp NStefan Hajnoczi
Automatically size the number of virtio-blk-pci request virtqueues to match the number of vCPUs. Other transports continue to default to 1 request virtqueue. A 1:1 virtqueue:vCPU mapping ensures that completion interrupts are handled on the same vCPU that submitted the request. No IPI is necessary to complete an I/O request and performance is improved. The maximum number of MSI-X vectors and virtqueues limit are respected. Performance improves from 78k to 104k IOPS on a 32 vCPU guest with 101 virtio-blk-pci devices (ioengine=libaio, iodepth=1, bs=4k, rw=randread with NVMe storage). Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Reviewed-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Message-Id: <20200818143348.310613-7-stefanha@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2020-06-17block: consolidate blocksize properties consistency checksRoman Kagan
Several block device properties related to blocksize configuration must be in certain relationship WRT each other: physical block must be no smaller than logical block; min_io_size, opt_io_size, and discard_granularity must be a multiple of a logical block. To ensure these requirements are met, add corresponding consistency checks to blkconf_blocksizes, adjusting its signature to communicate possible error to the caller. Also remove the now redundant consistency checks from the specific devices. Signed-off-by: Roman Kagan <rvkagan@yandex-team.ru> Reviewed-by: Eric Blake <eblake@redhat.com> Reviewed-by: Paul Durrant <paul@xen.org> Message-Id: <20200528225516.1676602-3-rvkagan@yandex-team.ru> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2020-06-17virtio-blk: store opt_io_size with correct sizeRoman Kagan
The width of opt_io_size in virtio_blk_config is 32bit. However, it's written with virtio_stw_p; this may result in value truncation, and on big-endian systems with legacy virtio in completely bogus readings in the guest. Use the appropriate accessor to store it. Signed-off-by: Roman Kagan <rvkagan@yandex-team.ru> Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Message-Id: <20200528225516.1676602-2-rvkagan@yandex-team.ru> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2020-06-17virtio-blk: On restart, process queued requests in the proper contextSergio Lopez
On restart, we were scheduling a BH to process queued requests, which would run before starting up the data plane, leading to those requests being assigned and started on coroutines on the main context. This could cause requests to be wrongly processed in parallel from different threads (the main thread and the iothread managing the data plane), potentially leading to multiple issues. For example, stopping and resuming a VM multiple times while the guest is generating I/O on a virtio_blk device can trigger a crash with a stack tracing looking like this one: <------> Thread 2 (Thread 0x7ff736765700 (LWP 1062503)): #0 0x00005567a13b99d6 in iov_memset (iov=0x6563617073206f4e, iov_cnt=1717922848, offset=516096, fillc=0, bytes=7018105756081554803) at util/iov.c:69 #1 0x00005567a13bab73 in qemu_iovec_memset (qiov=0x7ff73ec99748, offset=516096, fillc=0, bytes=7018105756081554803) at util/iov.c:530 #2 0x00005567a12f411c in qemu_laio_process_completion (laiocb=0x7ff6512ee6c0) at block/linux-aio.c:86 #3 0x00005567a12f42ff in qemu_laio_process_completions (s=0x7ff7182e8420) at block/linux-aio.c:217 #4 0x00005567a12f480d in ioq_submit (s=0x7ff7182e8420) at block/linux-aio.c:323 #5 0x00005567a12f43d9 in qemu_laio_process_completions_and_submit (s=0x7ff7182e8420) at block/linux-aio.c:236 #6 0x00005567a12f44c2 in qemu_laio_poll_cb (opaque=0x7ff7182e8430) at block/linux-aio.c:267 #7 0x00005567a13aed83 in run_poll_handlers_once (ctx=0x5567a2b58c70, timeout=0x7ff7367645f8) at util/aio-posix.c:520 #8 0x00005567a13aee9f in run_poll_handlers (ctx=0x5567a2b58c70, max_ns=16000, timeout=0x7ff7367645f8) at util/aio-posix.c:562 #9 0x00005567a13aefde in try_poll_mode (ctx=0x5567a2b58c70, timeout=0x7ff7367645f8) at util/aio-posix.c:597 #10 0x00005567a13af115 in aio_poll (ctx=0x5567a2b58c70, blocking=true) at util/aio-posix.c:639 #11 0x00005567a109acca in iothread_run (opaque=0x5567a2b29760) at iothread.c:75 #12 0x00005567a13b2790 in qemu_thread_start (args=0x5567a2b694c0) at util/qemu-thread-posix.c:519 #13 0x00007ff73eedf2de in start_thread () at /lib64/libpthread.so.0 #14 0x00007ff73ec10e83 in clone () at /lib64/libc.so.6 Thread 1 (Thread 0x7ff743986f00 (LWP 1062500)): #0 0x00005567a13b99d6 in iov_memset (iov=0x6563617073206f4e, iov_cnt=1717922848, offset=516096, fillc=0, bytes=7018105756081554803) at util/iov.c:69 #1 0x00005567a13bab73 in qemu_iovec_memset (qiov=0x7ff73ec99748, offset=516096, fillc=0, bytes=7018105756081554803) at util/iov.c:530 #2 0x00005567a12f411c in qemu_laio_process_completion (laiocb=0x7ff6512ee6c0) at block/linux-aio.c:86 #3 0x00005567a12f42ff in qemu_laio_process_completions (s=0x7ff7182e8420) at block/linux-aio.c:217 #4 0x00005567a12f480d in ioq_submit (s=0x7ff7182e8420) at block/linux-aio.c:323 #5 0x00005567a12f4a2f in laio_do_submit (fd=19, laiocb=0x7ff5f4ff9ae0, offset=472363008, type=2) at block/linux-aio.c:375 #6 0x00005567a12f4af2 in laio_co_submit (bs=0x5567a2b8c460, s=0x7ff7182e8420, fd=19, offset=472363008, qiov=0x7ff5f4ff9ca0, type=2) at block/linux-aio.c:394 #7 0x00005567a12f1803 in raw_co_prw (bs=0x5567a2b8c460, offset=472363008, bytes=20480, qiov=0x7ff5f4ff9ca0, type=2) at block/file-posix.c:1892 #8 0x00005567a12f1941 in raw_co_pwritev (bs=0x5567a2b8c460, offset=472363008, bytes=20480, qiov=0x7ff5f4ff9ca0, flags=0) at block/file-posix.c:1925 #9 0x00005567a12fe3e1 in bdrv_driver_pwritev (bs=0x5567a2b8c460, offset=472363008, bytes=20480, qiov=0x7ff5f4ff9ca0, qiov_offset=0, flags=0) at block/io.c:1183 #10 0x00005567a1300340 in bdrv_aligned_pwritev (child=0x5567a2b5b070, req=0x7ff5f4ff9db0, offset=472363008, bytes=20480, align=512, qiov=0x7ff72c0425b8, qiov_offset=0, flags=0) at block/io.c:1980 #11 0x00005567a1300b29 in bdrv_co_pwritev_part (child=0x5567a2b5b070, offset=472363008, bytes=20480, qiov=0x7ff72c0425b8, qiov_offset=0, flags=0) at block/io.c:2137 #12 0x00005567a12baba1 in qcow2_co_pwritev_task (bs=0x5567a2b92740, file_cluster_offset=472317952, offset=487305216, bytes=20480, qiov=0x7ff72c0425b8, qiov_offset=0, l2meta=0x0) at block/qcow2.c:2444 #13 0x00005567a12bacdb in qcow2_co_pwritev_task_entry (task=0x5567a2b48540) at block/qcow2.c:2475 #14 0x00005567a13167d8 in aio_task_co (opaque=0x5567a2b48540) at block/aio_task.c:45 #15 0x00005567a13cf00c in coroutine_trampoline (i0=738245600, i1=32759) at util/coroutine-ucontext.c:115 #16 0x00007ff73eb622e0 in __start_context () at /lib64/libc.so.6 #17 0x00007ff6626f1350 in () #18 0x0000000000000000 in () <------> This is also known to cause crashes with this message (assertion failed): aio_co_schedule: Co-routine was already scheduled in 'aio_co_schedule' RHBZ: https://bugzilla.redhat.com/show_bug.cgi?id=1812765 Signed-off-by: Sergio Lopez <slp@redhat.com> Message-Id: <20200603093240.40489-3-slp@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2020-06-17virtio-blk: Refactor the code that processes queued requestsSergio Lopez
Move the code that processes queued requests from virtio_blk_dma_restart_bh() to its own, non-static, function. This will allow us to call it from the virtio_blk_data_plane_start() in a future patch. Signed-off-by: Sergio Lopez <slp@redhat.com> Message-Id: <20200603093240.40489-2-slp@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2020-05-15qdev: Unrealize must not failMarkus Armbruster
Devices may have component devices and buses. Device realization may fail. Realization is recursive: a device's realize() method realizes its components, and device_set_realized() realizes its buses (which should in turn realize the devices on that bus, except bus_set_realized() doesn't implement that, yet). When realization of a component or bus fails, we need to roll back: unrealize everything we realized so far. If any of these unrealizes failed, the device would be left in an inconsistent state. Must not happen. device_set_realized() lets it happen: it ignores errors in the roll back code starting at label child_realize_fail. Since realization is recursive, unrealization must be recursive, too. But how could a partly failed unrealize be rolled back? We'd have to re-realize, which can fail. This design is fundamentally broken. device_set_realized() does not roll back at all. Instead, it keeps unrealizing, ignoring further errors. It can screw up even for a device with no buses: if the lone dc->unrealize() fails, it still unregisters vmstate, and calls listeners' unrealize() callback. bus_set_realized() does not roll back either. Instead, it stops unrealizing. Fortunately, no unrealize method can fail, as we'll see below. To fix the design error, drop parameter @errp from all the unrealize methods. Any unrealize method that uses @errp now needs an update. This leads us to unrealize() methods that can fail. Merely passing it to another unrealize method cannot cause failure, though. Here are the ones that do other things with @errp: * virtio_serial_device_unrealize() Fails when qbus_set_hotplug_handler() fails, but still does all the other work. On failure, the device would stay realized with its resources completely gone. Oops. Can't happen, because qbus_set_hotplug_handler() can't actually fail here. Pass &error_abort to qbus_set_hotplug_handler() instead. * hw/ppc/spapr_drc.c's unrealize() Fails when object_property_del() fails, but all the other work is already done. On failure, the device would stay realized with its vmstate registration gone. Oops. Can't happen, because object_property_del() can't actually fail here. Pass &error_abort to object_property_del() instead. * spapr_phb_unrealize() Fails and bails out when remove_drcs() fails, but other work is already done. On failure, the device would stay realized with some of its resources gone. Oops. remove_drcs() fails only when chassis_from_bus()'s object_property_get_uint() fails, and it can't here. Pass &error_abort to remove_drcs() instead. Therefore, no unrealize method can fail before this patch. device_set_realized()'s recursive unrealization via bus uses object_property_set_bool(). Can't drop @errp there, so pass &error_abort. We similarly unrealize with object_property_set_bool() elsewhere, always ignoring errors. Pass &error_abort instead. Several unrealize methods no longer handle errors from other unrealize methods: virtio_9p_device_unrealize(), virtio_input_device_unrealize(), scsi_qdev_unrealize(), ... Much of the deleted error handling looks wrong anyway. One unrealize methods no longer ignore such errors: usb_ehci_pci_exit(). Several realize methods no longer ignore errors when rolling back: v9fs_device_realize_common(), pci_qdev_unrealize(), spapr_phb_realize(), usb_qdev_realize(), vfio_ccw_realize(), virtio_device_realize(). Signed-off-by: Markus Armbruster <armbru@redhat.com> Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20200505152926.18877-17-armbru@redhat.com>