aboutsummaryrefslogtreecommitdiff
path: root/block
AgeCommit message (Collapse)Author
2023-09-21parallels: fix formatting in bdrv_parallels initializationDenis V. Lunev
Old code is ugly and contains tabulations. There are no functional changes in this patch. Signed-off-by: Denis V. Lunev <den@openvz.org> Reviewed-by: Alexander Ivanov <alexander.ivanov@virtuozzo.com>
2023-09-20block-backend: process zoned requests in the current AioContextStefan Hajnoczi
Process zoned requests in the current thread's AioContext instead of in the BlockBackend's AioContext. There is no need to use the BlockBackend's AioContext thanks to CoMutex bs->wps->colock, which protects zone metadata. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Message-ID: <20230912231037.826804-5-stefanha@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2023-09-20block-backend: process I/O in the current AioContextStefan Hajnoczi
Switch blk_aio_*() APIs over to multi-queue by using qemu_get_current_aio_context() instead of blk_get_aio_context(). This change will allow devices to process I/O in multiple IOThreads in the future. I audited existing blk_aio_*() callers: - migration/block.c: blk_mig_lock() protects the data accessed by the completion callback. - The remaining emulated devices and exports run with qemu_get_aio_context() == blk_get_aio_context(). Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Message-ID: <20230912231037.826804-4-stefanha@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2023-09-20block: remove AIOCBInfo->get_aio_context()Stefan Hajnoczi
The synchronous bdrv_aio_cancel() function needs the acb's AioContext so it can call aio_poll() to wait for cancellation. It turns out that all users run under the BQL in the main AioContext, so this callback is not needed. Remove the callback, mark bdrv_aio_cancel() GLOBAL_STATE_CODE just like its blk_aio_cancel() caller, and poll the main loop AioContext. The purpose of this cleanup is to identify bdrv_aio_cancel() as an API that does not work with the multi-queue block layer. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Message-ID: <20230912231037.826804-2-stefanha@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Reviewed-by: Klaus Jensen <k.jensen@samsung.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2023-09-20block: add BDRV_BLOCK_COMPRESSED flag for bdrv_block_status()Andrey Drobyshev via
Functions qcow2_get_host_offset(), get_cluster_offset(), vmdk_co_block_status() explicitly report compressed cluster types when data is compressed. However, this information is never passed further. Let's make use of it by adding new BDRV_BLOCK_COMPRESSED flag for bdrv_block_status(), so that caller may know that the data range is compressed. In particular, we're going to use this flag to tweak "qemu-img map" output. This new flag is only being utilized by qcow, qcow2 and vmdk formats, as only those support compression. Reviewed-by: Denis V. Lunev <den@openvz.org> Reviewed-by: Hanna Czenczek <hreitz@redhat.com> Signed-off-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com> Message-ID: <20230907210226.953821-2-andrey.drobyshev@virtuozzo.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2023-09-20block: Mark bdrv_add/del_child() and caller GRAPH_WRLOCKKevin Wolf
The functions read the parents list in the generic block layer, so we need to hold the graph lock already there. The BlockDriver implementations actually modify the graph, so it has to be a writer lock. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Emanuele Giuseppe Esposito <eesposit@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Message-ID: <20230911094620.45040-22-kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2023-09-20block: Mark bdrv_unref_child() GRAPH_WRLOCKKevin Wolf
Instead of taking the writer lock internally, require callers to already hold it when calling bdrv_unref_child(). These callers will typically already hold the graph lock once the locking work is completed, which means that they can't call functions that take it internally. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Emanuele Giuseppe Esposito <eesposit@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Message-ID: <20230911094620.45040-21-kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2023-09-20block: Mark bdrv_root_unref_child() GRAPH_WRLOCKKevin Wolf
Instead of taking the writer lock internally, require callers to already hold it when calling bdrv_root_unref_child(). These callers will typically already hold the graph lock once the locking work is completed, which means that they can't call functions that take it internally. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Emanuele Giuseppe Esposito <eesposit@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Message-ID: <20230911094620.45040-20-kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2023-09-20block: Mark bdrv_child_perm() GRAPH_RDLOCKKevin Wolf
This adds GRAPH_RDLOCK annotations to declare that callers of bdrv_child_perm() need to hold a reader lock for the graph because some implementations access the children list of a node. The callers of bdrv_child_perm() conveniently already hold the lock. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Emanuele Giuseppe Esposito <eesposit@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Message-ID: <20230911094620.45040-16-kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2023-09-20block: Mark bdrv_parent_perms_conflict() and callers GRAPH_RDLOCKKevin Wolf
The function reads the parents list, so it needs to hold the graph lock. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Emanuele Giuseppe Esposito <eesposit@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Message-ID: <20230911094620.45040-14-kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2023-09-20block: Mark bdrv_attach_child() GRAPH_WRLOCKKevin Wolf
Instead of taking the writer lock internally, require callers to already hold it when calling bdrv_attach_child_common(). These callers will typically already hold the graph lock once the locking work is completed, which means that they can't call functions that take it internally. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Emanuele Giuseppe Esposito <eesposit@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Message-ID: <20230911094620.45040-13-kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2023-09-20block: Mark bdrv_attach_child_common() GRAPH_WRLOCKKevin Wolf
Instead of taking the writer lock internally, require callers to already hold it when calling bdrv_attach_child_common(). These callers will typically already hold the graph lock once the locking work is completed, which means that they can't call functions that take it internally. Note that the transaction callbacks still take the lock internally, so tran_finalize() must be called without the lock held. This is because bdrv_append() also calls bdrv_replace_node_noperm(), which currently requires the transaction callbacks to be called unlocked. In the next step, both of them can be switched to locked tran_finalize() calls together. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Message-ID: <20230911094620.45040-11-kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2023-09-20block: Introduce bdrv_schedule_unref()Kevin Wolf
bdrv_unref() is called by a lot of places that need to hold the graph lock (it naturally happens in the context of operations that change the graph). However, bdrv_unref() takes the graph writer lock internally, so it can't actually be called while already holding a graph lock without causing a deadlock. bdrv_unref() also can't just become GRAPH_WRLOCK because it drains the node before closing it, and draining requires that the graph is unlocked. The solution is to defer deleting the node until we don't hold the lock any more and draining is possible again. Note that keeping images open for longer than necessary can create problems, too: You can't open an image again before it is really closed (if image locking didn't prevent it, it would cause corruption). Reopening an image immediately happens at least during bdrv_open() and bdrv_co_create(). In order to solve this problem, make sure to run the deferred unref in bdrv_graph_wrunlock(), i.e. the first possible place where we can drain again. This is also why bdrv_schedule_unref() is marked GRAPH_WRLOCK. The output of iotest 051 is updated because the additional polling changes the order of HMP output, resulting in a new "(qemu)" prompt in the test output that was previously on a separate line and filtered out. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Message-ID: <20230911094620.45040-6-kwolf@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2023-09-20preallocate: Don't poll during permission updatesKevin Wolf
When the permission related BlockDriver callbacks are called, we are in the middle of an operation traversing the block graph. Polling in such a place is a very bad idea because the graph could change in unexpected ways. In the future, callers will also hold the graph lock, which is likely to turn polling into a deadlock. So we need to get rid of calls to functions like bdrv_getlength() or bdrv_truncate() there as these functions poll internally. They are currently used so that when no parent has write/resize permissions on the image any more, the preallocate filter drops the extra preallocated area in the image file and gives up write/resize permissions itself. In order to achieve this without polling in .bdrv_check_perm, don't immediately truncate the image, but only schedule a BH to do so. The filter keeps the write/resize permissions a bit longer now until the BH has executed. There is one case in which delaying doesn't work: Reopening the image read-only. In this case, bs->file will likely be reopened read-only, too, so keeping write permissions a bit longer on it doesn't work. But we can already cover this case in preallocate_reopen_prepare() and not rely on the permission updates for it. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Message-ID: <20230911094620.45040-4-kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2023-09-20preallocate: Factor out preallocate_truncate_to_real_size()Kevin Wolf
It's essentially the same code in preallocate_check_perm() and preallocate_close(), except that the latter ignores errors. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Reviewed-by: Emanuele Giuseppe Esposito <eesposit@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Message-ID: <20230911094620.45040-3-kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2023-09-11Merge tag 'for-upstream' of https://repo.or.cz/qemu/kevin into stagingStefan Hajnoczi
Block layer patches - Optimise reqs_lock to make multiqueue actually scale - virtio: Drop out of coroutine context in virtio_load() - iotests: Fix reference output for some tests after recent changes - vpc: Avoid dynamic stack allocation - Code cleanup, improved documentation # -----BEGIN PGP SIGNATURE----- # # iQJFBAABCAAvFiEE3D3rFZqa+V09dFb+fwmycsiPL9YFAmT7VYgRHGt3b2xmQHJl # ZGhhdC5jb20ACgkQfwmycsiPL9YfOg/7BoYF6lkB7DF/jH3XLY6f8zoI+OVM7dg1 # QFEjyVO+uZiJVh0CeBNI9WgnBe7f5vXMbiStyGbWKo3BLUsjnwoQcW/Sxpw61bR2 # jZYK6UHe0RhFqTQpbt8G1iCmlpRS+sX+Cy+lxcVcbqxcnLRXCOjT6ivyA4bGbYIC # q9BHg/9hBmjuM05NTV6Axy8qjqBGVaIWE9ALTnw8H//waBr4/ydJPTl7EWHe3+tO # Stm73evgPG7aLHM6W4qdFW4gwAQ8f+f42Q+0NH1YavB/pN3LTN1B6sLQY/51du+0 # d/JCsXex0IZQXmNPhqv1h01vhOyU9WBmlwpPG2iZv3a06SXk1ys3rQt/L7uIcsZg # Z58CpcUJ517FERnkl0BWXzYhsdcW2K+RdlaiL5PX6H1A2B9LT05ouZfD47hh7kKv # oX+Ulk05PFr3JRCKQF6QDEejRKXt169bGzInTlns/wXinD/V4sCkUnr9aWQuhoWk # KhQm7WMscTTIyHP2FznO4x9kq0ALsoX/NKqBW2wgJUtqRzsd4XxPp5CXEsAir8Vt # dpne/DaV5iDI1mGFJrvkctJN545tEoezBtUzC8/9rZGE0cxHAkhvQVZUDo7xVmrq # PlGQ1ko9cNui/Gf9B6qDqaJJwSyw0S6vHurGVQJRwbyly57Fi5aisWkr4w7Rc4eA # 7u9B1RvwF/Q= # =2wGD # -----END PGP SIGNATURE----- # gpg: Signature made Fri 08 Sep 2023 13:10:32 EDT # gpg: using RSA key DC3DEB159A9AF95D3D7456FE7F09B272C88F2FD6 # gpg: issuer "kwolf@redhat.com" # gpg: Good signature from "Kevin Wolf <kwolf@redhat.com>" [full] # Primary key fingerprint: DC3D EB15 9A9A F95D 3D74 56FE 7F09 B272 C88F 2FD6 * tag 'for-upstream' of https://repo.or.cz/qemu/kevin: virtio: Drop out of coroutine context in virtio_load() vmstate: Mark VMStateInfo.get/put() coroutine_mixed_fn block: Make more BlockDriver definitions static block/meson.build: Restore alphabetical order of files block: Remove unnecessary variable in bdrv_block_device_info block: Remove bdrv_query_block_node_info vmdk: Clean up bdrv_open_child() return value check qemu-img: Update documentation for compressed images block: Be more verbose in create fallback block/iscsi: Document why we use raw malloc() qemu-img: omit errno value in error message block: change reqs_lock to QemuMutex block: minimize bs->reqs_lock section in tracked_request_end() iotests: adapt test output for new qemu_cleanup() behavior block/vpc: Avoid dynamic stack allocation Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2023-09-08block: Make more BlockDriver definitions staticKevin Wolf
Most block driver implementations don't have any reason for their BlockDriver to be public. The only exceptions are bdrv_file, bdrv_raw and bdrv_qcow2, which are actually used in other source files. Make all other BlockDriver definitions static if they aren't yet. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Message-ID: <20230905130607.35134-3-kwolf@redhat.com> Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2023-09-08block/meson.build: Restore alphabetical order of filesKevin Wolf
When commit 5e5733e5999 created block/meson.build, the list of unconditionally added files was in alphabetical order. Later commits added new files in random places. Reorder the list to be alphabetical again. (As for ordering foo.c against foo-*.c, there are both ways used currently; standardise on having foo.c first, even though this is different from the original commit 5e5733e5999.) Signed-off-by: Kevin Wolf <kwolf@redhat.com> Message-ID: <20230905130607.35134-2-kwolf@redhat.com> Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2023-09-08block: Remove unnecessary variable in bdrv_block_device_infoFabiano Rosas
The commit 5d8813593f ("block/qapi: Let bdrv_query_image_info() recurse") removed the loop where we set the 'bs0' variable, so now it is just the same as 'bs'. Signed-off-by: Fabiano Rosas <farosas@suse.de> Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Message-ID: <20230901184605.32260-3-farosas@suse.de> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2023-09-08block: Remove bdrv_query_block_node_infoFabiano Rosas
The last call site of this function has been removed by commit c04d0ab026 ("qemu-img: Let info print block graph"). Reviewed-by: Claudio Fontana <cfontana@suse.de> Signed-off-by: Fabiano Rosas <farosas@suse.de> Message-ID: <20230901184605.32260-2-farosas@suse.de> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2023-09-08vmdk: Clean up bdrv_open_child() return value checkDmitry Frolov
bdrv_open_child() may return NULL. Usually return value is checked for this function. Check for return value is more reliable. Fixes: 24bc15d1f6 ("vmdk: Use BdrvChild instead of BDS for references to extents") Signed-off-by: Dmitry Frolov <frolov@swemel.ru> Message-ID: <20230831125926.796205-1-frolov@swemel.ru> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2023-09-08block/iscsi: Document why we use raw malloc()Peter Maydell
In block/iscsi.c we use a raw malloc() call, which is unusual given the project standard is to use the glib memory allocation functions. Document why we do so, to avoid it being converted to g_malloc() by mistake. Signed-off-by: Peter Maydell <peter.maydell@linaro.org> Message-ID: <20230727150705.2664464-1-peter.maydell@linaro.org> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2023-09-08block: change reqs_lock to QemuMutexStefan Hajnoczi
CoMutex has poor performance when lock contention is high. The tracked requests list is accessed frequently and performance suffers in QEMU multi-queue block layer scenarios. It is not necessary to use CoMutex for the requests lock. The lock is always released across coroutine yield operations. It is held for relatively short periods of time and it is not beneficial to yield when the lock is held by another coroutine. Change the lock type from CoMutex to QemuMutex to improve multi-queue block layer performance. fio randread bs=4k iodepth=64 with 4 IOThreads handling a virtio-blk device with 8 virtqueues improves from 254k to 517k IOPS (+203%). Full benchmark results and configuration details are available here: https://gitlab.com/stefanha/virt-playbooks/-/commit/980c40845d540e3669add1528739503c2e817b57 In the future we may wish to introduce thread-local tracked requests lists to avoid lock contention completely. That would be much more involved though. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Message-ID: <20230808155852.2745350-3-stefanha@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2023-09-08block: minimize bs->reqs_lock section in tracked_request_end()Stefan Hajnoczi
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Message-ID: <20230808155852.2745350-2-stefanha@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2023-09-08block/vpc: Avoid dynamic stack allocationPhilippe Mathieu-Daudé
Use autofree heap allocation instead of variable-length array on the stack. Here we don't expect the bitmap size to be enormous, and since we're about to read/write it to disk the overhead of the allocation should be fine. The codebase has very few VLAs, and if we can get rid of them all we can make the compiler error on new additions. This is a defensive measure against security bugs where an on-stack dynamic allocation isn't correctly size-checked (e.g. CVE-2021-3527). Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com> [PMM: expanded commit message] Signed-off-by: Peter Maydell <peter.maydell@linaro.org> Message-ID: <20230811175229.808139-1-peter.maydell@linaro.org> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Francisco Iglesias <frasse.iglesias@gmail.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2023-09-08Merge tag 'pull-trivial-patches' of https://gitlab.com/mjt0k/qemu into stagingStefan Hajnoczi
trivial patches for 2023-09-08 # -----BEGIN PGP SIGNATURE----- # # iQFDBAABCAAtFiEEe3O61ovnosKJMUsicBtPaxppPlkFAmT68tMPHG1qdEB0bHMu # bXNrLnJ1AAoJEHAbT2saaT5ZbEwH/2XcX1f4KcEJbgUn0JVhGQ5GH2c2jepZlkTZ # 2dhvdEECbOPMg73hty0fyyWlyuLWdJ9cMpONfMtzmHTH8RKEOAbpn/zusyo3H+48 # 6cunyUpBqbmb7MHPchrN+JmvtvaSPSazsj2Zdkh+Y4WlfEYj+yVysQ4zQlBlRyHv # iOTi6OdjxXg1QcbtJxAUhp+tKaRJzagiCpLkoyW2m8DIuV9cLVHMJsE3OMgfKNgK # /S+O1fLcaDhuSCrHAbZzArF3Tr4bfLqSwDtGCJfQpqKeIQDJuI+41GLIlm1nYY70 # IFJzEWMOrX/rcMG1CQnUFZOOyDSO+NfILwNnU+eyM49MUekmY54= # =mmPS # -----END PGP SIGNATURE----- # gpg: Signature made Fri 08 Sep 2023 06:09:23 EDT # gpg: using RSA key 7B73BAD68BE7A2C289314B22701B4F6B1A693E59 # gpg: issuer "mjt@tls.msk.ru" # gpg: Good signature from "Michael Tokarev <mjt@tls.msk.ru>" [full] # gpg: aka "Michael Tokarev <mjt@corpit.ru>" [full] # gpg: aka "Michael Tokarev <mjt@debian.org>" [full] # Primary key fingerprint: 6EE1 95D1 886E 8FFB 810D 4324 457C E0A0 8044 65C5 # Subkey fingerprint: 7B73 BAD6 8BE7 A2C2 8931 4B22 701B 4F6B 1A69 3E59 * tag 'pull-trivial-patches' of https://gitlab.com/mjt0k/qemu: (22 commits) qxl: don't assert() if device isn't yet initialized hw/net/vmxnet3: Fix guest-triggerable assert() tests/qtest/usb-hcd: Remove the empty "init" tests target/ppc: use g_free() in test_opcode_table() hw/ppc: use g_free() in spapr_tce_table_post_load() trivial: Simplify the spots that use TARGET_BIG_ENDIAN as a numeric value accel/tcg: Fix typo in translator_io_start() description tests/qtest/test-hmp: Fix migrate_set_parameter xbzrle-cache-size test docs tests: Fix use of migrate_set_parameter qemu-options.hx: Rephrase the descriptions of the -hd* and -cdrom options hw/display/xlnx_dp: update comments block: spelling fixes misc/other: spelling fixes qga/: spelling fixes tests/: spelling fixes scripts/: spelling fixes include/: spelling fixes audio: spelling fixes xen: spelling fix riscv: spelling fixes ... Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2023-09-08block: spelling fixesMichael Tokarev
Signed-off-by: Michael Tokarev <mjt@tls.msk.ru> Reviewed-by: Eric Blake <eblake@redhat.com>
2023-09-07io: follow coroutine AioContext in qio_channel_yield()Stefan Hajnoczi
The ongoing QEMU multi-queue block layer effort makes it possible for multiple threads to process I/O in parallel. The nbd block driver is not compatible with the multi-queue block layer yet because QIOChannel cannot be used easily from coroutines running in multiple threads. This series changes the QIOChannel API to make that possible. In the current API, calling qio_channel_attach_aio_context() sets the AioContext where qio_channel_yield() installs an fd handler prior to yielding: qio_channel_attach_aio_context(ioc, my_ctx); ... qio_channel_yield(ioc); // my_ctx is used here ... qio_channel_detach_aio_context(ioc); This API design has limitations: reading and writing must be done in the same AioContext and moving between AioContexts involves a cumbersome sequence of API calls that is not suitable for doing on a per-request basis. There is no fundamental reason why a QIOChannel needs to run within the same AioContext every time qio_channel_yield() is called. QIOChannel only uses the AioContext while inside qio_channel_yield(). The rest of the time, QIOChannel is independent of any AioContext. In the new API, qio_channel_yield() queries the AioContext from the current coroutine using qemu_coroutine_get_aio_context(). There is no need to explicitly attach/detach AioContexts anymore and qio_channel_attach_aio_context() and qio_channel_detach_aio_context() are gone. One coroutine can read from the QIOChannel while another coroutine writes from a different AioContext. This API change allows the nbd block driver to use QIOChannel from any thread. It's important to keep in mind that the block driver already synchronizes QIOChannel access and ensures that two coroutines never read simultaneously or write simultaneously. This patch updates all users of qio_channel_attach_aio_context() to the new API. Most conversions are simple, but vhost-user-server requires a new qemu_coroutine_yield() call to quiesce the vu_client_trip() coroutine when not attached to any AioContext. While the API is has become simpler, there is one wart: QIOChannel has a special case for the iohandler AioContext (used for handlers that must not run in nested event loops). I didn't find an elegant way preserve that behavior, so I added a new API called qio_channel_set_follow_coroutine_ctx(ioc, true|false) for opting in to the new AioContext model. By default QIOChannel uses the iohandler AioHandler. Code that formerly called qio_channel_attach_aio_context() now calls qio_channel_set_follow_coroutine_ctx(ioc, true) once after the QIOChannel is created. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Acked-by: Daniel P. Berrangé <berrange@redhat.com> Message-ID: <20230830224802.493686-5-stefanha@redhat.com> [eblake: also fix migration/rdma.c] Signed-off-by: Eric Blake <eblake@redhat.com>
2023-09-06parallels: Add data_off repairing to parallels_open()Alexander Ivanov
Place data_start/data_end calculation after reading the image header to s->header. Set s->data_start to the offset calculated in parallels_test_data_off(). Call bdrv_check() if data_off is incorrect. Signed-off-by: Alexander Ivanov <alexander.ivanov@virtuozzo.com> Reviewed-by: Denis V. Lunev <den@openvz.org> Signed-off-by: Denis V. Lunev <den@openvz.org>
2023-09-06parallels: Add data_off checkAlexander Ivanov
data_off field of the parallels image header can be corrupted. Check if this field greater than the header + BAT size and less than file size. Signed-off-by: Alexander Ivanov <alexander.ivanov@virtuozzo.com> Reviewed-by: Denis V. Lunev <den@openvz.org> Signed-off-by: Denis V. Lunev <den@openvz.org>
2023-09-06parallels: Use bdrv_co_getlength() in parallels_check_outside_image()Alexander Ivanov
bdrv_co_getlength() should be used in coroutine context. Replace bdrv_getlength() by bdrv_co_getlength() in parallels_check_outside_image(). Signed-off-by: Alexander Ivanov <alexander.ivanov@virtuozzo.com> Reviewed-by: Denis V. Lunev <den@openvz.org> Signed-off-by: Denis V. Lunev <den@openvz.org>
2023-09-06parallels: Image repairing in parallels_open()Alexander Ivanov
Repair an image at opening if the image is unclean or out-of-image corruption was detected. Signed-off-by: Alexander Ivanov <alexander.ivanov@virtuozzo.com> Reviewed-by: Denis V. Lunev <den@openvz.org> Signed-off-by: Denis V. Lunev <den@openvz.org>
2023-09-06parallels: Add checking and repairing duplicate offsets in BATAlexander Ivanov
Cluster offsets must be unique among all the BAT entries. Find duplicate offsets in the BAT and fix it by copying the content of the relevant cluster to a newly allocated cluster and set the new cluster offset to the duplicated entry. Add host_cluster_index() helper to deduplicate the code. When new clusters are allocated, the file size increases by 128 Mb. Call parallels_check_leak() to fix this leak. Signed-off-by: Alexander Ivanov <alexander.ivanov@virtuozzo.com> Reviewed-by: Denis V. Lunev <den@openvz.org> Signed-off-by: Denis V. Lunev <den@openvz.org>
2023-09-06parallels: Add data_start field to BDRVParallelsStateAlexander Ivanov
In the next patch we will need the offset of the data area for host cluster index calculation. Add this field and setting up code. Signed-off-by: Alexander Ivanov <alexander.ivanov@virtuozzo.com> Reviewed-by: Denis V. Lunev <den@openvz.org> Signed-off-by: Denis V. Lunev <den@openvz.org>
2023-09-06parallels: Add "explicit" argument to parallels_check_leak()Alexander Ivanov
In the on of the next patches we need to repair leaks without changing leaks and leaks_fixed info in res. Also we don't want to print any warning about leaks. Add "explicit" argument to skip info changing if the argument is false. Signed-off-by: Alexander Ivanov <alexander.ivanov@virtuozzo.com> Reviewed-by: Denis V. Lunev <den@openvz.org> Signed-off-by: Denis V. Lunev <den@openvz.org>
2023-09-06parallels: Check if data_end greater than the file sizeAlexander Ivanov
Initially data_end is set to the data_off image header field and must not be greater than the file size. Signed-off-by: Alexander Ivanov <alexander.ivanov@virtuozzo.com> Reviewed-by: Denis V. Lunev <den@openvz.org> Signed-off-by: Denis V. Lunev <den@openvz.org>
2023-09-06parallels: Incorrect data end calculation in parallels_open()Alexander Ivanov
The BDRVParallelsState structure contains data_end field that is measured in sectors. In parallels_open() initially this field is set by data_off field from parallels image header. According to the parallels format documentation, data_off field contains an offset, in sectors, from the start of the file to the start of the data area. For "WithoutFreeSpace" images: if data_off is zero, the offset is calculated as the end of the BAT table plus some padding to ensure sector size alignment. The parallels_open() function has code for handling zero value in data_off, but in the result data_end contains the offset in bytes. Replace the alignment to sector size by division by sector size and fix the comparision with s->header_size. Signed-off-by: Alexander Ivanov <alexander.ivanov@virtuozzo.com> Reviewed-by: Denis V. Lunev <den@openvz.org> Reviewed-by: Hanna Czenczek <hreitz@redhat.com> Signed-off-by: Denis V. Lunev <den@openvz.org>
2023-09-06parallels: Fix comments formatting inside parallels driverAlexander Ivanov
This patch is technically necessary as git patch rendering could result in moving some code from one place to the another and that hits checkpatch.pl warning. This problem specifically happens within next series. Signed-off-by: Alexander Ivanov <alexander.ivanov@virtuozzo.com> Reviewed-by: Denis V. Lunev <den@openvz.org> Signed-off-by: Denis V. Lunev <den@openvz.org>
2023-08-30block/io: align requests to subcluster_sizeAndrey Drobyshev
When target image is using subclusters, and we align the request during copy-on-read, it makes sense to align to subcluster_size rather than cluster_size. Otherwise we end up with unnecessary allocations. This commit renames bdrv_round_to_clusters() to bdrv_round_to_subclusters() and utilizes subcluster_size field of BlockDriverInfo to make necessary alignments. It affects copy-on-read as well as mirror job (which is using bdrv_round_to_clusters()). This change also fixes the following bug with failing assert (covered by the test in the subsequent commit): qemu-img create -f qcow2 base.qcow2 64K qemu-img create -f qcow2 -o extended_l2=on,backing_file=base.qcow2,backing_fmt=qcow2 img.qcow2 64K qemu-io -c "write -P 0xaa 0 2K" img.qcow2 qemu-io -C -c "read -P 0x00 2K 62K" img.qcow2 qemu-io: ../block/io.c:1236: bdrv_co_do_copy_on_readv: Assertion `skip_bytes < pnum' failed. Reviewed-by: Eric Blake <eblake@redhat.com> Reviewed-by: Denis V. Lunev <den@openvz.org> Signed-off-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Message-ID: <20230711172553.234055-3-andrey.drobyshev@virtuozzo.com>
2023-08-30block: add subcluster_size field to BlockDriverInfoAndrey Drobyshev
This is going to be used in the subsequent commit as requests alignment (in particular, during copy-on-read). This value only makes sense for the formats which support subclusters (currently QCOW2 only). If this field isn't set by driver's own bdrv_get_info() implementation, we simply set it equal to the cluster size thus treating each cluster as having a single subcluster. Reviewed-by: Eric Blake <eblake@redhat.com> Reviewed-by: Denis V. Lunev <den@openvz.org> Signed-off-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Message-ID: <20230711172553.234055-2-andrey.drobyshev@virtuozzo.com>
2023-08-29file-posix: Simplify raw_co_prw's 'out' zone codeHanna Czenczek
We duplicate the same condition three times here, pull it out to the top level. Signed-off-by: Hanna Czenczek <hreitz@redhat.com> Message-Id: <20230824155345.109765-5-hreitz@redhat.com> Reviewed-by: Sam Li <faithilikerun@gmail.com>
2023-08-29file-posix: Fix zone update in I/O error pathHanna Czenczek
We must check that zone information is present before running update_zones_wp(). Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=2234374 Fixes: Coverity CID 1512459 Signed-off-by: Hanna Czenczek <hreitz@redhat.com> Message-Id: <20230824155345.109765-4-hreitz@redhat.com> Reviewed-by: Sam Li <faithilikerun@gmail.com>
2023-08-29file-posix: Check bs->bl.zoned for zone infoHanna Czenczek
Instead of checking bs->wps or bs->bl.zone_size for whether zone information is present, check bs->bl.zoned. That is the flag that raw_refresh_zoned_limits() reliably sets to indicate zone support. If it is set to something other than BLK_Z_NONE, other values and objects like bs->wps and bs->bl.zone_size must be non-null/zero and valid; if it is not, we cannot rely on their validity. Signed-off-by: Hanna Czenczek <hreitz@redhat.com> Message-Id: <20230824155345.109765-3-hreitz@redhat.com> Reviewed-by: Sam Li <faithilikerun@gmail.com>
2023-08-29file-posix: Clear bs->bl.zoned on errorHanna Czenczek
bs->bl.zoned is what indicates whether the zone information is present and valid; it is the only thing that raw_refresh_zoned_limits() sets if CONFIG_BLKZONED is not defined, and it is also the only thing that it sets if CONFIG_BLKZONED is defined, but there are no zones. Make sure that it is always set to BLK_Z_NONE if there is an error anywhere in raw_refresh_zoned_limits() so that we do not accidentally announce zones while our information is incomplete or invalid. This also fixes a memory leak in the last error path in raw_refresh_zoned_limits(). Signed-off-by: Hanna Czenczek <hreitz@redhat.com> Message-Id: <20230824155345.109765-2-hreitz@redhat.com> Reviewed-by: Sam Li <faithilikerun@gmail.com>
2023-08-29block/throttle-groups: Use ThrottleDirection instread of bool is_writezhenwei pi
'bool is_write' style is obsolete from throttle framework, adapt block throttle groups to the new style: - use ThrottleDirection instead of 'bool is_write'. Ex, schedule_next_request(ThrottleGroupMember *tgm, bool is_write) -> schedule_next_request(ThrottleGroupMember *tgm, ThrottleDirection direction) - use THROTTLE_MAX instead of hard code. Ex, ThrottleGroupMember *tokens[2] -> ThrottleGroupMember *tokens[THROTTLE_MAX] - use ThrottleDirection instead of hard code on iteration. Ex, (i = 0; i < 2; i++) -> for (dir = THROTTLE_READ; dir < THROTTLE_MAX; dir++) Use a simple python script to test the new style: #!/usr/bin/python3 import subprocess import random import time commands = ['virsh blkdeviotune jammy vda --write-bytes-sec ', \ 'virsh blkdeviotune jammy vda --write-iops-sec ', \ 'virsh blkdeviotune jammy vda --read-bytes-sec ', \ 'virsh blkdeviotune jammy vda --read-iops-sec '] for loop in range(1, 1000): time.sleep(random.randrange(3, 5)) command = commands[random.randrange(0, 3)] + str(random.randrange(0, 1000000)) subprocess.run(command, shell=True, check=True) This works fine. Signed-off-by: zhenwei pi <pizhenwei@bytedance.com> Message-Id: <20230728022006.1098509-10-pizhenwei@bytedance.com> Reviewed-by: Hanna Czenczek <hreitz@redhat.com> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
2023-08-29throttle: use enum ThrottleDirection instead of bool is_writezhenwei pi
enum ThrottleDirection is already there, use ThrottleDirection instead of 'bool is_write' for throttle API, also modify related codes from block, fsdev, cryptodev and tests. Reviewed-by: Hanna Czenczek <hreitz@redhat.com> Signed-off-by: zhenwei pi <pizhenwei@bytedance.com> Message-Id: <20230728022006.1098509-7-pizhenwei@bytedance.com> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
2023-08-03block/blkio: add more comments on the fd passing handlingStefano Garzarella
As Hanna pointed out, it is not clear in the code why qemu_open() can fail, and why blkio_set_int("fd") is not enough to discover the `fd` property support. Let's fix them by adding more details in the code comments. Suggested-by: Hanna Czenczek <hreitz@redhat.com> Reviewed-by: Hanna Czenczek <hreitz@redhat.com> Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Message-id: 20230803082825.25293-3-sgarzare@redhat.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2023-08-03block/blkio: close the fd when blkio_connect() failsStefano Garzarella
libblkio drivers take ownership of `fd` only after a successful blkio_connect(), so if it fails, we are still the owners. Fixes: cad2ccc395 ("block/blkio: use qemu_open() to support fd passing for virtio-blk") Suggested-by: Hanna Czenczek <hreitz@redhat.com> Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Reviewed-by: Hanna Czenczek <hreitz@redhat.com> Message-id: 20230803082825.25293-2-sgarzare@redhat.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2023-07-27Merge tag 'block-pull-request' of https://gitlab.com/stefanha/qemu into stagingRichard Henderson
Pull request Please include these bug fixes in QEMU 8.1. Thanks! # -----BEGIN PGP SIGNATURE----- # # iQEzBAABCAAdFiEEhpWov9P5fNqsNXdanKSrs4Grc8gFAmTCzPUACgkQnKSrs4Gr # c8g1DAf/fPUQ4zRsCn079pHIyK9TFo4COm23p4kiusxj8otfjt8LH1Zsc9pGWC2+ # bl2RlnPID8JlyJFDRN7b/RCEhj45a83GtCmhDDmqVgy1eO5vwOKm2XyyWeD+pq/U # Hf2QLPLZZ7tCD8Njpty+gB3Ux4zqthKGXSg8FpJ3w0tl4me2efLvjMa6jHMwtnHT # aAbyQ3WMpT9w4XHLqRQDHzBqrTSY4od3nl9SrM/DQ2klLIcz8ECTEZVBY9B3pq6m # QvAg24tfb0QvS14YnZv/PMCfOaVuE87M9G4f93pCynnMxMYze+XczL0sGhIAS9wp # 03NgGlhGumOix6r2kHjlG6p3xywV8A== # =jMf8 # -----END PGP SIGNATURE----- # gpg: Signature made Thu 27 Jul 2023 01:00:53 PM PDT # gpg: using RSA key 8695A8BFD3F97CDAAC35775A9CA4ABB381AB73C8 # gpg: Good signature from "Stefan Hajnoczi <stefanha@redhat.com>" [full] # gpg: aka "Stefan Hajnoczi <stefanha@gmail.com>" [full] * tag 'block-pull-request' of https://gitlab.com/stefanha/qemu: block/blkio: use blkio_set_int("fd") to check fd support block/blkio: fall back on using `path` when `fd` setting fails block/blkio: retry blkio_connect() if it fails using `fd` block/blkio: move blkio_connect() in the drivers functions block: Fix pad_request's request restriction block/file-posix: fix g_file_get_contents return path block/blkio: do not use open flags in qemu_open() block/blkio: enable the completion eventfd Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2023-07-27block/blkio: use blkio_set_int("fd") to check fd supportStefano Garzarella
Setting the `fd` property fails with virtio-blk-* libblkio drivers that do not support fd passing since https://gitlab.com/libblkio/libblkio/-/merge_requests/208. Getting the `fd` property, on the other hand, always succeeds for virtio-blk-* libblkio drivers even when they don't support fd passing. This patch switches to setting the `fd` property because it is a better mechanism for probing fd passing support than getting the `fd` property. Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Message-id: 20230727161020.84213-5-sgarzare@redhat.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>