aboutsummaryrefslogtreecommitdiff
path: root/include/block
AgeCommit message (Collapse)Author
2022-05-12nbd/server: Allow MULTI_CONN for shared writable exportsEric Blake
According to the NBD spec, a server that advertises NBD_FLAG_CAN_MULTI_CONN promises that multiple client connections will not see any cache inconsistencies: when properly separated by a single flush, actions performed by one client will be visible to another client, regardless of which client did the flush. We always satisfy these conditions in qemu - even when we support multiple clients, ALL clients go through a single point of reference into the block layer, with no local caching. The effect of one client is instantly visible to the next client. Even if our backend were a network device, we argue that any multi-path caching effects that would cause inconsistencies in back-to-back actions not seeing the effect of previous actions would be a bug in that backend, and not the fault of caching in qemu. As such, it is safe to unconditionally advertise CAN_MULTI_CONN for any qemu NBD server situation that supports parallel clients. Note, however, that we don't want to advertise CAN_MULTI_CONN when we know that a second client cannot connect (for historical reasons, qemu-nbd defaults to a single connection while nbd-server-add and QMP commands default to unlimited connections; but we already have existing means to let either style of NBD server creation alter those defaults). This is visible by no longer advertising MULTI_CONN for 'qemu-nbd -r' without -e, as in the iotest nbd-qemu-allocation. The harder part of this patch is setting up an iotest to demonstrate behavior of multiple NBD clients to a single server. It might be possible with parallel qemu-io processes, but I found it easier to do in python with the help of libnbd, and help from Nir and Vladimir in writing the test. Signed-off-by: Eric Blake <eblake@redhat.com> Suggested-by: Nir Soffer <nsoffer@redhat.com> Suggested-by: Vladimir Sementsov-Ogievskiy <v.sementsov-og@mail.ru> Message-Id: <20220512004924.417153-3-eblake@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-05-12qemu-nbd: Pass max connections to blockdev layerEric Blake
The next patch wants to adjust whether the NBD server code advertises MULTI_CONN based on whether it is known if the server limits to exactly one client. For a server started by QMP, this information is obtained through nbd_server_start (which can support more than one export); but for qemu-nbd (which supports exactly one export), it is controlled only by the command-line option -e/--shared. Since we already have a hook function used by qemu-nbd, it's easiest to just alter its signature to fit our needs. Signed-off-by: Eric Blake <eblake@redhat.com> Message-Id: <20220512004924.417153-2-eblake@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-05-11Clean up decorations and whitespace around header guardsMarkus Armbruster
Cleaned up with scripts/clean-header-guards.pl. Signed-off-by: Markus Armbruster <armbru@redhat.com> Message-Id: <20220506134911.2856099-5-armbru@redhat.com> Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
2022-05-11Clean up header guards that don't match their file nameMarkus Armbruster
Header guard symbols should match their file name to make guard collisions less likely. Cleaned up with scripts/clean-header-guards.pl, followed by some renaming of new guard symbols picked by the script to better ones. Signed-off-by: Markus Armbruster <armbru@redhat.com> Message-Id: <20220506134911.2856099-2-armbru@redhat.com> Reviewed-by: Richard Henderson <richard.henderson@linaro.org> [Change to generated file ebpf/rss.bpf.skeleton.h backed out]
2022-05-09util/event-loop-base: Introduce options to set the thread pool sizeNicolas Saenz Julienne
The thread pool regulates itself: when idle, it kills threads until empty, when in demand, it creates new threads until full. This behaviour doesn't play well with latency sensitive workloads where the price of creating a new thread is too high. For example, when paired with qemu's '-mlock', or using safety features like SafeStack, creating a new thread has been measured take multiple milliseconds. In order to mitigate this let's introduce a new 'EventLoopBase' property to set the thread pool size. The threads will be created during the pool's initialization or upon updating the property's value, remain available during its lifetime regardless of demand, and destroyed upon freeing it. A properly characterized workload will then be able to configure the pool to avoid any latency spikes. Signed-off-by: Nicolas Saenz Julienne <nsaenzju@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Acked-by: Markus Armbruster <armbru@redhat.com> Message-id: 20220425075723.20019-4-nsaenzju@redhat.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2022-05-04block: Classify bdrv_get_flags() as I/O functionHanna Reitz
This function is safe to call in an I/O context, and qcow2_do_open() does so (invoked in an I/O context by qcow2_co_invalidate_cache()). Signed-off-by: Hanna Reitz <hreitz@redhat.com> Message-Id: <20220427114057.36651-2-hreitz@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-04-26qapi: rename BlockDirtyBitmapMergeSource to BlockDirtyBitmapOrStrVladimir Sementsov-Ogievskiy
Rename the type to be reused. Old name is "what is it for". To be natively reused for other needs, let's name it exactly "what is it". Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@openvz.org> Message-Id: <20220314213226.362217-2-v.sementsov-og@mail.ru> [eblake: Adjust S-o-b to Vladimir's new email, with permission] Reviewed-by: Eric Blake <eblake@redhat.com> Acked-by: John Snow <jsnow@redhat.com> Signed-off-by: Eric Blake <eblake@redhat.com>
2022-04-21include: move qdict_{crumple,flatten} declarationsMarc-André Lureau
Move them where they belong, since the functions are implemented in block-qdict.c. Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com> Reviewed-by: Daniel P. Berrangé <berrange@redhat.com> Message-Id: <20220420132624.2439741-25-marcandre.lureau@redhat.com>
2022-03-07block: pass desired TLS hostname through from block driver clientDaniel P. Berrangé
In commit a71d597b989fd701b923f09b3c20ac4fcaa55e81 Author: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Date: Thu Jun 10 13:08:00 2021 +0300 block/nbd: reuse nbd_co_do_establish_connection() in nbd_open() the use of the 'hostname' field from the BDRVNBDState struct was lost, and 'nbd_connect' just hardcoded it to match the IP socket address. This was a harmless bug at the time since we block use with anything other than IP sockets. Shortly though, we want to allow the caller to override the hostname used in the TLS certificate checks. This is to allow for TLS when doing port forwarding or tunneling. Thus we need to reinstate the passing along of the 'hostname'. Reviewed-by: Eric Blake <eblake@redhat.com> Signed-off-by: Daniel P. Berrangé <berrange@redhat.com> Message-Id: <20220304193610.3293146-3-berrange@redhat.com> Signed-off-by: Eric Blake <eblake@redhat.com>
2022-03-07block: introduce snapshot-access block driverVladimir Sementsov-Ogievskiy
The new block driver simply utilizes snapshot-access API of underlying block node. In further patches we want to use it like this: [guest] [NBD export] | | | root | root v file v [copy-before-write]<------[snapshot-access] | | | file | target v v [active-disk] [temp.img] This way, NBD client will be able to read snapshotted state of active disk, when active disk is continued to be written by guest. This is known as "fleecing", and currently uses another scheme based on qcow2 temporary image which backing file is active-disk. New scheme comes with benefits - see next commit. The other possible application is exporting internal snapshots of qcow2, like this: [guest] [NBD export] | | | root | root v file v [qcow2]<---------[snapshot-access] For this, we'll need to implement snapshot-access API handlers in qcow2 driver, and improve snapshot-access block driver (and API) to make it possible to select snapshot by name. Another thing to improve is size of snapshot. Now for simplicity we just use size of bs->file, which is OK for backup, but for qcow2 snapshots export we'll need to imporve snapshot-access API to get size of snapshot. Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Message-Id: <20220303194349.2304213-12-vsementsov@virtuozzo.com> [hreitz: Rebased on block GS/IO split] Signed-off-by: Hanna Reitz <hreitz@redhat.com>
2022-03-07block/io: introduce block driver snapshot-access APIVladimir Sementsov-Ogievskiy
Add new block driver handlers and corresponding generic wrappers. It will be used to allow copy-before-write filter to provide reach fleecing interface in further commit. In future this approach may be used to allow reading qcow2 internal snapshots, for example to export them through NBD. Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Reviewed-by: Hanna Reitz <hreitz@redhat.com> Message-Id: <20220303194349.2304213-11-vsementsov@virtuozzo.com> [hreitz: Rebased on block GS/IO split] Signed-off-by: Hanna Reitz <hreitz@redhat.com>
2022-03-07block/reqlist: add reqlist_wait_all()Vladimir Sementsov-Ogievskiy
Add function to wait for all intersecting requests. To be used in the further commit. Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Reviewed-by: Nikita Lapshin <nikita.lapshin@virtuozzo.com> Reviewed-by: Hanna Reitz <hreitz@redhat.com> Message-Id: <20220303194349.2304213-10-vsementsov@virtuozzo.com> Signed-off-by: Hanna Reitz <hreitz@redhat.com>
2022-03-07block/dirty-bitmap: introduce bdrv_dirty_bitmap_status()Vladimir Sementsov-Ogievskiy
Add a convenient function similar with bdrv_block_status() to get status of dirty bitmap. Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Reviewed-by: Hanna Reitz <hreitz@redhat.com> Message-Id: <20220303194349.2304213-9-vsementsov@virtuozzo.com> Signed-off-by: Hanna Reitz <hreitz@redhat.com>
2022-03-07block: intoduce reqlistVladimir Sementsov-Ogievskiy
Split intersecting-requests functionality out of block-copy to be reused in copy-before-write filter. Note: while being here, fix tiny typo in MAINTAINERS. Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Reviewed-by: Hanna Reitz <hreitz@redhat.com> Message-Id: <20220303194349.2304213-7-vsementsov@virtuozzo.com> Signed-off-by: Hanna Reitz <hreitz@redhat.com>
2022-03-07block/block-copy: add block_copy_reset()Vladimir Sementsov-Ogievskiy
Split block_copy_reset() out of block_copy_reset_unallocated() to be used separately later. Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Reviewed-by: Hanna Reitz <hreitz@redhat.com> Message-Id: <20220303194349.2304213-6-vsementsov@virtuozzo.com> Signed-off-by: Hanna Reitz <hreitz@redhat.com>
2022-03-07block/block-copy: block_copy_state_new(): add bitmap parameterVladimir Sementsov-Ogievskiy
This will be used in the following commit to bring "incremental" mode to copy-before-write filter. Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Reviewed-by: Hanna Reitz <hreitz@redhat.com> Message-Id: <20220303194349.2304213-4-vsementsov@virtuozzo.com> Signed-off-by: Hanna Reitz <hreitz@redhat.com>
2022-03-07block/dirty-bitmap: bdrv_merge_dirty_bitmap(): add return valueVladimir Sementsov-Ogievskiy
That simplifies handling failure in existing code and in further new usage of bdrv_merge_dirty_bitmap(). Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Reviewed-by: Hanna Reitz <hreitz@redhat.com> Message-Id: <20220303194349.2304213-3-vsementsov@virtuozzo.com> Signed-off-by: Hanna Reitz <hreitz@redhat.com>
2022-03-07block: fix preallocate filter: don't do unaligned preallocate requestsVladimir Sementsov-Ogievskiy
There is a bug in handling BDRV_REQ_NO_WAIT flag: we still may wait in wait_serialising_requests() if request is unaligned. And this is possible for the only user of this flag (preallocate filter) if underlying file is unaligned to its request_alignment on start. So, we have to fix preallocate filter to do only aligned preallocate requests. Next, we should fix generic block/io.c somehow. Keeping in mind that preallocate is the only user of BDRV_REQ_NO_WAIT and that we have to fix its behavior now, it seems more safe to just assert that we never use BDRV_REQ_NO_WAIT with unaligned requests and add corresponding comment. Let's do so. Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Reviewed-by: Denis V. Lunev <den@openvz.org> Message-Id: <20220215121609.38570-1-vsementsov@virtuozzo.com> [hreitz: Rebased on block GS/IO split] Signed-off-by: Hanna Reitz <hreitz@redhat.com>
2022-03-04block_int-common.h: split function pointers in BdrvChildClassEmanuele Giuseppe Esposito
Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com> Message-Id: <20220303151616.325444-28-eesposit@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-03-04block_int-common.h: split function pointers in BlockDriverEmanuele Giuseppe Esposito
Similar to the header split, also the function pointers in BlockDriver can be split in I/O and global state. Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com> Message-Id: <20220303151616.325444-26-eesposit@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-03-04include/block/snapshot: global state API + assertionsEmanuele Giuseppe Esposito
Snapshots run also under the BQL, so they all are in the global state API. The aiocontext lock that they hold is currently an overkill and in future could be removed. Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com> Message-Id: <20220303151616.325444-23-eesposit@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-03-04include/block/blockjob.h: global state APIEmanuele Giuseppe Esposito
blockjob functions run always under the BQL lock. Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com> Message-Id: <20220303151616.325444-19-eesposit@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-03-04include/block/blockjob_int.h: split header into I/O and GS APIEmanuele Giuseppe Esposito
Since the I/O functions are not many, keep a single file. Also split the function pointers in BlockJobDriver. Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Message-Id: <20220303151616.325444-16-eesposit@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-03-04block: introduce assert_bdrv_graph_writableEmanuele Giuseppe Esposito
We want to be sure that the functions that write the child and parent list of a bs are under BQL and drain. BQL prevents from concurrent writings from the GS API, while drains protect from I/O. TODO: drains are missing in some functions using this assert. Therefore a proper assertion will fail. Because adding drains requires additional discussions, they will be added in future series. Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com> Message-Id: <20220303151616.325444-15-eesposit@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-03-04IO_CODE and IO_OR_GS_CODE for block_int I/O APIEmanuele Giuseppe Esposito
Mark all I/O functions with IO_CODE, and all "I/O OR GS" with IO_OR_GS_CODE. Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com> Message-Id: <20220303151616.325444-14-eesposit@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-03-04include/block/block_int: split header into I/O and global state APIEmanuele Giuseppe Esposito
Similarly to the previous patch, split block_int.h in block_int-io.h and block_int-global-state.h block_int-common.h contains the structures shared between the two headers, and the functions that can't be categorized as I/O or global state. Assertions are added in the next patch. Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com> Message-Id: <20220303151616.325444-12-eesposit@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-03-04IO_CODE and IO_OR_GS_CODE for block I/O APIEmanuele Giuseppe Esposito
Mark all I/O functions with IO_CODE, and all "I/O OR GS" with IO_OR_GS_CODE. Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com> Message-Id: <20220303151616.325444-6-eesposit@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-03-04include/block/block: split header into I/O and global state APIEmanuele Giuseppe Esposito
block.h currently contains a mix of functions: some of them run under the BQL and modify the block layer graph, others are instead thread-safe and perform I/O in iothreads. Some others can only be called by either the main loop or the iothread running the AioContext (and not other iothreads), and using them in another thread would cause deadlocks, and therefore it is not ideal to define them as I/O. It is not easy to understand which function is part of which group (I/O vs GS vs "I/O or GS"), and this patch aims to clarify it. The "GS" functions need the BQL, and often use aio_context_acquire/release and/or drain to be sure they can modify the graph safely. The I/O function are instead thread safe, and can run in any AioContext. "I/O or GS" functions run instead in the main loop or in a single iothread, and use BDRV_POLL_WHILE(). By splitting the header in two files, block-io.h and block-global-state.h we have a clearer view on what needs what kind of protection. block-common.h contains common structures shared by both headers. block.h is left there for legacy and to avoid changing all includes in all c files that use the block APIs. Assertions are added in the next patch. Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com> Message-Id: <20220303151616.325444-4-eesposit@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-03-04block: rename bdrv_invalidate_cache_all, blk_invalidate_cache and ↵Emanuele Giuseppe Esposito
test_sync_op_invalidate_cache Following the bdrv_activate renaming, change also the name of the respective callers. bdrv_invalidate_cache_all -> bdrv_activate_all blk_invalidate_cache -> blk_activate test_sync_op_invalidate_cache -> test_sync_op_activate No functional change intended. Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Reviewed-by: Hanna Reitz <hreitz@redhat.com> Message-Id: <20220209105452.1694545-5-eesposit@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-03-04block: introduce bdrv_activateEmanuele Giuseppe Esposito
This function is currently just a wrapper for bdrv_invalidate_cache(), but in future will contain the code of bdrv_co_invalidate_cache() that has to always be protected by BQL, and leave the rest in the I/O coroutine. Replace all bdrv_invalidate_cache() invokations with bdrv_activate(). Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com> Reviewed-by: Hanna Reitz <hreitz@redhat.com> Message-Id: <20220209105452.1694545-4-eesposit@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-03-04crypto: perform permission checks under BQLEmanuele Giuseppe Esposito
Move the permission API calls into driver-specific callbacks that always run under BQL. In this case, bdrv_crypto_luks needs to perform permission checks before and after qcrypto_block_amend_options(). The problem is that the caller, block_crypto_amend_options_generic_luks(), can also run in I/O from .bdrv_co_amend(). This does not comply with Global State-I/O API split, as permissions API must always run under BQL. Firstly, introduce .bdrv_amend_pre_run() and .bdrv_amend_clean() callbacks. These two callbacks are guaranteed to be invoked under BQL, respectively before and after .bdrv_co_amend(). They take care of performing the permission checks in the same way as they are currently done before and after qcrypto_block_amend_options(). These callbacks are in preparation for next patch, where we delete the original permission check. Right now they just add redundant control. Then, call .bdrv_amend_pre_run() before job_start in qmp_x_blockdev_amend(), so that it will be run before the job coroutine is created and stay in the main loop. As a cleanup, use JobDriver's .clean() callback to call .bdrv_amend_clean(), and run amend-specific cleanup callbacks under BQL. After this patch, permission failures occur early in the blockdev-amend job to update a LUKS volume's keys. iotest 296 must now expect them in x-blockdev-amend's QMP reply instead of waiting for the actual job to fail later. Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com> Message-Id: <20220209105452.1694545-2-eesposit@redhat.com> Signed-off-by: Hanna Reitz <hreitz@redhat.com> Message-Id: <20220304153729.711387-6-hreitz@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-03-03hw/nvme: 64-bit pi supportNaveen Nagar
This adds support for one possible new protection information format introduced in TP4068 (and integrated in NVMe 2.0): the 64-bit CRC guard and 48-bit reference tag. This version does not support storage tags. Like the CRC16 support already present, this uses a software implementation of CRC64 (so it is naturally pretty slow). But its good enough for verification purposes. This may go nicely hand-in-hand with the support that Keith submitted for the Linux kernel[1]. [1]: https://lore.kernel.org/linux-nvme/20220126165214.GA1782352@dhcp-10-100-145-180.wdc.com/T/ Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Naveen Nagar <naveen.n1@samsung.com> Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
2022-03-03hw/nvme: add support for the lbafee hbs featureNaveen Nagar
Add support for up to 64 LBA formats through the LBAFEE field of the Host Behavior Support feature. Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Naveen Nagar <naveen.n1@samsung.com> Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
2022-03-03hw/nvme: add host behavior support featureNaveen Nagar
Add support for getting and setting the Host Behavior Support feature. Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Naveen Nagar <naveen.n1@samsung.com> Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
2022-02-14hw/nvme: add support for zoned random write areaKlaus Jensen
Add support for TP 4076 ("Zoned Random Write Area"), v2021.08.23 ("Ratified"). This adds three new namespace parameters: "zoned.numzrwa" (number of zrwa resources, i.e. number of zones that can have a zrwa), "zoned.zrwas" (zrwa size in LBAs), "zoned.zrwafg" (granularity in LBAs for flushes). Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
2022-02-14hw/nvme: add ozcs enumKlaus Jensen
Add enumeration for OZCS values. Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
2022-02-14hw/nvme: add struct for zone management sendKlaus Jensen
Add struct for Zone Management Send in preparation for more zone send flags. Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
2022-02-01block.h: remove outdated commentEmanuele Giuseppe Esposito
The comment "disk I/O throttling" doesn't make any sense at all any more. It was added in commit 0563e191516 to describe bdrv_io_limits_enable()/disable(), which were removed in commit 97148076, so the comment is just a forgotten leftover. Suggested-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com> Message-Id: <20220131125615.74612-1-eesposit@redhat.com> Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org> Signed-off-by: Hanna Reitz <hreitz@redhat.com>
2022-01-14Merge remote-tracking branch 'remotes/kevin/tags/for-upstream' into stagingPeter Maydell
Block layer patches - qemu-storage-daemon: Add vhost-user-blk help - block-backend: Fix use-after-free for BDS pointers after aio_poll() - qemu-img: Fix sparseness of output image with unaligned ranges - vvfat: Fix crashes in read-write mode - Fix device deletion events with -device JSON syntax - Code cleanups # gpg: Signature made Fri 14 Jan 2022 13:50:16 GMT # gpg: using RSA key DC3DEB159A9AF95D3D7456FE7F09B272C88F2FD6 # gpg: issuer "kwolf@redhat.com" # gpg: Good signature from "Kevin Wolf <kwolf@redhat.com>" [full] # Primary key fingerprint: DC3D EB15 9A9A F95D 3D74 56FE 7F09 B272 C88F 2FD6 * remotes/kevin/tags/for-upstream: iotests/testrunner.py: refactor test_field_width block: drop BLK_PERM_GRAPH_MOD qemu-img: make is_allocated_sectors() more efficient iotests: Test qemu-img convert of zeroed data cluster vvfat: Fix vvfat_write() for writes before the root directory vvfat: Fix size of temporary qcow file iotests/308: Fix for CAP_DAC_OVERRIDE iotests/stream-error-on-reset: New test block-backend: prevent dangling BDS pointers across aio_poll() qapi/block: Restrict vhost-user-blk to CONFIG_VHOST_USER_BLK_SERVER qemu-storage-daemon: Add vhost-user-blk help docs: Correct 'vhost-user-blk' spelling softmmu: fix device deletion events with -device JSON syntax include/sysemu/blockdev.h: remove drive_get_max_devs include/sysemu/blockdev.h: remove drive_mark_claimed_by_board and inline drive_def block_int: make bdrv_backing_overridden static Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
2022-01-14block: drop BLK_PERM_GRAPH_MODVladimir Sementsov-Ogievskiy
First, this permission never protected a node from being changed, as generic child-replacing functions don't check it. Second, it's a strange thing: it presents a permission of parent node to change its child. But generally, children are replaced by different mechanisms, like jobs or qmp commands, not by nodes. Graph-mod permission is hard to understand. All other permissions describe operations which done by parent node on its child: read, write, resize. Graph modification operations are something completely different. The only place where BLK_PERM_GRAPH_MOD is used as "perm" (not shared perm) is mirror_start_job, for s->target. Still modern code should use bdrv_freeze_backing_chain() to protect from graph modification, if we don't do it somewhere it may be considered as a bug. So, it's a bit risky to drop GRAPH_MOD, and analyzing of possible loss of protection is hard. But one day we should do it, let's do it now. One more bit of information is that locking the corresponding byte in file-posix doesn't make sense at all. Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Message-Id: <20210902093754.2352-1-vsementsov@virtuozzo.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-01-14block_int: make bdrv_backing_overridden staticEmanuele Giuseppe Esposito
bdrv_backing_overridden is only used in block.c, so there is no need to leave it in block_int.h Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Message-Id: <20211215121140.456939-2-eesposit@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-01-12aio-posix: split poll check from ready handlerStefan Hajnoczi
Adaptive polling measures the execution time of the polling check plus handlers called when a polled event becomes ready. Handlers can take a significant amount of time, making it look like polling was running for a long time when in fact the event handler was running for a long time. For example, on Linux the io_submit(2) syscall invoked when a virtio-blk device's virtqueue becomes ready can take 10s of microseconds. This can exceed the default polling interval (32 microseconds) and cause adaptive polling to stop polling. By excluding the handler's execution time from the polling check we make the adaptive polling calculation more accurate. As a result, the event loop now stays in polling mode where previously it would have fallen back to file descriptor monitoring. The following data was collected with virtio-blk num-queues=2 event_idx=off using an IOThread. Before: 168k IOPS, IOThread syscalls: 9837.115 ( 0.020 ms): IO iothread1/620155 io_submit(ctx_id: 140512552468480, nr: 16, iocbpp: 0x7fcb9f937db0) = 16 9837.158 ( 0.002 ms): IO iothread1/620155 write(fd: 103, buf: 0x556a2ef71b88, count: 8) = 8 9837.161 ( 0.001 ms): IO iothread1/620155 write(fd: 104, buf: 0x556a2ef71b88, count: 8) = 8 9837.163 ( 0.001 ms): IO iothread1/620155 ppoll(ufds: 0x7fcb90002800, nfds: 4, tsp: 0x7fcb9f1342d0, sigsetsize: 8) = 3 9837.164 ( 0.001 ms): IO iothread1/620155 read(fd: 107, buf: 0x7fcb9f939cc0, count: 512) = 8 9837.174 ( 0.001 ms): IO iothread1/620155 read(fd: 105, buf: 0x7fcb9f939cc0, count: 512) = 8 9837.176 ( 0.001 ms): IO iothread1/620155 read(fd: 106, buf: 0x7fcb9f939cc0, count: 512) = 8 9837.209 ( 0.035 ms): IO iothread1/620155 io_submit(ctx_id: 140512552468480, nr: 32, iocbpp: 0x7fca7d0cebe0) = 32 174k IOPS (+3.6%), IOThread syscalls: 9809.566 ( 0.036 ms): IO iothread1/623061 io_submit(ctx_id: 140539805028352, nr: 32, iocbpp: 0x7fd0cdd62be0) = 32 9809.625 ( 0.001 ms): IO iothread1/623061 write(fd: 103, buf: 0x5647cfba5f58, count: 8) = 8 9809.627 ( 0.002 ms): IO iothread1/623061 write(fd: 104, buf: 0x5647cfba5f58, count: 8) = 8 9809.663 ( 0.036 ms): IO iothread1/623061 io_submit(ctx_id: 140539805028352, nr: 32, iocbpp: 0x7fd0d0388b50) = 32 Notice that ppoll(2) and eventfd read(2) syscalls are eliminated because the IOThread stays in polling mode instead of falling back to file descriptor monitoring. As usual, polling is not implemented on Windows so this patch ignores the new io_poll_read() callback in aio-win32.c. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Message-id: 20211207132336.36627-2-stefanha@redhat.com [Fixed up aio_set_event_notifier() calls in tests/unit/test-fdmon-epoll.c added after this series was queued. --Stefan] Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2021-12-28blockjob: drop BlockJob.blk fieldVladimir Sementsov-Ogievskiy
It's unused now (except for permission handling)[*]. The only reasonable user of it was block-stream job, recently updated to use own blk. And other block jobs prefer to use own source node related objects. So, the arguments of dropping the field are: - block jobs prefer not to use it - block jobs usually has more then one node to operate on, and better to operate symmetrically (for example has both source and target blk's in specific block-job state structure) *: BlockJob.blk is used to keep some permissions. We simply move permissions to block-job child created in block_job_create() together with blk. In mirror, we just should not care anymore about restoring state of blk. Most probably this code could be dropped long ago, after dropping bs->job pointer. Now it finally goes away together with BlockJob.blk itself. iotest 141 output is updated, as "bdrv_has_blk(bs)" check in qmp_blockdev_del() doesn't fail (we don't have blk now). Still, new error message looks even better. In iotest 283 we need to add a job id, otherwise "Invalid job ID" happens now earlier than permission check (as permissions moved from blk to block-job node). Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Reviewed-by: Nikita Lapshin <nikita.lapshin@virtuozzo.com>
2021-12-28blockjob: implement and use block_job_get_aio_contextVladimir Sementsov-Ogievskiy
We are going to drop BlockJob.blk. So let's retrieve block job context from underlying job instead of main node. Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Reviewed-by: Nikita Lapshin <nikita.lapshin@virtuozzo.com>
2021-11-02linux-aio: add `dev_max_batch` parameter to laio_io_unplug()Stefano Garzarella
Between the submission of a request and the unplug, other devices with larger limits may have been queued new requests without flushing the batch. Using the new `dev_max_batch` parameter, laio_io_unplug() can check if the batch exceeds the device limit to flush the current batch. Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Message-Id: <20211026162346.253081-4-sgarzare@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2021-11-02linux-aio: add `dev_max_batch` parameter to laio_co_submit()Stefano Garzarella
This new parameter can be used by block devices to limit the Linux AIO batch size more than the limit set by the AIO context. file-posix backend supports this, passing its `aio-max-batch` option previously added. Add an helper function to calculate the maximum batch size. Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Message-Id: <20211026162346.253081-3-sgarzare@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2021-10-06block: introduce max_hw_iov for use in scsi-genericPaolo Bonzini
Linux limits the size of iovecs to 1024 (UIO_MAXIOV in the kernel sources, IOV_MAX in POSIX). Because of this, on some host adapters requests with many iovecs are rejected with -EINVAL by the io_submit() or readv()/writev() system calls. In fact, the same limit applies to SG_IO as well. To fix both the EINVAL and the possible performance issues from using fewer iovecs than allowed by Linux (some HBAs have max_segments as low as 128), introduce a separate entry in BlockLimits to hold the max_segments value from sysfs. This new limit is used only for SG_IO and clamped to bs->bl.max_iov anyway, just like max_hw_transfer is clamped to bs->bl.max_transfer. Reported-by: Halil Pasic <pasic@linux.ibm.com> Cc: Hanna Reitz <hreitz@redhat.com> Cc: Kevin Wolf <kwolf@redhat.com> Cc: qemu-block@nongnu.org Cc: qemu-stable@nongnu.org Fixes: 18473467d5 ("file-posix: try BLKSECTGET on block devices too, do not round to power of 2", 2021-06-25) Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210923130436.1187591-1-pbonzini@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2021-10-06block: implement bdrv_new_open_driver_opts()Vladimir Sementsov-Ogievskiy
Add version of bdrv_new_open_driver() that supports QDict options. We'll use it in further commit. Simply add one more argument to bdrv_new_open_driver() is worse, as there are too many invocations of bdrv_new_open_driver() to update then. Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Suggested-by: Kevin Wolf <kwolf@redhat.com> Message-Id: <20210920115538.264372-2-vsementsov@virtuozzo.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2021-10-06include/block.h: remove outdated commentEmanuele Giuseppe Esposito
There are a couple of errors in bdrv_drained_begin header comment: - block_job_pause does not exist anymore, it has been replaced with job_pause in b15de82867 - job_pause is automatically invoked as a .drained_begin callback (child_job_drained_begin) by the child_job BdrvChildClass struct in blockjob.c. So no additional pause should be required. Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com> Message-Id: <20210903113800.59970-1-eesposit@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2021-09-29block: use int64_t instead of int in driver discard handlersVladimir Sementsov-Ogievskiy
We are generally moving to int64_t for both offset and bytes parameters on all io paths. Main motivation is realization of 64-bit write_zeroes operation for fast zeroing large disk chunks, up to the whole disk. We chose signed type, to be consistent with off_t (which is signed) and with possibility for signed return type (where negative value means error). So, convert driver discard handlers bytes parameter to int64_t. The only caller of all updated function is bdrv_co_pdiscard in block/io.c. It is already prepared to work with 64bit requests, but pass at most max(bs->bl.max_pdiscard, INT_MAX) to the driver. Let's look at all updated functions: blkdebug: all calculations are still OK, thanks to bdrv_check_qiov_request(). both rule_check and bdrv_co_pdiscard are 64bit blklogwrites: pass to blk_loc_writes_co_log which is 64bit blkreplay, copy-on-read, filter-compress: pass to bdrv_co_pdiscard, OK copy-before-write: pass to bdrv_co_pdiscard which is 64bit and to cbw_do_copy_before_write which is 64bit file-posix: one handler calls raw_account_discard() is 64bit and both handlers calls raw_do_pdiscard(). Update raw_do_pdiscard, which pass to RawPosixAIOData::aio_nbytes, which is 64bit (and calls raw_account_discard()) gluster: somehow, third argument of glfs_discard_async is size_t. Let's set max_pdiscard accordingly. iscsi: iscsi_allocmap_set_invalid is 64bit, !is_byte_request_lun_aligned is 64bit. list.num is uint32_t. Let's clarify max_pdiscard and pdiscard_alignment. mirror_top: pass to bdrv_mirror_top_do_write() which is 64bit nbd: protocol limitation. max_pdiscard is alredy set strict enough, keep it as is for now. nvme: buf.nlb is uint32_t and we do shift. So, add corresponding limits to nvme_refresh_limits(). preallocate: pass to bdrv_co_pdiscard() which is 64bit. rbd: pass to qemu_rbd_start_co() which is 64bit. qcow2: calculations are still OK, thanks to bdrv_check_qiov_request(), qcow2_cluster_discard() is 64bit. raw-format: raw_adjust_offset() is 64bit, bdrv_co_pdiscard too. throttle: pass to bdrv_co_pdiscard() which is 64bit and to throttle_group_co_io_limits_intercept() which is 64bit as well. test-block-iothread: bytes argument is unused Great! Now all drivers are prepared to handle 64bit discard requests, or else have explicit max_pdiscard limits. Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Message-Id: <20210903102807.27127-11-vsementsov@virtuozzo.com> Reviewed-by: Eric Blake <eblake@redhat.com> Signed-off-by: Eric Blake <eblake@redhat.com>