aboutsummaryrefslogtreecommitdiff
path: root/block
AgeCommit message (Collapse)Author
2018-07-03Merge remote-tracking branch 'remotes/ericb/tags/pull-nbd-2018-07-02' into ↵Peter Maydell
staging nbd patches for 2018-07-02 Bug fixes and iotest exposure of fleecing via NBD (serving a read-only point-in-time view via blockdev-backup sync:none, as well as serving dirty bitmaps over NBD), including a new x-dirty-bitmap parameter when opening NBD clients as the counterpart to x-nbd-server-add-bitmap. Also a random fix for iscsi block_status spotted by Coverity that missed other miscellaneous trees. - Eric Blake: nbd/server: Fix dirty bitmap logic regression - Eric Blake: iscsi: Avoid potential for get_status overflow - John Snow/Vladimir Sementsov-Ogievskiy: 0/2 block: formalize and test fleecing - Eric Blake: 0/2 test NBD bitmap export # gpg: Signature made Tue 03 Jul 2018 02:33:03 BST # gpg: using RSA key A7A16B4A2527436A # gpg: Good signature from "Eric Blake <eblake@redhat.com>" # gpg: aka "Eric Blake (Free Software Programmer) <ebb9@byu.net>" # gpg: aka "[jpeg image of size 6874]" # Primary key fingerprint: 71C2 CC22 B1C4 6029 27D2 F3AA A7A1 6B4A 2527 436A * remotes/ericb/tags/pull-nbd-2018-07-02: iotests: New test 223 for exporting dirty bitmap over NBD nbd/client: Add x-dirty-bitmap to query bitmap from server iotests: add 222 to test basic fleecing blockdev: enable non-root nodes for backup source iscsi: Avoid potential for get_status overflow nbd/server: Fix dirty bitmap logic regression Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
2018-07-02backup: Use copy offloadingFam Zheng
The implementation is similar to the 'qemu-img convert'. In the beginning of the job, offloaded copy is attempted. If it fails, further I/O will go through the existing bounce buffer code path. Then, as Kevin pointed out, both this and qemu-img convert can benefit from a local check if one request fails because of, for example, the offset is beyond EOF, but another may well be accepted by the protocol layer. This will be implemented separately. Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Fam Zheng <famz@redhat.com> Message-id: 20180703023758.14422-4-famz@redhat.com Signed-off-by: Jeff Cody <jcody@redhat.com>
2018-07-02block: Honour BDRV_REQ_NO_SERIALISING in copy rangeFam Zheng
This semantics is needed by drive-backup so implement it before using this API there. Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Fam Zheng <famz@redhat.com> Message-id: 20180703023758.14422-3-famz@redhat.com Signed-off-by: Jeff Cody <jcody@redhat.com>
2018-07-02block: Fix parameter checking in bdrv_co_copy_range_internalFam Zheng
src may be NULL if BDRV_REQ_ZERO_WRITE flag is set, in this case only check dst and dst->bs. This bug was introduced when moving in the request tracking code from bdrv_co_copy_range, in 37aec7d75eb. This especially fixes the possible segfault when initializing src_bs with a NULL src. Signed-off-by: Fam Zheng <famz@redhat.com> Message-id: 20180703023758.14422-2-famz@redhat.com Reviewed-by: Jeff Cody <jcody@redhat.com> Signed-off-by: Jeff Cody <jcody@redhat.com>
2018-07-02nbd/client: Add x-dirty-bitmap to query bitmap from serverEric Blake
In order to test that the NBD server is properly advertising dirty bitmaps, we need a bare minimum client that can request and read the context. Since feature freeze for 3.0 is imminent, this is the smallest workable patch, which replaces the qemu block status report with the results of the NBD server's dirty bitmap (making it very easy to use 'qemu-img map --output=json' to learn where the dirty portions are). Note that the NBD protocol defines a dirty section with the same bit but opposite sense that normal "base:allocation" uses to report an allocated section; so in qemu-img map output, "data":true corresponds to clean, "data":false corresponds to dirty. A more complete solution that allows dirty bitmaps to be queried at the same time as normal block status will be required before this addition can lose the x- prefix. Until then, the fact that this replaces normal status with dirty status means actions like 'qemu-img convert' will likely misbehave due to treating dirty regions of the file as if they are unallocated. The next patch adds an iotest to exercise this new code. Signed-off-by: Eric Blake <eblake@redhat.com> Message-Id: <20180702191458.28741-2-eblake@redhat.com>
2018-07-02iscsi: Avoid potential for get_status overflowEric Blake
Detected by Coverity: Multiplying two 32-bit int and assigning the result to a 64-bit number is a risk of overflow. Prior to the conversion to byte-based interfaces, the block layer took care of ensuring that a status request never exceeded 2G in the driver; but after that conversion, the block layer expects drivers to deal with any size request (the driver can always truncate the request size back down, as long as it makes progress). So, in the off-chance that someone makes a large request, we are at the mercy of whether iscsi_get_lba_status_task() will cap things to at most INT_MAX / iscsilun->block_size when it populates lbasd->num_blocks; since I could not easily audit that, it's better to be safe than sorry by just forcing a 64-bit multiply. Fixes: 92809c36 CC: qemu-stable@nongnu.org Signed-off-by: Eric Blake <eblake@redhat.com> Message-Id: <20180508212718.1482663-1-eblake@redhat.com> Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
2018-07-02vdi: Use definitions from "qemu/units.h"Philippe Mathieu-Daudé
Signed-off-by: Philippe Mathieu-Daudé <f4bug@amsat.org> Reviewed-by: Stefan Weil <sw@weilnetz.de> Message-Id: <20180625124238.25339-3-f4bug@amsat.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-06-29block: Remove unused sector-based vectored I/OEric Blake
We are gradually moving away from sector-based interfaces, towards byte-based. Now that all callers of vectored I/O have been converted to use our preferred byte-based bdrv_co_p{read,write}v(), we can delete the unused bdrv_co_{read,write}v(). Furthermore, this gets rid of the signature difference between the public bdrv_co_writev() and the callback .bdrv_co_writev (the latter still exists, because some drivers still need more work before they are fully byte-based). Signed-off-by: Eric Blake <eblake@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Jeff Cody <jcody@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-06-29vhdx: Switch to byte-based callsEric Blake
We are gradually moving away from sector-based interfaces, towards byte-based. Make the change for the last few sector-based calls into the block layer from the vhdx driver. Ideally, the vhdx driver should switch to doing everything byte-based, but that's a more invasive change that requires a bit more auditing. Signed-off-by: Eric Blake <eblake@redhat.com> Reviewed-by: Jeff Cody <jcody@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-06-29replication: Switch to byte-based callsEric Blake
We are gradually moving away from sector-based interfaces, towards byte-based. Make the change for the last few sector-based calls into the block layer from the replication driver. Ideally, the replication driver should switch to doing everything byte-based, but that's a more invasive change that requires a bit more auditing. Signed-off-by: Eric Blake <eblake@redhat.com> Reviewed-by: Jeff Cody <jcody@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-06-29qcow: Switch to a byte-based driverEric Blake
We are gradually moving away from sector-based interfaces, towards byte-based. The qcow driver is now ready to fully utilize the byte-based callback interface, as long as we override the default alignment to still be 512 (needed at least for asserts present because of encryption, but easier to do everywhere than to audit which sub-sector requests are handled correctly, especially since we no longer recommend qcow for new disk images). Signed-off-by: Eric Blake <eblake@redhat.com> Reviewed-by: Jeff Cody <jcody@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-06-29qcow: Switch qcow_co_writev to byte-based callsEric Blake
We are gradually moving away from sector-based interfaces, towards byte-based. Make the change for the internals of the qcow driver write function, by iterating over offset/bytes instead of sector_num/nb_sectors, and with a rename of index_in_cluster and repurposing of n to track bytes instead of sectors. A later patch will then switch the qcow driver as a whole over to byte-based operation. Signed-off-by: Eric Blake <eblake@redhat.com> Reviewed-by: Jeff Cody <jcody@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-06-29qcow: Switch qcow_co_readv to byte-based callsEric Blake
We are gradually moving away from sector-based interfaces, towards byte-based. Make the change for the internals of the qcow driver read function, by iterating over offset/bytes instead of sector_num/nb_sectors, and with a rename of index_in_cluster and repurposing of n to track bytes instead of sectors. A later patch will then switch the qcow driver as a whole over to byte-based operation. Signed-off-by: Eric Blake <eblake@redhat.com> Reviewed-by: Jeff Cody <jcody@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-06-29qcow: Switch get_cluster_offset to be byte-basedEric Blake
We are gradually moving away from sector-based interfaces, towards byte-based. Make the change for the internal helper function get_cluster_offset(), by changing n_start and n_end to be byte offsets rather than sector indices within the cluster being allocated. However, assert that these values are still sector-aligned (at least qcrypto_block_encrypt() still wants that). For now we get that alignment for free because we still use sector-based driver callbacks. A later patch will then switch the qcow driver as a whole over to byte-based operation; but will still leave things at sector alignments as it is not worth auditing the qcow image format to worry about sub-sector requests. Signed-off-by: Eric Blake <eblake@redhat.com> Reviewed-by: Jeff Cody <jcody@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-06-29parallels: Switch to byte-based callsEric Blake
We are gradually moving away from sector-based interfaces, towards byte-based. Make the change for the last few sector-based calls into the block layer from the parallels driver. Ideally, the parallels driver should switch to doing everything byte-based, but that's a more invasive change that requires a bit more auditing. Signed-off-by: Eric Blake <eblake@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Denis V. Lunev <den@openvz.org> Reviewed-by: Jeff Cody <jcody@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-06-29file-posix: Fix EINTR handlingFam Zheng
EINTR should be checked against errno, not ret. While fixing the bug, collect the branches with a switch block. Also, change the return value from -ENOSTUP to -ENOSPC when the actual issue is request range passes EOF, which should be distinguishable from the case of error == ENOSYS by the caller, so that it could still retry with other byte ranges, whereas it shouldn't retry anymore upon ENOSYS. Signed-off-by: Fam Zheng <famz@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-06-29iscsi: Don't blindly use designator length in response for memcpyFam Zheng
Per SCSI definition the designator_length we receive from INQUIRY is 8, 12 or at most 16, but we should be careful because the remote iscsi target may misbehave, otherwise we could have a buffer overflow. Reported-by: Max Reitz <mreitz@redhat.com> Signed-off-by: Fam Zheng <famz@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-06-29qcow2: Fix src_offset in copy offloadingFam Zheng
Not updating src_offset will result in wrong data being written to dst image. Reported-by: Max Reitz <mreitz@redhat.com> Signed-off-by: Fam Zheng <famz@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-06-29file-posix: Implement co versions of discard/flushKevin Wolf
This simplifies file-posix by implementing the coroutine variants of the discard and flush BlockDriver callbacks. These were the last remaining users of paio_submit(), which can be removed now. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com>
2018-06-29qcow2: Free allocated clusters on write errorKevin Wolf
If we managed to allocate the clusters, but then failed to write the data, there's a good chance that we'll still be able to free the clusters again in order to avoid cluster leaks (the refcounts are cached, so even if we can't write them out right now, we may be able to do so when the VM is resumed after a werror=stop/enospc pause). Signed-off-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Max Reitz <mreitz@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Tested-by: Eric Blake <eblake@redhat.com>
2018-06-29block/crypto: Simplify block_crypto_{open,create}_opts_init()Markus Armbruster
block_crypto_open_opts_init() and block_crypto_create_opts_init() contain a virtual visit of QCryptoBlockOptions and QCryptoBlockCreateOptions less member "format", respectively. Change their callers to put member "format" in the QDict, so they can use the generated visitors for these types instead. Signed-off-by: Markus Armbruster <armbru@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-06-29block: Move request tracking to children in copy offloadingFam Zheng
in_flight and tracked requests need to be tracked in every layer during recursion. For now the only user is qemu-img convert where overlapping requests and IOThreads don't exist, therefore this change doesn't make much difference form user point of view, but it is incorrect as part of the API. Fix it. Reported-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Fam Zheng <famz@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-06-29qcow2: Remove dead check on !retFam Zheng
In the beginning of the function, we initialize the local variable to 0, and in the body of the function, we check the assigned values and exit the loop immediately. So here it can never be non-zero. Reported-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Fam Zheng <famz@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-06-29file-posix: Make .bdrv_co_truncate asynchronousKevin Wolf
This moves the code to resize an image file to the thread pool to avoid blocking. Creating large images with preallocation with blockdev-create is now actually a background job instead of blocking the monitor (and most other things) until the preallocation has completed. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
2018-06-29block: Use tracked request for truncateKevin Wolf
When growing an image, block drivers (especially protocol drivers) may initialise the newly added area. I/O requests to the same area need to wait for this initialisation to be completed so that data writes don't get overwritten and reads don't read uninitialised data. To avoid overhead in the fast I/O path by adding new locking in the protocol drivers and to restrict the impact to requests that actually touch the new area, reuse the existing tracked request infrastructure in block/io.c and mark all discard requests as serialising. With this change, it is safe for protocol drivers to make .bdrv_co_truncate actually asynchronous. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
2018-06-29block: Move bdrv_truncate() implementation to io.cKevin Wolf
This moves the bdrv_truncate() implementation from block.c to block/io.c so it can have access to the tracked requests infrastructure. This involves making refresh_total_sectors() public (in block_int.h). Signed-off-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
2018-06-29qcow2: Remove coroutine trampoline for preallocate_co()Kevin Wolf
All callers are coroutine_fns now, so we can just directly call preallocate_co(). Signed-off-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
2018-06-29block: Convert .bdrv_truncate callback to coroutine_fnKevin Wolf
bdrv_truncate() is an operation that can block (even for a quite long time, depending on the PreallocMode) in I/O paths that shouldn't block. Convert it to a coroutine_fn so that we have the infrastructure for drivers to make their .bdrv_co_truncate implementation asynchronous. This change could potentially introduce new race conditions because bdrv_truncate() isn't necessarily executed atomically any more. Whether this is a problem needs to be evaluated for each block driver that supports truncate: * file-posix/win32, gluster, iscsi, nfs, rbd, ssh, sheepdog: The protocol drivers are trivially safe because they don't actually yield yet, so there is no change in behaviour. * copy-on-read, crypto, raw-format: Essentially just filter drivers that pass the request to a child node, no problem. * qcow2: The implementation modifies metadata, so it needs to hold s->lock to be safe with concurrent I/O requests. In order to avoid double locking, this requires pulling the locking out into preallocate_co() and using qcow2_write_caches() instead of bdrv_flush(). * qed: Does a single header update, this is fine without locking. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
2018-06-29qcow2: Fix qcow2_truncate() error return valueKevin Wolf
If qcow2_alloc_clusters_at() returns an error, we do need to negate it to get back the positive errno code for error_setg_errno(), but we still need to return the negative error code. Fixes: 772d1f973f87269f6a4a4ea4b880680f3779bbdf Signed-off-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
2018-06-29block/crypto: Pacify Coverity after commit f853465aacbMarkus Armbruster
Coverity can't see that qobject_input_visitor_new_flat_confused() returns non-null when it doesn't set @local_err. Check the return value instead, like all the other callers do. Fixes: CID 1393615 Fixes: CID 1393616 Signed-off-by: Markus Armbruster <armbru@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-06-27linux-aio: properly bubble up errors from initializationNishanth Aravamudan
laio_init() can fail for a couple of reasons, which will lead to a NULL pointer dereference in laio_attach_aio_context(). To solve this, add a aio_setup_linux_aio() function which is called early in raw_open_common. If this fails, propagate the error up. The signature of aio_get_linux_aio() was not modified, because it seems preferable to return the actual errno from the possible failing initialization calls. Additionally, when the AioContext changes, we need to associate a LinuxAioState with the new AioContext. Use the bdrv_attach_aio_context callback and call the new aio_setup_linux_aio(), which will allocate a new AioContext if needed, and return errors on failures. If it fails for any reason, fallback to threaded AIO with an error message, as the device is already in-use by the guest. Add an assert that aio_get_linux_aio() cannot return NULL. Signed-off-by: Nishanth Aravamudan <naravamudan@digitalocean.com> Message-id: 20180622193700.6523-1-naravamudan@digitalocean.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2018-06-22qapi: remove empty flat union branches and typesAnton Nefedov
Flat unions may now have uncovered branches, so it is possible to get rid of empty types defined for that purpose only. Signed-off-by: Anton Nefedov <anton.nefedov@virtuozzo.com> Reviewed-by: Markus Armbruster <armbru@redhat.com> Message-Id: <1529311206-76847-3-git-send-email-anton.nefedov@virtuozzo.com> Signed-off-by: Markus Armbruster <armbru@redhat.com>
2018-06-18block/mirror: Add copy mode QAPI interfaceMax Reitz
This patch allows the user to specify whether to use active or only background mode for mirror block jobs. Currently, this setting will remain constant for the duration of the entire block job. Signed-off-by: Max Reitz <mreitz@redhat.com> Reviewed-by: Alberto Garcia <berto@igalia.com> Message-id: 20180613181823.13618-14-mreitz@redhat.com Signed-off-by: Max Reitz <mreitz@redhat.com>
2018-06-18block/mirror: Add active mirroringMax Reitz
This patch implements active synchronous mirroring. In active mode, the passive mechanism will still be in place and is used to copy all initially dirty clusters off the source disk; but every write request will write data both to the source and the target disk, so the source cannot be dirtied faster than data is mirrored to the target. Also, once the block job has converged (BLOCK_JOB_READY sent), source and target are guaranteed to stay in sync (unless an error occurs). Active mode is completely optional and currently disabled at runtime. A later patch will add a way for users to enable it. Signed-off-by: Max Reitz <mreitz@redhat.com> Reviewed-by: Fam Zheng <famz@redhat.com> Message-id: 20180613181823.13618-13-mreitz@redhat.com Signed-off-by: Max Reitz <mreitz@redhat.com>
2018-06-18block/mirror: Add MirrorBDSOpaqueMax Reitz
This will allow us to access the block job data when the mirror block driver becomes more complex. Signed-off-by: Max Reitz <mreitz@redhat.com> Reviewed-by: Fam Zheng <famz@redhat.com> Message-id: 20180613181823.13618-11-mreitz@redhat.com Signed-off-by: Max Reitz <mreitz@redhat.com>
2018-06-18block/dirty-bitmap: Add bdrv_dirty_iter_next_areaMax Reitz
This new function allows to look for a consecutively dirty area in a dirty bitmap. Signed-off-by: Max Reitz <mreitz@redhat.com> Reviewed-by: Fam Zheng <famz@redhat.com> Reviewed-by: John Snow <jsnow@redhat.com> Message-id: 20180613181823.13618-10-mreitz@redhat.com Signed-off-by: Max Reitz <mreitz@redhat.com>
2018-06-18hbitmap: Add @advance param to hbitmap_iter_next()Max Reitz
This new parameter allows the caller to just query the next dirty position without moving the iterator. Signed-off-by: Max Reitz <mreitz@redhat.com> Reviewed-by: Fam Zheng <famz@redhat.com> Reviewed-by: John Snow <jsnow@redhat.com> Message-id: 20180613181823.13618-8-mreitz@redhat.com Signed-off-by: Max Reitz <mreitz@redhat.com>
2018-06-18block/mirror: Use source as a BdrvChildMax Reitz
With this, the mirror_top_bs is no longer just a technically required node in the BDS graph but actually represents the block job operation. Also, drop MirrorBlockJob.source, as we can reach it through mirror_top_bs->backing. Signed-off-by: Max Reitz <mreitz@redhat.com> Reviewed-by: Fam Zheng <famz@redhat.com> Reviewed-by: Alberto Garcia <berto@igalia.com> Message-id: 20180613181823.13618-6-mreitz@redhat.com Signed-off-by: Max Reitz <mreitz@redhat.com>
2018-06-18block/mirror: Wait for in-flight op conflictsMax Reitz
This patch makes the mirror code differentiate between simply waiting for any operation to complete (mirror_wait_for_free_in_flight_slot()) and specifically waiting for all operations touching a certain range of the virtual disk to complete (mirror_wait_on_conflicts()). Signed-off-by: Max Reitz <mreitz@redhat.com> Reviewed-by: Fam Zheng <famz@redhat.com> Message-id: 20180613181823.13618-5-mreitz@redhat.com Signed-off-by: Max Reitz <mreitz@redhat.com>
2018-06-18block/mirror: Use CoQueue to wait on in-flight opsMax Reitz
Attach a CoQueue to each in-flight operation so if we need to wait for any we can use it to wait instead of just blindly yielding and hoping for some operation to wake us. A later patch will use this infrastructure to allow requests accessing the same area of the virtual disk to specifically wait for each other. Signed-off-by: Max Reitz <mreitz@redhat.com> Reviewed-by: Fam Zheng <famz@redhat.com> Message-id: 20180613181823.13618-4-mreitz@redhat.com Signed-off-by: Max Reitz <mreitz@redhat.com>
2018-06-18block/mirror: Convert to coroutinesMax Reitz
In order to talk to the source BDS (and maybe in the future to the target BDS as well) directly, we need to convert our existing AIO requests into coroutine I/O requests. Signed-off-by: Max Reitz <mreitz@redhat.com> Reviewed-by: Fam Zheng <famz@redhat.com> Message-id: 20180613181823.13618-3-mreitz@redhat.com Signed-off-by: Max Reitz <mreitz@redhat.com>
2018-06-18block/mirror: Pull out mirror_perform()Max Reitz
When converting mirror's I/O to coroutines, we are going to need a point where these coroutines are created. mirror_perform() is going to be that point. Signed-off-by: Max Reitz <mreitz@redhat.com> Reviewed-by: Fam Zheng <famz@redhat.com> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Reviewed-by: Jeff Cody <jcody@redhat.com> Reviewed-by: Alberto Garcia <berto@igalia.com> Message-id: 20180613181823.13618-2-mreitz@redhat.com Signed-off-by: Max Reitz <mreitz@redhat.com>
2018-06-18block: fix QEMU crash with scsi-hd and drive_delGreg Kurz
Removing a drive with drive_del while it is being used to run an I/O intensive workload can cause QEMU to crash. An AIO flush can yield at some point: blk_aio_flush_entry() blk_co_flush(blk) bdrv_co_flush(blk->root->bs) ... qemu_coroutine_yield() and let the HMP command to run, free blk->root and give control back to the AIO flush: hmp_drive_del() blk_remove_bs() bdrv_root_unref_child(blk->root) child_bs = blk->root->bs bdrv_detach_child(blk->root) bdrv_replace_child(blk->root, NULL) blk->root->bs = NULL g_free(blk->root) <============== blk->root becomes stale bdrv_unref(child_bs) bdrv_delete(child_bs) bdrv_close() bdrv_drained_begin() bdrv_do_drained_begin() bdrv_drain_recurse() aio_poll() ... qemu_coroutine_switch() and the AIO flush completion ends up dereferencing blk->root: blk_aio_complete() scsi_aio_complete() blk_get_aio_context(blk) bs = blk_bs(blk) ie, bs = blk->root ? blk->root->bs : NULL ^^^^^ stale The problem is that we should avoid making block driver graph changes while we have in-flight requests. Let's drain all I/O for this BB before calling bdrv_root_unref_child(). Signed-off-by: Greg Kurz <groug@kaod.org> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-06-18block: Allow graph changes in bdrv_drain_all_begin/end sectionsKevin Wolf
bdrv_drain_all_*() used bdrv_next() to iterate over all root nodes and did a subtree drain for each of them. This works fine as long as the graph is static, but sadly, reality looks different. If the graph changes so that root nodes are added or removed, we would have to compensate for this. bdrv_next() returns each root node only once even if it's the root node for multiple BlockBackends or for a monitor-owned block driver tree, which would only complicate things. The much easier and more obviously correct way is to fundamentally change the way the functions work: Iterate over all BlockDriverStates, no matter who owns them, and drain them individually. Compensation is only necessary when a new BDS is created inside a drain_all section. Removal of a BDS doesn't require any action because it's gone afterwards anyway. Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-06-18block: ignore_bds_parents parameter for drain functionsKevin Wolf
In the future, bdrv_drained_all_begin/end() will drain all invidiual nodes separately rather than whole subtrees. This means that we don't want to propagate the drain to all parents any more: If the parent is a BDS, it will already be drained separately. Recursing to all parents is unnecessary work and would make it an O(n²) operation. Prepare the drain function for the changed drain_all by adding an ignore_bds_parents parameter to the internal implementation that prevents the propagation of the drain to BDS parents. We still (have to) propagate it to non-BDS parents like BlockBackends or Jobs because those are not drained separately. Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-06-18block: Move bdrv_drain_all_begin() out of coroutine contextKevin Wolf
Before we can introduce a single polling loop for all nodes in bdrv_drain_all_begin(), we must make sure to run it outside of coroutine context like we already do for bdrv_do_drained_begin(). Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-06-18block: Defer .bdrv_drain_begin callback to polling phaseKevin Wolf
We cannot allow aio_poll() in bdrv_drain_invoke(begin=true) until we're done with propagating the drain through the graph and are doing the single final BDRV_POLL_WHILE(). Just schedule the coroutine with the callback and increase bs->in_flight to make sure that the polling phase will wait for it. Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-06-18block: Don't poll in parent drain callbacksKevin Wolf
bdrv_do_drained_begin() is only safe if we have a single BDRV_POLL_WHILE() after quiescing all affected nodes. We cannot allow that parent callbacks introduce a nested polling loop that could cause graph changes while we're traversing the graph. Split off bdrv_do_drained_begin_quiesce(), which only quiesces a single node without waiting for its requests to complete. These requests will be waited for in the BDRV_POLL_WHILE() call down the call chain. Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-06-18block: Drain recursively with a single BDRV_POLL_WHILE()Kevin Wolf
Anything can happen inside BDRV_POLL_WHILE(), including graph changes that may interfere with its callers (e.g. child list iteration in recursive callers of bdrv_do_drained_begin). Switch to a single BDRV_POLL_WHILE() call for the whole subtree at the end of bdrv_do_drained_begin() to avoid such effects. The recursion happens now inside the loop condition. As the graph can only change between bdrv_drain_poll() calls, but not inside of it, doing the recursion here is safe. Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-06-18block: Remove bdrv_drain_recurse()Kevin Wolf
For bdrv_drain(), recursively waiting for child node requests is pointless because we didn't quiesce their parents, so new requests could come in anyway. Letting the function work only on a single node makes it more consistent. For subtree drains and drain_all, we already have the recursion in bdrv_do_drained_begin(), so the extra recursion doesn't add anything either. Remove the useless code. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>