aboutsummaryrefslogtreecommitdiff
path: root/block
AgeCommit message (Collapse)Author
2019-10-10block/backup: use backup-top instead of write notifiersVladimir Sementsov-Ogievskiy
Drop write notifiers and use filter node instead. = Changes = 1. Add filter-node-name argument for backup qmp api. We have to do it in this commit, as 257 needs to be fixed. 2. There are no more write notifiers here, so is_write_notifier parameter is dropped from block-copy paths. 3. To sync with in-flight requests at job finish we now have drained removing of the filter, we don't need rw-lock. 4. Block-copy is now using BdrvChildren instead of BlockBackends 5. As backup-top owns these children, we also move block-copy state into backup-top's ownership. = Iotest changes = 56: op-blocker doesn't shoot now, as we set it on source, but then check on filter, when trying to start second backup. To keep the test we instead can catch another collision: both jobs will get 'drive0' job-id, as job-id parameter is unspecified. To prevent interleaving with file-posix locks (as they are dependent on config) let's use another target for second backup. Also, it's obvious now that we'd like to drop this op-blocker at all and add a test-case for two backups from one node (to different destinations) actually works. But not in these series. 141: Output changed: prepatch, "Node is in use" comes from bdrv_has_blk check inside qmp_blockdev_del. But we've dropped block-copy blk objects, so no more blk objects on source bs (job blk is on backup-top filter bs). New message is from op-blocker, which is the next check in qmp_blockdev_add. 257: The test wants to emulate guest write during backup. They should go to filter node, not to original source node, of course. Therefore we need to specify filter node name and use it. Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Message-id: 20191001131409.14202-6-vsementsov@virtuozzo.com Reviewed-by: Max Reitz <mreitz@redhat.com> Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-10-10block: introduce backup-top filter driverVladimir Sementsov-Ogievskiy
Backup-top filter caches write operations and does copy-before-write operations. The driver will be used in backup instead of write-notifiers. Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Message-id: 20191001131409.14202-5-vsementsov@virtuozzo.com Reviewed-by: Max Reitz <mreitz@redhat.com> Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-10-10block/block-copy: split block_copy_set_callbacks functionVladimir Sementsov-Ogievskiy
Split block_copy_set_callbacks out of block_copy_state_new. It's needed for further commit: block-copy will use BdrvChildren of backup-top filter, so it will be created from backup-top filter creation function. But callbacks will still belong to backup job and will be set in separate. Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Message-id: 20191001131409.14202-4-vsementsov@virtuozzo.com Reviewed-by: Max Reitz <mreitz@redhat.com> Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-10-10block/backup: move write_flags calculation inside backup_job_createVladimir Sementsov-Ogievskiy
This is logic-less refactoring, which simplifies further patch, as we'll need write_flags for backup-top filter creation and backup-top should be created before block job creation. Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Message-id: 20191001131409.14202-3-vsementsov@virtuozzo.com Reviewed-by: Max Reitz <mreitz@redhat.com> Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-10-10block/backup: move in-flight requests handling from backup to block-copyVladimir Sementsov-Ogievskiy
Move synchronization mechanism to block-copy, to be able to use one block-copy instance from backup job and backup-top filter in parallel. Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Message-id: 20191001131409.14202-2-vsementsov@virtuozzo.com Reviewed-by: Max Reitz <mreitz@redhat.com> Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-10-10qapi: query-blockstat: add driver specific file-posix statsAnton Nefedov
A block driver can provide a callback to report driver-specific statistics. file-posix driver now reports discard statistics Signed-off-by: Anton Nefedov <anton.nefedov@virtuozzo.com> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Acked-by: Markus Armbruster <armbru@redhat.com> Message-id: 20190923121737.83281-10-anton.nefedov@virtuozzo.com Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-10-10file-posix: account discard operationsAnton Nefedov
This will help to identify how many of the user-issued discard operations (accounted on a device level) have actually suceeded down on the host file (even though the numbers will not be exactly the same if non-raw format driver is used (e.g. qcow2 sending metadata discards)). Note that these numbers will not include discards triggered by write-zeroes + MAY_UNMAP calls. Signed-off-by: Anton Nefedov <anton.nefedov@virtuozzo.com> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Message-id: 20190923121737.83281-9-anton.nefedov@virtuozzo.com Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-10-10block: add empty account cookie typeAnton Nefedov
Each block_acct_done/failed call is designed to correspond to a previous block_acct_start call, which initializes the stats cookie. However sometimes it is not the case, e.g. some error paths might report the same cookie twice because it is hard to accurately track if the cookie was reported yet or not. This patch cleans the cookie after report. (Note: block_acct_failed/done without a previous block_acct_start at all should be avoided. Uninitialized cookie might hold a garbage value and there is still "< BLOCK_MAX_IOTYPE" assertion for that) It will be particularly useful in ide code where it's hard to keep track whether the request done its accounting or not: in the following patch of the series, trim requests will do the accounting separately. Signed-off-by: Anton Nefedov <anton.nefedov@virtuozzo.com> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Message-id: 20190923121737.83281-4-anton.nefedov@virtuozzo.com Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-10-10qapi: add unmap to BlockDeviceStatsAnton Nefedov
Signed-off-by: Anton Nefedov <anton.nefedov@virtuozzo.com> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Reviewed-by: Alberto Garcia <berto@igalia.com> Reviewed-by: Eric Blake <eblake@redhat.com> Message-id: 20190923121737.83281-3-anton.nefedov@virtuozzo.com Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-10-10block: move block_copy from block/backup.c to separate fileVladimir Sementsov-Ogievskiy
Split block_copy to separate file, to be cleanly shared with backup-top filter driver in further commits. It's a clean movement, the only change is drop "static" from interface functions. Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Reviewed-by: Max Reitz <mreitz@redhat.com> Message-id: 20190920142056.12778-8-vsementsov@virtuozzo.com Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-10-10block/backup: fix block-comment styleVladimir Sementsov-Ogievskiy
We need to fix comment style around block-copy functions before further moving them to separate file to satisfy checkpatch. But do more: fix all comments style. Also, seems like doubled first asterisk is not forbidden, but drop it too for consistency. Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Reviewed-by: Max Reitz <mreitz@redhat.com> Message-id: 20190920142056.12778-7-vsementsov@virtuozzo.com Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-10-10block/backup: introduce BlockCopyStateVladimir Sementsov-Ogievskiy
Split copying code part from backup to "block-copy", including separate state structure and function renaming. This is needed to share it with backup-top filter driver in further commits. Notes: 1. As BlockCopyState keeps own BlockBackend objects, remaining job->common.blk users only use it to get bs by blk_bs() call, so clear job->commen.blk permissions set in block_job_create and add job->source_bs to be used instead of blk_bs(job->common.blk), to keep it more clear which bs we use when introduce backup-top filter in further commit. 2. Rename s/initializing_bitmap/skip_unallocated/ to sound a bit better as interface to BlockCopyState 3. Split is not very clean: there left some duplicated fields, backup code uses some BlockCopyState fields directly, let's postpone it for further improvements and keep this comment simpler for review. Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Message-id: 20190920142056.12778-6-vsementsov@virtuozzo.com Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-10-10block/backup: improve comment about image fleecingVladimir Sementsov-Ogievskiy
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Reviewed-by: Max Reitz <mreitz@redhat.com> Message-id: 20190920142056.12778-5-vsementsov@virtuozzo.com Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-10-10block/backup: split shareable copying part from backup_do_cowVladimir Sementsov-Ogievskiy
Split copying logic which will be shared with backup-top filter. Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Reviewed-by: Max Reitz <mreitz@redhat.com> Message-id: 20190920142056.12778-4-vsementsov@virtuozzo.com Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-10-10block/backup: fix backup_cow_with_offload for last clusterVladimir Sementsov-Ogievskiy
We shouldn't try to copy bytes beyond EOF. Fix it. Fixes: 9ded4a0114968e Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Reviewed-by: Max Reitz <mreitz@redhat.com> Reviewed-by: John Snow <jsnow@redhat.com> Message-id: 20190920142056.12778-3-vsementsov@virtuozzo.com Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-10-10block/backup: fix max_transfer handling for copy_rangeVladimir Sementsov-Ogievskiy
Of course, QEMU_ALIGN_UP is a typo, it should be QEMU_ALIGN_DOWN, as we are trying to find aligned size which satisfy both source and target. Also, don't ignore too small max_transfer. In this case seems safer to disable copy_range. Fixes: 9ded4a0114968e Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Message-id: 20190920142056.12778-2-vsementsov@virtuozzo.com Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-10-10block/qcow2: introduce parallel subrequest handling in read and writeVladimir Sementsov-Ogievskiy
It improves performance for fragmented qcow2 images. Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Message-id: 20190916175324.18478-6-vsementsov@virtuozzo.com Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-10-10block/qcow2: refactor qcow2_co_pwritev_partVladimir Sementsov-Ogievskiy
Similarly to previous commit, prepare for parallelizing write-loop iterations. Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Reviewed-by: Max Reitz <mreitz@redhat.com> Message-id: 20190916175324.18478-5-vsementsov@virtuozzo.com Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-10-10block/qcow2: refactor qcow2_co_preadv_partVladimir Sementsov-Ogievskiy
Further patch will run partial requests of iterations of qcow2_co_preadv in parallel for performance reasons. To prepare for this, separate part which may be parallelized into separate function (qcow2_co_preadv_task). While being here, also separate encrypted clusters reading to own function, like it is done for compressed reading. Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Reviewed-by: Max Reitz <mreitz@redhat.com> Message-id: 20190916175324.18478-4-vsementsov@virtuozzo.com Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-10-10block: introduce aio task poolVladimir Sementsov-Ogievskiy
Common interface for aio task loops. To be used for improving performance of synchronous io loops in qcow2, block-stream, copy-on-read, and may be other places. Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Reviewed-by: Max Reitz <mreitz@redhat.com> Message-id: 20190916175324.18478-3-vsementsov@virtuozzo.com Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-10-08block: Skip COR for inactive nodesMax Reitz
We must not write data to inactive nodes, and a COR is certainly something we can simply not do without upsetting anyone. So skip COR operations on inactive nodes. Signed-off-by: Max Reitz <mreitz@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Message-id: 20191001174827.11081-2-mreitz@redhat.com Message-Id: <20191001174827.11081-2-mreitz@redhat.com> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2019-10-04block/snapshot: Restrict set of snapshot nodesKevin Wolf
Nodes involved in internal snapshots were those that were returned by bdrv_next(), inserted and not read-only. bdrv_next() in turn returns all nodes that are either the root node of a BlockBackend or monitor-owned nodes. With the typical -drive use, this worked well enough. However, in the typical -blockdev case, the user defines one node per option, making all nodes monitor-owned nodes. This includes protocol nodes etc. which often are not snapshottable, so "savevm" only returns an error. Change the conditions so that internal snapshot still include all nodes that have a BlockBackend attached (we definitely want to snapshot anything attached to a guest device and probably also the built-in NBD server; snapshotting block job BlockBackends is more of an accident, but a preexisting one), but other monitor-owned nodes are only included if they have no parents. This makes internal snapshots usable again with typical -blockdev configurations. Cc: qemu-stable@nongnu.org Signed-off-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Reviewed-by: Peter Krempa <pkrempa@redhat.com> Tested-by: Peter Krempa <pkrempa@redhat.com>
2019-09-19cutils: Move size_to_str() from "qemu-common.h" to "qemu/cutils.h"Philippe Mathieu-Daudé
"qemu/cutils.h" contains various qemu_strtosz_*() functions useful to convert strings to size. It seems natural to have the opposite usage (from size to string) there too. The function definition is already in util/cutils.c. Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com> Reviewed-by: John Snow <jsnow@redhat.com> Reviewed-by: Peter Xu <peterx@redhat.com> Message-Id: <20190903120555.7551-1-philmd@redhat.com> Signed-off-by: Laurent Vivier <laurent@vivier.eu>
2019-09-16block/qcow2: refactor encryption codeMaxim Levitsky
* Change the qcow2_co_{encrypt|decrypt} to just receive full host and guest offsets and use this function directly instead of calling do_perform_cow_encrypt (which is removed by that patch). * Adjust qcow2_co_encdec to take full host and guest offsets as well. * Document the qcow2_co_{encrypt|decrypt} arguments to prevent the bug fixed in former commit from hopefully happening again. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-id: 20190915203655.21638-3-mlevitsk@redhat.com Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> [mreitz: Let perform_cow() return the error value returned by qcow2_co_encrypt(), as proposed by Vladimir] Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-09-16block/qcow2: Fix corruption introduced by commit 8ac0f15f335Maxim Levitsky
This fixes subtle corruption introduced by luks threaded encryption in commit 8ac0f15f335 Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1745922 The corruption happens when we do a write that * writes to two or more unallocated clusters at once * doesn't fully cover the first sector * doesn't fully cover the last sector * uses luks encryption In this case, when allocating the new clusters we COW both areas prior to the write and after the write, and we encrypt them. The above mentioned commit accidentally made it so we encrypt the second COW area using the physical cluster offset of the first area. The problem is that offset_in_cluster in do_perform_cow_encrypt can be larger that the cluster size, thus cluster_offset will no longer point to the start of the cluster at which encrypted area starts. Next patch in this series will refactor the code to avoid all these assumptions. In the bugreport that was triggered by rebasing a luks image to new, zero filled base, which lot of such writes, and causes some files with zero areas to contain garbage there instead. But as described above it can happen elsewhere as well Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Message-id: 20190915203655.21638-2-mlevitsk@redhat.com Reviewed-by: Max Reitz <mreitz@redhat.com> Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-09-16curl: Check curl_multi_add_handle()'s return codeMax Reitz
If we had done that all along, debugging would have been much simpler. (Also, I/O errors are better than hangs.) Signed-off-by: Max Reitz <mreitz@redhat.com> Message-id: 20190910124136.10565-8-mreitz@redhat.com Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: John Snow <jsnow@redhat.com> Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-09-16curl: Handle success in multi_check_completionMax Reitz
Background: As of cURL 7.59.0, it verifies that several functions are not called from within a callback. Among these functions is curl_multi_add_handle(). curl_read_cb() is a callback from cURL and not a coroutine. Waking up acb->co will lead to entering it then and there, which means the current request will settle and the caller (if it runs in the same coroutine) may then issue the next request. In such a case, we will enter curl_setup_preadv() effectively from within curl_read_cb(). Calling curl_multi_add_handle() will then fail and the new request will not be processed. Fix this by not letting curl_read_cb() wake up acb->co. Instead, leave the whole business of settling the AIOCB objects to curl_multi_check_completion() (which is called from our timer callback and our FD handler, so not from any cURL callbacks). Reported-by: Natalie Gavrielov <ngavrilo@redhat.com> Buglink: https://bugzilla.redhat.com/show_bug.cgi?id=1740193 Cc: qemu-stable@nongnu.org Signed-off-by: Max Reitz <mreitz@redhat.com> Message-id: 20190910124136.10565-7-mreitz@redhat.com Reviewed-by: John Snow <jsnow@redhat.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-09-16curl: Report only ready socketsMax Reitz
Instead of reporting all sockets to cURL, only report the one that has caused curl_multi_do_locked() to be called. This lets us get rid of the QLIST_FOREACH_SAFE() list, which was actually wrong: SAFE foreaches are only safe when the current element is removed in each iteration. If it possible for the list to be concurrently modified, we cannot guarantee that only the current element will be removed. Therefore, we must not use QLIST_FOREACH_SAFE() here. Fixes: ff5ca1664af85b24a4180d595ea6873fd3deac57 Cc: qemu-stable@nongnu.org Signed-off-by: Max Reitz <mreitz@redhat.com> Message-id: 20190910124136.10565-6-mreitz@redhat.com Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: John Snow <jsnow@redhat.com> Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-09-16curl: Pass CURLSocket to curl_multi_do()Max Reitz
curl_multi_do_locked() currently marks all sockets as ready. That is not only inefficient, but in fact unsafe (the loop is). A follow-up patch will change that, but to do so, curl_multi_do_locked() needs to know exactly which socket is ready; and that is accomplished by this patch here. Cc: qemu-stable@nongnu.org Signed-off-by: Max Reitz <mreitz@redhat.com> Message-id: 20190910124136.10565-5-mreitz@redhat.com Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: John Snow <jsnow@redhat.com> Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-09-16curl: Check completion in curl_multi_do()Max Reitz
While it is more likely that transfers complete after some file descriptor has data ready to read, we probably should not rely on it. Better be safe than sorry and call curl_multi_check_completion() in curl_multi_do(), too, just like it is done in curl_multi_read(). With this change, curl_multi_do() and curl_multi_read() are actually the same, so drop curl_multi_read() and use curl_multi_do() as the sole FD handler. Signed-off-by: Max Reitz <mreitz@redhat.com> Message-id: 20190910124136.10565-4-mreitz@redhat.com Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: John Snow <jsnow@redhat.com> Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-09-16curl: Keep *socket until the end of curl_sock_cb()Max Reitz
This does not really change anything, but it makes the code a bit easier to follow once we use @socket as the opaque pointer for aio_set_fd_handler(). Cc: qemu-stable@nongnu.org Signed-off-by: Max Reitz <mreitz@redhat.com> Message-id: 20190910124136.10565-3-mreitz@redhat.com Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: John Snow <jsnow@redhat.com> Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-09-16curl: Keep pointer to the CURLState in CURLSocketMax Reitz
A follow-up patch will make curl_multi_do() and curl_multi_read() take a CURLSocket instead of the CURLState. They still need the latter, though, so add a pointer to it to the former. Cc: qemu-stable@nongnu.org Signed-off-by: Max Reitz <mreitz@redhat.com> Reviewed-by: John Snow <jsnow@redhat.com> Message-id: 20190910124136.10565-2-mreitz@redhat.com Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-09-16block: Use QEMU_IS_ALIGNEDNir Soffer
Replace instances of: (n & (BDRV_SECTOR_SIZE - 1)) == 0 And: (n & ~BDRV_SECTOR_MASK) == 0 With: QEMU_IS_ALIGNED(n, BDRV_SECTOR_SIZE) Which reveals the intent of the code better, and makes it easier to locate the code checking alignment. Signed-off-by: Nir Soffer <nsoffer@redhat.com> Message-id: 20190827185913.27427-2-nsoffer@redhat.com Reviewed-by: John Snow <jsnow@redhat.com> Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-09-13qcow2: Stop overwriting compressed clusters one by oneAlberto Garcia
handle_alloc() tries to find as many contiguous clusters that need copy-on-write as possible in order to allocate all of them at the same time. However, compressed clusters are only overwritten one by one, so let's say that we have an image with 1024 consecutive compressed clusters: qemu-img create -f qcow2 hd.qcow2 64M for f in `seq 0 64 65472`; do qemu-io -c "write -c ${f}k 64k" hd.qcow2 done In this case trying to overwrite the whole image with one large write request results in 1024 separate allocations: qemu-io -c "write 0 64M" hd.qcow2 This restriction comes from commit 095a9c58ce12afeeb90c2 from 2008. Nowadays QEMU can overwrite multiple compressed clusters just fine, and in fact it already does: as long as the first cluster that handle_alloc() finds is not compressed, all other compressed clusters in the same batch will be overwritten in one go: qemu-img create -f qcow2 hd.qcow2 64M qemu-io -c "write -z 0 64k" hd.qcow2 for f in `seq 64 64 65472`; do qemu-io -c "write -c ${f}k 64k" hd.qcow2 done Compared to the previous one, overwriting this image on my computer goes from 8.35s down to 230ms. Signed-off-by: Alberto Garcia <berto@igalia.com> Reviewed-by: John Snow <jsnow@redhat.com Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-09-13block/create: Do not abort if a block driver is not availablePhilippe Mathieu-Daudé
The 'blockdev-create' QMP command was introduced as experimental feature in commit b0292b851b8, using the assert() debug call. It got promoted to 'stable' command in 3fb588a0f2c, but the assert call was not removed. Some block drivers are optional, and bdrv_find_format() might return a NULL value, triggering the assertion. Stable code is not expected to abort, so return an error instead. This is easily reproducible when libnfs is not installed: ./configure [...] module support no Block whitelist (rw) Block whitelist (ro) libiscsi support yes libnfs support no [...] Start QEMU: $ qemu-system-x86_64 -S -qmp unix:/tmp/qemu.qmp,server,nowait Send the 'blockdev-create' with the 'nfs' driver: $ ( cat << 'EOF' {'execute': 'qmp_capabilities'} {'execute': 'blockdev-create', 'arguments': {'job-id': 'x', 'options': {'size': 0, 'driver': 'nfs', 'location': {'path': '/', 'server': {'host': '::1', 'type': 'inet'}}}}, 'id': 'x'} EOF ) | socat STDIO UNIX:/tmp/qemu.qmp {"QMP": {"version": {"qemu": {"micro": 50, "minor": 1, "major": 4}, "package": "v4.1.0-733-g89ea03a7dc"}, "capabilities": ["oob"]}} {"return": {}} QEMU crashes: $ gdb qemu-system-x86_64 core Program received signal SIGSEGV, Segmentation fault. (gdb) bt #0 0x00007ffff510957f in raise () at /lib64/libc.so.6 #1 0x00007ffff50f3895 in abort () at /lib64/libc.so.6 #2 0x00007ffff50f3769 in _nl_load_domain.cold.0 () at /lib64/libc.so.6 #3 0x00007ffff5101a26 in .annobin_assert.c_end () at /lib64/libc.so.6 #4 0x0000555555d7e1f1 in qmp_blockdev_create (job_id=0x555556baee40 "x", options=0x555557666610, errp=0x7fffffffc770) at block/create.c:69 #5 0x0000555555c96b52 in qmp_marshal_blockdev_create (args=0x7fffdc003830, ret=0x7fffffffc7f8, errp=0x7fffffffc7f0) at qapi/qapi-commands-block-core.c:1314 #6 0x0000555555deb0a0 in do_qmp_dispatch (cmds=0x55555645de70 <qmp_commands>, request=0x7fffdc005c70, allow_oob=false, errp=0x7fffffffc898) at qapi/qmp-dispatch.c:131 #7 0x0000555555deb2a1 in qmp_dispatch (cmds=0x55555645de70 <qmp_commands>, request=0x7fffdc005c70, allow_oob=false) at qapi/qmp-dispatch.c:174 With this patch applied, QEMU returns a QMP error: {'execute': 'blockdev-create', 'arguments': {'job-id': 'x', 'options': {'size': 0, 'driver': 'nfs', 'location': {'path': '/', 'server': {'host': '::1', 'type': 'inet'}}}}, 'id': 'x'} {"id": "x", "error": {"class": "GenericError", "desc": "Block driver 'nfs' not found or not supported"}} Cc: qemu-stable@nongnu.org Reported-by: Xu Tian <xutian@redhat.com> Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Reviewed-by: John Snow <jsnow@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-09-13block/nfs: add support for nfs_umountPeter Lieven
libnfs recently added support for unmounting. Add support in Qemu too. Signed-off-by: Peter Lieven <pl@kamp.de> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-09-13block/nfs: tear down aio before nfs_closePeter Lieven
nfs_close is a sync call from libnfs and has its own event handler polling on the nfs FD. Avoid that both QEMU and libnfs are intefering here. CC: qemu-stable@nongnu.org Signed-off-by: Peter Lieven <pl@kamp.de> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-09-10vpc: Return 0 from vpc_co_create() on successMax Reitz
blockdev_create_run() directly uses .bdrv_co_create()'s return value as the job's return value. Jobs must return 0 on success, not just any nonnegative value. Therefore, using blockdev-create for VPC images may currently fail as the vpc driver may return a positive integer. Because there is no point in returning a positive integer anywhere in the block layer (all non-negative integers are generally treated as complete success), we probably do not want to add more such cases. Therefore, fix this problem by making the vpc driver always return 0 in case of success. Suggested-by: Kevin Wolf <kwolf@redhat.com> Cc: qemu-stable@nongnu.org Signed-off-by: Max Reitz <mreitz@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-09-10file-posix: Fix has_write_zeroes after NO_FALLBACKKevin Wolf
If QEMU_AIO_NO_FALLBACK is given, we always return failure and don't even try to use the BLKZEROOUT ioctl. In this failure case, we shouldn't disable has_write_zeroes because we didn't learn anything about the ioctl. The next request might not set QEMU_AIO_NO_FALLBACK and we can still use the ioctl then. Fixes: 738301e1175 Reported-by: Eric Blake <eblake@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com>
2019-09-10block/file-posix: Reduce xfsctl() useMax Reitz
This patch removes xfs_write_zeroes() and xfs_discard(). Both functions have been added just before the same feature was present through fallocate(): - fallocate() has supported PUNCH_HOLE for XFS since Linux 2.6.38 (March 2011); xfs_discard() was added in December 2010. - fallocate() has supported ZERO_RANGE for XFS since Linux 3.15 (June 2014); xfs_write_zeroes() was added in November 2013. Nowadays, all systems that qemu runs on should support both fallocate() features (RHEL 7's kernel does). xfsctl() is still useful for getting the request alignment for O_DIRECT, so this patch does not remove our dependency on it completely. Note that xfs_write_zeroes() had a bug: It calls ftruncate() when the file is shorter than the specified range (because ZERO_RANGE does not increase the file length). ftruncate() may yield and then discard data that parallel write requests have written past the EOF in the meantime. Dropping the function altogether fixes the bug. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Fixes: 50ba5b2d994853b38fed10e0841b119da0f8b8e5 Reported-by: Lukáš Doktor <ldoktor@redhat.com> Cc: qemu-stable@nongnu.org Signed-off-by: Max Reitz <mreitz@redhat.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Reviewed-by: John Snow <jsnow@redhat.com> Tested-by: Stefano Garzarella <sgarzare@redhat.com> Tested-by: John Snow <jsnow@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-09-10job: drop job_drainVladimir Sementsov-Ogievskiy
In job_finish_sync job_enter should be enough for a job to make some progress and draining is a wrong tool for it. So use job_enter directly here and drop job_drain with all related staff not used more. Suggested-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Tested-by: John Snow <jsnow@redhat.com> Reviewed-by: John Snow <jsnow@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-09-10qcow2: Fix the calculation of the maximum L2 cache sizeAlberto Garcia
The size of the qcow2 L2 cache defaults to 32 MB, which can be easily larger than the maximum amount of L2 metadata that the image can have. For example: with 64 KB clusters the user would need a qcow2 image with a virtual size of 256 GB in order to have 32 MB of L2 metadata. Because of that, since commit b749562d9822d14ef69c9eaa5f85903010b86c30 we forbid the L2 cache to become larger than the maximum amount of L2 metadata for the image, calculated using this formula: uint64_t max_l2_cache = virtual_disk_size / (s->cluster_size / 8); The problem with this formula is that the result should be rounded up to the cluster size because an L2 table on disk always takes one full cluster. For example, a 1280 MB qcow2 image with 64 KB clusters needs exactly 160 KB of L2 metadata, but we need 192 KB on disk (3 clusters) even if the last 32 KB of those are not going to be used. However QEMU rounds the numbers down and only creates 2 cache tables (128 KB), which is not enough for the image. A quick test doing 4KB random writes on a 1280 MB image gives me around 500 IOPS, while with the correct cache size I get 16K IOPS. Cc: qemu-stable@nongnu.org Signed-off-by: Alberto Garcia <berto@igalia.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-09-05nbd: Implement client use of NBD FAST_ZEROEric Blake
The client side is fairly straightforward: if the server advertised fast zero support, then we can map that to BDRV_REQ_NO_FALLBACK support. A server that advertises FAST_ZERO but not WRITE_ZEROES is technically broken, but we can ignore that situation as it does not change our behavior. Signed-off-by: Eric Blake <eblake@redhat.com> Message-Id: <20190823143726.27062-4-eblake@redhat.com> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
2019-09-05block: workaround for unaligned byte range in fallocate()Andrey Shinkevich
Revert the commit 118f99442d 'block/io.c: fix for the allocation failure' and use better error handling for file systems that do not support fallocate() for an unaligned byte range. Allow falling back to pwrite in case fallocate() returns EINVAL. Suggested-by: Kevin Wolf <kwolf@redhat.com> Suggested-by: Eric Blake <eblake@redhat.com> Signed-off-by: Andrey Shinkevich <andrey.shinkevich@virtuozzo.com> Reviewed-by: Eric Blake <eblake@redhat.com> Reviewed-by: Denis V. Lunev <den@openvz.org> Message-Id: <1566913973-15490-1-git-send-email-andrey.shinkevich@virtuozzo.com> Signed-off-by: Eric Blake <eblake@redhat.com>
2019-09-05nbd: Use g_autofree in a few placesEric Blake
Thanks to our recent move to use glib's g_autofree, I can join the bandwagon. Getting rid of gotos is fun ;) There are probably more places where we could register cleanup functions and get rid of more gotos; this patch just focuses on the labels that existed merely to call g_free. Signed-off-by: Eric Blake <eblake@redhat.com> Message-Id: <20190824172813.29720-2-eblake@redhat.com> Reviewed-by: Daniel P. Berrangé <berrange@redhat.com> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
2019-09-03file-posix: fix request_alignment typoStefan Hajnoczi
Fixes: a6b257a08e3d72219f03e461a52152672fec0612 ("file-posix: Handle undetectable alignment") Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Message-id: 20190827101328.4062-1-stefanha@redhat.com Reviewed-by: Thomas Huth <thuth@redhat.com> Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-09-03vmdk: Reject invalid compressed writesMax Reitz
Compressed writes generally have to write full clusters, not just in theory but also in practice when it comes to vmdk's streamOptimized subformat. It currently is just silently broken for writes with non-zero in-cluster offsets: $ qemu-img create -f vmdk -o subformat=streamOptimized foo.vmdk 1M $ qemu-io -c 'write 4k 4k' -c 'read 4k 4k' foo.vmdk wrote 4096/4096 bytes at offset 4096 4 KiB, 1 ops; 00.01 sec (443.724 KiB/sec and 110.9309 ops/sec) read failed: Invalid argument (The technical reason is that vmdk_write_extent() just writes the incomplete compressed data actually to offset 4k. When reading the data, vmdk_read_extent() looks at offset 0 and finds the compressed data size to be 0, because that is what it reads from there. This yields an error.) For incomplete writes with zero in-cluster offsets, the error path when reading the rest of the cluster is a bit different, but the result is the same: $ qemu-img create -f vmdk -o subformat=streamOptimized foo.vmdk 1M $ qemu-io -c 'write 0k 4k' -c 'read 4k 4k' foo.vmdk wrote 4096/4096 bytes at offset 0 4 KiB, 1 ops; 00.01 sec (362.641 KiB/sec and 90.6603 ops/sec) read failed: Invalid argument (Here, vmdk_read_extent() finds the data and then sees that the uncompressed data is short.) It is better to reject invalid writes than to make the user believe they might have succeeded and then fail when trying to read it back. Signed-off-by: Max Reitz <mreitz@redhat.com> Reviewed-by: John Snow <jsnow@redhat.com> Message-id: 20190815153638.4600-5-mreitz@redhat.com Reviewed-by: John Snow <jsnow@redhat.com> Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-09-03vmdk: Use bdrv_dirname() for relative extent pathsMax Reitz
This makes iotest 033 pass with e.g. subformat=monolithicFlat. It also turns a former error in 059 into success. Signed-off-by: Max Reitz <mreitz@redhat.com> Message-id: 20190815153638.4600-3-mreitz@redhat.com Reviewed-by: John Snow <jsnow@redhat.com> Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-09-03block: posix: Always allocate the first blockNir Soffer
When creating an image with preallocation "off" or "falloc", the first block of the image is typically not allocated. When using Gluster storage backed by XFS filesystem, reading this block using direct I/O succeeds regardless of request length, fooling alignment detection. In this case we fallback to a safe value (4096) instead of the optimal value (512), which may lead to unneeded data copying when aligning requests. Allocating the first block avoids the fallback. Since we allocate the first block even with preallocation=off, we no longer create images with zero disk size: $ ./qemu-img create -f raw test.raw 1g Formatting 'test.raw', fmt=raw size=1073741824 $ ls -lhs test.raw 4.0K -rw-r--r--. 1 nsoffer nsoffer 1.0G Aug 16 23:48 test.raw And converting the image requires additional cluster: $ ./qemu-img measure -f raw -O qcow2 test.raw required size: 458752 fully allocated size: 1074135040 When using format like vmdk with multiple files per image, we allocate one block per file: $ ./qemu-img create -f vmdk -o subformat=twoGbMaxExtentFlat test.vmdk 4g Formatting 'test.vmdk', fmt=vmdk size=4294967296 compat6=off hwversion=undefined subformat=twoGbMaxExtentFlat $ ls -lhs test*.vmdk 4.0K -rw-r--r--. 1 nsoffer nsoffer 2.0G Aug 27 03:23 test-f001.vmdk 4.0K -rw-r--r--. 1 nsoffer nsoffer 2.0G Aug 27 03:23 test-f002.vmdk 4.0K -rw-r--r--. 1 nsoffer nsoffer 353 Aug 27 03:23 test.vmdk I did quick performance test for copying disks with qemu-img convert to new raw target image to Gluster storage with sector size of 512 bytes: for i in $(seq 10); do rm -f dst.raw sleep 10 time ./qemu-img convert -f raw -O raw -t none -T none src.raw dst.raw done Here is a table comparing the total time spent: Type Before(s) After(s) Diff(%) --------------------------------------- real 530.028 469.123 -11.4 user 17.204 10.768 -37.4 sys 17.881 7.011 -60.7 We can see very clear improvement in CPU usage. Signed-off-by: Nir Soffer <nsoffer@redhat.com> Message-id: 20190827010528.8818-2-nsoffer@redhat.com Reviewed-by: Max Reitz <mreitz@redhat.com> Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-08-27block/qcow2: implement .bdrv_co_pwritev(_compressed)_partVladimir Sementsov-Ogievskiy
Implement and use new interface to get rid of hd_qiov. Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Acked-by: Stefan Hajnoczi <stefanha@redhat.com> Message-id: 20190604161514.262241-13-vsementsov@virtuozzo.com Message-Id: <20190604161514.262241-13-vsementsov@virtuozzo.com> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>