slackcoder/qemu - QEMU is a generic and open source machine & userspace emulator and virtualizer

Age	Commit message (Collapse)	Author
2019-06-24	ssh: switch from libssh2 to libssh	Pino Toscano
	Rewrite the implementation of the ssh block driver to use libssh instead of libssh2. The libssh library has various advantages over libssh2: - easier API for authentication (for example for using ssh-agent) - easier API for known_hosts handling - supports newer types of keys in known_hosts Use APIs/features available in libssh 0.8 conditionally, to support older versions (which are not recommended though). Adjust the iotest 207 according to the different error message, and to find the default key type for localhost (to properly compare the fingerprint with). Contributed-by: Max Reitz <mreitz@redhat.com> Adjust the various Docker/Travis scripts to use libssh when available instead of libssh2. The mingw/mxe testing is dropped for now, as there are no packages for it. Signed-off-by: Pino Toscano <ptoscano@redhat.com> Tested-by: Philippe Mathieu-Daudé <philmd@redhat.com> Acked-by: Alex Bennée <alex.bennee@linaro.org> Message-id: 20190620200840.17655-1-ptoscano@redhat.com Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com> Message-id: 5873173.t2JhDm7DL7@lindworm.usersys.redhat.com Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-06-24	vmdk: Add read-only support for seSparse snapshots	Sam Eiderman
	Until ESXi 6.5 VMware used the vmfsSparse format for snapshots (VMDK3 in QEMU). This format was lacking in the following: * Grain directory (L1) and grain table (L2) entries were 32-bit, allowing access to only 2TB (slightly less) of data. * The grain size (default) was 512 bytes - leading to data fragmentation and many grain tables. * For space reclamation purposes, it was necessary to find all the grains which are not pointed to by any grain table - so a reverse mapping of "offset of grain in vmdk" to "grain table" must be constructed - which takes large amounts of CPU/RAM. The format specification can be found in VMware's documentation: https://www.vmware.com/support/developer/vddk/vmdk_50_technote.pdf In ESXi 6.5, to support snapshot files larger than 2TB, a new format was introduced: SESparse (Space Efficient). This format fixes the above issues: * All entries are now 64-bit. * The grain size (default) is 4KB. * Grain directory and grain tables are now located at the beginning of the file. + seSparse format reserves space for all grain tables. + Grain tables can be addressed using an index. + Grains are located in the end of the file and can also be addressed with an index. - seSparse vmdks of large disks (64TB) have huge preallocated headers - mainly due to L2 tables, even for empty snapshots. * The header contains a reverse mapping ("backmap") of "offset of grain in vmdk" to "grain table" and a bitmap ("free bitmap") which specifies for each grain - whether it is allocated or not. Using these data structures we can implement space reclamation efficiently. * Due to the fact that the header now maintains two mappings: * The regular one (grain directory & grain tables) * A reverse one (backmap and free bitmap) These data structures can lose consistency upon crash and result in a corrupted VMDK. Therefore, a journal is also added to the VMDK and is replayed when the VMware reopens the file after a crash. Since ESXi 6.7 - SESparse is the only snapshot format available. Unfortunately, VMware does not provide documentation regarding the new seSparse format. This commit is based on black-box research of the seSparse format. Various in-guest block operations and their effect on the snapshot file were tested. The only VMware provided source of information (regarding the underlying implementation) was a log file on the ESXi: /var/log/hostd.log Whenever an seSparse snapshot is created - the log is being populated with seSparse records. Relevant log records are of the form: [...] Const Header: [...] constMagic = 0xcafebabe [...] version = 2.1 [...] capacity = 204800 [...] grainSize = 8 [...] grainTableSize = 64 [...] flags = 0 [...] Extents: [...] Header : <1 : 1> [...] JournalHdr : <2 : 2> [...] Journal : <2048 : 2048> [...] GrainDirectory : <4096 : 2048> [...] GrainTables : <6144 : 2048> [...] FreeBitmap : <8192 : 2048> [...] BackMap : <10240 : 2048> [...] Grain : <12288 : 204800> [...] Volatile Header: [...] volatileMagic = 0xcafecafe [...] FreeGTNumber = 0 [...] nextTxnSeqNumber = 0 [...] replayJournal = 0 The sizes that are seen in the log file are in sectors. Extents are of the following format: <offset : size> This commit is a strict implementation which enforces: * magics * version number 2.1 * grain size of 8 sectors (4KB) * grain table size of 64 sectors * zero flags * extent locations Additionally, this commit proivdes only a subset of the functionality offered by seSparse's format: * Read-only * No journal replay * No space reclamation * No unmap support Hence, journal header, journal, free bitmap and backmap extents are unused, only the "classic" (L1 -> L2 -> data) grain access is implemented. However there are several differences in the grain access itself. Grain directory (L1): * Grain directory entries are indexes (not offsets) to grain tables. * Valid grain directory entries have their highest nibble set to 0x1. * Since grain tables are always located in the beginning of the file - the index can fit into 32 bits - so we can use its low part if it's valid. Grain table (L2): * Grain table entries are indexes (not offsets) to grains. * If the highest nibble of the entry is: 0x0: The grain in not allocated. The rest of the bytes are 0. 0x1: The grain is unmapped - guest sees a zero grain. The rest of the bits point to the previously mapped grain, see 0x3 case. 0x2: The grain is zero. 0x3: The grain is allocated - to get the index calculate: ((entry & 0x0fff000000000000) >> 48) \| ((entry & 0x0000ffffffffffff) << 12) * The difference between 0x1 and 0x2 is that 0x1 is an unallocated grain which results from the guest using sg_unmap to unmap the grain - but the grain itself still exists in the grain extent - a space reclamation procedure should delete it. Unmapping a zero grain has no effect (0x2 will not change to 0x1) but unmapping an unallocated grain will (0x0 to 0x1) - naturally. In order to implement seSparse some fields had to be changed to support both 32-bit and 64-bit entry sizes. Reviewed-by: Karl Heubaum <karl.heubaum@oracle.com> Reviewed-by: Eyal Moscovici <eyal.moscovici@oracle.com> Reviewed-by: Arbel Moshe <arbel.moshe@oracle.com> Signed-off-by: Sam Eiderman <shmuel.eiderman@oracle.com> Message-id: 20190620091057.47441-4-shmuel.eiderman@oracle.com Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-06-24	vmdk: Reduce the max bound for L1 table size	Sam Eiderman
	512M of L1 entries is a very loose bound, only 32M are required to store the maximal supported VMDK file size of 2TB. Fixed qemu-iotest 59# - now failure occures before on impossible L1 table size. Reviewed-by: Karl Heubaum <karl.heubaum@oracle.com> Reviewed-by: Eyal Moscovici <eyal.moscovici@oracle.com> Reviewed-by: Liran Alon <liran.alon@oracle.com> Reviewed-by: Arbel Moshe <arbel.moshe@oracle.com> Signed-off-by: Sam Eiderman <shmuel.eiderman@oracle.com> Message-id: 20190620091057.47441-3-shmuel.eiderman@oracle.com Reviewed-by: Max Reitz <mreitz@redhat.com> Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-06-24	vmdk: Fix comment regarding max l1_size coverage	Sam Eiderman
	Commit b0651b8c246d ("vmdk: Move l1_size check into vmdk_add_extent") extended the l1_size check from VMDK4 to VMDK3 but did not update the default coverage in the moved comment. The previous vmdk4 calculation: (512 * 1024 * 1024) * 512(l2 entries) * 65536(grain) = 16PB The added vmdk3 calculation: (512 * 1024 * 1024) * 4096(l2 entries) * 512(grain) = 1PB Adding the calculation of vmdk3 to the comment. In any case, VMware does not offer virtual disks more than 2TB for vmdk4/vmdk3 or 64TB for the new undocumented seSparse format which is not implemented yet in qemu. Reviewed-by: Karl Heubaum <karl.heubaum@oracle.com> Reviewed-by: Eyal Moscovici <eyal.moscovici@oracle.com> Reviewed-by: Liran Alon <liran.alon@oracle.com> Reviewed-by: Arbel Moshe <arbel.moshe@oracle.com> Signed-off-by: Sam Eiderman <shmuel.eiderman@oracle.com> Message-id: 20190620091057.47441-2-shmuel.eiderman@oracle.com Reviewed-by: yuchenlin <yuchenlin@synology.com> Reviewed-by: Max Reitz <mreitz@redhat.com> Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-06-18	block/commit: Drop bdrv_child_try_set_perm()	Max Reitz
	commit_top_bs never requests or unshares any permissions. There is no reason to make this so explicit here. Signed-off-by: Max Reitz <mreitz@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-06-18	block/mirror: Fix child permissions	Max Reitz
	We cannot use bdrv_child_try_set_perm() to give up all restrictions on the child edge, and still have bdrv_mirror_top_child_perm() request BLK_PERM_WRITE. Fix this by making bdrv_mirror_top_child_perm() return 0/BLK_PERM_ALL when we want to give up all permissions, and replacing bdrv_child_try_set_perm() by bdrv_child_refresh_perms(). The bdrv_child_try_set_perm() before removing the node with bdrv_replace_node() is then unnecessary. No permissions have changed since the previous invocation of bdrv_child_try_set_perm(). Signed-off-by: Max Reitz <mreitz@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-06-18	file-posix: Update open_flags in raw_set_perm()	Max Reitz
	raw_check_perm() + raw_set_perm() can change the flags associated with the current FD. If so, we have to update BDRVRawState.open_flags accordingly. Otherwise, we may keep reopening the FD even though the current one already has the correct flags. Signed-off-by: Max Reitz <mreitz@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-06-18	block: drop bs->job	Vladimir Sementsov-Ogievskiy
	Drop remaining users of bs->job: 1. assertions actually duplicated by assert(!bs->refcnt) 2. trace-point seems not enough reason to change stream_start to return BlockJob pointer 3. Restricting creation of two jobs based on same bs is bad idea, as 3.1 Some jobs creates filters to be their main node, so, this check don't actually prevent creating second job on same real node (which will create another filter node) (but I hope it is restricted by other mechanisms) 3.2 Even without bs->job we have two systems of permissions: op-blockers and BLK_PERM 3.3 We may want to run several jobs on one node one day And finally, drop bs->job pointer itself. Hurrah! Suggested-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-06-18	block/block-backend: blk_iostatus_reset: drop usage of bs->job	Vladimir Sementsov-Ogievskiy
	We are going to remove bs->job pointer. Drop it's usage in blk_iostatus_reset. blk_iostatus_reset() has only two callers: 1. blk_attach_dev(). This doesn't have anything to do with jobs and attaching a new guest device won't solve any problem the job encountered, so no reason to reset the iostatus for the job. 2. qmp_cont(). This resets the iostatus for everything. We can just call block_job_iostatus_reset() for all block jobs instead of going through BlockBackend. Suggested-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-06-18	block/replication: drop usage of bs->job	Vladimir Sementsov-Ogievskiy
	We are going to remove bs->job pointer. Drop it's usage in replication code. Additionally we have to return job pointer from some mirror APIs. Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-06-14	blkdebug: Inject errors on .bdrv_co_block_status()	Max Reitz
	Signed-off-by: Max Reitz <mreitz@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Message-id: 20190507203508.18026-6-mreitz@redhat.com Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-06-14	blkdebug: Add "none" event	Max Reitz
	Together with @iotypes and @sector, this can be used to trap e.g. the first read or write access to a certain sector without having to know what happens internally in the block layer, i.e. which "real" events happen right before such an access. Signed-off-by: Max Reitz <mreitz@redhat.com> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Message-id: 20190507203508.18026-5-mreitz@redhat.com Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-06-14	blkdebug: Add @iotype error option	Max Reitz
	This new error option allows users of blkdebug to inject errors only on certain kinds of I/O operations. Users usually want to make a very specific operation fail, not just any; but right now they simply hope that the event that triggers the error injection is followed up with that very operation. That may not be true, however, because the block layer is changing (including blkdebug, which may increase the number of types of I/O operations on which to inject errors). The new option's default has been chosen to keep backwards compatibility. Note that similar to the internal representation, we could choose to expose this option as a list of I/O types. But there is no practical use for this, because as described above, users usually know exactly which kind of operation they want to make fail, so there is no need to specify multiple I/O types at once. In addition, exposing this option as a list would require non-trivial changes to qemu_opts_absorb_qdict(). Signed-off-by: Max Reitz <mreitz@redhat.com> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Message-id: 20190507203508.18026-4-mreitz@redhat.com Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-06-13	block/nbd: merge NBDClientSession struct back to BDRVNBDState	Vladimir Sementsov-Ogievskiy
	No reason to keep it separate, it differs from others block driver behavior and therefore confuses. Instead of generic 'state = (State*)bs->opaque' we have to use special helper. Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Message-Id: <20190611102720.86114-4-vsementsov@virtuozzo.com> Reviewed-by: Eric Blake <eblake@redhat.com> Signed-off-by: Eric Blake <eblake@redhat.com>
2019-06-13	block/nbd: merge nbd-client.* to nbd.c	Vladimir Sementsov-Ogievskiy
	No reason for keeping driver handlers realization separate from driver structure. We can get rid of extra header file. While being here, fix comments style, restore forgotten comments for NBD_FOREACH_REPLY_CHUNK and nbd_reply_chunk_iter_receive, remove extra includes. Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Message-Id: <20190611102720.86114-3-vsementsov@virtuozzo.com> Reviewed-by: Eric Blake <eblake@redhat.com> Signed-off-by: Eric Blake <eblake@redhat.com>
2019-06-13	block/nbd-client: drop stale logout	Vladimir Sementsov-Ogievskiy
	Drop one on failure path (we have errp) and turn two others into trace points. Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Message-Id: <20190611102720.86114-2-vsementsov@virtuozzo.com> Reviewed-by: Eric Blake <eblake@redhat.com> Signed-off-by: Eric Blake <eblake@redhat.com>
2019-06-12	block/gluster: update .help of BLOCK_OPT_PREALLOC option	Stefano Garzarella
	Add missing 'falloc' among the allowed values of 'preallocation' option; show it and 'full' only when they are supported. ('falloc' is supported if defined CONFIG_GLUSTERFS_FALLOCATE, 'full' is supported if defined CONFIG_GLUSTERFS_ZEROFILL) Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Reviewed-by: Markus Armbruster <armbru@redhat.com> Message-Id: <20190524075848.23781-4-sgarzare@redhat.com> Signed-off-by: Markus Armbruster <armbru@redhat.com>
2019-06-12	block/file-posix: update .help of BLOCK_OPT_PREALLOC option	Stefano Garzarella
	Show 'falloc' among the allowed values of 'preallocation' option, only when it is supported (if defined CONFIG_POSIX_FALLOCATE) Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Reviewed-by: Markus Armbruster <armbru@redhat.com> Message-Id: <20190524075848.23781-3-sgarzare@redhat.com> Signed-off-by: Markus Armbruster <armbru@redhat.com>
2019-06-12	Include qemu-common.h exactly where needed	Markus Armbruster
	No header includes qemu-common.h after this commit, as prescribed by qemu-common.h's file comment. Signed-off-by: Markus Armbruster <armbru@redhat.com> Message-Id: <20190523143508.25387-5-armbru@redhat.com> [Rebased with conflicts resolved automatically, except for include/hw/arm/xlnx-zynqmp.h hw/arm/nrf51_soc.c hw/arm/msf2-soc.c block/qcow2-refcount.c block/qcow2-cluster.c block/qcow2-cache.c target/arm/cpu.h target/lm32/cpu.h target/m68k/cpu.h target/mips/cpu.h target/moxie/cpu.h target/nios2/cpu.h target/openrisc/cpu.h target/riscv/cpu.h target/tilegx/cpu.h target/tricore/cpu.h target/unicore32/cpu.h target/xtensa/cpu.h; bsd-user/main.c and net/tap-bsd.c fixed up]
2019-06-12	Include qemu/module.h where needed, drop it from qemu-common.h	Markus Armbruster
	Signed-off-by: Markus Armbruster <armbru@redhat.com> Message-Id: <20190523143508.25387-4-armbru@redhat.com> [Rebased with conflicts resolved automatically, except for hw/usb/dev-hub.c hw/misc/exynos4210_rng.c hw/misc/bcm2835_rng.c hw/misc/aspeed_scu.c hw/display/virtio-vga.c hw/arm/stm32f205_soc.c; ui/cocoa.m fixed up]
2019-06-11	qemu-common: Move qemu_isalnum() etc. to qemu/ctype.h	Markus Armbruster
	Signed-off-by: Markus Armbruster <armbru@redhat.com> Message-Id: <20190523143508.25387-3-armbru@redhat.com> Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
2019-06-04	block/io: bdrv_pdiscard: support int64_t bytes parameter	Vladimir Sementsov-Ogievskiy
	This fixes at least one overflow in qcow2_process_discards, which passes 64bit region length to bdrv_pdiscard where bytes (or sectors in the past) parameter is int since its introduction in 0b919fae. Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-06-04	block/qcow2-refcount: add trace-point to qcow2_process_discards	Vladimir Sementsov-Ogievskiy
	Let's at least trace ignored failure. Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Reviewed-by: Eric Blake <eblake@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-06-04	block: Remove wrong bdrv_set_aio_context() calls	Kevin Wolf
	The mirror and commit block jobs use bdrv_set_aio_context() to move their filter node into the right AioContext before hooking it up in the graph. Similarly, bdrv_open_backing_file() explicitly moves the backing file node into the right AioContext first. This isn't necessary any more, they get automatically moved into the right context now when attaching them. However, in the case of bdrv_open_backing_file() with a node reference, it's actually not only unnecessary, but even wrong: The unchecked bdrv_set_aio_context() changes the AioContext of the child node even if other parents require it to retain the old context. So this is not only a simplification, but a bug fix, too. Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1684342 Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-06-04	block: Adjust AioContexts when attaching nodes	Kevin Wolf
	So far, we only made sure that updating the AioContext of a node affected the whole subtree. However, if a node is newly attached to a new parent, we also need to make sure that both the subtree of the node and the parent are in the same AioContext. This tries to move the new child node to the parent AioContext and returns an error if this isn't possible. BlockBackends now actually apply their AioContext to their root node. Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-06-04	block: Add BlockBackend.ctx	Kevin Wolf
	This adds a new parameter to blk_new() which requires its callers to declare from which AioContext this BlockBackend is going to be used (or the locks of which AioContext need to be taken anyway). The given context is only stored and kept up to date when changing AioContexts. Actually applying the stored AioContext to the root node is saved for another commit. Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-06-04	block: Add Error to blk_set_aio_context()	Kevin Wolf
	Add an Error parameter to blk_set_aio_context() and use bdrv_child_try_set_aio_context() internally to check whether all involved nodes can actually support the AioContext switch. Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-06-04	block/linux-aio: Drop unused BlockAIOCB submission method	Julia Suvorova
	Callback-based laio_submit() and laio_cancel() were left after rewriting Linux AIO backend to coroutines in hope that they would be used in other code that could bypass coroutines. They can be safely removed because they have not been used since that time. Signed-off-by: Julia Suvorova <jusual@mail.ru> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-06-04	block/io: Delay decrementing the quiesce_counter	Max Reitz
	When ending a drained section, bdrv_do_drained_end() currently first decrements the quiesce_counter, and only then actually ends the drain. The bdrv_drain_invoke(bs, false) call may cause graph changes. Say the graph change involves replacing an existing BB's ("blk") BDS (blk_bs(blk)) by @bs. Let us introducing the following values: - bs_oqc = old_quiesce_counter (so bs->quiesce_counter == bs_oqc - 1) - obs_qc = blk_bs(blk)->quiesce_counter (before bdrv_drain_invoke()) Let us assume there is no blk_pread_unthrottled() involved, so blk->quiesce_counter == obs_qc (before bdrv_drain_invoke()). Now replacing blk_bs(blk) by @bs will reduce blk->quiesce_counter by obs_qc (making it 0) and increase it by bs_oqc-1 (making it bs_oqc-1). bdrv_drain_invoke() returns and we invoke bdrv_parent_drained_end(). This will decrement blk->quiesce_counter by one, so it would be -1 -- were there not an assertion against that in blk_root_drained_end(). We therefore have to keep the quiesce_counter up at least until bdrv_drain_invoke() returns, so that bdrv_parent_drained_end() does the right thing for the parents @bs got during bdrv_drain_invoke(). But let us delay it even further, namely until bdrv_parent_drained_end() returns, because then it mirrors bdrv_do_drained_begin(): There, we first increment the quiesce_counter, then begin draining the parents, and then call bdrv_drain_invoke(). It makes sense to let bdrv_do_drained_end() unravel this exactly in reverse. Signed-off-by: Max Reitz <mreitz@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-06-04	block: avoid recursive block_status call if possible	Vladimir Sementsov-Ogievskiy
	drv_co_block_status digs bs->file for additional, more accurate search for hole inside region, reported as DATA by bs since 5daa74a6ebc. This accuracy is not free: assume we have qcow2 disk. Actually, qcow2 knows, where are holes and where is data. But every block_status request calls lseek additionally. Assume a big disk, full of data, in any iterative copying block job (or img convert) we'll call lseek(HOLE) on every iteration, and each of these lseeks will have to iterate through all metadata up to the end of file. It's obviously ineffective behavior. And for many scenarios we don't need this lseek at all. However, lseek is needed when we have metadata-preallocated image. So, let's detect metadata-preallocation case and don't dig qcow2's protocol file in other cases. The idea is to compare allocation size in POV of filesystem with allocations size in POV of Qcow2 (by refcounts). If allocation in fs is significantly lower, consider it as metadata-preallocation case. 102 iotest changed, as our detector can't detect shrinked file as metadata-preallocation, which don't seem to be wrong, as with metadata preallocation we always have valid file length. Two other iotests have a slight change in their QMP output sequence: Active 'block-commit' returns earlier because the job coroutine yields earlier on a blocking operation. This operation is loading the refcount blocks in qcow2_detect_metadata_preallocation(). Suggested-by: Denis V. Lunev <den@openvz.org> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-05-30	Merge remote-tracking branch 'remotes/jnsnow/tags/bitmaps-pull-request' into ↵	Peter Maydell
	staging Pull request # gpg: Signature made Wed 29 May 2019 00:58:33 BST # gpg: using RSA key F9B7ABDBBCACDF95BE76CBD07DEF8106AAFC390E # gpg: Good signature from "John Snow (John Huston) <jsnow@redhat.com>" [full] # Primary key fingerprint: FAEB 9711 A12C F475 812F 18F2 88A9 064D 1835 61EB # Subkey fingerprint: F9B7 ABDB BCAC DF95 BE76 CBD0 7DEF 8106 AAFC 390E * remotes/jnsnow/tags/bitmaps-pull-request: iotests: test external snapshot with bitmap copying qapi: support external bitmaps in block-dirty-bitmap-merge migration/dirty-bitmaps: change bitmap enumeration method Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
2019-05-28	qapi: support external bitmaps in block-dirty-bitmap-merge	Vladimir Sementsov-Ogievskiy
	Add new optional parameter making possible to merge bitmaps from different nodes. It is needed to maintain external snapshots during incremental backup chain history. Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Reviewed-by: John Snow <jsnow@redhat.com> Message-id: 20190517152111.206494-2-vsementsov@virtuozzo.com Signed-off-by: John Snow <jsnow@redhat.com>
2019-05-28	qcow2-bitmap: initialize bitmap directory alignment	Andrey Shinkevich
	Valgrind detects multiple issues in QEMU iotests when the memory is used without being initialized. Valgrind may dump lots of unnecessary reports what makes the memory issue analysis harder. Particularly, that is true for the aligned bitmap directory and can be seen while running the iotest #169. Padding the aligned space with zeros eases the pain. Signed-off-by: Andrey Shinkevich <andrey.shinkevich@virtuozzo.com> Message-id: 1558961521-131620-1-git-send-email-andrey.shinkevich@virtuozzo.com Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-05-28	qcow2: skip writing zero buffers to empty COW areas	Anton Nefedov
	If COW areas of the newly allocated clusters are zeroes on the backing image, efficient bdrv_write_zeroes(flags=BDRV_REQ_NO_FALLBACK) can be used on the whole cluster instead of writing explicit zero buffers later in perform_cow(). iotest 060: write to the discarded cluster does not trigger COW anymore. Use a backing image instead. Signed-off-by: Anton Nefedov <anton.nefedov@virtuozzo.com> Message-id: 20190516142749.81019-2-anton.nefedov@virtuozzo.com Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Reviewed-by: Alberto Garcia <berto@igalia.com> Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-05-28	block: Make bdrv_root_attach_child() unref child_bs on failure	Alberto Garcia
	A consequence of the previous patch is that bdrv_attach_child() transfers the reference to child_bs from the caller to parent_bs, which will drop it on bdrv_close() or when someone calls bdrv_unref_child(). But this only happens when bdrv_attach_child() succeeds. If it fails then the caller is responsible for dropping the reference to child_bs. This patch makes bdrv_attach_child() take the reference also when there is an error, freeing the caller for having to do it. A similar situation happens with bdrv_root_attach_child(), so the changes on this patch affect both functions. Signed-off-by: Alberto Garcia <berto@igalia.com> Message-id: 20dfb3d9ccec559cdd1a9690146abad5d204a186.1557754872.git.berto@igalia.com [mreitz: Removed now superfluous BdrvChild * variable in bdrv_open_child()] Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-05-28	block/backup: refactor: split out backup_calculate_cluster_size	Vladimir Sementsov-Ogievskiy
	Split out cluster_size calculation. Move copy-bitmap creation above block-job creation, as we are going to share it with upcoming backup-top filter, which also should be created before actual block job creation. Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Message-id: 20190429090842.57910-6-vsementsov@virtuozzo.com [mreitz: Dropped a paragraph from the commit message that was left over from a previous version] Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-05-28	block/backup: unify different modes code path	Vladimir Sementsov-Ogievskiy
	Do full, top and incremental mode copying all in one place. This unifies the code path and helps further improvements. Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Reviewed-by: Max Reitz <mreitz@redhat.com> Message-id: 20190429090842.57910-5-vsementsov@virtuozzo.com Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-05-28	block/backup: refactor and tolerate unallocated cluster skipping	Vladimir Sementsov-Ogievskiy
	Split allocation checking to separate function and reduce nesting. Consider bdrv_is_allocated() fail as allocated area, as copying more than needed is not wrong (and we do it anyway) and seems better than fail the whole job. And, most probably we will fail on the next read, if there are real problem with source. Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Reviewed-by: Max Reitz <mreitz@redhat.com> Message-id: 20190429090842.57910-4-vsementsov@virtuozzo.com Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-05-28	block/backup: move to copy_bitmap with granularity	Vladimir Sementsov-Ogievskiy
	We are going to share this bitmap between backup and backup-top filter driver, so let's share something more meaningful. It also simplifies some calculations. Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Reviewed-by: Max Reitz <mreitz@redhat.com> Message-id: 20190429090842.57910-3-vsementsov@virtuozzo.com Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-05-28	block/backup: simplify backup_incremental_init_copy_bitmap	Vladimir Sementsov-Ogievskiy
	Simplify backup_incremental_init_copy_bitmap using the function bdrv_dirty_bitmap_next_dirty_area. Note: move to job->len instead of bitmap size: it should not matter but less code. Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Reviewed-by: Max Reitz <mreitz@redhat.com> Message-id: 20190429090842.57910-2-vsementsov@virtuozzo.com Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-05-28	qcow2: do encryption in threads	Vladimir Sementsov-Ogievskiy
	Do encryption/decryption in threads, like it is already done for compression. This improves asynchronous encrypted io. Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Reviewed-by: Alberto Garcia <berto@igalia.com> Reviewed-by: Max Reitz <mreitz@redhat.com> Message-id: 20190506142741.41731-9-vsementsov@virtuozzo.com Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-05-28	qcow2: bdrv_co_pwritev: move encryption code out of the lock	Vladimir Sementsov-Ogievskiy
	Encryption will be done in threads, to take benefit of it, we should move it out of the lock first. Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Reviewed-by: Alberto Garcia <berto@igalia.com> Reviewed-by: Max Reitz <mreitz@redhat.com> Message-id: 20190506142741.41731-8-vsementsov@virtuozzo.com Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-05-28	qcow2: qcow2_co_preadv: improve locking	Vladimir Sementsov-Ogievskiy
	Background: decryption will be done in threads, to take benefit of it, we should move it out of the lock first. But let's go further: it turns out, that only qcow2_get_cluster_offset() needs locking, so reduce locking to it. Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Message-id: 20190506142741.41731-7-vsementsov@virtuozzo.com Reviewed-by: Alberto Garcia <berto@igalia.com> Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-05-28	qcow2-threads: split out generic path	Vladimir Sementsov-Ogievskiy
	Move generic part out of qcow2_co_do_compress, to reuse it for encryption and rename things that would be shared with encryption path. Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Reviewed-by: Alberto Garcia <berto@igalia.com> Reviewed-by: Max Reitz <mreitz@redhat.com> Message-id: 20190506142741.41731-6-vsementsov@virtuozzo.com Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-05-28	qcow2-threads: qcow2_co_do_compress: protect queuing by mutex	Vladimir Sementsov-Ogievskiy
	Drop dependence on AioContext lock. Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Reviewed-by: Alberto Garcia <berto@igalia.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Max Reitz <mreitz@redhat.com> Message-id: 20190506142741.41731-5-vsementsov@virtuozzo.com Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-05-28	qcow2-threads: use thread_pool_submit_co	Vladimir Sementsov-Ogievskiy
	Use thread_pool_submit_co, instead of reinventing it here. Note, that thread_pool_submit_aio() never returns NULL, so checking it was an extra thing. Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Reviewed-by: Alberto Garcia <berto@igalia.com> Reviewed-by: Max Reitz <mreitz@redhat.com> Message-id: 20190506142741.41731-4-vsementsov@virtuozzo.com Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-05-28	qcow2: add separate file for threaded data processing functions	Vladimir Sementsov-Ogievskiy
	Move compression-on-threads to separate file. Encryption will be in it too. Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Reviewed-by: Alberto Garcia <berto@igalia.com> Reviewed-by: Max Reitz <mreitz@redhat.com> Message-id: 20190506142741.41731-3-vsementsov@virtuozzo.com Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-05-28	qcow2.h: add missing include	Vladimir Sementsov-Ogievskiy
	qcow2.h depends on block_int.h. Compilation isn't broken currently only due to block_int.h always included before qcow2.h. Though, it seems better to directly include block_int.h in qcow2.h. Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Reviewed-by: Alberto Garcia <berto@igalia.com> Reviewed-by: Max Reitz <mreitz@redhat.com> Message-id: 20190506142741.41731-2-vsementsov@virtuozzo.com Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-05-20	block/file-posix: Unaligned O_DIRECT block-status	Max Reitz
	Currently, qemu crashes whenever someone queries the block status of an unaligned image tail of an O_DIRECT image: $ echo > foo $ qemu-img map --image-opts driver=file,filename=foo,cache.direct=on Offset Length Mapped to File qemu-img: block/io.c:2093: bdrv_co_block_status: Assertion `pnum && QEMU_IS_ALIGNED(pnum, align) && align > offset - aligned_offset' failed. This is because bdrv_co_block_status() checks that the result returned by the driver's implementation is aligned to the request_alignment, but file-posix can fail to do so, which is actually mentioned in a comment there: "[...] possibly including a partial sector at EOF". Fix this by rounding up those partial sectors. There are two possible alternative fixes: (1) We could refuse to open unaligned image files with O_DIRECT altogether. That sounds reasonable until you realize that qcow2 does necessarily not fill up its metadata clusters, and that nobody runs qemu-img create with O_DIRECT. Therefore, unpreallocated qcow2 files usually have an unaligned image tail. (2) bdrv_co_block_status() could ignore unaligned tails. It actually throws away everything past the EOF already, so that sounds reasonable. Unfortunately, the block layer knows file lengths only with a granularity of BDRV_SECTOR_SIZE, so bdrv_co_block_status() usually would have to guess whether its file length information is inexact or whether the driver is broken. Fixing what raw_co_block_status() returns is the safest thing to do. There seems to be no other block driver that sets request_alignment and does not make sure that it always returns aligned values. Cc: qemu-stable@nongnu.org Signed-off-by: Max Reitz <mreitz@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-05-20	blockjob: Propagate AioContext change to all job nodes	Kevin Wolf
	Block jobs require that all of the nodes the job is using are in the same AioContext. Therefore all BdrvChild objects of the job propagate .(can_)set_aio_context to all other job nodes, so that the switch is checked and performed consistently even if both nodes are in different subtrees. Signed-off-by: Kevin Wolf <kwolf@redhat.com>