aboutsummaryrefslogtreecommitdiff
path: root/block/file-posix.c
AgeCommit message (Collapse)Author
2020-02-20file-posix: Drop hdev_co_create_opts()Max Reitz
The generic fallback implementation effectively does the same. Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Max Reitz <mreitz@redhat.com> Message-Id: <20200122164532.178040-4-mreitz@redhat.com> Signed-off-by: Max Reitz <mreitz@redhat.com>
2020-01-30block/file-posix.c: extend to use io_uringAarushi Mehta
Signed-off-by: Aarushi Mehta <mehta.aaru20@gmail.com> Reviewed-by: Maxim Levitsky <maximlevitsky@gmail.com> Acked-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Message-id: 20200120141858.587874-9-stefanha@redhat.com Message-Id: <20200120141858.587874-9-stefanha@redhat.com> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2020-01-30block/io: wait for serialising requests when a request becomes serialisingPaolo Bonzini
Marking without waiting would not result in actual serialising behavior. Thus, make a call bdrv_mark_request_serialising sufficient for serialisation to happen. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-id: 1578495356-46219-3-git-send-email-pbonzini@redhat.com Message-Id: <1578495356-46219-3-git-send-email-pbonzini@redhat.com> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2019-12-02block/file-posix: Fix laio_init() error handling crash bugMarkus Armbruster
raw_aio_attach_aio_context() passes uninitialized Error *local_err by reference to laio_init() via aio_setup_linux_aio(). When laio_init() fails, it passes it on to error_setg_errno(), tripping error_setv()'s assertion unless @local_err is null by dumb luck. Fix by initializing @local_err properly. Fixes: ed6e2161715c527330f936d44af4c547f25f687e Cc: Nishanth Aravamudan <naravamudan@digitalocean.com> Cc: Stefan Hajnoczi <stefanha@redhat.com> Cc: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Markus Armbruster <armbru@redhat.com> Message-Id: <20191130194240.10517-4-armbru@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com>
2019-11-04block/file-posix: Let post-EOF fallocate serializeMax Reitz
The XFS kernel driver has a bug that may cause data corruption for qcow2 images as of qemu commit c8bb23cbdbe32f. We can work around it by treating post-EOF fallocates as serializing up until infinity (INT64_MAX in practice). Cc: qemu-stable@nongnu.org Signed-off-by: Max Reitz <mreitz@redhat.com> Message-id: 20191101152510.11719-4-mreitz@redhat.com Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-10-28Merge remote-tracking branch 'remotes/maxreitz/tags/pull-block-2019-10-28' ↵Peter Maydell
into staging Block patches for softfreeze: - iotest patches - Improve performance of the mirror block job in write-blocking mode - Limit memory usage for the backup block job - Add discard and write-zeroes support to the NVMe host block driver - Fix a bug in the mirror job - Prevent the qcow2 driver from creating technically non-compliant qcow2 v3 images (where there is not enough extra data for snapshot table entries) - Allow callers of bdrv_truncate() (etc.) to determine whether the file must be resized to the exact given size or whether it is OK for block devices not to shrink # gpg: Signature made Mon 28 Oct 2019 12:13:53 GMT # gpg: using RSA key 91BEB60A30DB3E8857D11829F407DB0061D5CF40 # gpg: issuer "mreitz@redhat.com" # gpg: Good signature from "Max Reitz <mreitz@redhat.com>" [full] # Primary key fingerprint: 91BE B60A 30DB 3E88 57D1 1829 F407 DB00 61D5 CF40 * remotes/maxreitz/tags/pull-block-2019-10-28: (69 commits) qemu-iotests: restrict 264 to qcow2 only Revert "qemu-img: Check post-truncation size" block: Pass truncate exact=true where reasonable block: Let format drivers pass @exact block: Evaluate @exact in protocol drivers block: Add @exact parameter to bdrv_co_truncate() block: Do not truncate file node when formatting block/cor: Drop cor_co_truncate() block: Handle filter truncation like native impl. iotests: Test qcow2's snapshot table handling iotests: Add peek_file* functions qcow2: Fix v3 snapshot table entry compliancy qcow2: Repair snapshot table with too many entries qcow2: Fix overly long snapshot tables qcow2: Keep track of the snapshot table length qcow2: Fix broken snapshot table entries qcow2: Add qcow2_check_fix_snapshot_table() qcow2: Separate qcow2_check_read_snapshot_table() qcow2: Write v3-compliant snapshot list on upgrade qcow2: Put qcow2_upgrade() into its own function ... Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
2019-10-28block: Evaluate @exact in protocol driversMax Reitz
We have two protocol drivers that return success when trying to shrink a block device even though they cannot shrink it. This behavior is now only allowed with exact=false, so they should return an error with exact=true. Signed-off-by: Max Reitz <mreitz@redhat.com> Message-id: 20190918095144.955-6-mreitz@redhat.com Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-10-28block: Add @exact parameter to bdrv_co_truncate()Max Reitz
We have two drivers (iscsi and file-posix) that (in some cases) return success from their .bdrv_co_truncate() implementation if the block device is larger than the requested offset, but cannot be shrunk. Some callers do not want that behavior, so this patch adds a new parameter that they can use to turn off that behavior. This patch just adds the parameter and lets the block/io.c and block/block-backend.c functions pass it around. All other callers always pass false and none of the implementations evaluate it, so that this patch does not change existing behavior. Future patches take care of that. Suggested-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Max Reitz <mreitz@redhat.com> Message-id: 20190918095144.955-5-mreitz@redhat.com Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-10-26core: replace getpagesize() with qemu_real_host_page_sizeWei Yang
There are three page size in qemu: real host page size host page size target page size All of them have dedicate variable to represent. For the last two, we use the same form in the whole qemu project, while for the first one we use two forms: qemu_real_host_page_size and getpagesize(). qemu_real_host_page_size is defined to be a replacement of getpagesize(), so let it serve the role. [Note] Not fully tested for some arch or device. Signed-off-by: Wei Yang <richardw.yang@linux.intel.com> Message-Id: <20191013021145.16011-3-richardw.yang@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-10-10qapi: query-blockstat: add driver specific file-posix statsAnton Nefedov
A block driver can provide a callback to report driver-specific statistics. file-posix driver now reports discard statistics Signed-off-by: Anton Nefedov <anton.nefedov@virtuozzo.com> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Acked-by: Markus Armbruster <armbru@redhat.com> Message-id: 20190923121737.83281-10-anton.nefedov@virtuozzo.com Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-10-10file-posix: account discard operationsAnton Nefedov
This will help to identify how many of the user-issued discard operations (accounted on a device level) have actually suceeded down on the host file (even though the numbers will not be exactly the same if non-raw format driver is used (e.g. qcow2 sending metadata discards)). Note that these numbers will not include discards triggered by write-zeroes + MAY_UNMAP calls. Signed-off-by: Anton Nefedov <anton.nefedov@virtuozzo.com> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Message-id: 20190923121737.83281-9-anton.nefedov@virtuozzo.com Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-09-10file-posix: Fix has_write_zeroes after NO_FALLBACKKevin Wolf
If QEMU_AIO_NO_FALLBACK is given, we always return failure and don't even try to use the BLKZEROOUT ioctl. In this failure case, we shouldn't disable has_write_zeroes because we didn't learn anything about the ioctl. The next request might not set QEMU_AIO_NO_FALLBACK and we can still use the ioctl then. Fixes: 738301e1175 Reported-by: Eric Blake <eblake@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com>
2019-09-10block/file-posix: Reduce xfsctl() useMax Reitz
This patch removes xfs_write_zeroes() and xfs_discard(). Both functions have been added just before the same feature was present through fallocate(): - fallocate() has supported PUNCH_HOLE for XFS since Linux 2.6.38 (March 2011); xfs_discard() was added in December 2010. - fallocate() has supported ZERO_RANGE for XFS since Linux 3.15 (June 2014); xfs_write_zeroes() was added in November 2013. Nowadays, all systems that qemu runs on should support both fallocate() features (RHEL 7's kernel does). xfsctl() is still useful for getting the request alignment for O_DIRECT, so this patch does not remove our dependency on it completely. Note that xfs_write_zeroes() had a bug: It calls ftruncate() when the file is shorter than the specified range (because ZERO_RANGE does not increase the file length). ftruncate() may yield and then discard data that parallel write requests have written past the EOF in the meantime. Dropping the function altogether fixes the bug. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Fixes: 50ba5b2d994853b38fed10e0841b119da0f8b8e5 Reported-by: Lukáš Doktor <ldoktor@redhat.com> Cc: qemu-stable@nongnu.org Signed-off-by: Max Reitz <mreitz@redhat.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Reviewed-by: John Snow <jsnow@redhat.com> Tested-by: Stefano Garzarella <sgarzare@redhat.com> Tested-by: John Snow <jsnow@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-09-05block: workaround for unaligned byte range in fallocate()Andrey Shinkevich
Revert the commit 118f99442d 'block/io.c: fix for the allocation failure' and use better error handling for file systems that do not support fallocate() for an unaligned byte range. Allow falling back to pwrite in case fallocate() returns EINVAL. Suggested-by: Kevin Wolf <kwolf@redhat.com> Suggested-by: Eric Blake <eblake@redhat.com> Signed-off-by: Andrey Shinkevich <andrey.shinkevich@virtuozzo.com> Reviewed-by: Eric Blake <eblake@redhat.com> Reviewed-by: Denis V. Lunev <den@openvz.org> Message-Id: <1566913973-15490-1-git-send-email-andrey.shinkevich@virtuozzo.com> Signed-off-by: Eric Blake <eblake@redhat.com>
2019-09-03file-posix: fix request_alignment typoStefan Hajnoczi
Fixes: a6b257a08e3d72219f03e461a52152672fec0612 ("file-posix: Handle undetectable alignment") Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Message-id: 20190827101328.4062-1-stefanha@redhat.com Reviewed-by: Thomas Huth <thuth@redhat.com> Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-09-03block: posix: Always allocate the first blockNir Soffer
When creating an image with preallocation "off" or "falloc", the first block of the image is typically not allocated. When using Gluster storage backed by XFS filesystem, reading this block using direct I/O succeeds regardless of request length, fooling alignment detection. In this case we fallback to a safe value (4096) instead of the optimal value (512), which may lead to unneeded data copying when aligning requests. Allocating the first block avoids the fallback. Since we allocate the first block even with preallocation=off, we no longer create images with zero disk size: $ ./qemu-img create -f raw test.raw 1g Formatting 'test.raw', fmt=raw size=1073741824 $ ls -lhs test.raw 4.0K -rw-r--r--. 1 nsoffer nsoffer 1.0G Aug 16 23:48 test.raw And converting the image requires additional cluster: $ ./qemu-img measure -f raw -O qcow2 test.raw required size: 458752 fully allocated size: 1074135040 When using format like vmdk with multiple files per image, we allocate one block per file: $ ./qemu-img create -f vmdk -o subformat=twoGbMaxExtentFlat test.vmdk 4g Formatting 'test.vmdk', fmt=vmdk size=4294967296 compat6=off hwversion=undefined subformat=twoGbMaxExtentFlat $ ls -lhs test*.vmdk 4.0K -rw-r--r--. 1 nsoffer nsoffer 2.0G Aug 27 03:23 test-f001.vmdk 4.0K -rw-r--r--. 1 nsoffer nsoffer 2.0G Aug 27 03:23 test-f002.vmdk 4.0K -rw-r--r--. 1 nsoffer nsoffer 353 Aug 27 03:23 test.vmdk I did quick performance test for copying disks with qemu-img convert to new raw target image to Gluster storage with sector size of 512 bytes: for i in $(seq 10); do rm -f dst.raw sleep 10 time ./qemu-img convert -f raw -O raw -t none -T none src.raw dst.raw done Here is a table comparing the total time spent: Type Before(s) After(s) Diff(%) --------------------------------------- real 530.028 469.123 -11.4 user 17.204 10.768 -37.4 sys 17.881 7.011 -60.7 We can see very clear improvement in CPU usage. Signed-off-by: Nir Soffer <nsoffer@redhat.com> Message-id: 20190827010528.8818-2-nsoffer@redhat.com Reviewed-by: Max Reitz <mreitz@redhat.com> Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-08-19block: Implement .bdrv_has_zero_init_truncate()Max Reitz
We need to implement .bdrv_has_zero_init_truncate() for every block driver that supports truncation and has a .bdrv_has_zero_init() implementation. Implement it the same way each driver implements .bdrv_has_zero_init(). This is at least not any more unsafe than what we had before. Signed-off-by: Max Reitz <mreitz@redhat.com> Message-id: 20190724171239.8764-5-mreitz@redhat.com Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-08-17block: fix NetBSD qemu-iotests failurePaolo Bonzini
Opening a block device on NetBSD has an additional step compared to other OSes, corresponding to raw_normalize_devicepath. The error message in that function is slightly different from that in raw_open_common and this was causing spurious failures in qemu-iotests. However, in general it is not important to know what exact step was failing, for example in the qemu-iotests case the error message contains the fairly unequivocal "No such file or directory" text from strerror. We can thus fix the failures by standardizing on a single error message for both raw_open_common and raw_normalize_devicepath; in fact, we can even use error_setg_file_open to make sure the error message is the same as in the rest of QEMU. Message-Id: <20190725095920.28419-1-pbonzini@redhat.com> Tested-by: Thomas Huth <thuth@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com> Signed-off-by: Thomas Huth <thuth@redhat.com>
2019-08-16file-posix: Handle undetectable alignmentNir Soffer
In some cases buf_align or request_alignment cannot be detected: 1. With Gluster, buf_align cannot be detected since the actual I/O is done on Gluster server, and qemu buffer alignment does not matter. Since we don't have alignment requirement, buf_align=1 is the best value. 2. With local XFS filesystem, buf_align cannot be detected if reading from unallocated area. In this we must align the buffer, but we don't know what is the correct size. Using the wrong alignment results in I/O error. 3. With Gluster backed by XFS, request_alignment cannot be detected if reading from unallocated area. In this case we need to use the correct alignment, and failing to do so results in I/O errors. 4. With NFS, the server does not use direct I/O, so both buf_align cannot be detected. In this case we don't need any alignment so we can use buf_align=1 and request_alignment=1. These cases seems to work when storage sector size is 512 bytes, because the current code starts checking align=512. If the check succeeds because alignment cannot be detected we use 512. But this does not work for storage with 4k sector size. To determine if we can detect the alignment, we probe first with align=1. If probing succeeds, maybe there are no alignment requirement (cases 1, 4) or we are probing unallocated area (cases 2, 3). Since we don't have any way to tell, we treat this as undetectable alignment. If probing with align=1 fails with EINVAL, but probing with one of the expected alignments succeeds, we know that we found a working alignment. Practically the alignment requirements are the same for buffer alignment, buffer length, and offset in file. So in case we cannot detect buf_align, we can use request alignment. If we cannot detect request alignment, we can fallback to a safe value. To use this logic, we probe first request alignment instead of buf_align. Here is a table showing the behaviour with current code (the value in parenthesis is the optimal value). Case Sector buf_align (opt) request_alignment (opt) result ====================================================================== 1 512 512 (1) 512 (512) OK 1 4096 512 (1) 4096 (4096) FAIL ---------------------------------------------------------------------- 2 512 512 (512) 512 (512) OK 2 4096 512 (4096) 4096 (4096) FAIL ---------------------------------------------------------------------- 3 512 512 (1) 512 (512) OK 3 4096 512 (1) 512 (4096) FAIL ---------------------------------------------------------------------- 4 512 512 (1) 512 (1) OK 4 4096 512 (1) 512 (1) OK Same cases with this change: Case Sector buf_align (opt) request_alignment (opt) result ====================================================================== 1 512 512 (1) 512 (512) OK 1 4096 4096 (1) 4096 (4096) OK ---------------------------------------------------------------------- 2 512 512 (512) 512 (512) OK 2 4096 4096 (4096) 4096 (4096) OK ---------------------------------------------------------------------- 3 512 4096 (1) 4096 (512) OK 3 4096 4096 (1) 4096 (4096) OK ---------------------------------------------------------------------- 4 512 4096 (1) 4096 (1) OK 4 4096 4096 (1) 4096 (1) OK I tested that provisioning VMs and copying disks on local XFS and Gluster with 4k bytes sector size work now, resolving bugs [1],[2]. I tested also on XFS, NFS, Gluster with 512 bytes sector size. [1] https://bugzilla.redhat.com/1737256 [2] https://bugzilla.redhat.com/1738657 Signed-off-by: Nir Soffer <nsoffer@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-07-12file-posix: Use max transfer length/segment count only for SCSI passthroughMaxim Levitsky
Regular kernel block devices (/dev/sda*, /dev/nvme*, etc) don't have max segment size/max segment count hardware requirements exposed to the userspace, but rather the kernel block layer takes care to split the incoming requests that violate these requirements. Allowing the kernel to do the splitting allows qemu to avoid various overheads that arise otherwise from this. This is especially visible in nbd server, exposing as a raw file, a mostly empty qcow2 image over the net. In this case most of the reads by the remote user won't even hit the underlying kernel block device, and therefore most of the overhead will be in the nbd traffic which increases significantly with lower max transfer size. In addition to that even for local block device access the peformance improves a bit due to less traffic between qemu and the kernel when large transfer sizes are used (e.g for image conversion) More info can be found at: https://bugzilla.redhat.com/show_bug.cgi?id=1647104 Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Reviewed-by: Pankaj Gupta <pagupta@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-06-18file-posix: Update open_flags in raw_set_perm()Max Reitz
raw_check_perm() + raw_set_perm() can change the flags associated with the current FD. If so, we have to update BDRVRawState.open_flags accordingly. Otherwise, we may keep reopening the FD even though the current one already has the correct flags. Signed-off-by: Max Reitz <mreitz@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-06-12block/file-posix: update .help of BLOCK_OPT_PREALLOC optionStefano Garzarella
Show 'falloc' among the allowed values of 'preallocation' option, only when it is supported (if defined CONFIG_POSIX_FALLOCATE) Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Reviewed-by: Markus Armbruster <armbru@redhat.com> Message-Id: <20190524075848.23781-3-sgarzare@redhat.com> Signed-off-by: Markus Armbruster <armbru@redhat.com>
2019-06-12Include qemu-common.h exactly where neededMarkus Armbruster
No header includes qemu-common.h after this commit, as prescribed by qemu-common.h's file comment. Signed-off-by: Markus Armbruster <armbru@redhat.com> Message-Id: <20190523143508.25387-5-armbru@redhat.com> [Rebased with conflicts resolved automatically, except for include/hw/arm/xlnx-zynqmp.h hw/arm/nrf51_soc.c hw/arm/msf2-soc.c block/qcow2-refcount.c block/qcow2-cluster.c block/qcow2-cache.c target/arm/cpu.h target/lm32/cpu.h target/m68k/cpu.h target/mips/cpu.h target/moxie/cpu.h target/nios2/cpu.h target/openrisc/cpu.h target/riscv/cpu.h target/tilegx/cpu.h target/tricore/cpu.h target/unicore32/cpu.h target/xtensa/cpu.h; bsd-user/main.c and net/tap-bsd.c fixed up]
2019-05-20block/file-posix: Unaligned O_DIRECT block-statusMax Reitz
Currently, qemu crashes whenever someone queries the block status of an unaligned image tail of an O_DIRECT image: $ echo > foo $ qemu-img map --image-opts driver=file,filename=foo,cache.direct=on Offset Length Mapped to File qemu-img: block/io.c:2093: bdrv_co_block_status: Assertion `*pnum && QEMU_IS_ALIGNED(*pnum, align) && align > offset - aligned_offset' failed. This is because bdrv_co_block_status() checks that the result returned by the driver's implementation is aligned to the request_alignment, but file-posix can fail to do so, which is actually mentioned in a comment there: "[...] possibly including a partial sector at EOF". Fix this by rounding up those partial sectors. There are two possible alternative fixes: (1) We could refuse to open unaligned image files with O_DIRECT altogether. That sounds reasonable until you realize that qcow2 does necessarily not fill up its metadata clusters, and that nobody runs qemu-img create with O_DIRECT. Therefore, unpreallocated qcow2 files usually have an unaligned image tail. (2) bdrv_co_block_status() could ignore unaligned tails. It actually throws away everything past the EOF already, so that sounds reasonable. Unfortunately, the block layer knows file lengths only with a granularity of BDRV_SECTOR_SIZE, so bdrv_co_block_status() usually would have to guess whether its file length information is inexact or whether the driver is broken. Fixing what raw_co_block_status() returns is the safest thing to do. There seems to be no other block driver that sets request_alignment and does not make sure that it always returns aligned values. Cc: qemu-stable@nongnu.org Signed-off-by: Max Reitz <mreitz@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-05-20block/file-posix: Truncate in xfs_write_zeroes()Max Reitz
XFS_IOC_ZERO_RANGE does not increase the file length: $ touch foo $ xfs_io -c 'zero 0 65536' foo $ stat -c "size=%s, blocks=%b" foo size=0, blocks=128 We do want writes beyond the EOF to automatically increase the file length, however. This is evidenced by the fact that iotest 061 is broken on XFS since qcow2's check implementation checks for blocks beyond the EOF. Reported-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Max Reitz <mreitz@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-04-02block/file-posix: do not fail on unlock bytesVladimir Sementsov-Ogievskiy
bdrv_replace_child() calls bdrv_check_perm() with error_abort on loosening permissions. However file-locking operations may fail even in this case, for example on NFS. And this leads to Qemu crash. Let's avoid such errors. Note, that we ignore such things anyway on permission update commit and abort. Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-03-26file-posix: Support BDRV_REQ_NO_FALLBACK for zero writesKevin Wolf
We know that the kernel implements a slow fallback code path for BLKZEROOUT, so if BDRV_REQ_NO_FALLBACK is given, we shouldn't call it. The other operations we call in the context of .bdrv_co_pwrite_zeroes should usually be quick, so no modification should be needed for them. If we ever notice that there are additional problematic cases, we can still make these conditional as well. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Acked-by: Eric Blake <eblake@redhat.com>
2019-03-14Merge remote-tracking branch 'remotes/stefanha/tags/block-pull-request' into ↵Peter Maydell
staging Pull request * Add 'drop-cache=on|off' option to file-posix.c. The default is on. Disabling the option fixes a QEMU 3.0.0 performance regression when live migrating on the same host with cache.direct=off. # gpg: Signature made Wed 13 Mar 2019 11:07:48 GMT # gpg: using RSA key 9CA4ABB381AB73C8 # gpg: Good signature from "Stefan Hajnoczi <stefanha@redhat.com>" [full] # gpg: aka "Stefan Hajnoczi <stefanha@gmail.com>" [full] # Primary key fingerprint: 8695 A8BF D3F9 7CDA AC35 775A 9CA4 ABB3 81AB 73C8 * remotes/stefanha/tags/block-pull-request: file-posix: add drop-cache=on|off option Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
2019-03-13file-posix: add drop-cache=on|off optionStefan Hajnoczi
Commit dd577a26ff03b6829721b1ffbbf9e7c411b72378 ("block/file-posix: implement bdrv_co_invalidate_cache() on Linux") introduced page cache invalidation so that cache.direct=off live migration is safe on Linux. The invalidation takes a significant amount of time when the file is large and present in the page cache. Normally this is not the case for cross-host live migration but it can happen when migrating between QEMU processes on the same host. On same-host migration we don't need to invalidate pages for correctness anyway, so an option to skip page cache invalidation is useful. I investigated optimizing invalidation and detecting same-host migration, but both are hard to achieve so a user-visible option will suffice. As a bonus this option means that the cache invalidation feature will now be detectable by libvirt via QMP schema introspection. Suggested-by: Neil Skrypuch <neil@tembosocial.com> Tested-by: Neil Skrypuch <neil@tembosocial.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Message-id: 20190307164941.3322-1-stefanha@redhat.com Message-Id: <20190307164941.3322-1-stefanha@redhat.com> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2019-03-12block: Add a 'mutable_opts' field to BlockDriverAlberto Garcia
If we reopen a BlockDriverState and there is an option that is present in bs->options but missing from the new set of options then we have to return an error unless the driver is able to reset it to its default value. This patch adds a new 'mutable_opts' field to BlockDriver. This is a list of runtime options that can be modified during reopen. If an option in this list is unspecified on reopen then it must be reset (or return an error). Signed-off-by: Alberto Garcia <berto@igalia.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-03-12file-posix: Make auto-read-only dynamicKevin Wolf
Until now, with auto-read-only=on we tried to open the file read-write first and if that failed, read-only was tried. This is actually not good enough for libvirt, which gives QEMU SELinux permissions for read-write only as soon as it actually intends to write to the image. So we need to be able to switch between read-only and read-write at runtime. This patch makes auto-read-only dynamic, i.e. the file is opened read-only as long as no user of the node has requested write permissions, but it is automatically reopened read-write as soon as the first writer is attached. Conversely, if the last writer goes away, the file is reopened read-only again. bs->read_only is no longer set for auto-read-only=on files even if the file descriptor is opened read-only because it will be transparently upgraded as soon as a writer is attached. This changes the output of qemu-iotests 232. Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-03-12file-posix: Prepare permission code for fd switchingKevin Wolf
In order to be able to dynamically reopen the file read-only or read-write, depending on the users that are attached, we need to be able to switch to a different file descriptor during the permission change. This interacts with reopen, which also creates a new file descriptor and performs permission changes internally. In this case, the permission change code must reuse the reopen file descriptor instead of creating a third one. In turn, reopen can drop its code to copy file locks to the new file descriptor because that is now done when applying the new permissions. Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-03-12file-posix: Lock new fd in raw_reopen_prepare()Kevin Wolf
There is no reason why we can take locks on the new file descriptor only in raw_reopen_commit() where error handling isn't possible any more. Instead, we can already do this in raw_reopen_prepare(). Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-03-12file-posix: Store BDRVRawState.reopen_state during reopenKevin Wolf
We'll want to access the file descriptor in the reopen_state while processing permission changes in the context of the repoen. Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-03-12file-posix: Factor out raw_reconfigure_getfd()Kevin Wolf
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2019-01-31block/file-posix: Convert from DPRINTF() macro to trace eventsLaurent Vivier
Signed-off-by: Laurent Vivier <lvivier@redhat.com> Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com> Message-id: 20181213162727.17438-4-lvivier@redhat.com Signed-off-by: Max Reitz <mreitz@redhat.com>
2019-01-11avoid TABs in files that only contain a fewPaolo Bonzini
Most files that have TABs only contain a handful of them. Change them to spaces so that we don't confuse people. disas, standard-headers, linux-headers and libdecnumber are imported from other projects and probably should be exempted from the check. Outside those, after this patch the following files still contain both 8-space and TAB sequences at the beginning of the line. Many of them have a majority of TABs, or were initially committed with all tabs. bsd-user/i386/target_syscall.h bsd-user/x86_64/target_syscall.h crypto/aes.c hw/audio/fmopl.c hw/audio/fmopl.h hw/block/tc58128.c hw/display/cirrus_vga.c hw/display/xenfb.c hw/dma/etraxfs_dma.c hw/intc/sh_intc.c hw/misc/mst_fpga.c hw/net/pcnet.c hw/sh4/sh7750.c hw/timer/m48t59.c hw/timer/sh_timer.c include/crypto/aes.h include/disas/bfd.h include/hw/sh4/sh.h libdecnumber/decNumber.c linux-headers/asm-generic/unistd.h linux-headers/linux/kvm.h linux-user/alpha/target_syscall.h linux-user/arm/nwfpe/double_cpdo.c linux-user/arm/nwfpe/fpa11_cpdt.c linux-user/arm/nwfpe/fpa11_cprt.c linux-user/arm/nwfpe/fpa11.h linux-user/flat.h linux-user/flatload.c linux-user/i386/target_syscall.h linux-user/ppc/target_syscall.h linux-user/sparc/target_syscall.h linux-user/syscall.c linux-user/syscall_defs.h linux-user/x86_64/target_syscall.h slirp/cksum.c slirp/if.c slirp/ip.h slirp/ip_icmp.c slirp/ip_icmp.h slirp/ip_input.c slirp/ip_output.c slirp/mbuf.c slirp/misc.c slirp/sbuf.c slirp/socket.c slirp/socket.h slirp/tcp_input.c slirp/tcpip.h slirp/tcp_output.c slirp/tcp_subr.c slirp/tcp_timer.c slirp/tftp.c slirp/udp.c slirp/udp.h target/cris/cpu.h target/cris/mmu.c target/cris/op_helper.c target/sh4/helper.c target/sh4/op_helper.c target/sh4/translate.c tcg/sparc/tcg-target.inc.c tests/tcg/cris/check_addo.c tests/tcg/cris/check_moveq.c tests/tcg/cris/check_swap.c tests/tcg/multiarch/test-mmap.c ui/vnc-enc-hextile-template.h ui/vnc-enc-zywrle.h util/envlist.c util/readline.c The following have only TABs: bsd-user/i386/target_signal.h bsd-user/sparc64/target_signal.h bsd-user/sparc64/target_syscall.h bsd-user/sparc/target_signal.h bsd-user/sparc/target_syscall.h bsd-user/x86_64/target_signal.h crypto/desrfb.c hw/audio/intel-hda-defs.h hw/core/uboot_image.h hw/sh4/sh7750_regnames.c hw/sh4/sh7750_regs.h include/hw/cris/etraxfs_dma.h linux-user/alpha/termbits.h linux-user/arm/nwfpe/fpopcode.h linux-user/arm/nwfpe/fpsr.h linux-user/arm/syscall_nr.h linux-user/arm/target_signal.h linux-user/cris/target_signal.h linux-user/i386/target_signal.h linux-user/linux_loop.h linux-user/m68k/target_signal.h linux-user/microblaze/target_signal.h linux-user/mips64/target_signal.h linux-user/mips/target_signal.h linux-user/mips/target_syscall.h linux-user/mips/termbits.h linux-user/ppc/target_signal.h linux-user/sh4/target_signal.h linux-user/sh4/termbits.h linux-user/sparc64/target_syscall.h linux-user/sparc/target_signal.h linux-user/x86_64/target_signal.h linux-user/x86_64/termbits.h pc-bios/optionrom/optionrom.h slirp/mbuf.h slirp/misc.h slirp/sbuf.h slirp/tcp.h slirp/tcp_timer.h slirp/tcp_var.h target/i386/svm.h target/sparc/asi.h target/xtensa/core-dc232b/xtensa-modules.inc.c target/xtensa/core-dc233c/xtensa-modules.inc.c target/xtensa/core-de212/core-isa.h target/xtensa/core-de212/xtensa-modules.inc.c target/xtensa/core-fsf/xtensa-modules.inc.c target/xtensa/core-sample_controller/core-isa.h target/xtensa/core-sample_controller/xtensa-modules.inc.c target/xtensa/core-test_kc705_be/core-isa.h target/xtensa/core-test_kc705_be/xtensa-modules.inc.c tests/tcg/cris/check_abs.c tests/tcg/cris/check_addc.c tests/tcg/cris/check_addcm.c tests/tcg/cris/check_addoq.c tests/tcg/cris/check_bound.c tests/tcg/cris/check_ftag.c tests/tcg/cris/check_int64.c tests/tcg/cris/check_lz.c tests/tcg/cris/check_openpf5.c tests/tcg/cris/check_sigalrm.c tests/tcg/cris/crisutils.h tests/tcg/cris/sys.c tests/tcg/i386/test-i386-ssse3.c ui/vgafont.h Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20181213223737.11793-3-pbonzini@redhat.com> Reviewed-by: Aleksandar Markovic <amarkovic@wavecomp.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Wainer dos Santos Moschetta <wainersm@redhat.com> Acked-by: Richard Henderson <richard.henderson@linaro.org> Acked-by: Eric Blake <eblake@redhat.com> Acked-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Stefan Markovic <smarkovic@wavecomp.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Alex Bennée <alex.bennee@linaro.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-12-14file-posix: Avoid aio_worker() for QEMU_AIO_IOCTLKevin Wolf
aio_worker() doesn't add anything interesting, it's only a useless indirection. Call the handler function directly instead. As we know that this handler function is only called from coroutine context and the coroutine stays around until the worker thread finishes, we can keep RawPosixAIOData on the stack. This was the last user of aio_worker(), so the function goes away now. Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-12-14file-posix: Switch to .bdrv_co_ioctlKevin Wolf
No real reason to keep using the callback based mechanism here when the rest of the file-posix driver is coroutine based. Changing it brings ioctls more in line with how other request types work. Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-12-14file-posix: Remove paio_submit_co()Kevin Wolf
The function is not used any more, remove it. Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-12-14file-posix: Avoid aio_worker() for QEMU_AIO_READ/WRITEKevin Wolf
aio_worker() doesn't add anything interesting, it's only a useless indirection. Call the handler function directly instead. As we know that this handler function is only called from coroutine context and the coroutine stays around until the worker thread finishes, we can keep RawPosixAIOData on the stack. Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-12-14file-posix: Move read/write operation logic out of aio_worker()Kevin Wolf
aio_worker() for reads and writes isn't boring enough yet. It still does some postprocessing for handling short reads and turning the result into the right return value. However, there is no reason why handle_aiocb_rw() couldn't do the same, and even without duplicating code between the read and write path. So move the code there. Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-12-14file-posix: Avoid aio_worker() for QEMU_AIO_FLUSHKevin Wolf
aio_worker() doesn't add anything interesting, it's only a useless indirection. Call the handler function directly instead. As we know that this handler function is only called from coroutine context and the coroutine stays around until the worker thread finishes, we can keep RawPosixAIOData on the stack. Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-12-14file-posix: Avoid aio_worker() for QEMU_AIO_DISCARDKevin Wolf
aio_worker() doesn't add anything interesting, it's only a useless indirection. Call the handler function directly instead. As we know that this handler function is only called from coroutine context and the coroutine stays around until the worker thread finishes, we can keep RawPosixAIOData on the stack. Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-12-14file-posix: Avoid aio_worker() for QEMU_AIO_WRITE_ZEROESKevin Wolf
aio_worker() doesn't add anything interesting, it's only a useless indirection. Call the handler function directly instead. As we know that this handler function is only called from coroutine context and the coroutine stays around until the worker thread finishes, we can keep RawPosixAIOData on the stack. Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-12-14file-posix: Avoid aio_worker() for QEMU_AIO_COPY_RANGEKevin Wolf
aio_worker() doesn't add anything interesting, it's only a useless indirection. Call the handler function directly instead. As we know that this handler function is only called from coroutine context and the coroutine stays around until the worker thread finishes, we can keep RawPosixAIOData on the stack. Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-12-14file-posix: Avoid aio_worker() for QEMU_AIO_TRUNCATEKevin Wolf
aio_worker() doesn't add anything interesting, it's only a useless indirection. Call the handler function directly instead. As we know that this handler function is only called from coroutine context and the coroutine stays around until the worker thread finishes, we can keep RawPosixAIOData on the stack. Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-12-14file-posix: Factor out raw_thread_pool_submit()Kevin Wolf
Getting the thread pool of the AioContext of a block node and scheduling some work in it is an operation that is already done twice, and we'll get more instances. Factor it out into a separate function. Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-12-14file-posix: Reorganise RawPosixAIODataKevin Wolf
RawPosixAIOData contains a lot of fields for several separate operations that are to be processed in a worker thread and that need different parameters. The struct is currently rather unorganised, with unions that cover some, but not all operations, and even one #define for field names instead of a union. Clean this up to have some common fields and a single union. As a side effect, on x86_64 the struct shrinks from 72 to 48 bytes. Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2018-11-19file-posix: Fix shared locks on reopen commitMax Reitz
s->locked_shared_perm is the set of bits locked in the file, which is the inverse of the permissions actually shared. So we need to pass them as they are to raw_apply_lock_bytes() instead of inverting them again. Reported-by: Alberto Garcia <berto@igalia.com> Signed-off-by: Max Reitz <mreitz@redhat.com> Reviewed-by: Alberto Garcia <berto@igalia.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>