aboutsummaryrefslogtreecommitdiff
path: root/include/block
AgeCommit message (Collapse)Author
2017-03-17block: Always call bdrv_child_check_perm firstFam Zheng
bdrv_child_set_perm alone is not very usable because the caller must call bdrv_child_check_perm first. This is already encapsulated conveniently in bdrv_child_try_set_perm, so remove the other prototypes from the header and fix the one wrong caller, block/mirror.c. Signed-off-by: Fam Zheng <famz@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2017-03-07block: Fix error handling in bdrv_replace_in_backing_chain()Kevin Wolf
When adding an Error parameter, bdrv_replace_in_backing_chain() would become nothing more than a wrapper around change_parent_backing_link(). So make the latter public, renamed as bdrv_replace_node(), and remove bdrv_replace_in_backing_chain(). Most of the callers just remove a node from the graph that they just inserted, so they can use &error_abort, but completion of a mirror job with 'replaces' set can actually fail. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Fam Zheng <famz@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com>
2017-03-07block: Ignore multiple children in bdrv_check_update_perm()Kevin Wolf
change_parent_backing_link() will need to update multiple BdrvChild objects at once. Checking permissions reference by reference doesn't work because permissions need to be consistent only with all parents moved to the new child. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Fam Zheng <famz@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com>
2017-02-28block: Add Error parameter to bdrv_append()Kevin Wolf
Aborting on error in bdrv_append() isn't correct. This patch fixes it and lets the callers handle failures. Test case 085 needs a reference output update. This is caused by the reversed order of bdrv_set_backing_hd() and change_parent_backing_link() in bdrv_append(): When the backing file of the new node is set, the parent nodes are still pointing to the old top, so the backing blocker is now initialised with the node name rather than the BlockBackend name. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Acked-by: Fam Zheng <famz@redhat.com> Reviewed-by: Max Reitz <mreitz@redhat.com>
2017-02-28block: Add Error parameter to bdrv_set_backing_hd()Kevin Wolf
Not all callers of bdrv_set_backing_hd() know for sure that attaching the backing file will be allowed by the permission system. Return the error from the function rather than aborting. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Acked-by: Fam Zheng <famz@redhat.com> Reviewed-by: Max Reitz <mreitz@redhat.com>
2017-02-28commit: Add filter-node-name to block-commitKevin Wolf
Management tools need to be able to know about every node in the graph and need a way to address them. Changing the graph structure was okay because libvirt doesn't really manage the node level yet, but future libvirt versions need to deal with both new and old version of qemu. This new option to blockdev-commit allows the client to set a node-name for the automatically inserted filter driver, and at the same time serves as a witness for a future libvirt that this version of qemu does automatically insert a filter driver. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Acked-by: Fam Zheng <famz@redhat.com> Reviewed-by: Max Reitz <mreitz@redhat.com>
2017-02-28mirror: Add filter-node-name to blockdev-mirrorKevin Wolf
Management tools need to be able to know about every node in the graph and need a way to address them. Changing the graph structure was okay because libvirt doesn't really manage the node level yet, but future libvirt versions need to deal with both new and old version of qemu. This new option to blockdev-mirror allows the client to set a node-name for the automatically inserted filter driver, and at the same time serves as a witness for a future libvirt that this version of qemu does automatically insert a filter driver. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Max Reitz <mreitz@redhat.com> Acked-by: Fam Zheng <famz@redhat.com>
2017-02-28blockjob: Factor out block_job_remove_all_bdrv()Kevin Wolf
In some cases, we want to remove op blockers on intermediate nodes before the whole block job transaction has completed (because they block restoring the final graph state during completion). Provide a function for this. The whole block job lifecycle is a bit messed up and it's hard to actually do all things in the right order, but I'll leave simplifying this for another day. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Acked-by: Fam Zheng <famz@redhat.com> Reviewed-by: Max Reitz <mreitz@redhat.com>
2017-02-28block: BdrvChildRole.attach/detach() callbacksKevin Wolf
Backing files are somewhat special compared to other kinds of children because they are attached and detached using bdrv_set_backing_hd() rather than the normal set of functions, which does a few more things like setting backing blockers, toggling the BDRV_O_NO_BACKING flag, setting parent_bs->backing_file, etc. These special features are a reason why change_parent_backing_link() can't handle backing files yet. With abstracting the additional features into .attach/.detach callbacks, we get a step closer to a function that can actually deal with this. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Max Reitz <mreitz@redhat.com> Acked-by: Fam Zheng <famz@redhat.com>
2017-02-28blockjob: Add permissions to block_job_add_bdrv()Kevin Wolf
Block jobs don't actually do I/O through the the reference they create with block_job_add_bdrv(), but they might want to use the permisssion system to express what the block job does to intermediate nodes. This adds permissions to block_job_add_bdrv() to provide the means to request permissions. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Max Reitz <mreitz@redhat.com> Acked-by: Fam Zheng <famz@redhat.com>
2017-02-28block: Add BdrvChildRole.stay_at_nodeKevin Wolf
When the parents' child links are updated in bdrv_append() or bdrv_replace_in_backing_chain(), this should affect all child links of BlockBackends or other nodes, but not on child links held for other purposes (like for setting permissions). This patch allows to control the behaviour per BdrvChildRole. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Max Reitz <mreitz@redhat.com> Acked-by: Fam Zheng <famz@redhat.com>
2017-02-28block: Add BdrvChildRole.get_parent_desc()Kevin Wolf
For meaningful error messages in the permission system, we need to get some human-readable description of the parent of a BdrvChild. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Max Reitz <mreitz@redhat.com> Acked-by: Fam Zheng <famz@redhat.com>
2017-02-28blockjob: Add permissions to block_job_create()Kevin Wolf
This functions creates a BlockBackend internally, so the block jobs need to tell it what they want to do with the BB. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Max Reitz <mreitz@redhat.com> Acked-by: Fam Zheng <famz@redhat.com>
2017-02-28block: Allow error return in BlockDevOps.change_media_cb()Kevin Wolf
Some devices allow a media change between read-only and read-write media. They need to adapt the permissions in their .change_media_cb() implementation, which can fail. So add an Error parameter to the function. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Max Reitz <mreitz@redhat.com> Acked-by: Fam Zheng <famz@redhat.com>
2017-02-28block: Add BDRV_O_RESIZE for blk_new_open()Kevin Wolf
blk_new_open() is a convenience function that processes flags rather than QDict options as a simple way to just open an image file. In order to keep it convenient in the future, it must automatically request the necessary permissions. This can easily be inferred from the flags for read and write, but we need another flag that tells us whether to get the resize permission. We can't just always request it because that means that no block jobs can run on the resulting BlockBackend (which is something that e.g. qemu-img commit wants to do), but we also can't request it never because most of the .bdrv_create() implementations call blk_truncate(). The solution is to introduce another flag that is passed by all users that want to resize the image. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Acked-by: Fam Zheng <famz@redhat.com> Reviewed-by: Max Reitz <mreitz@redhat.com>
2017-02-28vvfat: Implement .bdrv_child_perm()Kevin Wolf
vvfat is the last remaining driver that can have children, but doesn't implement .bdrv_child_perm() yet. The default handlers aren't suitable here, so let's implement a very simple driver-specific one that protects the internal child from being used by other users as good as our permissions permit. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Max Reitz <mreitz@redhat.com> Acked-by: Fam Zheng <famz@redhat.com>
2017-02-28block: Default .bdrv_child_perm() for format driversKevin Wolf
Almost all format drivers have the same characteristics as far as permissions are concerned: They have one or more children for storing their own data and, more importantly, metadata (can be written to and grow even without external write requests, must be protected against other writers and present consistent data) and optionally a backing file (this is just data, so like for a filter, it only depends on what the parent nodes need). This provides a default implementation that can be shared by most of our format drivers. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Acked-by: Fam Zheng <famz@redhat.com> Reviewed-by: Max Reitz <mreitz@redhat.com>
2017-02-28block: Default .bdrv_child_perm() for filter driversKevin Wolf
Most filters need permissions related to read and write for their children, but only if the node has a parent that wants to use the same operation on the filter. The same is true for resize. This adds a default implementation that simply forwards all necessary permissions to all children of the node and leaves the other permissions unchanged. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Acked-by: Fam Zheng <famz@redhat.com> Reviewed-by: Max Reitz <mreitz@redhat.com>
2017-02-28block: Involve block drivers in permission grantingKevin Wolf
In many cases, the required permissions of one node on its children depend on what its parents require from it. For example, the raw format or most filter drivers only need to request consistent reads if that's something that one of their parents wants. In order to achieve this, this patch introduces two new BlockDriver callbacks. The first one lets drivers first check (recursively) whether the requested permissions can be set; the second one actually sets the new permission bitmask. Also add helper functions that drivers can use in their implementation of the callbacks to update their permissions on a specific child. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Acked-by: Fam Zheng <famz@redhat.com> Reviewed-by: Max Reitz <mreitz@redhat.com>
2017-02-28block: Let callers request permissions when attaching a child nodeKevin Wolf
When attaching a node as a child to a new parent, the required and shared permissions for this parent are checked against all other parents of the node now, and an error is returned if there is a conflict. This allows error returns to a function that previously always succeeded, and the same is true for quite a few callers and their callers. Converting all of them within the same patch would be too much, so for now everyone tells that they don't need any permissions and allow everyone else to do anything. This way we can use &error_abort initially and convert caller by caller to pass actual permission requirements and implement error handling. All these places are marked with FIXME comments and it will be the job of the next patches to clean them up again. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Max Reitz <mreitz@redhat.com> Acked-by: Fam Zheng <famz@redhat.com>
2017-02-28block: Add Error argument to bdrv_attach_child()Kevin Wolf
It will have to return an error soon, so prepare the callers for it. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Max Reitz <mreitz@redhat.com> Acked-by: Fam Zheng <famz@redhat.com>
2017-02-28block: Add op blocker permission constantsKevin Wolf
This patch defines the permission categories that will be used by the new op blocker system. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Max Reitz <mreitz@redhat.com> Acked-by: Fam Zheng <famz@redhat.com>
2017-02-24block: Add bdrv_new_open_driver()Kevin Wolf
This function allows to create more or less normal BlockDriverStates even for BlockDrivers that aren't globally registered (e.g. helper filters for block jobs). Signed-off-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Max Reitz <mreitz@redhat.com>
2017-02-24block: Pass BdrvChild to bdrv_truncate()Kevin Wolf
Signed-off-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Max Reitz <mreitz@redhat.com>
2017-02-21block: document fields protected by AioContext lockPaolo Bonzini
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Fam Zheng <famz@redhat.com> Reviewed-by: Daniel P. Berrange <berrange@redhat.com> Message-id: 20170213135235.12274-19-pbonzini@redhat.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2017-02-21aio-posix: partially inline aio_dispatch into aio_pollPaolo Bonzini
This patch prepares for the removal of unnecessary lockcnt inc/dec pairs. Extract the dispatching loop for file descriptor handlers into a new function aio_dispatch_handlers, and then inline aio_dispatch into aio_poll. aio_dispatch can now become void. Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Fam Zheng <famz@redhat.com> Reviewed-by: Daniel P. Berrange <berrange@redhat.com> Message-id: 20170213135235.12274-17-pbonzini@redhat.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2017-02-21aio: introduce aio_co_schedule and aio_co_wakePaolo Bonzini
aio_co_wake provides the infrastructure to start a coroutine on a "home" AioContext. It will be used by CoMutex and CoQueue, so that coroutines don't jump from one context to another when they go to sleep on a mutex or waitqueue. However, it can also be used as a more efficient alternative to one-shot bottom halves, and saves the effort of tracking which AioContext a coroutine is running on. aio_co_schedule is the part of aio_co_wake that starts a coroutine on a remove AioContext, but it is also useful to implement e.g. bdrv_set_aio_context callbacks. The implementation of aio_co_schedule is based on a lock-free multiple-producer, single-consumer queue. The multiple producers use cmpxchg to add to a LIFO stack. The consumer (a per-AioContext bottom half) grabs all items added so far, inverts the list to make it FIFO, and goes through it one item at a time until it's empty. The data structure was inspired by OSv, which uses it in the very code we'll "port" to QEMU for the thread-safe CoMutex. Most of the new code is really tests. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Fam Zheng <famz@redhat.com> Message-id: 20170213135235.12274-3-pbonzini@redhat.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2017-01-27block: Fix target variable of BLKSECTGET ioctlEric Farman
Commit 6f6071745bd0 ("raw-posix: Fetch max sectors for host block device") introduced a routine to call the kernel BLKSECTGET ioctl, which stores the result back to user space. However, the size of the data returned depends on the routine handling the ioctl. The (compat_)blkdev_ioctl returns a short, while sg_ioctl returns an int. Thus, on big-endian systems, we can find ourselves accidentally shifting the result to a much larger value. (On s390x, a short is 16 bits while an int is 32 bits.) Also, the two ioctl handlers return values in different scales (block returns sectors, while sg returns bytes), so some tweaking of the outputs is required such that hdev_get_max_transfer_length returns a value in a consistent set of units. Signed-off-by: Eric Farman <farman@linux.vnet.ibm.com> Message-Id: <20170120162527.66075-3-farman@linux.vnet.ibm.com> Reviewed-by: Fam Zheng <famz@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-01-16aio: document lockingPaolo Bonzini
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Fam Zheng <famz@redhat.com> Message-id: 20170112180800.21085-10-pbonzini@redhat.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2017-01-16aio: make ctx->list_lock a QemuLockCnt, subsuming ctx->walking_bhPaolo Bonzini
This will make it possible to walk the list of bottom halves without holding the AioContext lock---and in turn to call bottom half handlers without holding the lock. Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Fam Zheng <famz@redhat.com> Message-id: 20170112180800.21085-4-pbonzini@redhat.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2017-01-16aio: rename bh_lock to list_lockPaolo Bonzini
This will be used for AioHandlers too. There is going to be little or no contention, so it is better to reuse the same lock. Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Fam Zheng <famz@redhat.com> Message-id: 20170112180800.21085-2-pbonzini@redhat.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2017-01-16block: get rid of bdrv_io_unplugged_begin/endPaolo Bonzini
bdrv_io_plug and bdrv_io_unplug are only called (via their BlockBackend equivalents) after starting asynchronous I/O. bdrv_drain is not going to be called while they are running, because---even if a coroutine runs for some reason---it will only drain in the next iteration of the event loop through bdrv_co_yield_to_drain. So this mechanism is unnecessary, get rid of it. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-id: 20161129113334.605-1-pbonzini@redhat.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2017-01-09block: Rename raw-{posix,win32} to file-*.cEric Blake
These files deal with the file protocol, not the raw format (the file protocol is often used with other formats, and the raw format is not forced to use the file protocol). Rename things to make it a bit easier to follow. Suggested-by: Daniel P. Berrange <berrange@redhat.com> Signed-off-by: Eric Blake <eblake@redhat.com> Reviewed-by: John Snow <jsnow@redhat.com> Reviewed-by: Laszlo Ersek <lersek@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2017-01-03aio: self-tune polling timeStefan Hajnoczi
This patch is based on the algorithm for the kvm.ko halt_poll_ns parameter in Linux. The initial polling time is zero. If the event loop is woken up within the maximum polling time it means polling could be effective, so grow polling time. If the event loop is woken up beyond the maximum polling time it means polling is not effective, so shrink polling time. If the event loop makes progress within the current polling time then the sweet spot has been reached. This algorithm adjusts the polling time so it can adapt to variations in workloads. The goal is to reach the sweet spot while also recognizing when polling would hurt more than help. Two new trace events, poll_grow and poll_shrink, are added for observing polling time adjustment. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Message-id: 20161201192652.9509-13-stefanha@redhat.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2017-01-03aio: add .io_poll_begin/end() callbacksStefan Hajnoczi
The begin and end callbacks can be used to prepare for the polling loop and clean up when polling stops. Note that they may only be called once for multiple aio_poll() calls if polling continues to succeed. Once polling fails the end callback is invoked before aio_poll() resumes file descriptor monitoring. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Message-id: 20161201192652.9509-11-stefanha@redhat.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2017-01-03aio: add polling mode to AioContextStefan Hajnoczi
The AioContext event loop uses ppoll(2) or epoll_wait(2) to monitor file descriptors or until a timer expires. In cases like virtqueues, Linux AIO, and ThreadPool it is technically possible to wait for events via polling (i.e. continuously checking for events without blocking). Polling can be faster than blocking syscalls because file descriptors, the process scheduler, and system calls are bypassed. The main disadvantage to polling is that it increases CPU utilization. In classic polling configuration a full host CPU thread might run at 100% to respond to events as quickly as possible. This patch implements a timeout so we fall back to blocking syscalls if polling detects no activity. After the timeout no CPU cycles are wasted on polling until the next event loop iteration. The run_poll_handlers_begin() and run_poll_handlers_end() trace events are added to aid performance analysis and troubleshooting. If you need to know whether polling mode is being used, trace these events to find out. Note that the AioContext is now re-acquired before disabling notify_me in the non-polling case. This makes the code cleaner since notify_me was enabled outside the non-polling AioContext release region. This change is correct since it's safe to keep notify_me enabled longer (disabling is an optimization) but potentially causes unnecessary event_notifer_set() calls. I think the chance of performance regression is small here. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Message-id: 20161201192652.9509-4-stefanha@redhat.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2017-01-03aio: add AioPollFn and io_poll() interfaceStefan Hajnoczi
The new AioPollFn io_poll() argument to aio_set_fd_handler() and aio_set_event_handler() is used in the next patch. Keep this code change separate due to the number of files it touches. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Message-id: 20161201192652.9509-3-stefanha@redhat.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2017-01-03aio: add flag to skip fds to aio_dispatch()Stefan Hajnoczi
Polling mode will not call ppoll(2)/epoll_wait(2). Therefore we know there are no fds ready and should avoid looping over fd handlers in aio_dispatch(). Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Message-id: 20161201192652.9509-2-stefanha@redhat.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2016-12-22block: drop remaining legacy aio functions in commentYaowei Bai
Commit 87f68d318222563822b5c6b28192215fc4b4e441 (block: drop aio functions that operate on the main AioContext) drops qemu_aio_wait function references mostly while leaves these behind, clean up them. Signed-off-by: Yaowei Bai <baiyaowei@cmss.chinamobile.com> Message-Id: <1480566640-27264-3-git-send-email-baiyaowei@cmss.chinamobile.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-11-14blockjob: refactor backup_start as backup_job_createJohn Snow
Refactor backup_start as backup_job_create, which only creates the job, but does not automatically start it. The old interface, 'backup_start', is not kept in favor of limiting the number of nearly-identical interfaces that would have to be edited to keep up with QAPI changes in the future. Callers that wish to synchronously start the backup_block_job can instead just call block_job_start immediately after calling backup_job_create. Transactions are updated to use the new interface, calling block_job_start only during the .commit phase, which helps prevent race conditions where jobs may finish before we even finish building the transaction. This may happen, for instance, during empty block backup jobs. Reported-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Signed-off-by: John Snow <jsnow@redhat.com> Message-id: 1478587839-9834-6-git-send-email-jsnow@redhat.com Signed-off-by: Jeff Cody <jcody@redhat.com>
2016-11-14blockjob: add block_job_startJohn Snow
Instead of automatically starting jobs at creation time via backup_start et al, we'd like to return a job object pointer that can be started manually at later point in time. For now, add the block_job_start mechanism and start the jobs automatically as we have been doing, with conversions job-by-job coming in later patches. Of note: cancellation of unstarted jobs will perform all the normal cleanup as if the job had started, particularly abort and clean. The only difference is that we will not emit any events, because the job never actually started. Signed-off-by: John Snow <jsnow@redhat.com> Message-id: 1478587839-9834-5-git-send-email-jsnow@redhat.com Signed-off-by: Jeff Cody <jcody@redhat.com>
2016-11-14blockjob: add .start fieldJohn Snow
Add an explicit start field to specify the entrypoint. We already have ownership of the coroutine itself AND managing the lifetime of the coroutine, let's take control of creation of the coroutine, too. This will allow us to delay creation of the actual coroutine until we know we'll actually start a BlockJob in block_job_start. This avoids the sticky question of how to "un-create" a Coroutine that hasn't been started yet. Signed-off-by: John Snow <jsnow@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Message-id: 1478587839-9834-4-git-send-email-jsnow@redhat.com Signed-off-by: Jeff Cody <jcody@redhat.com>
2016-11-14blockjob: add .clean propertyJohn Snow
Cleaning up after we have deferred to the main thread but before the transaction has converged can be dangerous and result in deadlocks if the job cleanup invokes any BH polling loops. A job may attempt to begin cleaning up, but may induce another job to enter its cleanup routine. The second job, part of our same transaction, will block waiting for the first job to finish, so neither job may now make progress. To rectify this, allow jobs to register a cleanup operation that will always run regardless of if the job was in a transaction or not, and if the transaction job group completed successfully or not. Move sensitive cleanup to this callback instead which is guaranteed to be run only after the transaction has converged, which removes sensitive timing constraints from said cleanup. Furthermore, in future patches these cleanup operations will be performed regardless of whether or not we actually started the job. Therefore, cleanup callbacks should essentially confine themselves to undoing create operations, e.g. setup actions taken in what is now backup_start. Reported-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Signed-off-by: John Snow <jsnow@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Message-id: 1478587839-9834-3-git-send-email-jsnow@redhat.com Signed-off-by: Jeff Cody <jcody@redhat.com>
2016-11-03Merge remote-tracking branch 'remotes/bonzini/tags/for-upstream' into stagingStefan Hajnoczi
* NBD bugfix (Changlong) * NBD write zeroes support (Eric) * Memory backend fixes (Haozhong) * Atomics fix (Alex) * New AVX512 features (Luwei) * "make check" logging fix (Paolo) * Chardev refactoring fallout (Paolo) * Small checkpatch improvements (Paolo, Jeff) # gpg: Signature made Wed 02 Nov 2016 08:31:11 AM GMT # gpg: using RSA key 0xBFFBD25F78C7AE83 # gpg: Good signature from "Paolo Bonzini <bonzini@gnu.org>" # gpg: aka "Paolo Bonzini <pbonzini@redhat.com>" # Primary key fingerprint: 46F5 9FBD 57D6 12E7 BFD4 E2F7 7E15 100C CD36 69B1 # Subkey fingerprint: F133 3857 4B66 2389 866C 7682 BFFB D25F 78C7 AE83 * remotes/bonzini/tags/for-upstream: (30 commits) main-loop: Suppress I/O thread warning under qtest docs/rcu.txt: Fix minor typo vl: exit qemu on guest panic if -no-shutdown is not set checkpatch: allow spaces before parenthesis for 'coroutine_fn' x86: add AVX512_4VNNIW and AVX512_4FMAPS features slirp: fix CharDriver breakage qemu-char: do not forward events through the mux until QEMU has started nbd: Implement NBD_CMD_WRITE_ZEROES on client nbd: Implement NBD_CMD_WRITE_ZEROES on server nbd: Improve server handling of shutdown requests nbd: Refactor conversion to errno to silence checkpatch nbd: Support shorter handshake nbd: Less allocation during NBD_OPT_LIST nbd: Let client skip portions of server reply nbd: Let server know when client gives up negotiation nbd: Share common option-sending code in client nbd: Send message along with server NBD_REP_ERR errors nbd: Share common reply-sending code in server nbd: Rename struct nbd_request and nbd_reply nbd: Rename NbdClientSession to NBDClientSession ... Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2016-11-02nbd: Implement NBD_CMD_WRITE_ZEROES on serverEric Blake
Upstream NBD protocol recently added the ability to efficiently write zeroes without having to send the zeroes over the wire, along with a flag to control whether the client wants to allow a hole. Note that when it comes to requiring full allocation, vs. permitting optimizations, the NBD spec intentionally picked a different sense for the flag; the rules in qemu are: MAY_UNMAP == 0: must write zeroes MAY_UNMAP == 1: may use holes if reads will see zeroes while in NBD, the rules are: FLAG_NO_HOLE == 1: must write zeroes FLAG_NO_HOLE == 0: may use holes if reads will see zeroes In all cases, the 'may use holes' scenario is optional (the server need not use a hole, and must not use a hole if subsequent reads would not see zeroes). Signed-off-by: Eric Blake <eblake@redhat.com> Message-Id: <1476469998-28592-16-git-send-email-eblake@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-11-02nbd: Improve server handling of shutdown requestsEric Blake
NBD commit 6d34500b clarified how clients and servers are supposed to behave before closing a connection. It added NBD_REP_ERR_SHUTDOWN (for the server to announce it is about to go away during option haggling, so the client should quit sending NBD_OPT_* other than NBD_OPT_ABORT) and ESHUTDOWN (for the server to announce it is about to go away during transmission, so the client should quit sending NBD_CMD_* other than NBD_CMD_DISC). It also clarified that NBD_OPT_ABORT gets a reply, while NBD_CMD_DISC does not. This patch merely adds the missing reply to NBD_OPT_ABORT and teaches the client to recognize server errors. Actually teaching the server to send NBD_REP_ERR_SHUTDOWN or ESHUTDOWN would require knowing that the server has been requested to shut down soon (maybe we could do that by installing a SIGINT handler in qemu-nbd, which transitions from RUNNING to a new state that waits for the client to react, rather than just out-right quitting - but that's a bigger task for another day). Signed-off-by: Eric Blake <eblake@redhat.com> Message-Id: <1476469998-28592-15-git-send-email-eblake@redhat.com> [Move dummy ESHUTDOWN to include/qemu/osdep.h. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-11-02nbd: Support shorter handshakeEric Blake
The NBD Protocol allows the server and client to mutually agree on a shorter handshake (omit the 124 bytes of reserved 0), via the server advertising NBD_FLAG_NO_ZEROES and the client acknowledging with NBD_FLAG_C_NO_ZEROES (only possible in newstyle, whether or not it is fixed newstyle). It doesn't shave much off the wire, but we might as well implement it. Signed-off-by: Eric Blake <eblake@redhat.com> Reviewed-by: Alex Bligh <alex@alex.org.uk> Message-Id: <1476469998-28592-13-git-send-email-eblake@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-11-02nbd: Share common option-sending code in clientEric Blake
Rather than open-coding each option request, it's easier to have common helper functions do the work. That in turn requires having convenient packed types for handling option requests and replies. Signed-off-by: Eric Blake <eblake@redhat.com> Message-Id: <1476469998-28592-9-git-send-email-eblake@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-11-02nbd: Rename struct nbd_request and nbd_replyEric Blake
Our coding convention prefers CamelCase names, and we already have other existing structs with NBDFoo naming. Let's be consistent, before later patches add even more structs. Signed-off-by: Eric Blake <eblake@redhat.com> Message-Id: <1476469998-28592-6-git-send-email-eblake@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-11-02nbd: Treat flags vs. command type as separate fieldsEric Blake
Current upstream NBD documents that requests have a 16-bit flags, followed by a 16-bit type integer; although older versions mentioned only a 32-bit field with masking to find flags. Since the protocol is in network order (big-endian over the wire), the ABI is unchanged; but dealing with the flags as a separate field rather than masking will make it easier to add support for upcoming NBD extensions that increase the number of both flags and commands. Improve some comments in nbd.h based on the current upstream NBD protocol (https://github.com/yoe/nbd/blob/master/doc/proto.md), and touch some nearby code to keep checkpatch.pl happy. Signed-off-by: Eric Blake <eblake@redhat.com> Message-Id: <1476469998-28592-3-git-send-email-eblake@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>