aboutsummaryrefslogtreecommitdiff
path: root/migration/migration.c
AgeCommit message (Collapse)Author
2018-12-18qmp hmp: Make system_wakeup check wake-up support and run stateDaniel Henrique Barboza
The qmp/hmp command 'system_wakeup' is simply a direct call to 'qemu_system_wakeup_request' from vl.c. This function verifies if runstate is SUSPENDED and if the wake up reason is valid before proceeding. However, no error or warning is thrown if any of those pre-requirements isn't met. There is no way for the caller to differentiate between a successful wakeup or an error state caused when trying to wake up a guest that wasn't suspended. This means that system_wakeup is silently failing, which can be considered a bug. Adding error handling isn't an API break in this case - applications that didn't check the result will remain broken, the ones that check it will have a chance to deal with it. Adding to that, the commit before previous created a new QMP API called query-current-machine, with a new flag called wakeup-suspend-support, that indicates if the guest has the capability of waking up from suspended state. Although such guest will never reach SUSPENDED state and erroring it out in this scenario would suffice, it is more informative for the user to differentiate between a failure because the guest isn't suspended versus a failure because the guest does not have support for wake up at all. All this considered, this patch changes qmp_system_wakeup to check if the guest is capable of waking up from suspend, and if it is suspended. After this patch, this is the output of system_wakeup in a guest that does not have wake-up from suspend support (ppc64): (qemu) system_wakeup wake-up from suspend is not supported by this guest (qemu) And this is the output of system_wakeup in a x86 guest that has the support but isn't suspended: (qemu) system_wakeup Unable to wake up: guest is not in suspended state (qemu) Reported-by: Balamuruhan S <bala24@linux.vnet.ibm.com> Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com> Message-Id: <20181205194701.17836-4-danielhb413@gmail.com> Reviewed-by: Markus Armbruster <armbru@redhat.com> Acked-by: Eduardo Habkost <ehabkost@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Markus Armbruster <armbru@redhat.com>
2018-11-21migration/migration.c: Add COLO dependency checksZhang Chen
Current COLO mode(independent disk mode) need replication module work together. Suggested by Dr. David Alan Gilbert <dgilbert@redhat.com>. Signed-off-by: Zhang Chen <chen.zhang@intel.com> Message-Id: <20181114190912.7242-1-chen.zhang@intel.com> Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
2018-10-31migration: avoid segmentfault when take a snapshot of a VM which being migratedJia Lina
During an active background migration, snapshot will trigger a segmentfault. As snapshot clears the "current_migration" struct and updates "to_dst_file" before it finds out that there is a migration task, Migration accesses the null pointer in "current_migration" struct and qemu crashes eventually. Signed-off-by: Jia Lina <jialina01@baidu.com> Signed-off-by: Chai Wen <chaiwen@baidu.com> Signed-off-by: Zhang Yu <zhangyu31@baidu.com> Message-Id: <20181026083620.10172-1-jialina01@baidu.com> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
2018-10-23Merge remote-tracking branch 'remotes/armbru/tags/pull-error-2018-10-22' ↵Peter Maydell
into staging Error reporting patches for 2018-10-22 # gpg: Signature made Mon 22 Oct 2018 13:20:23 BST # gpg: using RSA key 3870B400EB918653 # gpg: Good signature from "Markus Armbruster <armbru@redhat.com>" # gpg: aka "Markus Armbruster <armbru@pond.sub.org>" # Primary key fingerprint: 354B C8B3 D7EB 2A6B 6867 4E5F 3870 B400 EB91 8653 * remotes/armbru/tags/pull-error-2018-10-22: (40 commits) error: Drop bogus "use error_setg() instead" admonitions vpc: Fail open on bad header checksum block: Clean up bdrv_img_create()'s error reporting vl: Simplify call of parse_name() vl: Fix exit status for -drive format=help blockdev: Convert drive_new() to Error vl: Assert drive_new() does not fail in default_drive() fsdev: Clean up error reporting in qemu_fsdev_add() spice: Clean up error reporting in add_channel() tpm: Clean up error reporting in tpm_init_tpmdev() numa: Clean up error reporting in parse_numa() vnc: Clean up error reporting in vnc_init_func() ui: Convert vnc_display_init(), init_keyboard_layout() to Error ui/keymaps: Fix handling of erroneous include files vl: Clean up error reporting in device_init_func() vl: Clean up error reporting in parse_fw_cfg() vl: Clean up error reporting in mon_init_func() vl: Clean up error reporting in machine_set_property() vl: Clean up error reporting in chardev_init_func() qom: Clean up error reporting in user_creatable_add_opts_foreach() ... Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
2018-10-19error: Fix use of error_prepend() with &error_fatal, &error_abortMarkus Armbruster
From include/qapi/error.h: * Pass an existing error to the caller with the message modified: * error_propagate(errp, err); * error_prepend(errp, "Could not frobnicate '%s': ", name); Fei Li pointed out that doing error_propagate() first doesn't work well when @errp is &error_fatal or &error_abort: the error_prepend() is never reached. Since I doubt fixing the documentation will stop people from getting it wrong, introduce error_propagate_prepend(), in the hope that it lures people away from using its constituents in the wrong order. Update the instructions in error.h accordingly. Convert existing error_prepend() next to error_propagate to error_propagate_prepend(). If any of these get reached with &error_fatal or &error_abort, the error messages improve. I didn't check whether that's the case anywhere. Cc: Fei Li <fli@suse.com> Signed-off-by: Markus Armbruster <armbru@redhat.com> Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Message-Id: <20181017082702.5581-2-armbru@redhat.com>
2018-10-19COLO: Load dirty pages into SVM's RAM cache firstlyZhang Chen
We should not load PVM's state directly into SVM, because there maybe some errors happen when SVM is receving data, which will break SVM. We need to ensure receving all data before load the state into SVM. We use an extra memory to cache these data (PVM's ram). The ram cache in secondary side is initially the same as SVM/PVM's memory. And in the process of checkpoint, we cache the dirty pages of PVM into this ram cache firstly, so this ram cache always the same as PVM's memory at every checkpoint, then we flush this cached ram to SVM after we receive all PVM's state. Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com> Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com> Signed-off-by: Zhang Chen <zhangckid@gmail.com> Signed-off-by: Zhang Chen <chen.zhang@intel.com> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com>
2018-10-19COLO: Remove colo_state migration structZhang Chen
We need to know if migration is going into COLO state for incoming side before start normal migration. Instead by using the VMStateDescription to send colo_state from source side to destination side, we use MIG_CMD_ENABLE_COLO to indicate whether COLO is enabled or not. Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com> Signed-off-by: Zhang Chen <zhangckid@gmail.com> Signed-off-by: Zhang Chen <chen.zhang@intel.com> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com>
2018-10-19COLO: Add block replication into colo processZhang Chen
Make sure master start block replication after slave's block replication started. Besides, we need to activate VM's blocks before goes into COLO state. Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com> Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com> Signed-off-by: Zhang Chen <zhangckid@gmail.com> Signed-off-by: Zhang Chen <chen.zhang@intel.com> Signed-off-by: Jason Wang <jasowang@redhat.com>
2018-10-19COLO: integrate colo compare with colo frameZhang Chen
For COLO FT, both the PVM and SVM run at the same time, only sync the state while it needs. So here, let SVM runs while not doing checkpoint, change DEFAULT_MIGRATE_X_CHECKPOINT_DELAY to 200*100. Besides, we forgot to release colo_checkpoint_semd and colo_delay_timer, fix them here. Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com> Signed-off-by: Zhang Chen <zhangckid@gmail.com> Signed-off-by: Zhang Chen <chen.zhang@intel.com> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com>
2018-09-26migration: fix QEMUFile leakMarc-André Lureau
Spotted by ASAN while running: $ tests/migration-test -p /x86_64/migration/postcopy/recovery ================================================================= ==18034==ERROR: LeakSanitizer: detected memory leaks Direct leak of 33864 byte(s) in 1 object(s) allocated from: #0 0x7f3da7f31e50 in calloc (/lib64/libasan.so.5+0xeee50) #1 0x7f3da644441d in g_malloc0 (/lib64/libglib-2.0.so.0+0x5241d) #2 0x55af9db15440 in qemu_fopen_channel_input /home/elmarco/src/qemu/migration/qemu-file-channel.c:183 #3 0x55af9db15413 in channel_get_output_return_path /home/elmarco/src/qemu/migration/qemu-file-channel.c:159 #4 0x55af9db0d4ac in qemu_file_get_return_path /home/elmarco/src/qemu/migration/qemu-file.c:78 #5 0x55af9dad5e4f in open_return_path_on_source /home/elmarco/src/qemu/migration/migration.c:2295 #6 0x55af9dadb3bf in migrate_fd_connect /home/elmarco/src/qemu/migration/migration.c:3111 #7 0x55af9dae1bf3 in migration_channel_connect /home/elmarco/src/qemu/migration/channel.c:91 #8 0x55af9daddeca in socket_outgoing_migration /home/elmarco/src/qemu/migration/socket.c:108 #9 0x55af9e13d3db in qio_task_complete /home/elmarco/src/qemu/io/task.c:158 #10 0x55af9e13ca03 in qio_task_thread_result /home/elmarco/src/qemu/io/task.c:89 #11 0x7f3da643b1ca in g_idle_dispatch gmain.c:5535 Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com> Message-Id: <20180925092245.29565-1-marcandre.lureau@redhat.com> Reviewed-by: Peter Xu <peterx@redhat.com> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
2018-09-26migration: show the statistics of compressionXiao Guangrong
Currently, it includes: pages: amount of pages compressed and transferred to the target VM busy: amount of count that no free thread to compress data busy-rate: rate of thread busy compressed-size: amount of bytes after compression compression-rate: rate of compressed size Reviewed-by: Juan Quintela <quintela@redhat.com> Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com> Message-Id: <20180906070101.27280-3-xiaoguangrong@tencent.com> Signed-off-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
2018-08-28qapi: Drop qapi_event_send_FOO()'s Error ** argumentPeter Xu
The generated qapi_event_send_FOO() take an Error ** argument. They can't actually fail, because all they do with the argument is passing it to functions that can't fail: the QObject output visitor, and the @qmp_emit callback, which is either monitor_qapi_event_queue() or event_test_emit(). Drop the argument, and pass &error_abort to the QObject output visitor and @qmp_emit instead. Suggested-by: Eric Blake <eblake@redhat.com> Suggested-by: Markus Armbruster <armbru@redhat.com> Signed-off-by: Peter Xu <peterx@redhat.com> Message-Id: <20180815133747.25032-4-peterx@redhat.com> Reviewed-by: Markus Armbruster <armbru@redhat.com> [Commit message rewritten, update to qapi-code-gen.txt corrected] Signed-off-by: Markus Armbruster <armbru@redhat.com>
2018-08-22migration: do not wait for free threadXiao Guangrong
Instead of putting the main thread to sleep state to wait for free compression thread, we can directly post it out as normal page that reduces the latency and uses CPUs more efficiently A parameter, compress-wait-thread, is introduced, it can be enabled if the user really wants the old behavior Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com>
2018-08-22migration: poll the cm event for destination qemuLidong Chen
The destination qemu only poll the comp_channel->fd in qemu_rdma_wait_comp_channel. But when source qemu disconnnect the rdma connection, the destination qemu should be notified. Signed-off-by: Lidong Chen <lidongchen@tencent.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com>
2018-08-22migration: implement bi-directional RDMA QIOChannelLidong Chen
This patch implements bi-directional RDMA QIOChannel. Because different threads may access RDMAQIOChannel currently, this patch use RCU to protect it. Signed-off-by: Lidong Chen <lidongchen@tencent.com> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com>
2018-08-22migrate/cpu-throttle: Add max-cpu-throttle migration parameterLi Qiang
Currently, the default maximum CPU throttle for migration is 99(CPU_THROTTLE_PCT_MAX). This is too big and can make a remarkable performance effect for the guest. We see a lot of packets latency exceed 500ms when the CPU_THROTTLE_PCT_MAX reached. This patch set adds a new max-cpu-throttle parameter to limit the CPU throttle. Signed-off-by: Li Qiang <liq3ea@gmail.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com>
2018-07-24migration: fix duplicate initialization for expected_downtime and cleanup_bhLidong Chen
migrate_fd_connect duplicate initialize expected_downtime and cleanup_bh. Signed-off-by: Lidong Chen <lidongchen@tencent.com> Message-Id: <1532434585-14732-2-git-send-email-lidongchen@tencent.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
2018-07-24migration: disallow recovery for release-ramPeter Xu
Postcopy recovery won't work well with release-ram capability since release-ram will drop the page buffer as long as the page is put into the send buffer. So if there is a network failure happened, any page buffers that have not yet reached the destination VM but have already been sent from the source VM will be lost forever. Let's refuse the client from resuming such a postcopy migration. Luckily release-ram was designed to only be used when src and destination VMs are on the same host, so it should be fine. Signed-off-by: Peter Xu <peterx@redhat.com> Message-Id: <20180723123305.24792-3-peterx@redhat.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
2018-07-24migrate: Fix cancelling state warningDr. David Alan Gilbert
We've been getting the warning: migration_iteration_finish: Unknown ending state 2 on a cancel. I think that's originally due to 39b9e17905c; although I've only seen the warning, I think that in some cases that we could find the VM stays paused after a cancel where it should restart. Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com> Message-Id: <20180719092257.12703-1-dgilbert@redhat.com> Reviewed-by: Peter Xu <peterx@redhat.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
2018-07-10migration: show pause/recover state on dst hostPeter Xu
These two states will be missing when doing "query-migrate" on destination VM. Add these states so that we can get the query results as expected. Signed-off-by: Peter Xu <peterx@redhat.com> Message-Id: <20180710091902.28780-5-peterx@redhat.com> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
2018-07-10migration: unify incoming processingPeter Xu
This is the 2nd patch to unbreak postcopy recovery. Let's unify the migration_incoming_process() call at a single place rather than calling it in connection setup codes. This fixes a problem that we will go into incoming migration procedure even if we are trying to recovery from a paused postcopy migration. Fixes: 36c2f8be2c ("migration: Delay start of migration main routines") Signed-off-by: Peter Xu <peterx@redhat.com> Message-Id: <20180627132246.5576-5-peterx@redhat.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
2018-07-10migration: unbreak postcopy recoveryPeter Xu
The whole postcopy recovery logic was accidentally broken. We need to fix it in two steps. This is the first step that we should do the recovery when needed. It was bypassed before after commit 36c2f8be2c. Introduce postcopy_try_recovery() helper for the postcopy recovery logic. Call it both in migration_fd_process_incoming() and migration_ioc_process_incoming(). Fixes: 36c2f8be2c ("migration: Delay start of migration main routines") Signed-off-by: Peter Xu <peterx@redhat.com> Message-Id: <20180627132246.5576-4-peterx@redhat.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
2018-07-10migration: move income process out of multifdPeter Xu
Move the call to migration_incoming_process() out of multifd code. It's a bit strange that we can migration generic calls in multifd code. Instead, let multifd_recv_new_channel() return a boolean showing whether it's ready to continue the incoming migration. Signed-off-by: Peter Xu <peterx@redhat.com> Message-Id: <20180627132246.5576-3-peterx@redhat.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
2018-06-27migration: fix crash in when incoming client channel setup failsDaniel P. Berrangé
The way we determine if we can start the incoming migration was changed to use migration_has_all_channels() in: commit 428d89084c709e568f9cd301c2f6416a54c53d6d Author: Juan Quintela <quintela@redhat.com> Date: Mon Jul 24 13:06:25 2017 +0200 migration: Create migration_has_all_channels This method in turn calls multifd_recv_all_channels_created() which is hardcoded to always return 'true' when multifd is not in use. This is a latent bug... ...activated in a following commit where that return result ends up acting as the flag to indicate whether it is possible to start processing the migration: commit 36c2f8be2c4eb0003ac77a14910842b7ddd7337e Author: Juan Quintela <quintela@redhat.com> Date: Wed Mar 7 08:40:52 2018 +0100 migration: Delay start of migration main routines This means that if channel initialization fails with normal migration, it'll never notice and attempt to start the incoming migration regardless and crash on a NULL pointer. This can be seen, for example, if a client connects to a server requiring TLS, but has an invalid x509 certificate: qemu-system-x86_64: The certificate hasn't got a known issuer qemu-system-x86_64: migration/migration.c:386: process_incoming_migration_co: Assertion `mis->from_src_file' failed. #0 0x00007fffebd24f2b in raise () at /lib64/libc.so.6 #1 0x00007fffebd0f561 in abort () at /lib64/libc.so.6 #2 0x00007fffebd0f431 in _nl_load_domain.cold.0 () at /lib64/libc.so.6 #3 0x00007fffebd1d692 in () at /lib64/libc.so.6 #4 0x0000555555ad027e in process_incoming_migration_co (opaque=<optimized out>) at migration/migration.c:386 #5 0x0000555555c45e8b in coroutine_trampoline (i0=<optimized out>, i1=<optimized out>) at util/coroutine-ucontext.c:116 #6 0x00007fffebd3a6a0 in __start_context () at /lib64/libc.so.6 #7 0x0000000000000000 in () To handle the non-multifd case, we check whether mis->from_src_file is non-NULL. With this in place, the migration server drops the rejected client and stays around waiting for another, hopefully valid, client to arrive. Signed-off-by: Daniel P. Berrangé <berrange@redhat.com> Message-Id: <20180619163552.18206-1-berrange@redhat.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com>
2018-06-27migration: Create ram_save_multifd_pageJuan Quintela
The function still don't use multifd, but we have simplified ram_save_page, xbzrle and RDMA stuff is gone. We have added a new counter. Signed-off-by: Juan Quintela <quintela@redhat.com> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com> -- Add last_page parameter Add commets for done and address Remove multifd field, it is the same than normal pages Merge next patch, now we send multiple pages at a time Remove counter for multifd pages, it is identical to normal pages Use iovec's instead of creating the equivalent. Clear memory used by pages (dave) Use g_new0(danp) define MULTIFD_CONTINUE now pages member is a pointer Fix off-by-one in number of pages in one packet Remove RAM_SAVE_FLAG_MULTIFD_PAGE s/multifd_pages_t/MultiFDPages_t/ add comment explaining what it means
2018-06-27migration: Create multifd_bytes ram_counterJuan Quintela
This will include how many bytes they are sent through multifd. Signed-off-by: Juan Quintela <quintela@redhat.com> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
2018-06-27migration: Abstract the number of bytes sentJuan Quintela
Right now we use the "position" inside the QEMUFile, but things like RDMA already do weird things to be able to maintain that counter right, and multifd will have some similar problems. Signed-off-by: Juan Quintela <quintela@redhat.com> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
2018-06-27migration: Calculate mbps only during transfer timeJuan Quintela
We used to include in this calculation the setup time, but that can be quite big in rdma or multifd. Signed-off-by: Juan Quintela <quintela@redhat.com> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
2018-06-15migration: calculate expected_downtime with ram_bytes_remaining()Balamuruhan S
expected_downtime value is not accurate with dirty_pages_rate * page_size, using ram_bytes_remaining() would yeild it resonable. consider to read the remaining ram just after having updated the dirty pages count later migration_bitmap_sync_range() in migration_bitmap_sync() and reuse the `remaining` field in ram_counters to hold ram_bytes_remaining() for calculating expected_downtime. Reported-by: Michael Roth <mdroth@linux.vnet.ibm.com> Signed-off-by: Balamuruhan S <bala24@linux.vnet.ibm.com> Signed-off-by: Laurent Vivier <lvivier@redhat.com> Message-Id: <20180612085009.17594-2-bala24@linux.vnet.ibm.com> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
2018-06-15migration: Wake rate limiting for urgent requestsDr. David Alan Gilbert
Rate limiting sleeps the migration thread for a while when it runs out of bandwidth; but sometimes we want to wake up to get on with something more urgent (like a postcopy request). Here we use a semaphore with a timedwait instead of a simple sleep; Incrementing the sempahore will wake it up sooner. Anything that consumes these urgent events must decrement the sempahore. Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com> Message-Id: <20180613102642.23995-3-dgilbert@redhat.com> Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
2018-06-15migration/postcopy: Add max-postcopy-bandwidth parameterDr. David Alan Gilbert
Limit the background transfer bandwidth during the postcopy phase to the value set on this new parameter. The default, 0, corresponds to the existing behaviour which is unlimited bandwidth. Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com> Message-Id: <20180613102642.23995-2-dgilbert@redhat.com> Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
2018-06-04migration: Don't activate block devices if using -SDr. David Alan Gilbert
Activating the block devices causes the locks to be taken on the backing file. If we're running with -S and the destination libvirt hasn't started the destination with 'cont', it's expecting the locks are still untaken. Don't activate the block devices if we're not going to autostart the VM; 'cont' already will do that anyway. This change is tied to the new migration capability 'late-block-activate' that defaults to off, keeping the old behaviour by default. bz: https://bugzilla.redhat.com/show_bug.cgi?id=1560854 Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com>
2018-06-04migration: introduce decompress-error-checkXiao Guangrong
QEMU 3.0 enables strict check for compression & decompression to make the migration more robust, that depends on the source to fix the internal design which triggers the unexpected error conditions To make it work for migrating old version QEMU to 2.13 QEMU, we introduce this parameter to disable the error check on the destination which is the default behavior of the machine type which is older than 2.13, alternately, the strict check can be enabled explicitly as followings: -M pc-q35-2.11 -global migration.decompress-error-check=true Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com>
2018-05-15migration/qmp: add command migrate-pausePeter Xu
It pauses an ongoing migration. Currently it only supports postcopy. Note that this command will work on either side of the migration. Basically when we trigger this on one side, it'll interrupt the other side as well since the other side will get notified on the disconnect event. However, it's still possible that the other side is not notified, for example, when the network is totally broken, or due to some firewall configuration changes. In that case, we will also need to run the same command on the other side so both sides will go into the paused state. Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com> Signed-off-by: Peter Xu <peterx@redhat.com> Message-Id: <20180502104740.12123-24-peterx@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com> --- s/2.12/2.13/
2018-05-15migration: introduce lock for to_dst_filePeter Xu
Let's introduce a lock for that QEMUFile since we are going to operate on it in multiple threads. Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com> Signed-off-by: Peter Xu <peterx@redhat.com> Message-Id: <20180502104740.12123-23-peterx@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com>
2018-05-15qmp/migration: new command migrate-recoverPeter Xu
The first allow-oob=true command. It's used on destination side when the postcopy migration is paused and ready for a recovery. After execution, a new migration channel will be established for postcopy to continue. Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com> Signed-off-by: Peter Xu <peterx@redhat.com> Message-Id: <20180502104740.12123-21-peterx@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com> --- s/2.12/2.13/
2018-05-15migration: init dst in migration_object_init tooPeter Xu
Though we may not need it, now we init both the src/dst migration objects in migration_object_init() so that even incoming migration object would be thread safe (it was not). Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com> Signed-off-by: Peter Xu <peterx@redhat.com> Message-Id: <20180502104740.12123-20-peterx@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com>
2018-05-15migration: final handshake for the resumePeter Xu
Finish the last step to do the final handshake for the recovery. First source sends one MIG_CMD_RESUME to dst, telling that source is ready to resume. Then, dest replies with MIG_RP_MSG_RESUME_ACK to source, telling that dest is ready to resume (after switch to postcopy-active state). When source received the RESUME_ACK, it switches its state to postcopy-active, and finally the recovery is completed. Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com> Signed-off-by: Peter Xu <peterx@redhat.com> Message-Id: <20180502104740.12123-19-peterx@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com>
2018-05-15migration: synchronize dirty bitmap for resumePeter Xu
This patch implements the first part of core RAM resume logic for postcopy. ram_resume_prepare() is provided for the work. When the migration is interrupted by network failure, the dirty bitmap on the source side will be meaningless, because even the dirty bit is cleared, it is still possible that the sent page was lost along the way to destination. Here instead of continue the migration with the old dirty bitmap on source, we ask the destination side to send back its received bitmap, then invert it to be our initial dirty bitmap. The source side send thread will issue the MIG_CMD_RECV_BITMAP requests, once per ramblock, to ask for the received bitmap. On destination side, MIG_RP_MSG_RECV_BITMAP will be issued, along with the requested bitmap. Data will be received on the return-path thread of source, and the main migration thread will be notified when all the ramblock bitmaps are synchronized. Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com> Signed-off-by: Peter Xu <peterx@redhat.com> Message-Id: <20180502104740.12123-17-peterx@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com>
2018-05-15migration: introduce SaveVMHandlers.resume_preparePeter Xu
This is hook function to be called when a postcopy migration wants to resume from a failure. For each module, it should provide its own recovery logic before we switch to the postcopy-active state. Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com> Signed-off-by: Peter Xu <peterx@redhat.com> Message-Id: <20180502104740.12123-16-peterx@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com>
2018-05-15migration: new message MIG_RP_MSG_RESUME_ACKPeter Xu
Creating new message to reply for MIG_CMD_POSTCOPY_RESUME. One uint32_t is used as payload to let the source know whether destination is ready to continue the migration. Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com> Signed-off-by: Peter Xu <peterx@redhat.com> Message-Id: <20180502104740.12123-15-peterx@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com>
2018-05-15migration: new message MIG_RP_MSG_RECV_BITMAPPeter Xu
Introducing new return path message MIG_RP_MSG_RECV_BITMAP to send received bitmap of ramblock back to source. This is the reply message of MIG_CMD_RECV_BITMAP, it contains not only the header (including the ramblock name), and it was appended with the whole ramblock received bitmap on the destination side. When the source receives such a reply message (MIG_RP_MSG_RECV_BITMAP), it parses it, convert it to the dirty bitmap by inverting the bits. One thing to mention is that, when we send the recv bitmap, we are doing these things in extra: - converting the bitmap to little endian, to support when hosts are using different endianess on src/dst. - do proper alignment for 8 bytes, to support when hosts are using different word size (32/64 bits) on src/dst. Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com> Signed-off-by: Peter Xu <peterx@redhat.com> Message-Id: <20180502104740.12123-13-peterx@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com>
2018-05-15migration: wakeup dst ram-load-thread for recoverPeter Xu
On the destination side, we cannot wake up all the threads when we got reconnected. The first thing to do is to wake up the main load thread, so that we can continue to receive valid messages from source again and reply when needed. At this point, we switch the destination VM state from postcopy-paused back to postcopy-recover. Now we are finally ready to do the resume logic. Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com> Signed-off-by: Peter Xu <peterx@redhat.com> Message-Id: <20180502104740.12123-11-peterx@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com>
2018-05-15migration: new state "postcopy-recover"Peter Xu
Introducing new migration state "postcopy-recover". If a migration procedure is paused and the connection is rebuilt afterward successfully, we'll switch the source VM state from "postcopy-paused" to the new state "postcopy-recover", then we'll do the resume logic in the migration thread (along with the return path thread). This patch only do the state switch on source side. Another following up patch will handle the state switching on destination side using the same status bit. Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com> Signed-off-by: Peter Xu <peterx@redhat.com> Message-Id: <20180502104740.12123-10-peterx@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com> --- s/2.11/2.13/
2018-05-15migration: rebuild channel on sourcePeter Xu
This patch detects the "resume" flag of migration command, rebuild the channels only if the flag is set. Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com> Signed-off-by: Peter Xu <peterx@redhat.com> Message-Id: <20180502104740.12123-9-peterx@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com>
2018-05-15qmp: hmp: add migrate "resume" optionPeter Xu
It will be used when we want to resume one paused migration. Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com> Signed-off-by: Peter Xu <peterx@redhat.com> Message-Id: <20180502104740.12123-8-peterx@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com> --- s/2.12/2.13/
2018-05-15migration: allow fault thread to pausePeter Xu
Allows the fault thread to stop handling page faults temporarily. When network failure happened (and if we expect a recovery afterwards), we should not allow the fault thread to continue sending things to source, instead, it should halt for a while until the connection is rebuilt. When the dest main thread noticed the failure, it kicks the fault thread to switch to pause state. Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com> Signed-off-by: Peter Xu <peterx@redhat.com> Message-Id: <20180502104740.12123-7-peterx@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com>
2018-05-15migration: allow src return path to pausePeter Xu
Let the thread pause for network issues. Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com> Signed-off-by: Peter Xu <peterx@redhat.com> Message-Id: <20180502104740.12123-6-peterx@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com>
2018-05-15migration: allow dst vm pause on postcopyPeter Xu
When there is IO error on the incoming channel (e.g., network down), instead of bailing out immediately, we allow the dst vm to switch to the new POSTCOPY_PAUSE state. Currently it is still simple - it waits the new semaphore, until someone poke it for another attempt. One note is that here on ram loading thread we cannot detect the POSTCOPY_ACTIVE state, but we need to detect the more specific POSTCOPY_INCOMING_RUNNING state, to make sure we have already loaded all the device states. Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com> Signed-off-by: Peter Xu <peterx@redhat.com> Message-Id: <20180502104740.12123-5-peterx@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com>
2018-05-15migration: implement "postcopy-pause" src logicPeter Xu
Now when network down for postcopy, the source side will not fail the migration. Instead we convert the status into this new paused state, and we will try to wait for a rescue in the future. If a recovery is detected, migration_thread() will reset its local variables to prepare for that. Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com> Signed-off-by: Peter Xu <peterx@redhat.com> Message-Id: <20180502104740.12123-4-peterx@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com>