migration: Allow network to fail even during recovery

Normally the postcopy recover phase should only exist for a super short period, that's the duration when QEMU is trying to recover from an interrupted postcopy migration, during which handshake will be carried out for continuing the procedure with state changes from PAUSED -> RECOVER -> POSTCOPY_ACTIVE again. Here RECOVER phase should be super small, that happens right after the admin specified a new but working network link for QEMU to reconnect to dest QEMU. However there can still be case where the channel is broken in this small RECOVER window. If it happens, with current code there's no way the src QEMU can got kicked out of RECOVER stage. No way either to retry the recover in another channel when established. This patch allows the RECOVER phase to fail itself too - we're mostly ready, just some small things missing, e.g. properly kick the main migration thread out when sleeping on rp_sem when we found that we're at RECOVER stage. When this happens, it fails the RECOVER itself, and rollback to PAUSED stage. Then the user can retry another round of recovery. To make it even stronger, teach QMP command migrate-pause to explicitly kick src/dst QEMU out when needed, so even if for some reason the migration thread didn't got kicked out already by a failing rethrn-path thread, the admin can also kick it out. This will be an super, super corner case, but still try to cover that. One can try to test this with two proxy channels for migration: (a) socat unix-listen:/tmp/src.sock,reuseaddr,fork tcp:localhost:10000 (b) socat tcp-listen:10000,reuseaddr,fork unix:/tmp/dst.sock So the migration channel will be: (a) (b) src -> /tmp/src.sock -> tcp:10000 -> /tmp/dst.sock -> dst Then to make QEMU hang at RECOVER stage, one can do below: (1) stop the postcopy using QMP command postcopy-pause (2) kill the 2nd proxy (b) (3) try to recover the postcopy using /tmp/src.sock on src (4) src QEMU will go into RECOVER stage but won't be able to continue from there, because the channel is actually broken at (b) Before this patch, step (4) will make src QEMU stuck in RECOVER stage, without a way to kick the QEMU out or continue the postcopy again. After this patch, (4) will quickly fail qemu and bounce back to PAUSED stage. Admin can also kick QEMU from (4) into PAUSED when needed using migrate-pause when needed. After bouncing back to PAUSED stage, one can recover again. Reported-by: Xiaohui Li <xiaohli@redhat.com> Reviewed-by: Fabiano Rosas <farosas@suse.de> Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2111332 Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com> Message-ID: <20231017202633.296756-3-peterx@redhat.com>
author: Peter Xu <peterx@redhat.com> 2023-10-17 16:26:30 -0400
committer: Juan Quintela <quintela@redhat.com> 2023-11-02 11:35:03 +0100
commit: f8c543e808f20ba936d94cfa5fc592627d8d100c (patch)
tree: c7c2bf15e8aa8473ddc1bb45fc4cbd34d6968b56 /migration/ram.c
parent: 7aa6070d09c5a6c83490599a3c564c64d7e2520a (diff)
1 files changed, 3 insertions, 1 deletions
diff --git a/migration/ram.c b/migration/ram.c
index d05ffddbc8..929cba08f4 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -4099,7 +4099,9 @@ static int ram_dirty_bitmap_sync_all(MigrationState *s, RAMState *rs)
 
     /* Wait until all the ramblocks' dirty bitmap synced */
     while (qatomic_read(&rs->postcopy_bmap_sync_requested)) {
-        migration_rp_wait(s);
+        if (migration_rp_wait(s)) {
+            return -1;
+        }
     }
 
     trace_ram_dirty_bitmap_sync_complete();
author	Peter Xu <peterx@redhat.com>	2023-10-17 16:26:30 -0400
committer	Juan Quintela <quintela@redhat.com>	2023-11-02 11:35:03 +0100
commit	f8c543e808f20ba936d94cfa5fc592627d8d100c (patch)
tree	c7c2bf15e8aa8473ddc1bb45fc4cbd34d6968b56 /migration/ram.c
parent	7aa6070d09c5a6c83490599a3c564c64d7e2520a (diff)