diff options
author | Richard W.M. Jones <rjones@redhat.com> | 2017-03-31 21:51:33 +0100 |
---|---|---|
committer | Paolo Bonzini <pbonzini@redhat.com> | 2017-04-03 19:13:12 +0200 |
commit | ecbddbb106114f90008024b4e6c3ba1c38d7ca0e (patch) | |
tree | 158447602b6ade78397ec309e63872cbed85e65d /util/main-loop.c | |
parent | 90c4fe5fc517a045e7a7cf2f23472e114042ca29 (diff) |
main-loop: Acquire main_context lock around os_host_main_loop_wait.
When running virt-rescue the serial console hangs from time to time.
Virt-rescue runs an ordinary Linux kernel "appliance", but there is
only a single idle process running inside, so the qemu main loop is
largely idle. With virt-rescue >= 1.37 you may be able to observe the
hang by doing:
$ virt-rescue -e ^] --scratch
><rescue> while true; do ls -l /usr/bin; done
The hang in virt-rescue can be resolved by pressing a key on the
serial console.
Possibly with the same root cause, we also observed hangs during very
early boot of regular Linux VMs with a serial console. Those hangs
are extremely rare, but you may be able to observe them by running
this command on baremetal for a sufficiently long time:
$ while libguestfs-test-tool -t 60 >& /tmp/log ; do echo -n . ; done
(Check in /tmp/log that the failure was caused by a hang during early
boot, and not some other reason)
During investigation of this bug, Paolo Bonzini wrote:
> glib is expecting QEMU to use g_main_context_acquire around accesses to
> GMainContext. However QEMU is not doing that, instead it is taking its
> own mutex. So we should add g_main_context_acquire and
> g_main_context_release in the two implementations of
> os_host_main_loop_wait; these should undo the effect of Frediano's
> glib patch.
This patch exactly implements Paolo's suggestion in that paragraph.
This fixes the serial console hang in my testing, across 3 different
physical machines (AMD, Intel Core i7 and Intel Xeon), over many hours
of automated testing. I wasn't able to reproduce the early boot hangs
(but as noted above, these are extremely rare in any case).
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1435432
Reported-by: Richard W.M. Jones <rjones@redhat.com>
Tested-by: Richard W.M. Jones <rjones@redhat.com>
Signed-off-by: Richard W.M. Jones <rjones@redhat.com>
Message-Id: <20170331205133.23906-1-rjones@redhat.com>
[Paolo: this is actually a glib bug: recent glib versions are also
expecting g_main_context_acquire around g_poll---but that is not
documented and probably not even intended].
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Diffstat (limited to 'util/main-loop.c')
-rw-r--r-- | util/main-loop.c | 11 |
1 files changed, 11 insertions, 0 deletions
diff --git a/util/main-loop.c b/util/main-loop.c index 4534c89308..19cad6b8b6 100644 --- a/util/main-loop.c +++ b/util/main-loop.c @@ -218,9 +218,12 @@ static void glib_pollfds_poll(void) static int os_host_main_loop_wait(int64_t timeout) { + GMainContext *context = g_main_context_default(); int ret; static int spin_counter; + g_main_context_acquire(context); + glib_pollfds_fill(&timeout); /* If the I/O thread is very busy or we are incorrectly busy waiting in @@ -256,6 +259,9 @@ static int os_host_main_loop_wait(int64_t timeout) } glib_pollfds_poll(); + + g_main_context_release(context); + return ret; } #else @@ -412,12 +418,15 @@ static int os_host_main_loop_wait(int64_t timeout) fd_set rfds, wfds, xfds; int nfds; + g_main_context_acquire(context); + /* XXX: need to suppress polling by better using win32 events */ ret = 0; for (pe = first_polling_entry; pe != NULL; pe = pe->next) { ret |= pe->func(pe->opaque); } if (ret != 0) { + g_main_context_release(context); return ret; } @@ -472,6 +481,8 @@ static int os_host_main_loop_wait(int64_t timeout) g_main_context_dispatch(context); } + g_main_context_release(context); + return select_ret || g_poll_ret; } #endif |