aboutsummaryrefslogtreecommitdiff
path: root/accel/tcg
AgeCommit message (Collapse)Author
2018-02-08tcg: Add generic vector helpers with a scalar operandRichard Henderson
Use dup to convert a non-constant scalar to a third vector. Add addition, multiplication, and logical operations with an immediate. Add addition, subtraction, multiplication, and logical operations with a non-constant scalar. Allow for the front-end to build operations in which the scalar operand comes first. Reviewed-by: Alex Bennée <alex.bennee@linaro.org> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2018-02-08tcg: Add generic helpers for saturating arithmeticRichard Henderson
No vector ops as yet. SSE only has direct support for 8- and 16-bit saturation; handling 32- and 64-bit saturation is much more expensive. Reviewed-by: Alex Bennée <alex.bennee@linaro.org> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2018-02-08tcg: Add generic vector ops for multiplicationRichard Henderson
Reviewed-by: Alex Bennée <alex.bennee@linaro.org> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2018-02-08tcg: Add generic vector ops for comparisonsRichard Henderson
Reviewed-by: Alex Bennée <alex.bennee@linaro.org> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2018-02-08tcg: Add generic vector ops for constant shiftsRichard Henderson
Opcodes are added for scalar and vector shifts, but considering the varied semantics of these do not expose them to the front ends. Do go ahead and provide them in case they are needed for backend expansion. Reviewed-by: Alex Bennée <alex.bennee@linaro.org> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2018-02-08tcg: Add generic vector expandersRichard Henderson
Reviewed-by: Alex Bennée <alex.bennee@linaro.org> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2018-02-05Drop remaining bits of ia64 host supportPeter Maydell
We dropped support for ia64 host CPUs in the 2.11 release (removing the TCG backend for it, and advertising the support as being completely removed in the changelog). However there are a few bits and pieces of code still floating about. Remove those, too. We can drop the check in configure for "ia64 or hppa host?" entirely, because we don't support hppa hosts either any more. Signed-off-by: Peter Maydell <peter.maydell@linaro.org> Message-Id: <1516897189-11035-1-git-send-email-peter.maydell@linaro.org> Reviewed-by: Richard Henderson <richard.henderson@linaro.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-01-25accel/tcg: add size paremeter in tlb_fill()Laurent Vivier
The MC68040 MMU provides the size of the access that triggers the page fault. This size is set in the Special Status Word which is written in the stack frame of the access fault exception. So we need the size in m68k_cpu_unassigned_access() and m68k_cpu_handle_mmu_fault(). To be able to do that, this patch modifies the prototype of handle_mmu_fault handler, tlb_fill() and probe_write(). do_unassigned_access() already includes a size parameter. This patch also updates handle_mmu_fault handlers and tlb_fill() of all targets (only parameter, no code change). Signed-off-by: Laurent Vivier <laurent@vivier.eu> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Richard Henderson <richard.henderson@linaro.org> Message-Id: <20180118193846.24953-2-laurent@vivier.eu>
2018-01-23page_unprotect(): handle calls to pages that are PAGE_WRITEPeter Maydell
If multiple guest threads in user-mode emulation write to a page which QEMU has marked read-only because of cached TCG translations, the threads can race in page_unprotect: * threads A & B both try to do a write to a page with code in it at the same time (ie which we've made non-writeable, so SEGV) * they race into the signal handler with this faulting address * thread A happens to get to page_unprotect() first and takes the mmap lock, so thread B sits waiting for it to be done * A then finds the page, marks it PAGE_WRITE and mprotect()s it writable * A can then continue OK (returns from signal handler to retry the memory access) * ...but when B gets the mmap lock it finds that the page is already PAGE_WRITE, and so it exits page_unprotect() via the "not due to protected translation" code path, and wrongly delivers the signal to the guest rather than just retrying the access In particular, this meant that trying to run 'javac' in user-mode emulation would fail with a spurious guest SIGSEGV. Handle this by making page_unprotect() assume that a call for a page which is already PAGE_WRITE is due to a race of this sort and return a "fault handled" indication. Since this would cause an infinite loop if we ever called page_unprotect() for some other kind of fault than "write failed due to bad access permissions", tighten the condition in handle_cpu_signal() to check the signal number and si_code, and add a comment so that if somebody does ever find themselves debugging an infinite loop of faults they have some clue about why. (The trick for identifying the correct setting for current_tb_invalidated for thread B (needed to handle the precise-SMC case) is due to Richard Henderson. Paolo Bonzini suggested just relying on si_code rather than trying anything more complicated.) Signed-off-by: Peter Maydell <peter.maydell@linaro.org> Message-Id: <1511879725-9576-3-git-send-email-peter.maydell@linaro.org> Signed-off-by: Laurent Vivier <laurent@vivier.eu>
2018-01-23linux-user: Propagate siginfo_t through to handle_cpu_signal()Peter Maydell
Currently all the architecture/OS specific cpu_signal_handler() functions call handle_cpu_signal() without passing it the siginfo_t. We're going to want that so we can look at the si_code to determine whether this is a SEGV_ACCERR access violation or some other kind of fault, so change the functions to pass through the pointer to the siginfo_t rather than just the si_addr value. Signed-off-by: Peter Maydell <peter.maydell@linaro.org> Reviewed-by: Richard Henderson <richard.henderson@linaro.org> Message-Id: <1511879725-9576-2-git-send-email-peter.maydell@linaro.org> Signed-off-by: Laurent Vivier <laurent@vivier.eu>
2017-12-29tcg: add cs_base and flags to -d exec outputPaolo Bonzini
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20171217055023.29225-1-pbonzini@redhat.com> [rth: Also change the Chain logging in helper_lookup_tb_ptr.] Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2017-12-21cpu-exec: fix missed CPU kick during interrupt injectionDavid Hildenbrand
The conditional memory barrier not only looks strange but actually is wrong. On s390x, I can reproduce interrupts via cpu_interrupt() not leading to a proper kick out of emulation every now and then. cpu_interrupt() is especially used for inter CPU communication via SIGP (esp. external calls and emergency interrupts). With this patch, I was not able to reproduce. (esp. no stalls or hangs in the guest). My setup is s390x MTTCG with 16 VCPUs on 8 CPU host, running make -j16. Signed-off-by: David Hildenbrand <david@redhat.com> Message-Id: <20171129191319.11483-1-david@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-12-18misc: remove duplicated includesPhilippe Mathieu-Daudé
exec: housekeeping (funny since 02d0e095031) applied using ./scripts/clean-includes Signed-off-by: Philippe Mathieu-Daudé <f4bug@amsat.org> Reviewed-by: Peter Maydell <peter.maydell@linaro.org> Acked-by: Cornelia Huck <cohuck@redhat.com> Reviewed-by: Anthony PERARD <anthony.perard@citrix.com> Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
2017-12-18accel/tcg/cpu-exec-common.c: Remove unnecessary include of memory-internal.hPeter Maydell
The cpu-exec-common.c file includes memory-internal.h, but it doesn't actually use anything from that header. Remove the unnecessary include. Signed-off-by: Peter Maydell <peter.maydell@linaro.org> Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org> Tested-by: Philippe Mathieu-Daudé <f4bug@amsat.org> Reviewed-by: Richard Henderson <richard.henderson@linaro.org> Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
2017-12-18translate-all: fix 'consisits' typo in commentEmilio G. Cota
Signed-off-by: Emilio G. Cota <cota@braap.org> Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
2017-11-21accel/tcg: Handle atomic accesses to notdirty memory correctlyPeter Maydell
To do a write to memory that is marked as notdirty, we need to invalidate any TBs we have cached for that memory, and update the cpu physical memory dirty flags for VGA and migration. The slowpath code in notdirty_mem_write() does all this correctly, but the new atomic handling code in atomic_mmu_lookup() doesn't do anything at all, it just clears the dirty bit in the TLB. The effect of this bug is that if the first write to a notdirty page for which we have cached TBs is by a guest atomic access, we fail to invalidate the TBs and subsequently will execute incorrect code. This can be seen by trying to run 'javac' on AArch64. Use the new notdirty_call_before() and notdirty_call_after() functions to correctly handle the update to notdirty memory in the atomic codepath. Cc: qemu-stable@nongnu.org Signed-off-by: Peter Maydell <peter.maydell@linaro.org> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Richard Henderson <richard.henderson@linaro.org> Message-id: 1511201308-23580-3-git-send-email-peter.maydell@linaro.org
2017-11-20Revert "cpu-exec: don't overwrite exception_index"Peter Maydell
This reverts commit e01cecabf3e04d22340d7e8b3616ef051c42c891, which breaks booting of aarch64 Linux images. Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
2017-11-16Merge remote-tracking branch 'remotes/bonzini/tags/for-upstream' into stagingPeter Maydell
Miscellaneous bugfixes # gpg: Signature made Wed 15 Nov 2017 15:27:25 GMT # gpg: using RSA key 0xBFFBD25F78C7AE83 # gpg: Good signature from "Paolo Bonzini <bonzini@gnu.org>" # gpg: aka "Paolo Bonzini <pbonzini@redhat.com>" # Primary key fingerprint: 46F5 9FBD 57D6 12E7 BFD4 E2F7 7E15 100C CD36 69B1 # Subkey fingerprint: F133 3857 4B66 2389 866C 7682 BFFB D25F 78C7 AE83 * remotes/bonzini/tags/for-upstream: fix scripts/update-linux-headers.sh here document exec: Do not resolve subpage in mru_section util/stats64: Fix min/max comparisons cpu-exec: avoid cpu_exec_nocache infinite loop with record/replay cpu-exec: don't overwrite exception_index vhost-user-scsi: add missing virtqueue_size param target-i386: adds PV_TLB_FLUSH CPUID feature bit thread-posix: fix qemu_rec_mutex_trylock macro Makefile: simpler/faster "make help" ioapic/tracing: Remove last DPRINTFs Enable 8-byte wide MMIO for 16550 serial devices Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
2017-11-15tcg: Record code_gen_buffer address for user-only memory helpersRichard Henderson
When we handle a signal from a fault within a user-only memory helper, we cannot cpu_restore_state with the PC found within the signal frame. Use a TLS variable, helper_retaddr, to record the unwind start point to find the faulting guest insn. Tested-by: Alex Bennée <alex.bennee@linaro.org> Reviewed-by: Alex Bennée <alex.bennee@linaro.org> Reported-by: Peter Maydell <peter.maydell@linaro.org> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2017-11-14cpu-exec: avoid cpu_exec_nocache infinite loop with record/replayPavel Dovgalyuk
This patch ensures that icount_decr.u32.high is clear before calling cpu_exec_nocache when exception is pending. Because the exception is caused by the first instruction in the block and it cannot be executed without resetting the flag. There are two parts in the fix. First, clear icount_decr.u32.high in cpu_handle_interrupt (just before processing the "dependent" request, stored in cpu->interrupt_request or cpu->exit_request) rather than cpu_loop_exec_tb; this ensures that cpu_handle_exception is always reached with zero icount_decr.u32.high unless another interrupt has happened in the meanwhile. Second, try to cause the exception at the beginning of cpu_handle_exception, and exit immediately if the TB cannot execute. With this change, interrupts are processed and cpu_exec_nocache can make process. Signed-off-by: Maria Klimushenkova <maria.klimushenkova@ispras.ru> Signed-off-by: Pavel Dovgalyuk <pavel.dovgaluk@ispras.ru> Message-Id: <20171114081818.27640.33165.stgit@pasha-VirtualBox> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-11-14cpu-exec: don't overwrite exception_indexPavel Dovgalyuk
This patch adds a condition before overwriting exception_index fiels. It is needed when exception_index is already set to some meaningful value. Signed-off-by: Pavel Dovgalyuk <pavel.dovgaluk@ispras.ru> Message-Id: <20171114081812.27640.26372.stgit@pasha-VirtualBox> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-11-13accel/tcg/translate-all: expand cpu_restore_state addr checkAlex Bennée
We are still seeing signals during translation time when we walk over a page protection boundary. This expands the check to ensure the host PC is inside the code generation buffer. The original suggestion was to check versus tcg_ctx.code_gen_ptr but as we now segment the translation buffer we have to settle for just a general check for being inside. I've also fixed up the declaration to make it clear it can deal with invalid addresses. A later patch will fix up the call sites. Signed-off-by: Alex Bennée <alex.bennee@linaro.org> Reported-by: Peter Maydell <peter.maydell@linaro.org> Reviewed-by: Laurent Vivier <laurent@vivier.eu> Reviewed-by: Richard Henderson <richard.henderson@linaro.org> Message-id: 20171108153245.20740-2-alex.bennee@linaro.org Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Cc: Richard Henderson <rth@twiddle.net> Tested-by: Peter Maydell <peter.maydell@linaro.org> Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
2017-11-03cpu-exec: Exit exclusive region on longjmp from step_atomicPeter Maydell
Commit ac03ee5331612e44be narrowed the scope of the exclusive region so it only covers when we're executing the TB, not when we're generating it. However it missed that there is more than one execution path out of cpu_tb_exec -- if the atomic insn causes an exception then the code will longjmp out, skipping the code to end the exclusive region. This causes QEMU to hang the next time the CPU calls start_exclusive(), waiting for itself to exit the region. Move the "end the region" code out to the end of the function so that it is run for both normal exit and also for exit-via-longjmp. We have to use a volatile bool flag to decide whether we need to end the region, because we can longjump out of the codegen as well as the execution. (For some reason this only reproduces for me with a clang optimized build, not a gcc debug build.) Reviewed-by: Emilio G. Cota <cota@braap.org> Reviewed-by: Alex Bennée <alex.bennee@linaro.org> Reviewed-by: Richard Henderson <richard.henderson@linaro.org> Fixes: ac03ee5331612e44beb393df2b578c951d27dc0d Signed-off-by: Peter Maydell <peter.maydell@linaro.org> Message-Id: <1509640536-32160-1-git-send-email-peter.maydell@linaro.org> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2017-10-24translate-all: exit from tb_phys_invalidate if qht_remove failsEmilio G. Cota
Two or more threads might race while invalidating the same TB. We currently do not check for this at all despite taking tb_lock, which means we would wrongly invalidate the same TB more than once. This bug has actually been hit by users: I recently saw a report on IRC, although I have yet to see the corresponding test case. Fix this by using qht_remove as the synchronization point; if it fails, that means the TB has already been invalidated, and therefore there is nothing left to do in tb_phys_invalidate. Note that this solution works now that we still have tb_lock, and will continue working once we remove tb_lock. Reviewed-by: Richard Henderson <richard.henderson@linaro.org> Signed-off-by: Emilio G. Cota <cota@braap.org> Message-Id: <1508445114-4717-1-git-send-email-cota@braap.org> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2017-10-24tcg: enable multiple TCG contexts in softmmuEmilio G. Cota
This enables parallel TCG code generation. However, we do not take advantage of it yet since tb_lock is still held during tb_gen_code. In user-mode we use a single TCG context; see the documentation added to tcg_region_init for the rationale. Note that targets do not need any conversion: targets initialize a TCGContext (e.g. defining TCG globals), and after this initialization has finished, the context is cloned by the vCPU threads, each of them keeping a separate copy. TCG threads claim one entry in tcg_ctxs[] by atomically increasing n_tcg_ctxs. Do not be too annoyed by the subsequent atomic_read's of that variable and tcg_ctxs; they are there just to play nice with analysis tools such as thread sanitizer. Note that we do not allocate an array of contexts (we allocate an array of pointers instead) because when tcg_context_init is called, we do not know yet how many contexts we'll use since the bool behind qemu_tcg_mttcg_enabled() isn't set yet. Previous patches folded some TCG globals into TCGContext. The non-const globals remaining are only set at init time, i.e. before the TCG threads are spawned. Here is a list of these set-at-init-time globals under tcg/: Only written by tcg_context_init: - indirect_reg_alloc_order - tcg_op_defs Only written by tcg_target_init (called from tcg_context_init): - tcg_target_available_regs - tcg_target_call_clobber_regs - arm: arm_arch, use_idiv_instructions - i386: have_cmov, have_bmi1, have_bmi2, have_lzcnt, have_movbe, have_popcnt - mips: use_movnz_instructions, use_mips32_instructions, use_mips32r2_instructions, got_sigill (tcg_target_detect_isa) - ppc: have_isa_2_06, have_isa_3_00, tb_ret_addr - s390: tb_ret_addr, s390_facilities - sparc: qemu_ld_trampoline, qemu_st_trampoline (build_trampolines), use_vis3_instructions Only written by tcg_prologue_init: - 'struct jit_code_entry one_entry' - aarch64: tb_ret_addr - arm: tb_ret_addr - i386: tb_ret_addr, guest_base_flags - ia64: tb_ret_addr - mips: tb_ret_addr, bswap32_addr, bswap32u_addr, bswap64_addr Reviewed-by: Richard Henderson <rth@twiddle.net> Signed-off-by: Emilio G. Cota <cota@braap.org> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2017-10-24tcg: introduce regions to split code_gen_bufferEmilio G. Cota
This is groundwork for supporting multiple TCG contexts. The naive solution here is to split code_gen_buffer statically among the TCG threads; this however results in poor utilization if translation needs are different across TCG threads. What we do here is to add an extra layer of indirection, assigning regions that act just like pages do in virtual memory allocation. (BTW if you are wondering about the chosen naming, I did not want to use blocks or pages because those are already heavily used in QEMU). We use a global lock to serialize allocations as well as statistics reporting (we now export the size of the used code_gen_buffer with tcg_code_size()). Note that for the allocator we could just use a counter and atomic_inc; however, that would complicate the gathering of tcg_code_size()-like stats. So given that the region operations are not a fast path, a lock seems the most reasonable choice. The effectiveness of this approach is clear after seeing some numbers. I used the bootup+shutdown of debian-arm with '-tb-size 80' as a benchmark. Note that I'm evaluating this after enabling per-thread TCG (which is done by a subsequent commit). * -smp 1, 1 region (entire buffer): qemu: flush code_size=83885014 nb_tbs=154739 avg_tb_size=357 qemu: flush code_size=83884902 nb_tbs=153136 avg_tb_size=363 qemu: flush code_size=83885014 nb_tbs=152777 avg_tb_size=364 qemu: flush code_size=83884950 nb_tbs=150057 avg_tb_size=373 qemu: flush code_size=83884998 nb_tbs=150234 avg_tb_size=373 qemu: flush code_size=83885014 nb_tbs=154009 avg_tb_size=360 qemu: flush code_size=83885014 nb_tbs=151007 avg_tb_size=370 qemu: flush code_size=83885014 nb_tbs=151816 avg_tb_size=367 That is, 8 flushes. * -smp 8, 32 regions (80/32 MB per region) [i.e. this patch]: qemu: flush code_size=76328008 nb_tbs=141040 avg_tb_size=356 qemu: flush code_size=75366534 nb_tbs=138000 avg_tb_size=361 qemu: flush code_size=76864546 nb_tbs=140653 avg_tb_size=361 qemu: flush code_size=76309084 nb_tbs=135945 avg_tb_size=375 qemu: flush code_size=74581856 nb_tbs=132909 avg_tb_size=375 qemu: flush code_size=73927256 nb_tbs=135616 avg_tb_size=360 qemu: flush code_size=78629426 nb_tbs=142896 avg_tb_size=365 qemu: flush code_size=76667052 nb_tbs=138508 avg_tb_size=368 Again, 8 flushes. Note how buffer utilization is not 100%, but it is close. Smaller region sizes would yield higher utilization, but we want region allocation to be rare (it acquires a lock), so we do not want to go too small. * -smp 8, static partitioning of 8 regions (10 MB per region): qemu: flush code_size=21936504 nb_tbs=40570 avg_tb_size=354 qemu: flush code_size=11472174 nb_tbs=20633 avg_tb_size=370 qemu: flush code_size=11603976 nb_tbs=21059 avg_tb_size=365 qemu: flush code_size=23254872 nb_tbs=41243 avg_tb_size=377 qemu: flush code_size=28289496 nb_tbs=52057 avg_tb_size=358 qemu: flush code_size=43605160 nb_tbs=78896 avg_tb_size=367 qemu: flush code_size=45166552 nb_tbs=82158 avg_tb_size=364 qemu: flush code_size=63289640 nb_tbs=116494 avg_tb_size=358 qemu: flush code_size=51389960 nb_tbs=93937 avg_tb_size=362 qemu: flush code_size=59665928 nb_tbs=107063 avg_tb_size=372 qemu: flush code_size=38380824 nb_tbs=68597 avg_tb_size=374 qemu: flush code_size=44884568 nb_tbs=79901 avg_tb_size=376 qemu: flush code_size=50782632 nb_tbs=90681 avg_tb_size=374 qemu: flush code_size=39848888 nb_tbs=71433 avg_tb_size=372 qemu: flush code_size=64708840 nb_tbs=119052 avg_tb_size=359 qemu: flush code_size=49830008 nb_tbs=90992 avg_tb_size=362 qemu: flush code_size=68372408 nb_tbs=123442 avg_tb_size=368 qemu: flush code_size=33555560 nb_tbs=59514 avg_tb_size=378 qemu: flush code_size=44748344 nb_tbs=80974 avg_tb_size=367 qemu: flush code_size=37104248 nb_tbs=67609 avg_tb_size=364 That is, 20 flushes. Note how a static partitioning approach uses the code buffer poorly, leading to many unnecessary flushes. Reviewed-by: Richard Henderson <richard.henderson@linaro.org> Signed-off-by: Emilio G. Cota <cota@braap.org> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2017-10-24translate-all: use qemu_protect_rwx/none helpersEmilio G. Cota
The helpers require the address and size to be page-aligned, so do that before calling them. Reviewed-by: Richard Henderson <rth@twiddle.net> Signed-off-by: Emilio G. Cota <cota@braap.org> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2017-10-24tcg: distribute profiling counters across TCGContext'sEmilio G. Cota
This is groundwork for supporting multiple TCG contexts. To avoid scalability issues when profiling info is enabled, this patch makes the profiling info counters distributed via the following changes: 1) Consolidate profile info into its own struct, TCGProfile, which TCGContext also includes. Note that tcg_table_op_count is brought into TCGProfile after dropping the tcg_ prefix. 2) Iterate over the TCG contexts in the system to obtain the total counts. This change also requires updating the accessors to TCGProfile fields to use atomic_read/set whenever there may be conflicting accesses (as defined in C11) to them. Reviewed-by: Richard Henderson <rth@twiddle.net> Signed-off-by: Emilio G. Cota <cota@braap.org> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2017-10-24tcg: define tcg_init_ctx and make tcg_ctx a pointerEmilio G. Cota
Groundwork for supporting multiple TCG contexts. The core of this patch is this change to tcg/tcg.h: > -extern TCGContext tcg_ctx; > +extern TCGContext tcg_init_ctx; > +extern TCGContext *tcg_ctx; Note that for now we set *tcg_ctx to whatever TCGContext is passed to tcg_context_init -- in this case &tcg_init_ctx. Reviewed-by: Richard Henderson <rth@twiddle.net> Signed-off-by: Emilio G. Cota <cota@braap.org> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2017-10-24tcg: take tb_ctx out of TCGContextEmilio G. Cota
Groundwork for supporting multiple TCG contexts. Reviewed-by: Richard Henderson <rth@twiddle.net> Reviewed-by: Alex Bennée <alex.bennee@linaro.org> Signed-off-by: Emilio G. Cota <cota@braap.org> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2017-10-24translate-all: report correct avg host TB sizeEmilio G. Cota
Since commit 6e3b2bfd6 ("tcg: allocate TB structs before the corresponding translated code") we are not fully utilizing code_gen_buffer for translated code, and therefore are incorrectly reporting the amount of translated code as well as the average host TB size. Address this by: - Making the conscious choice of misreporting the total translated code; doing otherwise would mislead users into thinking "-tb-size" is not honoured. - Expanding tb_tree_stats to accurately count the bytes of translated code on the host, and using this for reporting the average tb host size, as well as the expansion ratio. In the future we might want to consider reporting the accurate numbers for the total translated code, together with a "bookkeeping/overhead" field to account for the TB structs. Reviewed-by: Richard Henderson <rth@twiddle.net> Signed-off-by: Emilio G. Cota <cota@braap.org> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2017-10-24exec-all: rename tb_free to tb_removeEmilio G. Cota
We don't really free anything in this function anymore; we just remove the TB from the binary search tree. Suggested-by: Alex Bennée <alex.bennee@linaro.org> Reviewed-by: Richard Henderson <rth@twiddle.net> Signed-off-by: Emilio G. Cota <cota@braap.org> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2017-10-24translate-all: use a binary search tree to track TBs in TBContextEmilio G. Cota
This is a prerequisite for supporting multiple TCG contexts, since we will have threads generating code in separate regions of code_gen_buffer. For this we need a new field (.size) in struct tb_tc to keep track of the size of the translated code. This field uses a size_t to avoid adding a hole to the struct, although really an unsigned int would have been enough. The comparison function we use is optimized for the common case: insertions. Profiling shows that upon booting debian-arm, 98% of comparisons are between existing tb's (i.e. a->size and b->size are both !0), which happens during insertions (and removals, but those are rare). The remaining cases are lookups. From reading the glib sources we see that the first key is always the lookup key. However, the code does not assume this to always be the case because this behaviour is not guaranteed in the glib docs. However, we embed this knowledge in the code as a branch hint for the compiler. Note that tb_free does not free space in the code_gen_buffer anymore, since we cannot easily know whether the tb is the last one inserted in code_gen_buffer. The next patch in this series renames tb_free to tb_remove to reflect this. Performance-wise, lookups in tb_find_pc are the same as before: O(log n). However, insertions are O(log n) instead of O(1), which results in a small slowdown when booting debian-arm: Performance counter stats for 'build/arm-softmmu/qemu-system-arm \ -machine type=virt -nographic -smp 1 -m 4096 \ -netdev user,id=unet,hostfwd=tcp::2222-:22 \ -device virtio-net-device,netdev=unet \ -drive file=img/arm/jessie-arm32.qcow2,id=myblock,index=0,if=none \ -device virtio-blk-device,drive=myblock \ -kernel img/arm/aarch32-current-linux-kernel-only.img \ -append console=ttyAMA0 root=/dev/vda1 \ -name arm,debug-threads=on -smp 1' (10 runs): - Before: 8048.598422 task-clock (msec) # 0.931 CPUs utilized ( +- 0.28% ) 16,974 context-switches # 0.002 M/sec ( +- 0.12% ) 0 cpu-migrations # 0.000 K/sec 10,125 page-faults # 0.001 M/sec ( +- 1.23% ) 35,144,901,879 cycles # 4.367 GHz ( +- 0.14% ) <not supported> stalled-cycles-frontend <not supported> stalled-cycles-backend 65,758,252,643 instructions # 1.87 insns per cycle ( +- 0.33% ) 10,871,298,668 branches # 1350.707 M/sec ( +- 0.41% ) 192,322,212 branch-misses # 1.77% of all branches ( +- 0.32% ) 8.640869419 seconds time elapsed ( +- 0.57% ) - After: 8146.242027 task-clock (msec) # 0.923 CPUs utilized ( +- 1.23% ) 17,016 context-switches # 0.002 M/sec ( +- 0.40% ) 0 cpu-migrations # 0.000 K/sec 18,769 page-faults # 0.002 M/sec ( +- 0.45% ) 35,660,956,120 cycles # 4.378 GHz ( +- 1.22% ) <not supported> stalled-cycles-frontend <not supported> stalled-cycles-backend 65,095,366,607 instructions # 1.83 insns per cycle ( +- 1.73% ) 10,803,480,261 branches # 1326.192 M/sec ( +- 1.95% ) 195,601,289 branch-misses # 1.81% of all branches ( +- 0.39% ) 8.828660235 seconds time elapsed ( +- 0.38% ) Reviewed-by: Richard Henderson <rth@twiddle.net> Signed-off-by: Emilio G. Cota <cota@braap.org> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2017-10-24tcg: Remove CF_IGNORE_ICOUNTRichard Henderson
Now that we have curr_cflags, we can include CF_USE_ICOUNT early and then remove it as necessary. Reviewed-by: Emilio G. Cota <cota@braap.org> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2017-10-24cpu-exec: lookup/generate TB outside exclusive region during step_atomicEmilio G. Cota
Now that all code generation has been converted to check CF_PARALLEL, we can generate !CF_PARALLEL code without having yet set !parallel_cpus -- and therefore without having to be in the exclusive region during cpu_exec_step_atomic. While at it, merge cpu_exec_step into cpu_exec_step_atomic. Reviewed-by: Richard Henderson <rth@twiddle.net> Signed-off-by: Emilio G. Cota <cota@braap.org> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2017-10-24tcg: check CF_PARALLEL instead of parallel_cpusEmilio G. Cota
Thereby decoupling the resulting translated code from the current state of the system. The tb->cflags field is not passed to tcg generation functions. So we add a field to TCGContext, storing there a copy of tb->cflags. Most architectures have <= 32 registers, which results in a 4-byte hole in TCGContext. Use this hole for the new field. Reviewed-by: Richard Henderson <rth@twiddle.net> Signed-off-by: Emilio G. Cota <cota@braap.org> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2017-10-24tcg: convert tb->cflags reads to tb_cflags(tb)Emilio G. Cota
Convert all existing readers of tb->cflags to tb_cflags, so that we use atomic_read and therefore avoid undefined behaviour in C11. Note that the remaining setters/getters of the field are protected by tb_lock, and therefore do not need conversion. Luckily all readers access the field via 'tb->cflags' (so no foo.cflags, bar->cflags in the code base), which makes the conversion easily scriptable: FILES=$(git grep 'tb->cflags' target include/exec/gen-icount.h \ accel/tcg/translator.c | cut -f1 -d':' | sort | uniq) perl -pi -e 's/([^.>])tb->cflags/$1tb_cflags(tb)/g' $FILES perl -pi -e 's/([a-z->.]*)(->|\.)tb->cflags/tb_cflags($1$2tb)/g' $FILES Then manually fixed the few errors that checkpatch reported. Compile-tested for all targets. Suggested-by: Richard Henderson <rth@twiddle.net> Reviewed-by: Richard Henderson <rth@twiddle.net> Signed-off-by: Emilio G. Cota <cota@braap.org> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2017-10-24tcg: Add CPUState cflags_next_tbRichard Henderson
We were generating code during tb_invalidate_phys_page_range, check_watchpoint, cpu_io_recompile, and (seemingly) discarding the TB, assuming that it would magically be picked up during the next iteration through the cpu_exec loop. Instead, record the desired cflags in CPUState so that we request the proper TB so that there is no more magic. Reviewed-by: Emilio G. Cota <cota@braap.org> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2017-10-24tcg: define CF_PARALLEL and use it for TB hashing along with CF_COUNT_MASKEmilio G. Cota
This will enable us to decouple code translation from the value of parallel_cpus at any given time. It will also help us minimize TB flushes when generating code via EXCP_ATOMIC. Note that the declaration of parallel_cpus is brought to exec-all.h to be able to define there the "curr_cflags" inline. Signed-off-by: Emilio G. Cota <cota@braap.org> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2017-10-20accel/tcg: allow to invalidate a write TLB entry immediatelyDavid Hildenbrand
Background: s390x implements Low-Address Protection (LAP). If LAP is enabled, writing to effective addresses (before any translation) 0-511 and 4096-4607 triggers a protection exception. So we have subpage protection on the first two pages of every address space (where the lowcore - the CPU private data resides). By immediately invalidating the write entry but allowing the caller to continue, we force every write access onto these first two pages into the slow path. we will get a tlb fault with the specific accessed addresses and can then evaluate if protection applies or not. We have to make sure to ignore the invalid bit if tlb_fill() succeeds. Signed-off-by: David Hildenbrand <david@redhat.com> Message-Id: <20171016202358.3633-2-david@redhat.com> Signed-off-by: Cornelia Huck <cohuck@redhat.com>
2017-10-16tcg: Fix off-by-one in assert in page_set_flagsRichard Henderson
Most of the users of page_set_flags offset (page, page + len) as the end points. One might consider this an error, since the other users do supply an endpoint as the last byte of the region. However, the first thing that page_set_flags does is round end UP to the start of the next page. Which means computing page + len - 1 is in the end pointless. Therefore, accept this usage and do not assert when given the exact size of the vm as the endpoint. Signed-off-by: Richard Henderson <rth@twiddle.net> Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org> Message-Id: <20170708025030.15845-2-rth@twiddle.net> Signed-off-by: Riku Voipio <riku.voipio@linaro.org>
2017-10-10exec-all: extract tb->tc_* into a separate struct tc_tbEmilio G. Cota
In preparation for adding tc.size to be able to keep track of TB's using the binary search tree implementation from glib. Reviewed-by: Richard Henderson <rth@twiddle.net> Signed-off-by: Emilio G. Cota <cota@braap.org> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2017-10-10translate-all: define and use DEBUG_TB_CHECK_GATEEmilio G. Cota
This prevents bit rot by ensuring the debug code is compiled when building a user-mode target. Unfortunately the helpers are user-mode-only so we cannot fully get rid of the ifdef checks. Add a comment to explain this. Suggested-by: Alex Bennée <alex.bennee@linaro.org> Reviewed-by: Richard Henderson <rth@twiddle.net> Signed-off-by: Emilio G. Cota <cota@braap.org> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2017-10-10translate-all: define and use DEBUG_TB_INVALIDATE_GATEEmilio G. Cota
This gets rid of an ifdef check while ensuring that the debug code is compiled, which prevents bit rot. Suggested-by: Alex Bennée <alex.bennee@linaro.org> Reviewed-by: Richard Henderson <rth@twiddle.net> Signed-off-by: Emilio G. Cota <cota@braap.org> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2017-10-10exec-all: introduce TB_PAGE_ADDR_FMTEmilio G. Cota
And fix the following warning when DEBUG_TB_INVALIDATE is enabled in translate-all.c: CC mipsn32-linux-user/accel/tcg/translate-all.o /data/src/qemu/accel/tcg/translate-all.c: In function ‘tb_alloc_page’: /data/src/qemu/accel/tcg/translate-all.c:1201:16: error: format ‘%lx’ expects argument of type ‘long unsigned int’, but argument 2 has type ‘tb_page_addr_t {aka unsigned int}’ [-Werror=format=] printf("protecting code page: 0x" TARGET_FMT_lx "\n", ^ cc1: all warnings being treated as errors /data/src/qemu/rules.mak:66: recipe for target 'accel/tcg/translate-all.o' failed make[1]: *** [accel/tcg/translate-all.o] Error 1 Makefile:328: recipe for target 'subdir-mipsn32-linux-user' failed make: *** [subdir-mipsn32-linux-user] Error 2 cota@flamenco:/data/src/qemu/build ((18f3fe1...) *$)$ Reviewed-by: Richard Henderson <rth@twiddle.net> Signed-off-by: Emilio G. Cota <cota@braap.org> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2017-10-10translate-all: define and use DEBUG_TB_FLUSH_GATEEmilio G. Cota
This gets rid of some ifdef checks while ensuring that the debug code is compiled, which prevents bit rot. Suggested-by: Alex Bennée <alex.bennee@linaro.org> Reviewed-by: Richard Henderson <rth@twiddle.net> Signed-off-by: Emilio G. Cota <cota@braap.org> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2017-10-10exec-all: bring tb->invalid into tb->cflagsEmilio G. Cota
This gets rid of a hole in struct TranslationBlock. Reviewed-by: Richard Henderson <rth@twiddle.net> Signed-off-by: Emilio G. Cota <cota@braap.org> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2017-10-10tcg: consolidate TB lookups in tb_lookup__cpu_stateEmilio G. Cota
This avoids duplicating code. cpu_exec_step will also use the new common function once we integrate parallel_cpus into tb->cflags. Note that in this commit we also fix a race, described by Richard Henderson during review. Think of this scenario with threads A and B: (A) Lookup succeeds for TB in hash without tb_lock (B) Sets the TB's tb->invalid flag (B) Removes the TB from tb_htable (B) Clears all CPU's tb_jmp_cache (A) Store TB into local tb_jmp_cache Given that order of events, (A) will keep executing that invalid TB until another flush of its tb_jmp_cache happens, which in theory might never happen. We can fix this by checking the tb->invalid flag every time we look up a TB from tb_jmp_cache, so that in the above scenario, next time we try to find that TB in tb_jmp_cache, we won't, and will therefore be forced to look it up in tb_htable. Performance-wise, I measured a small improvement when booting debian-arm. Note that inlining pays off: Performance counter stats for 'taskset -c 0 qemu-system-arm \ -machine type=virt -nographic -smp 1 -m 4096 \ -netdev user,id=unet,hostfwd=tcp::2222-:22 \ -device virtio-net-device,netdev=unet \ -drive file=jessie.qcow2,id=myblock,index=0,if=none \ -device virtio-blk-device,drive=myblock \ -kernel kernel.img -append console=ttyAMA0 root=/dev/vda1 \ -name arm,debug-threads=on -smp 1' (10 runs): Before: 18714.917392 task-clock # 0.952 CPUs utilized ( +- 0.95% ) 23,142 context-switches # 0.001 M/sec ( +- 0.50% ) 1 CPU-migrations # 0.000 M/sec 10,558 page-faults # 0.001 M/sec ( +- 0.95% ) 53,957,727,252 cycles # 2.883 GHz ( +- 0.91% ) [83.33%] 24,440,599,852 stalled-cycles-frontend # 45.30% frontend cycles idle ( +- 1.20% ) [83.33%] 16,495,714,424 stalled-cycles-backend # 30.57% backend cycles idle ( +- 0.95% ) [66.66%] 76,267,572,582 instructions # 1.41 insns per cycle # 0.32 stalled cycles per insn ( +- 0.87% ) [83.34%] 12,692,186,323 branches # 678.186 M/sec ( +- 0.92% ) [83.35%] 263,486,879 branch-misses # 2.08% of all branches ( +- 0.73% ) [83.34%] 19.648474449 seconds time elapsed ( +- 0.82% ) After, w/ inline (this patch): 18471.376627 task-clock # 0.955 CPUs utilized ( +- 0.96% ) 23,048 context-switches # 0.001 M/sec ( +- 0.48% ) 1 CPU-migrations # 0.000 M/sec 10,708 page-faults # 0.001 M/sec ( +- 0.81% ) 53,208,990,796 cycles # 2.881 GHz ( +- 0.98% ) [83.34%] 23,941,071,673 stalled-cycles-frontend # 44.99% frontend cycles idle ( +- 0.95% ) [83.34%] 16,161,773,848 stalled-cycles-backend # 30.37% backend cycles idle ( +- 0.76% ) [66.67%] 75,786,269,766 instructions # 1.42 insns per cycle # 0.32 stalled cycles per insn ( +- 1.24% ) [83.34%] 12,573,617,143 branches # 680.708 M/sec ( +- 1.34% ) [83.33%] 260,235,550 branch-misses # 2.07% of all branches ( +- 0.66% ) [83.33%] 19.340502161 seconds time elapsed ( +- 0.56% ) After, w/o inline: 18791.253967 task-clock # 0.954 CPUs utilized ( +- 0.78% ) 23,230 context-switches # 0.001 M/sec ( +- 0.42% ) 1 CPU-migrations # 0.000 M/sec 10,563 page-faults # 0.001 M/sec ( +- 1.27% ) 54,168,674,622 cycles # 2.883 GHz ( +- 0.80% ) [83.34%] 24,244,712,629 stalled-cycles-frontend # 44.76% frontend cycles idle ( +- 1.37% ) [83.33%] 16,288,648,572 stalled-cycles-backend # 30.07% backend cycles idle ( +- 0.95% ) [66.66%] 77,659,755,503 instructions # 1.43 insns per cycle # 0.31 stalled cycles per insn ( +- 0.97% ) [83.34%] 12,922,780,045 branches # 687.702 M/sec ( +- 1.06% ) [83.34%] 261,962,386 branch-misses # 2.03% of all branches ( +- 0.71% ) [83.35%] 19.700174670 seconds time elapsed ( +- 0.56% ) Reviewed-by: Richard Henderson <rth@twiddle.net> Signed-off-by: Emilio G. Cota <cota@braap.org> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2017-10-10tcg: remove addr argument from lookup_tb_ptrEmilio G. Cota
It is unlikely that we will ever want to call this helper passing an argument other than the current PC. So just remove the argument, and use the pc we already get from cpu_get_tb_cpu_state. This change paves the way to having a common "tb_lookup" function. Reviewed-by: Richard Henderson <rth@twiddle.net> Signed-off-by: Emilio G. Cota <cota@braap.org> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2017-10-10cpu-exec: rename have_tb_lock to acquired_tb_lock in tb_findEmilio G. Cota
Reusing the have_tb_lock name, which is also defined in translate-all.c, makes code reviewing unnecessarily harder. Avoid potential confusion by renaming the local have_tb_lock variable to something else. Reviewed-by: Richard Henderson <rth@twiddle.net> Signed-off-by: Emilio G. Cota <cota@braap.org> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>