Age | Commit message (Collapse) | Author |
|
The general ternary logic operation can implement BITSEL.
Funnel the 4-operand operation into three variants of the
3-operand instruction, depending on input operand overlap.
Tested-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
|
|
AVX512VL has a general ternary logic operation, VPTERNLOGQ,
which can implement NOT, ORC, NAND, NOR, EQV.
Tested-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
|
|
AVX512VL has VPROLVD and VPRORVQ.
Tested-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
|
|
AVX512VL has VPROLD and VPROLQ, layered onto the same
opcode as PSHIFTD, but requires EVEX encoding and W1.
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
|
|
There are some operation sizes in some subsets of AVX512 that
are missing from previous iterations of AVX. Detect them.
Tested-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
|
|
We've had placeholders for these opcodes for a while,
and should have support on ppc, s390x and avx512 hosts.
Tested-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
|
|
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
|
|
Since 6eea04347eb6, all tcg backends support goto_ptr.
Remove the conditional, making support mandatory.
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
|
|
Remove the ifdef ladder and move each define into the
appropriate header file.
Reviewed-by: Luis Pires <luis.pires@eldorado.org.br>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
|
|
Now that all native tcg hosts support splitwx, remove the define.
Replace the one use with a test for CONFIG_TCG_INTERPRETER.
Reviewed-by: Joelle van Dyne <j@getutm.app>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
|
|
Reviewed-by: Joelle van Dyne <j@getutm.app>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
|
|
Plumb the value through to alloc_code_gen_buffer. This is not
supported by any os or tcg backend, so for now enabling it will
result in an error.
Reviewed-by: Joelle van Dyne <j@getutm.app>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
|
|
Pass both rx and rw addresses to tb_target_set_jmp_target.
Reviewed-by: Joelle van Dyne <j@getutm.app>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
|
|
Enable this on i386 to restrict the set of input registers
for an 8-bit store, as required by the architecture. This
removes the last use of scratch registers for user-only mode.
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
|
|
Always true when movbe is available, otherwise leave
this to generic code.
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
|
|
This has been a tcg-specific function, but is also in use
by hardware accelerators via physmem.c. This can cause
link errors when tcg is disabled.
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Reviewed-by: Joelle van Dyne <j@getutm.app>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Message-Id: <20201214140314.18544-3-richard.henderson@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
The cmp_vec opcode is mandatory; this symbol is unused.
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
|
|
clang's C11 atomic_fetch_*() functions only take a C11 atomic type
pointer argument. QEMU uses direct types (int, etc) and this causes a
compiler error when a QEMU code calls these functions in a source file
that also included <stdatomic.h> via a system header file:
$ CC=clang CXX=clang++ ./configure ... && make
../util/async.c:79:17: error: address argument to atomic operation must be a pointer to _Atomic type ('unsigned int *' invalid)
Avoid using atomic_*() names in QEMU's atomic.h since that namespace is
used by <stdatomic.h>. Prefix QEMU's APIs with 'q' so that atomic.h
and <stdatomic.h> can co-exist. I checked /usr/include on my machine and
searched GitHub for existing "qatomic_" users but there seem to be none.
This patch was generated using:
$ git grep -h -o '\<atomic\(64\)\?_[a-z0-9_]\+' include/qemu/atomic.h | \
sort -u >/tmp/changed_identifiers
$ for identifier in $(</tmp/changed_identifiers); do
sed -i "s%\<$identifier\>%q$identifier%g" \
$(git grep -I -l "\<$identifier\>")
done
I manually fixed line-wrap issues and misaligned rST tables.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20200923105646.47864-1-stefanha@redhat.com>
|
|
No host backend support yet, but the interfaces for rotls
are in place. Only implement left-rotate for now, as the
only known use of vector rotate by scalar is s390x, so any
right-rotate would be unused and untestable.
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
|
|
No host backend support yet, but the interfaces for rotlv
and rotrv are in place.
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
v3: Drop the generic expansion from rot to shift; we can do better
for each backend, and then this code becomes unused.
|
|
No host backend support yet, but the interfaces for rotli
are in place. Canonicalize immediate rotate to the left,
based on a survey of architectures, but provide both left
and right shift interfaces to the translators.
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
|
|
We currently search both the root and the tcg/ directories for tcg
files:
$ git grep '#include "tcg/' | wc -l
28
$ git grep '#include "tcg[^/]' | wc -l
94
To simplify the preprocessor search path, unify by expliciting the
tcg/ directory.
Patch created mechanically by running:
$ for x in \
tcg.h tcg-mo.h tcg-op.h tcg-opc.h \
tcg-op-gvec.h tcg-gvec-desc.h; do \
sed -i "s,#include \"$x\",#include \"tcg/$x\"," \
$(git grep -l "#include \"$x\""); \
done
Acked-by: David Gibson <david@gibson.dropbear.id.au> (ppc parts)
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Alistair Francis <alistair.francis@wdc.com>
Reviewed-by: Stefan Weil <sw@weilnetz.de>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-Id: <20200101112303.20724-2-philmd@redhat.com>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
|
|
We already had backend support for this feature. Expand the new
cmpsel opcode using vpblendb. The combination allows us to avoid
an extra NOT for some comparison codes.
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
|
|
Perform a per-element conditional move. This combination operation is
easier to implement on some host vector units than plain cmp+bitsel.
Omit the usual gvec interface, as this is intended to be used by
target-specific gvec expansion call-backs.
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
|
|
This operation performs d = (b & a) | (c & ~a), and is present
on a majority of host vector units. Include gvec expanders.
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
|
|
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
|
|
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
|
|
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
|
|
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
|
|
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
|
|
This will let backends implement the double-word shift operation.
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
|
|
Now that all tcg backends support TCG_TARGET_IMPLEMENTS_DYN_TLB,
remove the define and the old code.
Reviewed-by: Alistair Francis <alistair.francis@wdc.com>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
|
|
As the following experiments show, this series is a net perf gain,
particularly for memory-heavy workloads. Experiments are run on an
Intel(R) Xeon(R) Gold 6142 CPU @ 2.60GHz.
1. System boot + shudown, debian aarch64:
- Before (v3.1.0):
Performance counter stats for './die.sh v3.1.0' (10 runs):
9019.797015 task-clock (msec) # 0.993 CPUs utilized ( +- 0.23% )
29,910,312,379 cycles # 3.316 GHz ( +- 0.14% )
54,699,252,014 instructions # 1.83 insn per cycle ( +- 0.08% )
10,061,951,686 branches # 1115.541 M/sec ( +- 0.08% )
172,966,530 branch-misses # 1.72% of all branches ( +- 0.07% )
9.084039051 seconds time elapsed ( +- 0.23% )
- After:
Performance counter stats for './die.sh tlb-dyn-v5' (10 runs):
8624.084842 task-clock (msec) # 0.993 CPUs utilized ( +- 0.23% )
28,556,123,404 cycles # 3.311 GHz ( +- 0.13% )
51,755,089,512 instructions # 1.81 insn per cycle ( +- 0.05% )
9,526,513,946 branches # 1104.641 M/sec ( +- 0.05% )
166,578,509 branch-misses # 1.75% of all branches ( +- 0.19% )
8.680540350 seconds time elapsed ( +- 0.24% )
That is, a 4.4% perf increase.
2. System boot + shutdown, ubuntu 18.04 x86_64:
- Before (v3.1.0):
56100.574751 task-clock (msec) # 1.016 CPUs utilized ( +- 4.81% )
200,745,466,128 cycles # 3.578 GHz ( +- 5.24% )
431,949,100,608 instructions # 2.15 insn per cycle ( +- 5.65% )
77,502,383,330 branches # 1381.490 M/sec ( +- 6.18% )
844,681,191 branch-misses # 1.09% of all branches ( +- 3.82% )
55.221556378 seconds time elapsed ( +- 5.01% )
- After:
56603.419540 task-clock (msec) # 1.019 CPUs utilized ( +- 10.19% )
202,217,930,479 cycles # 3.573 GHz ( +- 10.69% )
439,336,291,626 instructions # 2.17 insn per cycle ( +- 14.14% )
80,538,357,447 branches # 1422.853 M/sec ( +- 16.09% )
776,321,622 branch-misses # 0.96% of all branches ( +- 3.77% )
55.549661409 seconds time elapsed ( +- 10.44% )
No improvement (within noise range). Note that for this workload,
increasing the time window too much can lead to perf degradation,
since it flushes the TLB *very* frequently.
3. x86_64 SPEC06int:
x86_64-softmmu speedup vs. v3.1.0 for SPEC06int (test set)
Host: Intel(R) Xeon(R) Gold 6142 CPU @ 2.60GHz (Skylake)
5.5 +------------------------------------------------------------------------+
| +-+ |
5 |-+.................+-+...............................tlb-dyn-v5.......+-|
| * * |
4.5 |-+.................*.*................................................+-|
| * * |
4 |-+.................*.*................................................+-|
| * * |
3.5 |-+.................*.*................................................+-|
| * * |
3 |-+......+-+*.......*.*................................................+-|
| * * * * |
2.5 |-+......*..*.......*.*.................................+-+*...........+-|
| * * * * * * |
2 |-+......*..*.......*.*.................................*..*...........+-|
| * * * * * * +-+ |
1.5 |-+......*..*.......*.*.................................*..*.*+-+.*+-+.+-|
| * * *+-+ * * +-+ *+-+ +-+ +-+ * * * * * * |
1 |++++-+*+*++*+*++*++*+*++*+*+++-+*+*+-++*+-++++-++++-+++*++*+*++*+*++*+++|
| * * * * * * * * * * * * * * * * * * * * * * * * * * |
0.5 +------------------------------------------------------------------------+
400.perlb401.bzip403.g429445.g456.hm462.libq464.h471.omn47483.xalancbgeomean
png: https://imgur.com/YRF90f7
That is, a 1.51x average speedup over the baseline, with a max speedup
of 5.17x.
Here's a different look at the SPEC06int results, using KVM as the baseline:
x86_64-softmmu slowdown vs. KVM for SPEC06int (test set)
Host: Intel(R) Xeon(R) Gold 6142 CPU @ 2.60GHz (Skylake)
25 +---------------------------------------------------------------------------+
| +-+ +-+ |
| * * +-+ v3.1.0 |
| * * +-+ tlb-dyn-v5 |
| * * * * +-+ |
20 |-+.................*.*.............................*.+-+......*.*........+-|
| * * * # # * * |
| +-+ * * * # # * * |
| * * * * * # # * * |
15 |-+......*.*........*.*.............................*.#.#......*.+-+......+-|
| * * * * * # # * #|# |
| * * * * +-+ * # # * +-+ |
| * * +-+ * * ++-+ +-+ * # # * # # +-+ |
| * * +-+ * * * ## *| +-+ * # # * # # +-+ |
10 |-+......*.*..*.+-+.*.*........*.##.......++-+.*.+-+*.#.#......*.#.#.*.*..+-|
| * * * +-+ * * * ## +-+ *# # * # #* # # +-+ * # # * * |
| * * * # # * * +-+ * ## * +-+ *# # * # #* # # * * * # # *+-+ |
| * * * # # * * * +-+ * ## * # # *# # * # #* # # * * * # # * ## |
5 |-+......*.+-+*.#.#.*.*..*.#.#.*.##.*.#.#.*#.#.*.#.#*.#.#.*.*..*.#.#.*.##.+-|
| * # #* # # * +-+* # # * ## * # # *# # * # #* # # * * * # # * ## |
| * # #* # # * # #* # # * ## * # # *# # * # #* # # * +-+* # # * ## |
| ++-+ * # #* # # * # #* # # * ## * # # *# # * # #* # # * # #* # # * ## |
|+++*#+#+*+#+#*+#+#+*+#+#*+#+#+*+##+*+#+#+*#+#+*+#+#*+#+#+*+#+#*+#+#+*+##+++|
0 +---------------------------------------------------------------------------+
400.perlbe401.bzi403.gc429445.go456.h462.libqu464.h471.omne4483.xalancbmgeomean
png: https://imgur.com/YzAMNEV
After this series, we bring down the average SPEC06int slowdown vs KVM
from 11.47x to 7.58x.
Tested-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Emilio G. Cota <cota@braap.org>
Message-Id: <20190116170114.26802-4-cota@braap.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
|
|
Disabled in all TCG backends for now.
Tested-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Emilio G. Cota <cota@braap.org>
Message-Id: <20190116170114.26802-3-cota@braap.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
|
|
The avx instruction set does not directly provide MO_64.
We can still implement 64-bit with comparison and vpblendvb.
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
|
|
Only MO_8 and MO_16 are implemented, since that's all the
instruction set provides.
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
|
|
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
|
|
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
|
|
For now, defined universally as true, since we previously required
backends to implement swapped memory operations. Future patches
may now remove that support where it is onerous.
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
|
|
This preserves the invariant that all TCG_TYPE_I32 values are
zero-extended in the 64-bit host register.
Reviewed-by: Emilio G. Cota <cota@braap.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
|
|
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Emilio G. Cota <cota@braap.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
|
|
For x86_64, this can remove a REX prefix resulting in smaller code
when manipulating globals of type i32, as we move them between backing
store via cpu_env, aka TCG_AREG0.
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Emilio G. Cota <cota@braap.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
|
|
The x86 vector instruction set is extremely irregular. With newer
editions, Intel has filled in some of the blanks. However, we don't
get many 64-bit operations until SSE4.2, introduced in 2009.
The subsequent edition was for AVX1, introduced in 2011, which added
three-operand addressing, and adjusts how all instructions should be
encoded.
Given the relatively narrow 2 year window between possible to support
and desirable to support, and to vastly simplify code maintainence,
I am only planning to support AVX1 and later cpus.
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
|
|
Already it saves 2 bytes per call, but also the constant pool
entry may well be shared across multiple calls.
Signed-off-by: Richard Henderson <rth@twiddle.net>
|
|
Dispense with TCGBackendData, as it has never been used for more than
holding a single pointer. Use a define in the cpu/tcg-target.h to
signal requirement for TCGLabelQemuLdst, so that we can drop the no-op
tcg-be-null.h stubs. Rename tcg-be-ldst.h to tcg-ldst.inc.c.
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Richard Henderson <rth@twiddle.net>
|
|
Replace the USE_DIRECT_JUMP ifdef with a TCG_TARGET_HAS_direct_jump
boolean test. Replace the tb_set_jmp_target1 ifdef with an unconditional
function tb_target_set_jmp_target.
While we're touching all backends, add a parameter for tb->tc_ptr;
we're going to need it shortly for some backends.
Move tb_set_jmp_target and tb_add_jump from exec-all.h to cpu-exec.c.
This opens the possibility for TCG_TARGET_HAS_direct_jump to be
a runtime decision -- based on host cpu capabilities, the size of
code_gen_buffer, or a future debugging switch.
Signed-off-by: Richard Henderson <rth@twiddle.net>
|
|
Suggested-by: Richard Henderson <rth@twiddle.net>
Reviewed-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Emilio G. Cota <cota@braap.org>
Message-Id: <1493263764-18657-6-git-send-email-cota@braap.org>
[rth: Reuse goto_ptr epilogue for exit_tb 0.]
Signed-off-by: Richard Henderson <rth@twiddle.net>
|
|
Instead of exporting goto_ptr directly to TCG frontends, export
tcg_gen_lookup_and_goto_ptr(), which calls goto_ptr with the pointer
returned by the lookup_tb_ptr() helper. This is the only use case
we have for goto_ptr and lookup_tb_ptr, so having this function is
very convenient. Furthermore, it trivially allows us to avoid calling
the lookup helper if goto_ptr is not implemented by the backend.
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Emilio G. Cota <cota@braap.org>
Message-Id: <1493263764-18657-2-git-send-email-cota@braap.org>
Message-Id: <1493263764-18657-3-git-send-email-cota@braap.org>
Message-Id: <1493263764-18657-4-git-send-email-cota@braap.org>
Message-Id: <1493263764-18657-5-git-send-email-cota@braap.org>
[rth: Squashed 4 related commits.]
Signed-off-by: Richard Henderson <rth@twiddle.net>
|
|
This enables the multi-threaded system emulation by default for ARMv7
and ARMv8 guests using the x86_64 TCG backend. This is because on the
guest side:
- The ARM translate.c/translate-64.c have been converted to
- use MTTCG safe atomic primitives
- emit the appropriate barrier ops
- The ARM machine has been updated to
- hold the BQL when modifying shared cross-vCPU state
- defer powerctl changes to async safe work
All the host backends support the barrier and atomic primitives but
need to provide same-or-better support for normal load/store
operations.
Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Richard Henderson <rth@twiddle.net>
Acked-by: Peter Maydell <peter.maydell@linaro.org>
Tested-by: Pranith Kumar <bobby.prani@gmail.com>
Reviewed-by: Pranith Kumar <bobby.prani@gmail.com>
|
|
Signed-off-by: Richard Henderson <rth@twiddle.net>
|