aboutsummaryrefslogtreecommitdiff
path: root/util/bufferiszero.c
AgeCommit message (Collapse)Author
2024-06-05host/i386: assume presence of SSE2Paolo Bonzini
QEMU now requires an x86-64-v2 host, which has SSE2. Use it freely in buffer_is_zero. Reviewed-by: Richard Henderson <richard.henderson@linaro.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-03util/bufferiszero: Add simd acceleration for aarch64Richard Henderson
Because non-embedded aarch64 is expected to have AdvSIMD enabled, merely double-check with the compiler flags for __ARM_NEON and don't bother with a runtime check. Otherwise, model the loop after the x86 SSE2 function. Use UMAXV for the vector reduction. This is 3 cycles on cortex-a76 and 2 cycles on neoverse-n1. Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2024-05-03util/bufferiszero: Simplify test_buffer_is_zero_next_accelRichard Henderson
Because the three alternatives are monotonic, we don't need to keep a couple of bitmasks, just identify the strongest alternative at startup. Generalize test_buffer_is_zero_next_accel and init_accel by always defining an accel_table array. Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2024-05-03util/bufferiszero: Introduce biz_accel_fn typedefRichard Henderson
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2024-05-03util/bufferiszero: Improve scalar variantRichard Henderson
Split less-than and greater-than 256 cases. Use unaligned accesses for head and tail. Avoid using out-of-bounds pointers in loop boundary conditions. Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2024-05-03util/bufferiszero: Optimize SSE2 and AVX2 variantsAlexander Monakov
Increase unroll factor in SIMD loops from 4x to 8x in order to move their bottlenecks from ALU port contention to load issue rate (two loads per cycle on popular x86 implementations). Avoid using out-of-bounds pointers in loop boundary conditions. Follow SSE2 implementation strategy in the AVX2 variant. Avoid use of PTEST, which is not profitable there (like in the removed SSE4 variant). Signed-off-by: Alexander Monakov <amonakov@ispras.ru> Signed-off-by: Mikhail Romanov <mmromanov@ispras.ru> Reviewed-by: Richard Henderson <richard.henderson@linaro.org> Message-Id: <20240206204809.9859-6-amonakov@ispras.ru>
2024-05-03util/bufferiszero: Remove useless prefetchesAlexander Monakov
Use of prefetching in bufferiszero.c is quite questionable: - prefetches are issued just a few CPU cycles before the corresponding line would be hit by demand loads; - they are done for simple access patterns, i.e. where hardware prefetchers can perform better; - they compete for load ports in loops that should be limited by load port throughput rather than ALU throughput. Signed-off-by: Alexander Monakov <amonakov@ispras.ru> Signed-off-by: Mikhail Romanov <mmromanov@ispras.ru> Reviewed-by: Richard Henderson <richard.henderson@linaro.org> Message-Id: <20240206204809.9859-5-amonakov@ispras.ru>
2024-05-03util/bufferiszero: Reorganize for early test for accelerationAlexander Monakov
Test for length >= 256 inline, where is is often a constant. Before calling into the accelerated routine, sample three bytes from the buffer, which handles most non-zero buffers. Signed-off-by: Alexander Monakov <amonakov@ispras.ru> Signed-off-by: Mikhail Romanov <mmromanov@ispras.ru> Message-Id: <20240206204809.9859-3-amonakov@ispras.ru> [rth: Use __builtin_constant_p; move the indirect call out of line.] Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2024-05-03util/bufferiszero: Remove AVX512 variantAlexander Monakov
Thanks to early checks in the inline buffer_is_zero wrapper, the SIMD routines are invoked much more rarely in normal use when most buffers are non-zero. This makes use of AVX512 unprofitable, as it incurs extra frequency and voltage transition periods during which the CPU operates at reduced performance, as described in https://travisdowns.github.io/blog/2020/01/17/avxfreq1.html Signed-off-by: Mikhail Romanov <mmromanov@ispras.ru> Signed-off-by: Alexander Monakov <amonakov@ispras.ru> Reviewed-by: Richard Henderson <richard.henderson@linaro.org> Message-Id: <20240206204809.9859-4-amonakov@ispras.ru> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2024-05-03util/bufferiszero: Remove SSE4.1 variantAlexander Monakov
The SSE4.1 variant is virtually identical to the SSE2 variant, except for using 'PTEST+JNZ' in place of 'PCMPEQB+PMOVMSKB+CMP+JNE' for testing if an SSE register is all zeroes. The PTEST instruction decodes to two uops, so it can be handled only by the complex decoder, and since CMP+JNE are macro-fused, both sequences decode to three uops. The uops comprising the PTEST instruction dispatch to p0 and p5 on Intel CPUs, so PCMPEQB+PMOVMSKB is comparatively more flexible from dispatch standpoint. Hence, the use of PTEST brings no benefit from throughput standpoint. Its latency is not important, since it feeds only a conditional jump, which terminates the dependency chain. I never observed PTEST variants to be faster on real hardware. Signed-off-by: Alexander Monakov <amonakov@ispras.ru> Signed-off-by: Mikhail Romanov <mmromanov@ispras.ru> Reviewed-by: Richard Henderson <richard.henderson@linaro.org> Message-Id: <20240206204809.9859-2-amonakov@ispras.ru>
2023-05-23util/bufferiszero: Use i386 host/cpuinfo.hRichard Henderson
Use cpuinfo_init() during init_accel(), and the variable cpuinfo during test_buffer_is_zero_next_accel(). Adjust the logic that cycles through the set of accelerators for testing. Reviewed-by: Alex Bennée <alex.bennee@linaro.org> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2023-03-05include/qemu/cpuid: Introduce xgetbv_lowRichard Henderson
Replace the two uses of asm to expand xgetbv with an inline function. Since one of the two has been using the mnemonic, assume that the comment about "older versions of the assember" is obsolete, as even that is 4 years old. Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2023-01-16util/bufferiszero: Use __attribute__((target)) for avx2/avx512Richard Henderson
Use the attribute, which is supported by clang, instead of the #pragma, which is not supported and, for some reason, also not detected by the meson probe, so we fail by -Werror. Include only <immintrin.h> as that is the outermost "official" header for these intrinsics -- emmintrin.h and smmintrin -- are older SSE2 and SSE4 specific headers, while the immintrin.h includes all of the Intel intrinsics. Reviewed-by: Daniel P. Berrangé <berrange@redhat.com> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
2022-02-04cpuid: use unsigned for max cpuidMichael S. Tsirkin
__get_cpuid_max returns an unsigned value. For consistency, store the result in an unsigned variable. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Richard Henderson <rth@twiddle.net> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
2020-04-01util/bufferiszero: improve avx2 acceleratorRobert Hoo
By increasing avx2 length_to_accel to 128, we can simplify its logic and reduce a branch. The authorship of this patch actually belongs to Richard Henderson <richard.henderson@linaro.org>, I just fixed a boundary case on his original patch. Suggested-by: Richard Henderson <richard.henderson@linaro.org> Signed-off-by: Robert Hoo <robert.hu@linux.intel.com> Message-Id: <1585119021-46593-2-git-send-email-robert.hu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-04-01util/bufferiszero: assign length_to_accel value for each accelerator caseRobert Hoo
Because in unit test, init_accel() will be called several times, each with different accelerator type. Signed-off-by: Robert Hoo <robert.hu@linux.intel.com> Message-Id: <1585119021-46593-1-git-send-email-robert.hu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-03-16util: add util function buffer_zero_avx512()Robert Hoo
And intialize buffer_is_zero() with it, when Intel AVX512F is available on host. This function utilizes Intel AVX512 fundamental instructions which is faster than its implementation with AVX2 (in my unit test, with 4K buffer, on CascadeLake SP, ~36% faster, buffer_zero_avx512() V.S. buffer_zero_avx2()). Signed-off-by: Robert Hoo <robert.hu@linux.intel.com> Reviewed-by: Richard Henderson <richard.henderson@linaro.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-12Include qemu-common.h exactly where neededMarkus Armbruster
No header includes qemu-common.h after this commit, as prescribed by qemu-common.h's file comment. Signed-off-by: Markus Armbruster <armbru@redhat.com> Message-Id: <20190523143508.25387-5-armbru@redhat.com> [Rebased with conflicts resolved automatically, except for include/hw/arm/xlnx-zynqmp.h hw/arm/nrf51_soc.c hw/arm/msf2-soc.c block/qcow2-refcount.c block/qcow2-cluster.c block/qcow2-cache.c target/arm/cpu.h target/lm32/cpu.h target/m68k/cpu.h target/mips/cpu.h target/moxie/cpu.h target/nios2/cpu.h target/openrisc/cpu.h target/riscv/cpu.h target/tilegx/cpu.h target/tricore/cpu.h target/unicore32/cpu.h target/xtensa/cpu.h; bsd-user/main.c and net/tap-bsd.c fixed up]
2017-07-24util: Introduce include/qemu/cpuid.hRichard Henderson
Clang 3.9 passes the CONFIG_AVX2_OPT configure test. However, the supplied <cpuid.h> does not contain the bit_AVX2 define that we use when detecting whether the routine can be enabled. Introduce a qemu-specific header that uses the compiler's definition of __cpuid et al, but supplies any missing bit_* definitions needed. This avoids introducing any extra ifdefs to util/bufferiszero.c, and allows quite a few to be removed from tcg/i386/tcg-target.inc.c. Signed-off-by: Richard Henderson <rth@twiddle.net> Reviewed-by: Peter Maydell <peter.maydell@linaro.org> Message-id: 20170719044018.18063-1-rth@twiddle.net Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
2016-09-14cutils: Rewrite x86 buffer zero checkingRichard Henderson
Handle alignment of buffers, so that the vector paths can be used more often. Signed-off-by: Richard Henderson <rth@twiddle.net> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <1473800239-13841-1-git-send-email-rth@twiddle.net> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-09-13cutils: Add generic prefetchRichard Henderson
There's no real knowledge of the cacheline size, just prefetching one loop ahead. Signed-off-by: Richard Henderson <rth@twiddle.net> Message-Id: <1472496380-19706-7-git-send-email-rth@twiddle.net> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-09-13cutils: Add SSE4 versionPaolo Bonzini
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-09-13cutils: Add test for buffer_is_zeroRichard Henderson
Signed-off-by: Richard Henderson <rth@twiddle.net> Message-Id: <1472496380-19706-6-git-send-email-rth@twiddle.net> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-09-13cutils: Remove ppc buffer zero checkingRichard Henderson
For ppc64le, gcc6 does extremely poorly with the Altivec code. Moreover, on POWER7 and POWER8, a hand-optimized Altivec version turns out to be no faster than the revised integer version, and therefore not worth the effort. Signed-off-by: Richard Henderson <rth@twiddle.net> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-09-13cutils: Remove aarch64 buffer zero checkingRichard Henderson
The revised integer version is 4 times faster than the neon version on an AppliedMicro Mustang. Even with hand scheduling and additional unrolling I cannot make any neon version run as fast as the integer. Signed-off-by: Richard Henderson <rth@twiddle.net> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-09-13cutils: Rearrange buffer_is_zero accelerationRichard Henderson
Allow selection of several acceleration functions based on the size and alignment of the buffer. Do not require ifunc support for AVX2 acceleration. Signed-off-by: Richard Henderson <rth@twiddle.net> Message-Id: <1472496380-19706-5-git-send-email-rth@twiddle.net> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-09-13cutils: Export only buffer_is_zeroRichard Henderson
Since the two users don't make use of the returned offset, beyond ensuring that the entire buffer is zero, consider the can_use_buffer_find_nonzero_offset and buffer_find_nonzero_offset functions internal. Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com> Signed-off-by: Richard Henderson <rth@twiddle.net> Message-Id: <1472496380-19706-4-git-send-email-rth@twiddle.net> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-09-13cutils: Remove SPLAT macroRichard Henderson
This is unused and complicates the vector interface. Signed-off-by: Richard Henderson <rth@twiddle.net> Message-Id: <1472496380-19706-3-git-send-email-rth@twiddle.net> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-09-13cutils: Move buffer_is_zero and subroutines to a new fileRichard Henderson
Signed-off-by: Richard Henderson <rth@twiddle.net> Message-Id: <1472496380-19706-2-git-send-email-rth@twiddle.net> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>