diff options
author | Paolo Bonzini <pbonzini@redhat.com> | 2020-04-06 10:30:05 +0200 |
---|---|---|
committer | Paolo Bonzini <pbonzini@redhat.com> | 2020-04-11 08:49:25 -0400 |
commit | 15e8699f009f7feeab7d9ab406bf62882958e4d9 (patch) | |
tree | c36536d19b1248251bc4d66362e00802652adbb2 /docs/devel/atomics.rst | |
parent | 278fb1627351218b23dd33403f08d7521643fda2 (diff) |
atomics: convert to reStructuredText
No attempts to fix or update the text; these are left for the next
patch in the series.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Diffstat (limited to 'docs/devel/atomics.rst')
-rw-r--r-- | docs/devel/atomics.rst | 446 |
1 files changed, 446 insertions, 0 deletions
diff --git a/docs/devel/atomics.rst b/docs/devel/atomics.rst new file mode 100644 index 0000000000..83ed3d6981 --- /dev/null +++ b/docs/devel/atomics.rst @@ -0,0 +1,446 @@ +========================= +Atomic operations in QEMU +========================= + +CPUs perform independent memory operations effectively in random order. +but this can be a problem for CPU-CPU interaction (including interactions +between QEMU and the guest). Multi-threaded programs use various tools +to instruct the compiler and the CPU to restrict the order to something +that is consistent with the expectations of the programmer. + +The most basic tool is locking. Mutexes, condition variables and +semaphores are used in QEMU, and should be the default approach to +synchronization. Anything else is considerably harder, but it's +also justified more often than one would like. The two tools that +are provided by ``qemu/atomic.h`` are memory barriers and atomic operations. + +Macros defined by ``qemu/atomic.h`` fall in three camps: + +- compiler barriers: ``barrier()``; + +- weak atomic access and manual memory barriers: ``atomic_read()``, + ``atomic_set()``, ``smp_rmb()``, ``smp_wmb()``, ``smp_mb()``, ``smp_mb_acquire()``, + ``smp_mb_release()``, ``smp_read_barrier_depends()``; + +- sequentially consistent atomic access: everything else. + + +Compiler memory barrier +======================= + +``barrier()`` prevents the compiler from moving the memory accesses either +side of it to the other side. The compiler barrier has no direct effect +on the CPU, which may then reorder things however it wishes. + +``barrier()`` is mostly used within ``qemu/atomic.h`` itself. On some +architectures, CPU guarantees are strong enough that blocking compiler +optimizations already ensures the correct order of execution. In this +case, ``qemu/atomic.h`` will reduce stronger memory barriers to simple +compiler barriers. + +Still, ``barrier()`` can be useful when writing code that can be interrupted +by signal handlers. + + +Sequentially consistent atomic access +===================================== + +Most of the operations in the ``qemu/atomic.h`` header ensure *sequential +consistency*, where "the result of any execution is the same as if the +operations of all the processors were executed in some sequential order, +and the operations of each individual processor appear in this sequence +in the order specified by its program". + +``qemu/atomic.h`` provides the following set of atomic read-modify-write +operations:: + + void atomic_inc(ptr) + void atomic_dec(ptr) + void atomic_add(ptr, val) + void atomic_sub(ptr, val) + void atomic_and(ptr, val) + void atomic_or(ptr, val) + + typeof(*ptr) atomic_fetch_inc(ptr) + typeof(*ptr) atomic_fetch_dec(ptr) + typeof(*ptr) atomic_fetch_add(ptr, val) + typeof(*ptr) atomic_fetch_sub(ptr, val) + typeof(*ptr) atomic_fetch_and(ptr, val) + typeof(*ptr) atomic_fetch_or(ptr, val) + typeof(*ptr) atomic_fetch_xor(ptr, val) + typeof(*ptr) atomic_fetch_inc_nonzero(ptr) + typeof(*ptr) atomic_xchg(ptr, val) + typeof(*ptr) atomic_cmpxchg(ptr, old, new) + +all of which return the old value of ``*ptr``. These operations are +polymorphic; they operate on any type that is as wide as a pointer. + +Similar operations return the new value of ``*ptr``:: + + typeof(*ptr) atomic_inc_fetch(ptr) + typeof(*ptr) atomic_dec_fetch(ptr) + typeof(*ptr) atomic_add_fetch(ptr, val) + typeof(*ptr) atomic_sub_fetch(ptr, val) + typeof(*ptr) atomic_and_fetch(ptr, val) + typeof(*ptr) atomic_or_fetch(ptr, val) + typeof(*ptr) atomic_xor_fetch(ptr, val) + +Sequentially consistent loads and stores can be done using:: + + atomic_fetch_add(ptr, 0) for loads + atomic_xchg(ptr, val) for stores + +However, they are quite expensive on some platforms, notably POWER and +Arm. Therefore, qemu/atomic.h provides two primitives with slightly +weaker constraints:: + + typeof(*ptr) atomic_mb_read(ptr) + void atomic_mb_set(ptr, val) + +The semantics of these primitives map to Java volatile variables, +and are strongly related to memory barriers as used in the Linux +kernel (see below). + +As long as you use atomic_mb_read and atomic_mb_set, accesses cannot +be reordered with each other, and it is also not possible to reorder +"normal" accesses around them. + +However, and this is the important difference between +atomic_mb_read/atomic_mb_set and sequential consistency, it is important +for both threads to access the same volatile variable. It is not the +case that everything visible to thread A when it writes volatile field f +becomes visible to thread B after it reads volatile field g. The store +and load have to "match" (i.e., be performed on the same volatile +field) to achieve the right semantics. + + +These operations operate on any type that is as wide as an int or smaller. + + +Weak atomic access and manual memory barriers +============================================= + +Compared to sequentially consistent atomic access, programming with +weaker consistency models can be considerably more complicated. +In general, if the algorithm you are writing includes both writes +and reads on the same side, it is generally simpler to use sequentially +consistent primitives. + +When using this model, variables are accessed with: + +- ``atomic_read()`` and ``atomic_set()``; these prevent the compiler from + optimizing accesses out of existence and creating unsolicited + accesses, but do not otherwise impose any ordering on loads and + stores: both the compiler and the processor are free to reorder + them. + +- ``atomic_load_acquire()``, which guarantees the LOAD to appear to + happen, with respect to the other components of the system, + before all the LOAD or STORE operations specified afterwards. + Operations coming before ``atomic_load_acquire()`` can still be + reordered after it. + +- ``atomic_store_release()``, which guarantees the STORE to appear to + happen, with respect to the other components of the system, + after all the LOAD or STORE operations specified afterwards. + Operations coming after ``atomic_store_release()`` can still be + reordered after it. + +Restrictions to the ordering of accesses can also be specified +using the memory barrier macros: ``smp_rmb()``, ``smp_wmb()``, ``smp_mb()``, +``smp_mb_acquire()``, ``smp_mb_release()``, ``smp_read_barrier_depends()``. + +Memory barriers control the order of references to shared memory. +They come in six kinds: + +- ``smp_rmb()`` guarantees that all the LOAD operations specified before + the barrier will appear to happen before all the LOAD operations + specified after the barrier with respect to the other components of + the system. + + In other words, ``smp_rmb()`` puts a partial ordering on loads, but is not + required to have any effect on stores. + +- ``smp_wmb()`` guarantees that all the STORE operations specified before + the barrier will appear to happen before all the STORE operations + specified after the barrier with respect to the other components of + the system. + + In other words, ``smp_wmb()`` puts a partial ordering on stores, but is not + required to have any effect on loads. + +- ``smp_mb_acquire()`` guarantees that all the LOAD operations specified before + the barrier will appear to happen before all the LOAD or STORE operations + specified after the barrier with respect to the other components of + the system. + +- ``smp_mb_release()`` guarantees that all the STORE operations specified *after* + the barrier will appear to happen after all the LOAD or STORE operations + specified *before* the barrier with respect to the other components of + the system. + +- ``smp_mb()`` guarantees that all the LOAD and STORE operations specified + before the barrier will appear to happen before all the LOAD and + STORE operations specified after the barrier with respect to the other + components of the system. + + ``smp_mb()`` puts a partial ordering on both loads and stores. It is + stronger than both a read and a write memory barrier; it implies both + ``smp_mb_acquire()`` and ``smp_mb_release()``, but it also prevents STOREs + coming before the barrier from overtaking LOADs coming after the + barrier and vice versa. + +- ``smp_read_barrier_depends()`` is a weaker kind of read barrier. On + most processors, whenever two loads are performed such that the + second depends on the result of the first (e.g., the first load + retrieves the address to which the second load will be directed), + the processor will guarantee that the first LOAD will appear to happen + before the second with respect to the other components of the system. + However, this is not always true---for example, it was not true on + Alpha processors. Whenever this kind of access happens to shared + memory (that is not protected by a lock), a read barrier is needed, + and ``smp_read_barrier_depends()`` can be used instead of ``smp_rmb()``. + + Note that the first load really has to have a _data_ dependency and not + a control dependency. If the address for the second load is dependent + on the first load, but the dependency is through a conditional rather + than actually loading the address itself, then it's a _control_ + dependency and a full read barrier or better is required. + + +This is the set of barriers that is required *between* two ``atomic_read()`` +and ``atomic_set()`` operations to achieve sequential consistency: + + +----------------+-------------------------------------------------------+ + | | 2nd operation | + | +------------------+-----------------+------------------+ + | 1st operation | (after last) | atomic_read | atomic_set | + +----------------+------------------+-----------------+------------------+ + | (before first) | .. | none | smp_mb_release() | + +----------------+------------------+-----------------+------------------+ + | atomic_read | smp_mb_acquire() | smp_rmb() [1]_ | [2]_ | + +----------------+------------------+-----------------+------------------+ + | atomic_set | none | smp_mb() [3]_ | smp_wmb() | + +----------------+------------------+-----------------+------------------+ + + .. [1] Or smp_read_barrier_depends(). + + .. [2] This requires a load-store barrier. This is achieved by + either smp_mb_acquire() or smp_mb_release(). + + .. [3] This requires a store-load barrier. On most machines, the only + way to achieve this is a full barrier. + + +You can see that the two possible definitions of ``atomic_mb_read()`` +and ``atomic_mb_set()`` are the following: + + 1) | atomic_mb_read(p) = atomic_read(p); smp_mb_acquire() + | atomic_mb_set(p, v) = smp_mb_release(); atomic_set(p, v); smp_mb() + + 2) | atomic_mb_read(p) = smp_mb() atomic_read(p); smp_mb_acquire() + | atomic_mb_set(p, v) = smp_mb_release(); atomic_set(p, v); + +Usually the former is used, because ``smp_mb()`` is expensive and a program +normally has more reads than writes. Therefore it makes more sense to +make ``atomic_mb_set()`` the more expensive operation. + +There are two common cases in which atomic_mb_read and atomic_mb_set +generate too many memory barriers, and thus it can be useful to manually +place barriers, or use atomic_load_acquire/atomic_store_release instead: + +- when a data structure has one thread that is always a writer + and one thread that is always a reader, manual placement of + memory barriers makes the write side faster. Furthermore, + correctness is easy to check for in this case using the "pairing" + trick that is explained below: + + +----------------------------------------------------------------------+ + | thread 1 | + +-----------------------------------+----------------------------------+ + | before | after | + +===================================+==================================+ + | :: | :: | + | | | + | (other writes) | | + | atomic_mb_set(&a, x) | atomic_store_release(&a, x) | + | atomic_mb_set(&b, y) | atomic_store_release(&b, y) | + +-----------------------------------+----------------------------------+ + + +----------------------------------------------------------------------+ + | thread 2 | + +-----------------------------------+----------------------------------+ + | before | after | + +===================================+==================================+ + | :: | :: | + | | | + | y = atomic_mb_read(&b) | y = atomic_load_acquire(&b) | + | x = atomic_mb_read(&a) | x = atomic_load_acquire(&a) | + | (other reads) | | + +-----------------------------------+----------------------------------+ + + Note that the barrier between the stores in thread 1, and between + the loads in thread 2, has been optimized here to a write or a + read memory barrier respectively. On some architectures, notably + ARMv7, smp_mb_acquire and smp_mb_release are just as expensive as + smp_mb, but smp_rmb and/or smp_wmb are more efficient. + +- sometimes, a thread is accessing many variables that are otherwise + unrelated to each other (for example because, apart from the current + thread, exactly one other thread will read or write each of these + variables). In this case, it is possible to "hoist" the implicit + barriers provided by ``atomic_mb_read()`` and ``atomic_mb_set()`` outside + a loop. For example, the above definition ``atomic_mb_read()`` gives + the following transformation: + + +-----------------------------------+----------------------------------+ + | before | after | + +===================================+==================================+ + | :: | :: | + | | | + | n = 0; | n = 0; | + | for (i = 0; i < 10; i++) | for (i = 0; i < 10; i++) | + | n += atomic_mb_read(&a[i]); | n += atomic_read(&a[i]); | + | | smp_mb_acquire(); | + +-----------------------------------+----------------------------------+ + + Similarly, atomic_mb_set() can be transformed as follows: + + +-----------------------------------+----------------------------------+ + | before | after | + +===================================+==================================+ + | :: | :: | + | | | + | | smp_mb_release(); | + | for (i = 0; i < 10; i++) | for (i = 0; i < 10; i++) | + | atomic_mb_set(&a[i], false); | atomic_set(&a[i], false); | + | | smp_mb(); | + +-----------------------------------+----------------------------------+ + + + The other thread can still use ``atomic_mb_read()``/``atomic_mb_set()``. + +The two tricks can be combined. In this case, splitting a loop in +two lets you hoist the barriers out of the loops _and_ eliminate the +expensive ``smp_mb()``: + + +-----------------------------------+----------------------------------+ + | before | after | + +===================================+==================================+ + | :: | :: | + | | | + | | smp_mb_release(); | + | for (i = 0; i < 10; i++) { | for (i = 0; i < 10; i++) | + | atomic_mb_set(&a[i], false); | atomic_set(&a[i], false); | + | atomic_mb_set(&b[i], false); | smb_wmb(); | + | } | for (i = 0; i < 10; i++) | + | | atomic_set(&a[i], false); | + | | smp_mb(); | + +-----------------------------------+----------------------------------+ + + +Memory barrier pairing +---------------------- + +A useful rule of thumb is that memory barriers should always, or almost +always, be paired with another barrier. In the case of QEMU, however, +note that the other barrier may actually be in a driver that runs in +the guest! + +For the purposes of pairing, ``smp_read_barrier_depends()`` and ``smp_rmb()`` +both count as read barriers. A read barrier shall pair with a write +barrier or a full barrier; a write barrier shall pair with a read +barrier or a full barrier. A full barrier can pair with anything. +For example: + + +--------------------+------------------------------+ + | thread 1 | thread 2 | + +====================+==============================+ + | :: | :: | + | | | + | a = 1; | | + | smp_wmb(); | | + | b = 2; | x = b; | + | | smp_rmb(); | + | | y = a; | + +--------------------+------------------------------+ + +Note that the "writing" thread is accessing the variables in the +opposite order as the "reading" thread. This is expected: stores +before the write barrier will normally match the loads after the +read barrier, and vice versa. The same is true for more than 2 +access and for data dependency barriers: + + +----------------------+------------------------------+ + | thread 1 | thread 2 | + +======================+==============================+ + | :: | :: | + | | | + | b[2] = 1; | | + | smp_wmb(); | | + | x->i = 2; | | + | smp_wmb(); | | + | a = x; | x = a; | + | | smp_read_barrier_depends(); | + | | y = x->i; | + | | smp_read_barrier_depends(); | + | | z = b[y]; | + +----------------------+------------------------------+ + +``smp_wmb()`` also pairs with ``atomic_mb_read()`` and ``smp_mb_acquire()``. +and ``smp_rmb()`` also pairs with ``atomic_mb_set()`` and ``smp_mb_release()``. + + +Comparison with Linux kernel memory barriers +============================================ + +Here is a list of differences between Linux kernel atomic operations +and memory barriers, and the equivalents in QEMU: + +- atomic operations in Linux are always on a 32-bit int type and + use a boxed ``atomic_t`` type; atomic operations in QEMU are polymorphic + and use normal C types. + +- Originally, ``atomic_read`` and ``atomic_set`` in Linux gave no guarantee + at all. Linux 4.1 updated them to implement volatile + semantics via ``ACCESS_ONCE`` (or the more recent ``READ``/``WRITE_ONCE``). + + QEMU's ``atomic_read`` and ``atomic_set`` implement C11 atomic relaxed + semantics if the compiler supports it, and volatile semantics otherwise. + Both semantics prevent the compiler from doing certain transformations; + the difference is that atomic accesses are guaranteed to be atomic, + while volatile accesses aren't. Thus, in the volatile case we just cross + our fingers hoping that the compiler will generate atomic accesses, + since we assume the variables passed are machine-word sized and + properly aligned. + + No barriers are implied by ``atomic_read`` and ``atomic_set`` in either Linux + or QEMU. + +- atomic read-modify-write operations in Linux are of three kinds: + + ===================== ========================================= + ``atomic_OP`` returns void + ``atomic_OP_return`` returns new value of the variable + ``atomic_fetch_OP`` returns the old value of the variable + ``atomic_cmpxchg`` returns the old value of the variable + ===================== ========================================= + + In QEMU, the second kind does not exist. Currently Linux has + atomic_fetch_or only. QEMU provides and, or, inc, dec, add, sub. + +- different atomic read-modify-write operations in Linux imply + a different set of memory barriers; in QEMU, all of them enforce + sequential consistency, which means they imply full memory barriers + before and after the operation. + +- Linux does not have an equivalent of ``atomic_mb_set()``. In particular, + note that ``smp_store_mb()`` is a little weaker than ``atomic_mb_set()``. + ``atomic_mb_read()`` compiles to the same instructions as Linux's + ``smp_load_acquire()``, but this should be treated as an implementation + detail. + +Sources +======= + +- ``Documentation/memory-barriers.txt`` from the Linux kernel |