aboutsummaryrefslogtreecommitdiff
path: root/docs/devel/atomics.rst
diff options
context:
space:
mode:
authorPaolo Bonzini <pbonzini@redhat.com>2020-04-06 10:30:05 +0200
committerPaolo Bonzini <pbonzini@redhat.com>2020-04-11 08:49:25 -0400
commit15e8699f009f7feeab7d9ab406bf62882958e4d9 (patch)
treec36536d19b1248251bc4d66362e00802652adbb2 /docs/devel/atomics.rst
parent278fb1627351218b23dd33403f08d7521643fda2 (diff)
atomics: convert to reStructuredText
No attempts to fix or update the text; these are left for the next patch in the series. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Diffstat (limited to 'docs/devel/atomics.rst')
-rw-r--r--docs/devel/atomics.rst446
1 files changed, 446 insertions, 0 deletions
diff --git a/docs/devel/atomics.rst b/docs/devel/atomics.rst
new file mode 100644
index 0000000000..83ed3d6981
--- /dev/null
+++ b/docs/devel/atomics.rst
@@ -0,0 +1,446 @@
+=========================
+Atomic operations in QEMU
+=========================
+
+CPUs perform independent memory operations effectively in random order.
+but this can be a problem for CPU-CPU interaction (including interactions
+between QEMU and the guest). Multi-threaded programs use various tools
+to instruct the compiler and the CPU to restrict the order to something
+that is consistent with the expectations of the programmer.
+
+The most basic tool is locking. Mutexes, condition variables and
+semaphores are used in QEMU, and should be the default approach to
+synchronization. Anything else is considerably harder, but it's
+also justified more often than one would like. The two tools that
+are provided by ``qemu/atomic.h`` are memory barriers and atomic operations.
+
+Macros defined by ``qemu/atomic.h`` fall in three camps:
+
+- compiler barriers: ``barrier()``;
+
+- weak atomic access and manual memory barriers: ``atomic_read()``,
+ ``atomic_set()``, ``smp_rmb()``, ``smp_wmb()``, ``smp_mb()``, ``smp_mb_acquire()``,
+ ``smp_mb_release()``, ``smp_read_barrier_depends()``;
+
+- sequentially consistent atomic access: everything else.
+
+
+Compiler memory barrier
+=======================
+
+``barrier()`` prevents the compiler from moving the memory accesses either
+side of it to the other side. The compiler barrier has no direct effect
+on the CPU, which may then reorder things however it wishes.
+
+``barrier()`` is mostly used within ``qemu/atomic.h`` itself. On some
+architectures, CPU guarantees are strong enough that blocking compiler
+optimizations already ensures the correct order of execution. In this
+case, ``qemu/atomic.h`` will reduce stronger memory barriers to simple
+compiler barriers.
+
+Still, ``barrier()`` can be useful when writing code that can be interrupted
+by signal handlers.
+
+
+Sequentially consistent atomic access
+=====================================
+
+Most of the operations in the ``qemu/atomic.h`` header ensure *sequential
+consistency*, where "the result of any execution is the same as if the
+operations of all the processors were executed in some sequential order,
+and the operations of each individual processor appear in this sequence
+in the order specified by its program".
+
+``qemu/atomic.h`` provides the following set of atomic read-modify-write
+operations::
+
+ void atomic_inc(ptr)
+ void atomic_dec(ptr)
+ void atomic_add(ptr, val)
+ void atomic_sub(ptr, val)
+ void atomic_and(ptr, val)
+ void atomic_or(ptr, val)
+
+ typeof(*ptr) atomic_fetch_inc(ptr)
+ typeof(*ptr) atomic_fetch_dec(ptr)
+ typeof(*ptr) atomic_fetch_add(ptr, val)
+ typeof(*ptr) atomic_fetch_sub(ptr, val)
+ typeof(*ptr) atomic_fetch_and(ptr, val)
+ typeof(*ptr) atomic_fetch_or(ptr, val)
+ typeof(*ptr) atomic_fetch_xor(ptr, val)
+ typeof(*ptr) atomic_fetch_inc_nonzero(ptr)
+ typeof(*ptr) atomic_xchg(ptr, val)
+ typeof(*ptr) atomic_cmpxchg(ptr, old, new)
+
+all of which return the old value of ``*ptr``. These operations are
+polymorphic; they operate on any type that is as wide as a pointer.
+
+Similar operations return the new value of ``*ptr``::
+
+ typeof(*ptr) atomic_inc_fetch(ptr)
+ typeof(*ptr) atomic_dec_fetch(ptr)
+ typeof(*ptr) atomic_add_fetch(ptr, val)
+ typeof(*ptr) atomic_sub_fetch(ptr, val)
+ typeof(*ptr) atomic_and_fetch(ptr, val)
+ typeof(*ptr) atomic_or_fetch(ptr, val)
+ typeof(*ptr) atomic_xor_fetch(ptr, val)
+
+Sequentially consistent loads and stores can be done using::
+
+ atomic_fetch_add(ptr, 0) for loads
+ atomic_xchg(ptr, val) for stores
+
+However, they are quite expensive on some platforms, notably POWER and
+Arm. Therefore, qemu/atomic.h provides two primitives with slightly
+weaker constraints::
+
+ typeof(*ptr) atomic_mb_read(ptr)
+ void atomic_mb_set(ptr, val)
+
+The semantics of these primitives map to Java volatile variables,
+and are strongly related to memory barriers as used in the Linux
+kernel (see below).
+
+As long as you use atomic_mb_read and atomic_mb_set, accesses cannot
+be reordered with each other, and it is also not possible to reorder
+"normal" accesses around them.
+
+However, and this is the important difference between
+atomic_mb_read/atomic_mb_set and sequential consistency, it is important
+for both threads to access the same volatile variable. It is not the
+case that everything visible to thread A when it writes volatile field f
+becomes visible to thread B after it reads volatile field g. The store
+and load have to "match" (i.e., be performed on the same volatile
+field) to achieve the right semantics.
+
+
+These operations operate on any type that is as wide as an int or smaller.
+
+
+Weak atomic access and manual memory barriers
+=============================================
+
+Compared to sequentially consistent atomic access, programming with
+weaker consistency models can be considerably more complicated.
+In general, if the algorithm you are writing includes both writes
+and reads on the same side, it is generally simpler to use sequentially
+consistent primitives.
+
+When using this model, variables are accessed with:
+
+- ``atomic_read()`` and ``atomic_set()``; these prevent the compiler from
+ optimizing accesses out of existence and creating unsolicited
+ accesses, but do not otherwise impose any ordering on loads and
+ stores: both the compiler and the processor are free to reorder
+ them.
+
+- ``atomic_load_acquire()``, which guarantees the LOAD to appear to
+ happen, with respect to the other components of the system,
+ before all the LOAD or STORE operations specified afterwards.
+ Operations coming before ``atomic_load_acquire()`` can still be
+ reordered after it.
+
+- ``atomic_store_release()``, which guarantees the STORE to appear to
+ happen, with respect to the other components of the system,
+ after all the LOAD or STORE operations specified afterwards.
+ Operations coming after ``atomic_store_release()`` can still be
+ reordered after it.
+
+Restrictions to the ordering of accesses can also be specified
+using the memory barrier macros: ``smp_rmb()``, ``smp_wmb()``, ``smp_mb()``,
+``smp_mb_acquire()``, ``smp_mb_release()``, ``smp_read_barrier_depends()``.
+
+Memory barriers control the order of references to shared memory.
+They come in six kinds:
+
+- ``smp_rmb()`` guarantees that all the LOAD operations specified before
+ the barrier will appear to happen before all the LOAD operations
+ specified after the barrier with respect to the other components of
+ the system.
+
+ In other words, ``smp_rmb()`` puts a partial ordering on loads, but is not
+ required to have any effect on stores.
+
+- ``smp_wmb()`` guarantees that all the STORE operations specified before
+ the barrier will appear to happen before all the STORE operations
+ specified after the barrier with respect to the other components of
+ the system.
+
+ In other words, ``smp_wmb()`` puts a partial ordering on stores, but is not
+ required to have any effect on loads.
+
+- ``smp_mb_acquire()`` guarantees that all the LOAD operations specified before
+ the barrier will appear to happen before all the LOAD or STORE operations
+ specified after the barrier with respect to the other components of
+ the system.
+
+- ``smp_mb_release()`` guarantees that all the STORE operations specified *after*
+ the barrier will appear to happen after all the LOAD or STORE operations
+ specified *before* the barrier with respect to the other components of
+ the system.
+
+- ``smp_mb()`` guarantees that all the LOAD and STORE operations specified
+ before the barrier will appear to happen before all the LOAD and
+ STORE operations specified after the barrier with respect to the other
+ components of the system.
+
+ ``smp_mb()`` puts a partial ordering on both loads and stores. It is
+ stronger than both a read and a write memory barrier; it implies both
+ ``smp_mb_acquire()`` and ``smp_mb_release()``, but it also prevents STOREs
+ coming before the barrier from overtaking LOADs coming after the
+ barrier and vice versa.
+
+- ``smp_read_barrier_depends()`` is a weaker kind of read barrier. On
+ most processors, whenever two loads are performed such that the
+ second depends on the result of the first (e.g., the first load
+ retrieves the address to which the second load will be directed),
+ the processor will guarantee that the first LOAD will appear to happen
+ before the second with respect to the other components of the system.
+ However, this is not always true---for example, it was not true on
+ Alpha processors. Whenever this kind of access happens to shared
+ memory (that is not protected by a lock), a read barrier is needed,
+ and ``smp_read_barrier_depends()`` can be used instead of ``smp_rmb()``.
+
+ Note that the first load really has to have a _data_ dependency and not
+ a control dependency. If the address for the second load is dependent
+ on the first load, but the dependency is through a conditional rather
+ than actually loading the address itself, then it's a _control_
+ dependency and a full read barrier or better is required.
+
+
+This is the set of barriers that is required *between* two ``atomic_read()``
+and ``atomic_set()`` operations to achieve sequential consistency:
+
+ +----------------+-------------------------------------------------------+
+ | | 2nd operation |
+ | +------------------+-----------------+------------------+
+ | 1st operation | (after last) | atomic_read | atomic_set |
+ +----------------+------------------+-----------------+------------------+
+ | (before first) | .. | none | smp_mb_release() |
+ +----------------+------------------+-----------------+------------------+
+ | atomic_read | smp_mb_acquire() | smp_rmb() [1]_ | [2]_ |
+ +----------------+------------------+-----------------+------------------+
+ | atomic_set | none | smp_mb() [3]_ | smp_wmb() |
+ +----------------+------------------+-----------------+------------------+
+
+ .. [1] Or smp_read_barrier_depends().
+
+ .. [2] This requires a load-store barrier. This is achieved by
+ either smp_mb_acquire() or smp_mb_release().
+
+ .. [3] This requires a store-load barrier. On most machines, the only
+ way to achieve this is a full barrier.
+
+
+You can see that the two possible definitions of ``atomic_mb_read()``
+and ``atomic_mb_set()`` are the following:
+
+ 1) | atomic_mb_read(p) = atomic_read(p); smp_mb_acquire()
+ | atomic_mb_set(p, v) = smp_mb_release(); atomic_set(p, v); smp_mb()
+
+ 2) | atomic_mb_read(p) = smp_mb() atomic_read(p); smp_mb_acquire()
+ | atomic_mb_set(p, v) = smp_mb_release(); atomic_set(p, v);
+
+Usually the former is used, because ``smp_mb()`` is expensive and a program
+normally has more reads than writes. Therefore it makes more sense to
+make ``atomic_mb_set()`` the more expensive operation.
+
+There are two common cases in which atomic_mb_read and atomic_mb_set
+generate too many memory barriers, and thus it can be useful to manually
+place barriers, or use atomic_load_acquire/atomic_store_release instead:
+
+- when a data structure has one thread that is always a writer
+ and one thread that is always a reader, manual placement of
+ memory barriers makes the write side faster. Furthermore,
+ correctness is easy to check for in this case using the "pairing"
+ trick that is explained below:
+
+ +----------------------------------------------------------------------+
+ | thread 1 |
+ +-----------------------------------+----------------------------------+
+ | before | after |
+ +===================================+==================================+
+ | :: | :: |
+ | | |
+ | (other writes) | |
+ | atomic_mb_set(&a, x) | atomic_store_release(&a, x) |
+ | atomic_mb_set(&b, y) | atomic_store_release(&b, y) |
+ +-----------------------------------+----------------------------------+
+
+ +----------------------------------------------------------------------+
+ | thread 2 |
+ +-----------------------------------+----------------------------------+
+ | before | after |
+ +===================================+==================================+
+ | :: | :: |
+ | | |
+ | y = atomic_mb_read(&b) | y = atomic_load_acquire(&b) |
+ | x = atomic_mb_read(&a) | x = atomic_load_acquire(&a) |
+ | (other reads) | |
+ +-----------------------------------+----------------------------------+
+
+ Note that the barrier between the stores in thread 1, and between
+ the loads in thread 2, has been optimized here to a write or a
+ read memory barrier respectively. On some architectures, notably
+ ARMv7, smp_mb_acquire and smp_mb_release are just as expensive as
+ smp_mb, but smp_rmb and/or smp_wmb are more efficient.
+
+- sometimes, a thread is accessing many variables that are otherwise
+ unrelated to each other (for example because, apart from the current
+ thread, exactly one other thread will read or write each of these
+ variables). In this case, it is possible to "hoist" the implicit
+ barriers provided by ``atomic_mb_read()`` and ``atomic_mb_set()`` outside
+ a loop. For example, the above definition ``atomic_mb_read()`` gives
+ the following transformation:
+
+ +-----------------------------------+----------------------------------+
+ | before | after |
+ +===================================+==================================+
+ | :: | :: |
+ | | |
+ | n = 0; | n = 0; |
+ | for (i = 0; i < 10; i++) | for (i = 0; i < 10; i++) |
+ | n += atomic_mb_read(&a[i]); | n += atomic_read(&a[i]); |
+ | | smp_mb_acquire(); |
+ +-----------------------------------+----------------------------------+
+
+ Similarly, atomic_mb_set() can be transformed as follows:
+
+ +-----------------------------------+----------------------------------+
+ | before | after |
+ +===================================+==================================+
+ | :: | :: |
+ | | |
+ | | smp_mb_release(); |
+ | for (i = 0; i < 10; i++) | for (i = 0; i < 10; i++) |
+ | atomic_mb_set(&a[i], false); | atomic_set(&a[i], false); |
+ | | smp_mb(); |
+ +-----------------------------------+----------------------------------+
+
+
+ The other thread can still use ``atomic_mb_read()``/``atomic_mb_set()``.
+
+The two tricks can be combined. In this case, splitting a loop in
+two lets you hoist the barriers out of the loops _and_ eliminate the
+expensive ``smp_mb()``:
+
+ +-----------------------------------+----------------------------------+
+ | before | after |
+ +===================================+==================================+
+ | :: | :: |
+ | | |
+ | | smp_mb_release(); |
+ | for (i = 0; i < 10; i++) { | for (i = 0; i < 10; i++) |
+ | atomic_mb_set(&a[i], false); | atomic_set(&a[i], false); |
+ | atomic_mb_set(&b[i], false); | smb_wmb(); |
+ | } | for (i = 0; i < 10; i++) |
+ | | atomic_set(&a[i], false); |
+ | | smp_mb(); |
+ +-----------------------------------+----------------------------------+
+
+
+Memory barrier pairing
+----------------------
+
+A useful rule of thumb is that memory barriers should always, or almost
+always, be paired with another barrier. In the case of QEMU, however,
+note that the other barrier may actually be in a driver that runs in
+the guest!
+
+For the purposes of pairing, ``smp_read_barrier_depends()`` and ``smp_rmb()``
+both count as read barriers. A read barrier shall pair with a write
+barrier or a full barrier; a write barrier shall pair with a read
+barrier or a full barrier. A full barrier can pair with anything.
+For example:
+
+ +--------------------+------------------------------+
+ | thread 1 | thread 2 |
+ +====================+==============================+
+ | :: | :: |
+ | | |
+ | a = 1; | |
+ | smp_wmb(); | |
+ | b = 2; | x = b; |
+ | | smp_rmb(); |
+ | | y = a; |
+ +--------------------+------------------------------+
+
+Note that the "writing" thread is accessing the variables in the
+opposite order as the "reading" thread. This is expected: stores
+before the write barrier will normally match the loads after the
+read barrier, and vice versa. The same is true for more than 2
+access and for data dependency barriers:
+
+ +----------------------+------------------------------+
+ | thread 1 | thread 2 |
+ +======================+==============================+
+ | :: | :: |
+ | | |
+ | b[2] = 1; | |
+ | smp_wmb(); | |
+ | x->i = 2; | |
+ | smp_wmb(); | |
+ | a = x; | x = a; |
+ | | smp_read_barrier_depends(); |
+ | | y = x->i; |
+ | | smp_read_barrier_depends(); |
+ | | z = b[y]; |
+ +----------------------+------------------------------+
+
+``smp_wmb()`` also pairs with ``atomic_mb_read()`` and ``smp_mb_acquire()``.
+and ``smp_rmb()`` also pairs with ``atomic_mb_set()`` and ``smp_mb_release()``.
+
+
+Comparison with Linux kernel memory barriers
+============================================
+
+Here is a list of differences between Linux kernel atomic operations
+and memory barriers, and the equivalents in QEMU:
+
+- atomic operations in Linux are always on a 32-bit int type and
+ use a boxed ``atomic_t`` type; atomic operations in QEMU are polymorphic
+ and use normal C types.
+
+- Originally, ``atomic_read`` and ``atomic_set`` in Linux gave no guarantee
+ at all. Linux 4.1 updated them to implement volatile
+ semantics via ``ACCESS_ONCE`` (or the more recent ``READ``/``WRITE_ONCE``).
+
+ QEMU's ``atomic_read`` and ``atomic_set`` implement C11 atomic relaxed
+ semantics if the compiler supports it, and volatile semantics otherwise.
+ Both semantics prevent the compiler from doing certain transformations;
+ the difference is that atomic accesses are guaranteed to be atomic,
+ while volatile accesses aren't. Thus, in the volatile case we just cross
+ our fingers hoping that the compiler will generate atomic accesses,
+ since we assume the variables passed are machine-word sized and
+ properly aligned.
+
+ No barriers are implied by ``atomic_read`` and ``atomic_set`` in either Linux
+ or QEMU.
+
+- atomic read-modify-write operations in Linux are of three kinds:
+
+ ===================== =========================================
+ ``atomic_OP`` returns void
+ ``atomic_OP_return`` returns new value of the variable
+ ``atomic_fetch_OP`` returns the old value of the variable
+ ``atomic_cmpxchg`` returns the old value of the variable
+ ===================== =========================================
+
+ In QEMU, the second kind does not exist. Currently Linux has
+ atomic_fetch_or only. QEMU provides and, or, inc, dec, add, sub.
+
+- different atomic read-modify-write operations in Linux imply
+ a different set of memory barriers; in QEMU, all of them enforce
+ sequential consistency, which means they imply full memory barriers
+ before and after the operation.
+
+- Linux does not have an equivalent of ``atomic_mb_set()``. In particular,
+ note that ``smp_store_mb()`` is a little weaker than ``atomic_mb_set()``.
+ ``atomic_mb_read()`` compiles to the same instructions as Linux's
+ ``smp_load_acquire()``, but this should be treated as an implementation
+ detail.
+
+Sources
+=======
+
+- ``Documentation/memory-barriers.txt`` from the Linux kernel