aboutsummaryrefslogtreecommitdiff
path: root/docs/devel/migration/vfio.rst
diff options
context:
space:
mode:
Diffstat (limited to 'docs/devel/migration/vfio.rst')
-rw-r--r--docs/devel/migration/vfio.rst208
1 files changed, 208 insertions, 0 deletions
diff --git a/docs/devel/migration/vfio.rst b/docs/devel/migration/vfio.rst
new file mode 100644
index 0000000000..605fe60e96
--- /dev/null
+++ b/docs/devel/migration/vfio.rst
@@ -0,0 +1,208 @@
+=====================
+VFIO device Migration
+=====================
+
+Migration of virtual machine involves saving the state for each device that
+the guest is running on source host and restoring this saved state on the
+destination host. This document details how saving and restoring of VFIO
+devices is done in QEMU.
+
+Migration of VFIO devices consists of two phases: the optional pre-copy phase,
+and the stop-and-copy phase. The pre-copy phase is iterative and allows to
+accommodate VFIO devices that have a large amount of data that needs to be
+transferred. The iterative pre-copy phase of migration allows for the guest to
+continue whilst the VFIO device state is transferred to the destination, this
+helps to reduce the total downtime of the VM. VFIO devices opt-in to pre-copy
+support by reporting the VFIO_MIGRATION_PRE_COPY flag in the
+VFIO_DEVICE_FEATURE_MIGRATION ioctl.
+
+When pre-copy is supported, it's possible to further reduce downtime by
+enabling "switchover-ack" migration capability.
+VFIO migration uAPI defines "initial bytes" as part of its pre-copy data stream
+and recommends that the initial bytes are sent and loaded in the destination
+before stopping the source VM. Enabling this migration capability will
+guarantee that and thus, can potentially reduce downtime even further.
+
+To support migration of multiple devices that might do P2P transactions between
+themselves, VFIO migration uAPI defines an intermediate P2P quiescent state.
+While in the P2P quiescent state, P2P DMA transactions cannot be initiated by
+the device, but the device can respond to incoming ones. Additionally, all
+outstanding P2P transactions are guaranteed to have been completed by the time
+the device enters this state.
+
+All the devices that support P2P migration are first transitioned to the P2P
+quiescent state and only then are they stopped or started. This makes migration
+safe P2P-wise, since starting and stopping the devices is not done atomically
+for all the devices together.
+
+Thus, multiple VFIO devices migration is allowed only if all the devices
+support P2P migration. Single VFIO device migration is allowed regardless of
+P2P migration support.
+
+A detailed description of the UAPI for VFIO device migration can be found in
+the comment for the ``vfio_device_mig_state`` structure in the header file
+linux-headers/linux/vfio.h.
+
+VFIO implements the device hooks for the iterative approach as follows:
+
+* A ``save_setup`` function that sets up migration on the source.
+
+* A ``load_setup`` function that sets the VFIO device on the destination in
+ _RESUMING state.
+
+* A ``state_pending_estimate`` function that reports an estimate of the
+ remaining pre-copy data that the vendor driver has yet to save for the VFIO
+ device.
+
+* A ``state_pending_exact`` function that reads pending_bytes from the vendor
+ driver, which indicates the amount of data that the vendor driver has yet to
+ save for the VFIO device.
+
+* An ``is_active_iterate`` function that indicates ``save_live_iterate`` is
+ active only when the VFIO device is in pre-copy states.
+
+* A ``save_live_iterate`` function that reads the VFIO device's data from the
+ vendor driver during iterative pre-copy phase.
+
+* A ``switchover_ack_needed`` function that checks if the VFIO device uses
+ "switchover-ack" migration capability when this capability is enabled.
+
+* A ``save_state`` function to save the device config space if it is present.
+
+* A ``save_live_complete_precopy`` function that sets the VFIO device in
+ _STOP_COPY state and iteratively copies the data for the VFIO device until
+ the vendor driver indicates that no data remains.
+
+* A ``load_state`` function that loads the config section and the data
+ sections that are generated by the save functions above.
+
+* ``cleanup`` functions for both save and load that perform any migration
+ related cleanup.
+
+
+The VFIO migration code uses a VM state change handler to change the VFIO
+device state when the VM state changes from running to not-running, and
+vice versa.
+
+Similarly, a migration state change handler is used to trigger a transition of
+the VFIO device state when certain changes of the migration state occur. For
+example, the VFIO device state is transitioned back to _RUNNING in case a
+migration failed or was canceled.
+
+System memory dirty pages tracking
+----------------------------------
+
+A ``log_global_start`` and ``log_global_stop`` memory listener callback informs
+the VFIO dirty tracking module to start and stop dirty page tracking. A
+``log_sync`` memory listener callback queries the dirty page bitmap from the
+dirty tracking module and marks system memory pages which were DMA-ed by the
+VFIO device as dirty. The dirty page bitmap is queried per container.
+
+Currently there are two ways dirty page tracking can be done:
+(1) Device dirty tracking:
+In this method the device is responsible to log and report its DMAs. This
+method can be used only if the device is capable of tracking its DMAs.
+Discovering device capability, starting and stopping dirty tracking, and
+syncing the dirty bitmaps from the device are done using the DMA logging uAPI.
+More info about the uAPI can be found in the comments of the
+``vfio_device_feature_dma_logging_control`` and
+``vfio_device_feature_dma_logging_report`` structures in the header file
+linux-headers/linux/vfio.h.
+
+(2) VFIO IOMMU module:
+In this method dirty tracking is done by IOMMU. However, there is currently no
+IOMMU support for dirty page tracking. For this reason, all pages are
+perpetually marked dirty, unless the device driver pins pages through external
+APIs in which case only those pinned pages are perpetually marked dirty.
+
+If the above two methods are not supported, all pages are perpetually marked
+dirty by QEMU.
+
+By default, dirty pages are tracked during pre-copy as well as stop-and-copy
+phase. So, a page marked as dirty will be copied to the destination in both
+phases. Copying dirty pages in pre-copy phase helps QEMU to predict if it can
+achieve its downtime tolerances. If QEMU during pre-copy phase keeps finding
+dirty pages continuously, then it understands that even in stop-and-copy phase,
+it is likely to find dirty pages and can predict the downtime accordingly.
+
+QEMU also provides a per device opt-out option ``pre-copy-dirty-page-tracking``
+which disables querying the dirty bitmap during pre-copy phase. If it is set to
+off, all dirty pages will be copied to the destination in stop-and-copy phase
+only.
+
+System memory dirty pages tracking when vIOMMU is enabled
+---------------------------------------------------------
+
+With vIOMMU, an IO virtual address range can get unmapped while in pre-copy
+phase of migration. In that case, the unmap ioctl returns any dirty pages in
+that range and QEMU reports corresponding guest physical pages dirty. During
+stop-and-copy phase, an IOMMU notifier is used to get a callback for mapped
+pages and then dirty pages bitmap is fetched from VFIO IOMMU modules for those
+mapped ranges. If device dirty tracking is enabled with vIOMMU, live migration
+will be blocked.
+
+Flow of state changes during Live migration
+===========================================
+
+Below is the state change flow during live migration for a VFIO device that
+supports both precopy and P2P migration. The flow for devices that don't
+support it is similar, except that the relevant states for precopy and P2P are
+skipped.
+The values in the parentheses represent the VM state, the migration state, and
+the VFIO device state, respectively.
+
+Live migration save path
+------------------------
+
+::
+
+ QEMU normal running state
+ (RUNNING, _NONE, _RUNNING)
+ |
+ migrate_init spawns migration_thread
+ Migration thread then calls each device's .save_setup()
+ (RUNNING, _SETUP, _PRE_COPY)
+ |
+ (RUNNING, _ACTIVE, _PRE_COPY)
+ If device is active, get pending_bytes by .state_pending_{estimate,exact}()
+ If total pending_bytes >= threshold_size, call .save_live_iterate()
+ Data of VFIO device for pre-copy phase is copied
+ Iterate till total pending bytes converge and are less than threshold
+ |
+ On migration completion, the vCPUs and the VFIO device are stopped
+ The VFIO device is first put in P2P quiescent state
+ (FINISH_MIGRATE, _ACTIVE, _PRE_COPY_P2P)
+ |
+ Then the VFIO device is put in _STOP_COPY state
+ (FINISH_MIGRATE, _ACTIVE, _STOP_COPY)
+ .save_live_complete_precopy() is called for each active device
+ For the VFIO device, iterate in .save_live_complete_precopy() until
+ pending data is 0
+ |
+ (POSTMIGRATE, _COMPLETED, _STOP_COPY)
+ Migraton thread schedules cleanup bottom half and exits
+ |
+ .save_cleanup() is called
+ (POSTMIGRATE, _COMPLETED, _STOP)
+
+Live migration resume path
+--------------------------
+
+::
+
+ Incoming migration calls .load_setup() for each device
+ (RESTORE_VM, _ACTIVE, _STOP)
+ |
+ For each device, .load_state() is called for that device section data
+ (RESTORE_VM, _ACTIVE, _RESUMING)
+ |
+ At the end, .load_cleanup() is called for each device and vCPUs are started
+ The VFIO device is first put in P2P quiescent state
+ (RUNNING, _ACTIVE, _RUNNING_P2P)
+ |
+ (RUNNING, _NONE, _RUNNING)
+
+Postcopy
+========
+
+Postcopy migration is currently not supported for VFIO devices.