diff options
Diffstat (limited to 'docs')
-rw-r--r-- | docs/qapi-code-gen.txt | 15 | ||||
-rw-r--r-- | docs/specs/qcow2.txt | 221 | ||||
-rw-r--r-- | docs/throttle.txt | 252 |
3 files changed, 480 insertions, 8 deletions
diff --git a/docs/qapi-code-gen.txt b/docs/qapi-code-gen.txt index 128f074a2d..999f3b98f0 100644 --- a/docs/qapi-code-gen.txt +++ b/docs/qapi-code-gen.txt @@ -187,11 +187,11 @@ prevent incomplete include files. Usage: { 'struct': STRING, 'data': DICT, '*base': STRUCT-NAME } -A struct is a dictionary containing a single 'data' key whose -value is a dictionary. This corresponds to a struct in C or an Object -in JSON. Each value of the 'data' dictionary must be the name of a -type, or a one-element array containing a type name. An example of a -struct is: +A struct is a dictionary containing a single 'data' key whose value is +a dictionary; the dictionary may be empty. This corresponds to a +struct in C or an Object in JSON. Each value of the 'data' dictionary +must be the name of a type, or a one-element array containing a type +name. An example of a struct is: { 'struct': 'MyType', 'data': { 'member1': 'str', 'member2': 'int', '*member3': 'str' } } @@ -288,9 +288,10 @@ or: { 'union': STRING, 'data': DICT, 'base': STRUCT-NAME, Union types are used to let the user choose between several different variants for an object. There are two flavors: simple (no -discriminator or base), flat (both discriminator and base). A union +discriminator or base), and flat (both discriminator and base). A union type is defined using a data dictionary as explained in the following -paragraphs. +paragraphs. The data dictionary for either type of union must not +be empty. A simple union type defines a mapping from automatic discriminator values to data types like in this example: diff --git a/docs/specs/qcow2.txt b/docs/specs/qcow2.txt index f236d8c6d9..80cdfd0e91 100644 --- a/docs/specs/qcow2.txt +++ b/docs/specs/qcow2.txt @@ -103,7 +103,18 @@ in the description of a field. write to an image with unknown auto-clear features if it clears the respective bits from this field first. - Bits 0-63: Reserved (set to 0) + Bit 0: Bitmaps extension bit + This bit indicates consistency for the bitmaps + extension data. + + It is an error if this bit is set without the + bitmaps extension present. + + If the bitmaps extension is present but this + bit is unset, the bitmaps extension data must be + considered inconsistent. + + Bits 1-63: Reserved (set to 0) 96 - 99: refcount_order Describes the width of a reference count block entry (width @@ -123,6 +134,7 @@ be stored. Each extension has a structure like the following: 0x00000000 - End of the header extension area 0xE2792ACA - Backing file format name 0x6803f857 - Feature name table + 0x23852875 - Bitmaps extension other - Unknown header extension, can be safely ignored @@ -166,6 +178,36 @@ the header extension data. Each entry look like this: terminated if it has full length) +== Bitmaps extension == + +The bitmaps extension is an optional header extension. It provides the ability +to store bitmaps related to a virtual disk. For now, there is only one bitmap +type: the dirty tracking bitmap, which tracks virtual disk changes from some +point in time. + +The data of the extension should be considered consistent only if the +corresponding auto-clear feature bit is set, see autoclear_features above. + +The fields of the bitmaps extension are: + + Byte 0 - 3: nb_bitmaps + The number of bitmaps contained in the image. Must be + greater than or equal to 1. + + Note: Qemu currently only supports up to 65535 bitmaps per + image. + + 4 - 7: Reserved, must be zero. + + 8 - 15: bitmap_directory_size + Size of the bitmap directory in bytes. It is the cumulative + size of all (nb_bitmaps) bitmap headers. + + 16 - 23: bitmap_directory_offset + Offset into the image file at which the bitmap directory + starts. Must be aligned to a cluster boundary. + + == Host cluster management == qcow2 manages the allocation of host clusters by maintaining a reference count @@ -360,3 +402,180 @@ Snapshot table entry: variable: Padding to round up the snapshot table entry size to the next multiple of 8. + + +== Bitmaps == + +As mentioned above, the bitmaps extension provides the ability to store bitmaps +related to a virtual disk. This section describes how these bitmaps are stored. + +All stored bitmaps are related to the virtual disk stored in the same image, so +each bitmap size is equal to the virtual disk size. + +Each bit of the bitmap is responsible for strictly defined range of the virtual +disk. For bit number bit_nr the corresponding range (in bytes) will be: + + [bit_nr * bitmap_granularity .. (bit_nr + 1) * bitmap_granularity - 1] + +Granularity is a property of the concrete bitmap, see below. + + +=== Bitmap directory === + +Each bitmap saved in the image is described in a bitmap directory entry. The +bitmap directory is a contiguous area in the image file, whose starting offset +and length are given by the header extension fields bitmap_directory_offset and +bitmap_directory_size. The entries of the bitmap directory have variable +length, depending on the lengths of the bitmap name and extra data. These +entries are also called bitmap headers. + +Structure of a bitmap directory entry: + + Byte 0 - 7: bitmap_table_offset + Offset into the image file at which the bitmap table + (described below) for the bitmap starts. Must be aligned to + a cluster boundary. + + 8 - 11: bitmap_table_size + Number of entries in the bitmap table of the bitmap. + + 12 - 15: flags + Bit + 0: in_use + The bitmap was not saved correctly and may be + inconsistent. + + 1: auto + The bitmap must reflect all changes of the virtual + disk by any application that would write to this qcow2 + file (including writes, snapshot switching, etc.). The + type of this bitmap must be 'dirty tracking bitmap'. + + 2: extra_data_compatible + This flags is meaningful when the extra data is + unknown to the software (currently any extra data is + unknown to Qemu). + If it is set, the bitmap may be used as expected, extra + data must be left as is. + If it is not set, the bitmap must not be used, but + both it and its extra data be left as is. + + Bits 3 - 31 are reserved and must be 0. + + 16: type + This field describes the sort of the bitmap. + Values: + 1: Dirty tracking bitmap + + Values 0, 2 - 255 are reserved. + + 17: granularity_bits + Granularity bits. Valid values: 0 - 63. + + Note: Qemu currently doesn't support granularity_bits + greater than 31. + + Granularity is calculated as + granularity = 1 << granularity_bits + + A bitmap's granularity is how many bytes of the image + accounts for one bit of the bitmap. + + 18 - 19: name_size + Size of the bitmap name. Must be non-zero. + + Note: Qemu currently doesn't support values greater than + 1023. + + 20 - 23: extra_data_size + Size of type-specific extra data. + + For now, as no extra data is defined, extra_data_size is + reserved and should be zero. If it is non-zero the + behavior is defined by extra_data_compatible flag. + + variable: extra_data + Extra data for the bitmap, occupying extra_data_size bytes. + Extra data must never contain references to clusters or in + some other way allocate additional clusters. + + variable: name + The name of the bitmap (not null terminated), occupying + name_size bytes. Must be unique among all bitmap names + within the bitmaps extension. + + variable: Padding to round up the bitmap directory entry size to the + next multiple of 8. All bytes of the padding must be zero. + + +=== Bitmap table === + +Each bitmap is stored using a one-level structure (as opposed to two-level +structures like for refcounts and guest clusters mapping) for the mapping of +bitmap data to host clusters. This structure is called the bitmap table. + +Each bitmap table has a variable size (stored in the bitmap directory entry) +and may use multiple clusters, however, it must be contiguous in the image +file. + +Structure of a bitmap table entry: + + Bit 0: Reserved and must be zero if bits 9 - 55 are non-zero. + If bits 9 - 55 are zero: + 0: Cluster should be read as all zeros. + 1: Cluster should be read as all ones. + + 1 - 8: Reserved and must be zero. + + 9 - 55: Bits 9 - 55 of the host cluster offset. Must be aligned to + a cluster boundary. If the offset is 0, the cluster is + unallocated; in that case, bit 0 determines how this + cluster should be treated during reads. + + 56 - 63: Reserved and must be zero. + + +=== Bitmap data === + +As noted above, bitmap data is stored in separate clusters, described by the +bitmap table. Given an offset (in bytes) into the bitmap data, the offset into +the image file can be obtained as follows: + + image_offset(bitmap_data_offset) = + bitmap_table[bitmap_data_offset / cluster_size] + + (bitmap_data_offset % cluster_size) + +This offset is not defined if bits 9 - 55 of bitmap table entry are zero (see +above). + +Given an offset byte_nr into the virtual disk and the bitmap's granularity, the +bit offset into the image file to the corresponding bit of the bitmap can be +calculated like this: + + bit_offset(byte_nr) = + image_offset(byte_nr / granularity / 8) * 8 + + (byte_nr / granularity) % 8 + +If the size of the bitmap data is not a multiple of the cluster size then the +last cluster of the bitmap data contains some unused tail bits. These bits must +be zero. + + +=== Dirty tracking bitmaps === + +Bitmaps with 'type' field equal to one are dirty tracking bitmaps. + +When the virtual disk is in use dirty tracking bitmap may be 'enabled' or +'disabled'. While the bitmap is 'enabled', all writes to the virtual disk +should be reflected in the bitmap. A set bit in the bitmap means that the +corresponding range of the virtual disk (see above) was written to while the +bitmap was 'enabled'. An unset bit means that this range was not written to. + +The software doesn't have to sync the bitmap in the image file with its +representation in RAM after each write. Flag 'in_use' should be set while the +bitmap is not synced. + +In the image file the 'enabled' state is reflected by the 'auto' flag. If this +flag is set, the software must consider the bitmap as 'enabled' and start +tracking virtual disk changes to this bitmap from the first write to the +virtual disk. If this flag is not set then the bitmap is disabled. diff --git a/docs/throttle.txt b/docs/throttle.txt new file mode 100644 index 0000000000..28204e46ca --- /dev/null +++ b/docs/throttle.txt @@ -0,0 +1,252 @@ +The QEMU throttling infrastructure +================================== +Copyright (C) 2016 Igalia, S.L. +Author: Alberto Garcia <berto@igalia.com> + +This work is licensed under the terms of the GNU GPL, version 2 or +later. See the COPYING file in the top-level directory. + +Introduction +------------ +QEMU includes a throttling module that can be used to set limits to +I/O operations. The code itself is generic and independent of the I/O +units, but it is currenly used to limit the number of bytes per second +and operations per second (IOPS) when performing disk I/O. + +This document explains how to use the throttling code in QEMU, and how +it works internally. The implementation is in throttle.c. + + +Using throttling to limit disk I/O +---------------------------------- +Two aspects of the disk I/O can be limited: the number of bytes per +second and the number of operations per second (IOPS). For each one of +them the user can set a global limit or separate limits for read and +write operations. This gives us a total of six different parameters. + +I/O limits can be set using the throttling.* parameters of -drive, or +using the QMP 'block_set_io_throttle' command. These are the names of +the parameters for both cases: + +|-----------------------+-----------------------| +| -drive | block_set_io_throttle | +|-----------------------+-----------------------| +| throttling.iops-total | iops | +| throttling.iops-read | iops_rd | +| throttling.iops-write | iops_wr | +| throttling.bps-total | bps | +| throttling.bps-read | bps_rd | +| throttling.bps-write | bps_wr | +|-----------------------+-----------------------| + +It is possible to set limits for both IOPS and bps and the same time, +and for each case we can decide whether to have separate read and +write limits or not, but note that if iops-total is set then neither +iops-read nor iops-write can be set. The same applies to bps-total and +bps-read/write. + +The default value of these parameters is 0, and it means 'unlimited'. + +In its most basic usage, the user can add a drive to QEMU with a limit +of 100 IOPS with the following -drive line: + + -drive file=hd0.qcow2,throttling.iops-total=100 + +We can do the same using QMP. In this case all these parameters are +mandatory, so we must set to 0 the ones that we don't want to limit: + + { "execute": "block_set_io_throttle", + "arguments": { + "device": "virtio0", + "iops": 100, + "iops_rd": 0, + "iops_wr": 0, + "bps": 0, + "bps_rd": 0, + "bps_wr": 0 + } + } + + +I/O bursts +---------- +In addition to the basic limits we have just seen, QEMU allows the +user to do bursts of I/O for a configurable amount of time. A burst is +an amount of I/O that can exceed the basic limit. Bursts are useful to +allow better performance when there are peaks of activity (the OS +boots, a service needs to be restarted) while keeping the average +limits lower the rest of the time. + +Two parameters control bursts: their length and the maximum amount of +I/O they allow. These two can be configured separately for each one of +the six basic parameters described in the previous section, but in +this section we'll use 'iops-total' as an example. + +The I/O limit during bursts is set using 'iops-total-max', and the +maximum length (in seconds) is set with 'iops-total-max-length'. So if +we want to configure a drive with a basic limit of 100 IOPS and allow +bursts of 2000 IOPS for 60 seconds, we would do it like this (the line +is split for clarity): + + -drive file=hd0.qcow2, + throttling.iops-total=100, + throttling.iops-total-max=2000, + throttling.iops-total-max-length=60 + +Or, with QMP: + + { "execute": "block_set_io_throttle", + "arguments": { + "device": "virtio0", + "iops": 100, + "iops_rd": 0, + "iops_wr": 0, + "bps": 0, + "bps_rd": 0, + "bps_wr": 0, + "iops_max": 2000, + "iops_max_length": 60, + } + } + +With this, the user can perform I/O on hd0.qcow2 at a rate of 2000 +IOPS for 1 minute before it's throttled down to 100 IOPS. + +The user will be able to do bursts again if there's a sufficiently +long period of time with unused I/O (see below for details). + +The default value for 'iops-total-max' is 0 and it means that bursts +are not allowed. 'iops-total-max-length' can only be set if +'iops-total-max' is set as well, and its default value is 1 second. + +Here's the complete list of parameters for configuring bursts: + +|----------------------------------+-----------------------| +| -drive | block_set_io_throttle | +|----------------------------------+-----------------------| +| throttling.iops-total-max | iops_max | +| throttling.iops-total-max-length | iops_max_length | +| throttling.iops-read-max | iops_rd_max | +| throttling.iops-read-max-length | iops_rd_max_length | +| throttling.iops-write-max | iops_wr_max | +| throttling.iops-write-max-length | iops_wr_max_length | +| throttling.bps-total-max | bps_max | +| throttling.bps-total-max-length | bps_max_length | +| throttling.bps-read-max | bps_rd_max | +| throttling.bps-read-max-length | bps_rd_max_length | +| throttling.bps-write-max | bps_wr_max | +| throttling.bps-write-max-length | bps_wr_max_length | +|----------------------------------+-----------------------| + + +Controlling the size of I/O operations +-------------------------------------- +When applying IOPS limits all I/O operations are treated equally +regardless of their size. This means that the user can take advantage +of this in order to circumvent the limits and submit one huge I/O +request instead of several smaller ones. + +QEMU provides a setting called throttling.iops-size to prevent this +from happening. This setting specifies the size (in bytes) of an I/O +request for accounting purposes. Larger requests will be counted +proportionally to this size. + +For example, if iops-size is set to 4096 then an 8KB request will be +counted as two, and a 6KB request will be counted as one and a +half. This only applies to requests larger than iops-size: smaller +requests will be always counted as one, no matter their size. + +The default value of iops-size is 0 and it means that the size of the +requests is never taken into account when applying IOPS limits. + + +Applying I/O limits to groups of disks +-------------------------------------- +In all the examples so far we have seen how to apply limits to the I/O +performed on individual drives, but QEMU allows grouping drives so +they all share the same limits. + +The way it works is that each drive with I/O limits is assigned to a +group named using the throttling.group parameter. If this parameter is +not specified, then the device name (i.e. 'virtio0', 'ide0-hd0') will +be used as the group name. + +Limits set using the throttling.* parameters discussed earlier in this +document apply to the combined I/O of all members of a group. + +Consider this example: + + -drive file=hd1.qcow2,throttling.iops-total=6000,throttling.group=foo + -drive file=hd2.qcow2,throttling.iops-total=6000,throttling.group=foo + -drive file=hd3.qcow2,throttling.iops-total=3000,throttling.group=bar + -drive file=hd4.qcow2,throttling.iops-total=6000,throttling.group=foo + -drive file=hd5.qcow2,throttling.iops-total=3000,throttling.group=bar + -drive file=hd6.qcow2,throttling.iops-total=5000 + +Here hd1, hd2 and hd4 are all members of a group named 'foo' with a +combined IOPS limit of 6000, and hd3 and hd5 are members of 'bar'. hd6 +is left alone (technically it is part of a 1-member group). + +Limits are applied in a round-robin fashion so if there are concurrent +I/O requests on several drives of the same group they will be +distributed evenly. + +When I/O limits are applied to an existing drive using the QMP command +'block_set_io_throttle', the following things need to be taken into +account: + + - I/O limits are shared within the same group, so new values will + affect all members and overwrite the previous settings. In other + words: if different limits are applied to members of the same + group, the last one wins. + + - If 'group' is unset it is assumed to be the current group of that + drive. If the drive is not in a group yet, it will be added to a + group named after the device name. + + - If 'group' is set then the drive will be moved to that group if + it was member of a different one. In this case the limits + specified in the parameters will be applied to the new group + only. + + - I/O limits can be disabled by setting all of them to 0. In this + case the device will be removed from its group and the rest of + its members will not be affected. The 'group' parameter is + ignored. + + +The Leaky Bucket algorithm +-------------------------- +I/O limits in QEMU are implemented using the leaky bucket algorithm +(specifically the "Leaky bucket as a meter" variant). + +This algorithm uses the analogy of a bucket that leaks water +constantly. The water that gets into the bucket represents the I/O +that has been performed, and no more I/O is allowed once the bucket is +full. + +To see the way this corresponds to the throttling parameters in QEMU, +consider the following values: + + iops-total=100 + iops-total-max=2000 + iops-total-max-length=60 + + - Water leaks from the bucket at a rate of 100 IOPS. + - Water can be added to the bucket at a rate of 2000 IOPS. + - The size of the bucket is 2000 x 60 = 120000 + - If 'iops-total-max-length' is unset then the bucket size is 100. + +The bucket is initially empty, therefore water can be added until it's +full at a rate of 2000 IOPS (the burst rate). Once the bucket is full +we can only add as much water as it leaks, therefore the I/O rate is +reduced to 100 IOPS. If we add less water than it leaks then the +bucket will start to empty, allowing for bursts again. + +Note that since water is leaking from the bucket even during bursts, +it will take a bit more than 60 seconds at 2000 IOPS to fill it +up. After those 60 seconds the bucket will have leaked 60 x 100 = +6000, allowing for 3 more seconds of I/O at 2000 IOPS. + +Also, due to the way the algorithm works, longer burst can be done at +a lower I/O rate, e.g. 1000 IOPS during 120 seconds. |