diff options
author | Pieter Wuille <pieter.wuille@gmail.com> | 2017-06-09 19:24:30 -0700 |
---|---|---|
committer | Pieter Wuille <pieter.wuille@gmail.com> | 2017-06-09 19:24:30 -0700 |
commit | cf44e4ca7762742c6c3154447b40869ec9d041db (patch) | |
tree | d5a89851da0a8ab07c40966ac137f4e299c80927 /doc/log_format.md | |
parent | 634ad517037b319147816f1d112b066528e1724a (diff) | |
download | bitcoin-cf44e4ca7762742c6c3154447b40869ec9d041db.tar.xz |
Squashed 'src/leveldb/' changes from a31c8aa40..196962ff0
196962ff0 Add AcceleratedCRC32C to port_win.h
1bdf1c34c Merge upstream LevelDB v1.20
d31721eb0 Merge #17: Fixed file sharing errors
fecd44902 Fixed file sharing error in Win32Env::GetFileSize(), Win32SequentialFile::_Init(), Win32RandomAccessFile::_Init() Fixed error checking in Win32SequentialFile::_Init()
5b7510f1b Merge #14: Merge upstream LevelDB 1.19
0d969fd57 Merge #16: [LevelDB] Do no crash if filesystem can't fsync
c8c029b5b [LevelDB] Do no crash if filesystem can't fsync
a53934a3a Increase leveldb version to 1.20.
f3f139737 Separate Env tests from PosixEnv tests.
eb4f0972f leveldb: Fix compilation warnings in port_posix_sse.cc on x86 (32-bit).
d0883b600 Fixed path to doc file: index.md.
7fa20948d Convert documentation to markdown.
ea175e28f Implement support for Intel crc32 instruction (SSE 4.2)
95cd743e5 Including <limits> for std::numeric_limits.
646c3588d Limit the number of read-only files the POSIX Env will have open.
d40bc3fa5 Merge #13: Typo
ebbd772d3 Typo
a2fb086d0 Add option for max file size. The currend hard-coded value of 2M is inefficient in colossus.
git-subtree-dir: src/leveldb
git-subtree-split: 196962ff01c39b4705d8117df5c3f8c205349950
Diffstat (limited to 'doc/log_format.md')
-rw-r--r-- | doc/log_format.md | 75 |
1 files changed, 75 insertions, 0 deletions
diff --git a/doc/log_format.md b/doc/log_format.md new file mode 100644 index 0000000000..f32cb5d7da --- /dev/null +++ b/doc/log_format.md @@ -0,0 +1,75 @@ +leveldb Log format +================== +The log file contents are a sequence of 32KB blocks. The only exception is that +the tail of the file may contain a partial block. + +Each block consists of a sequence of records: + + block := record* trailer? + record := + checksum: uint32 // crc32c of type and data[] ; little-endian + length: uint16 // little-endian + type: uint8 // One of FULL, FIRST, MIDDLE, LAST + data: uint8[length] + +A record never starts within the last six bytes of a block (since it won't fit). +Any leftover bytes here form the trailer, which must consist entirely of zero +bytes and must be skipped by readers. + +Aside: if exactly seven bytes are left in the current block, and a new non-zero +length record is added, the writer must emit a FIRST record (which contains zero +bytes of user data) to fill up the trailing seven bytes of the block and then +emit all of the user data in subsequent blocks. + +More types may be added in the future. Some Readers may skip record types they +do not understand, others may report that some data was skipped. + + FULL == 1 + FIRST == 2 + MIDDLE == 3 + LAST == 4 + +The FULL record contains the contents of an entire user record. + +FIRST, MIDDLE, LAST are types used for user records that have been split into +multiple fragments (typically because of block boundaries). FIRST is the type +of the first fragment of a user record, LAST is the type of the last fragment of +a user record, and MIDDLE is the type of all interior fragments of a user +record. + +Example: consider a sequence of user records: + + A: length 1000 + B: length 97270 + C: length 8000 + +**A** will be stored as a FULL record in the first block. + +**B** will be split into three fragments: first fragment occupies the rest of +the first block, second fragment occupies the entirety of the second block, and +the third fragment occupies a prefix of the third block. This will leave six +bytes free in the third block, which will be left empty as the trailer. + +**C** will be stored as a FULL record in the fourth block. + +---- + +## Some benefits over the recordio format: + +1. We do not need any heuristics for resyncing - just go to next block boundary + and scan. If there is a corruption, skip to the next block. As a + side-benefit, we do not get confused when part of the contents of one log + file are embedded as a record inside another log file. + +2. Splitting at approximate boundaries (e.g., for mapreduce) is simple: find the + next block boundary and skip records until we hit a FULL or FIRST record. + +3. We do not need extra buffering for large records. + +## Some downsides compared to recordio format: + +1. No packing of tiny records. This could be fixed by adding a new record type, + so it is a shortcoming of the current implementation, not necessarily the + format. + +2. No compression. Again, this could be fixed by adding new record types. |