Squashed 'src/leveldb/' changes from a31c8aa40..196962ff0

196962ff0 Add AcceleratedCRC32C to port_win.h 1bdf1c34c Merge upstream LevelDB v1.20 d31721eb0 Merge #17: Fixed file sharing errors fecd44902 Fixed file sharing error in Win32Env::GetFileSize(), Win32SequentialFile::_Init(), Win32RandomAccessFile::_Init() Fixed error checking in Win32SequentialFile::_Init() 5b7510f1b Merge #14: Merge upstream LevelDB 1.19 0d969fd57 Merge #16: [LevelDB] Do no crash if filesystem can't fsync c8c029b5b [LevelDB] Do no crash if filesystem can't fsync a53934a3a Increase leveldb version to 1.20. f3f139737 Separate Env tests from PosixEnv tests. eb4f0972f leveldb: Fix compilation warnings in port_posix_sse.cc on x86 (32-bit). d0883b600 Fixed path to doc file: index.md. 7fa20948d Convert documentation to markdown. ea175e28f Implement support for Intel crc32 instruction (SSE 4.2) 95cd743e5 Including <limits> for std::numeric_limits. 646c3588d Limit the number of read-only files the POSIX Env will have open. d40bc3fa5 Merge #13: Typo ebbd772d3 Typo a2fb086d0 Add option for max file size. The currend hard-coded value of 2M is inefficient in colossus. git-subtree-dir: src/leveldb git-subtree-split: 196962ff01c39b4705d8117df5c3f8c205349950
author: Pieter Wuille <pieter.wuille@gmail.com> 2017-06-09 19:24:30 -0700
committer: Pieter Wuille <pieter.wuille@gmail.com> 2017-06-09 19:24:30 -0700
commit: cf44e4ca7762742c6c3154447b40869ec9d041db (patch)
tree: d5a89851da0a8ab07c40966ac137f4e299c80927 /doc/log_format.md
parent: 634ad517037b319147816f1d112b066528e1724a (diff)
download: bitcoin-cf44e4ca7762742c6c3154447b40869ec9d041db.tar.xz
1 files changed, 75 insertions, 0 deletions
diff --git a/doc/log_format.md b/doc/log_format.md
new file mode 100644
index 0000000000..f32cb5d7da
--- /dev/null
+++ b/doc/log_format.md
@@ -0,0 +1,75 @@
+leveldb Log format
+==================
+The log file contents are a sequence of 32KB blocks.  The only exception is that
+the tail of the file may contain a partial block.
+
+Each block consists of a sequence of records:
+
+    block := record* trailer?
+    record :=
+      checksum: uint32     // crc32c of type and data[] ; little-endian
+      length: uint16       // little-endian
+      type: uint8          // One of FULL, FIRST, MIDDLE, LAST
+      data: uint8[length]
+
+A record never starts within the last six bytes of a block (since it won't fit).
+Any leftover bytes here form the trailer, which must consist entirely of zero
+bytes and must be skipped by readers.
+
+Aside: if exactly seven bytes are left in the current block, and a new non-zero
+length record is added, the writer must emit a FIRST record (which contains zero
+bytes of user data) to fill up the trailing seven bytes of the block and then
+emit all of the user data in subsequent blocks.
+
+More types may be added in the future.  Some Readers may skip record types they
+do not understand, others may report that some data was skipped.
+
+    FULL == 1
+    FIRST == 2
+    MIDDLE == 3
+    LAST == 4
+
+The FULL record contains the contents of an entire user record.
+
+FIRST, MIDDLE, LAST are types used for user records that have been split into
+multiple fragments (typically because of block boundaries).  FIRST is the type
+of the first fragment of a user record, LAST is the type of the last fragment of
+a user record, and MIDDLE is the type of all interior fragments of a user
+record.
+
+Example: consider a sequence of user records:
+
+    A: length 1000
+    B: length 97270
+    C: length 8000
+
+**A** will be stored as a FULL record in the first block.
+
+**B** will be split into three fragments: first fragment occupies the rest of
+the first block, second fragment occupies the entirety of the second block, and
+the third fragment occupies a prefix of the third block.  This will leave six
+bytes free in the third block, which will be left empty as the trailer.
+
+**C** will be stored as a FULL record in the fourth block.
+
+----
+
+## Some benefits over the recordio format:
+
+1. We do not need any heuristics for resyncing - just go to next block boundary
+   and scan.  If there is a corruption, skip to the next block.  As a
+   side-benefit, we do not get confused when part of the contents of one log
+   file are embedded as a record inside another log file.
+
+2. Splitting at approximate boundaries (e.g., for mapreduce) is simple: find the
+   next block boundary and skip records until we hit a FULL or FIRST record.
+
+3. We do not need extra buffering for large records.
+
+## Some downsides compared to recordio format:
+
+1. No packing of tiny records.  This could be fixed by adding a new record type,
+   so it is a shortcoming of the current implementation, not necessarily the
+   format.
+
+2. No compression.  Again, this could be fixed by adding new record types.
author	Pieter Wuille <pieter.wuille@gmail.com>	2017-06-09 19:24:30 -0700
committer	Pieter Wuille <pieter.wuille@gmail.com>	2017-06-09 19:24:30 -0700
commit	cf44e4ca7762742c6c3154447b40869ec9d041db (patch)
tree	d5a89851da0a8ab07c40966ac137f4e299c80927 /doc/log_format.md
parent	634ad517037b319147816f1d112b066528e1724a (diff)
download	bitcoin-cf44e4ca7762742c6c3154447b40869ec9d041db.tar.xz