diff options
author | Jim Posen <jimpo@coinbase.com> | 2017-06-01 11:59:19 -0700 |
---|---|---|
committer | Jim Posen <jimpo@coinbase.com> | 2018-01-22 13:52:54 -0800 |
commit | 996c7c8e882b3629d1a06a41536bdda6955c324b (patch) | |
tree | 06fdcc484f7a5ef4291ba0db624d7558c65467c8 /bip-0158.mediawiki | |
parent | 1d19de104e9d11f81cc210f4185c0b1c49f5a6d8 (diff) |
BIP 157 & 158: client-side block filtering.
Diffstat (limited to 'bip-0158.mediawiki')
-rw-r--r-- | bip-0158.mediawiki | 431 |
1 files changed, 431 insertions, 0 deletions
diff --git a/bip-0158.mediawiki b/bip-0158.mediawiki new file mode 100644 index 0000000..dc28154 --- /dev/null +++ b/bip-0158.mediawiki @@ -0,0 +1,431 @@ +<pre> + BIP: 158 + Layer: Peer Services + Title: Compact Block Filters for Light Clients + Author: Olaoluwa Osuntokun <laolu32@gmail.com> + Alex Akselrod <alex@akselrod.org> + Comments-Summary: None yet + Comments-URI: https://github.com/bitcoin/bips/wiki/Comments:BIP-0158 + Status: Draft + Type: Standards Track + Created: 2017-05-24 + License: CC0-1.0 +</pre> + + +== Abstract == + +This BIP describes a structure for compact filters on block data, for use in the +BIP 157 light client protocol<ref>bip-0157.mediawiki</ref>. The filter +construction proposed is an alternative to Bloom filters, as used in BIP 37, +that minimizes filter size by using Golomb-Rice coding for compression. This +document specifies two initial types of filters based on this construction that +enables basic wallets and applications with more advanced smart contracts. + +== Motivation == + +[[bip-0157.mediawiki|BIP 157]] defines a light client protocol based on +deterministic filters of block content. The filters are designed to +minimize the expected bandwidth consumed by light clients, downloading filters +and full blocks. This document defines two initial filter types, ''basic'' and +''extended'', to provide support for advanced applications while reducing the +filter size for regular wallets. + +== Definitions == + +<code>[]byte</code> represents a vector of bytes. + +<code>[N]byte</code> represents a fixed-size byte array with length N. + +''CompactSize'' is a compact encoding of unsigned integers used in the Bitcoin +P2P protocol. + +''Data pushes'' are byte vectors pushed to the stack according to the rules of +Bitcoin script. + +''Bit streams'' are readable and writable streams of individual bits. The +following functions are used in the pseudocode in this document: +* <code>new_bit_stream</code> instantiates a new writable bit stream +* <code>new_bit_stream(vector)</code> instantiates a new bit stream reading data from <code>vector</code> +* <code>write_bit(stream, b)</code> appends the bit <code>b</code> to the end of the stream +* <code>read_bit(stream)</code> reads the next available bit from the stream +* <code>write_bits_big_endian(stream, n, k)</code> appends the <code>k</code> least significant bits of integer <code>n</code> to the end of the stream in big-endian bit order +* <code>read_bits_big_endian(stream, k)</code> reads the next available +* <code>k</code> bits from the stream and interprets them as the least significant bits of a big-endian integer + +The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", +"SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be +interpreted as described in RFC 2119. + +== Specification == + +=== Golomb-Coded Sets === + +For each block, compact filters are derived containing sets of items associated +with the block (eg. addresses sent to, outpoints spent, etc.). A set of such +data objects is compressed into a probabilistic structure called a +''Golomb-coded set'' (GCS), which matches all items in the set with probability +1, and matches other items with probability <code>2^(-P)</code> for some integer +parameter <code>P</code>. + +At a high level, a GCS is constructed from a set of <code>N</code> items by: +# hashing all items to 64-bit integers in the range <code>[0, N * 2^P)</code> +# sorting the hashed values in ascending order +# computing the differences between each value and the previous one +# writing the differences sequentially, compressed with Golomb-Rice coding + +The following sections describe each step in greater detail. + +==== Hashing Data Objects ==== + +The first step in the filter construction is hashing the variable-sized raw +items in the set to the range <code>[0, F)</code>, where <code>F = N * +2^P</code>. Set membership queries against the hash outputs will have a false +positive rate of <code>2^(-P)</code>. To avoid integer overflow, the number of +items <code>N</code> MUST be <2^32 and <code>P</code> MUST be <=32. + +The items are first passed through the pseudorandom function ''SipHash'', which +takes a 128-bit key <code>k</code> and a variable-sized byte vector and produces +a uniformly random 64-bit output. Implementations of this BIP MUST use the +SipHash parameters <code>c = 2</code> and <code>d = 4</code>. + +The 64-bit SipHash outputs are then mapped uniformly over the desired range by +multiplying with F and taking the top 64 bits of the 128-bit result. This +algorithm is a faster alternative to modulo reduction, as it avoids the +expensive division +operation<ref>https://lemire.me/blog/2016/06/27/a-fast-alternative-to-the-modulo-reduction/</ref>. +Note that care must be taken when implementing this reduction to ensure the +upper 64 bits of the integer multiplication are not truncated; certain +architectures and high level languages may require code that decomposes the +64-bit multiplication into four 32-bit multiplications and recombines into the +result. + +<pre> +hash_to_range(item: []byte, F: uint64, k: [16]byte) -> uint64: + return (siphash(k, item) * F) >> 64 + +hashed_set_construct(raw_items: [][]byte, P: uint, k: [16]byte) -> []uint64: + let N = len(raw_items) + let F = N << P + + let set_items = [] + + for item in raw_items: + let set_value = hash_to_range(item, F, k) + set_items.append(set_value) + + return set_items +</pre> + +==== Golomb-Rice Coding ==== + +Instead of writing the items in the hashed set directly to the filter, greater +compression is achieved by only writing the differences between successive +items in sorted order. Since the items are distributed uniformly, it can be +shown that the differences resemble a geometric +distribution<ref>https://en.wikipedia.org/wiki/Geometric_distribution</ref>. +''Golomb-Rice'' +''coding''<ref>https://en.wikipedia.org/wiki/Golomb_coding#Rice_coding</ref> +is a technique that optimally compresses geometrically distributed values. + +With Golomb-Rice, a value is split into a quotient and remainder modulo +<code>2^P</code>, which are encoded separately. The quotient <code>q</code> is +encoded as ''unary'', with a string of <code>q</code> 1's followed by one 0. The +remainder <code>r</code> is represented in big-endian by P bits. For example, +this is a table of Golomb-Rice coded values using <code>P=2</code>: + +{| class="wikitable" +! n !! (q, r) !! c +|- +| 0 || (0, 0) || <code>0 00</code> +|- +| 1 || (0, 1) || <code>0 01</code> +|- +| 2 || (0, 2) || <code>0 10</code> +|- +| 3 || (0, 3) || <code>0 11</code> +|- +| 4 || (1, 0) || <code>10 00</code> +|- +| 5 || (1, 1) || <code>10 01</code> +|- +| 6 || (1, 2) || <code>10 10</code> +|- +| 7 || (1, 3) || <code>10 11</code> +|- +| 8 || (2, 0) || <code>110 00</code> +|- +| 9 || (2, 1) || <code>110 01</code> +|} + +<pre> +golomb_encode(stream, x: uint64, P: uint): + let q = x >> P + + while q > 0: + write_bit(stream, 1) + q-- + write_bit(stream, 0) + + write_bits_big_endian(stream, x, P) + +golomb_decode(stream, P: uint) -> uint64: + let q = 0 + while read_bit(stream) == 1: + q++ + + let r = read_bits_big_endian(stream, P) + + let x = (q << P) + r + return x +</pre> + +==== Set Construction ==== + +A GCS is constructed from three parameters: +* <code>L</code>, a vector of <code>N</code> raw items +* <code>P</code>, which determines the false positive rate +* <code>k</code>, the 128-bit key used to randomize the SipHash outputs + +The result is a byte vector with a minimum size of <code>N * (P + 1)</code> +bits. + +The raw items in <code>L</code> are first hashed to 64-bit unsigned integers as +specified above and sorted. The differences between consecutive values, +hereafter referred to as ''deltas'', are encoded sequentially to a bit stream +with Golomb-Rice coding. Finally, the bit stream is padded with 0's to the +nearest byte boundary and serialized to the output byte vector. + +<pre> +construct_gcs(L: [][]byte, P: uint, k: [16]byte) -> []byte: + let set_items = hashed_set_construct(L, P, k) + + set_items.sort() + + let output_stream = new_bit_stream() + + let last_value = 0 + for item in set_items: + let delta = item - last_value + golomb_encode(output_stream, delta, P) + last_value = item + + return output_stream.bytes() +</pre> + +==== Set Querying/Decompression ==== + +To check membership of an item in a compressed GCS, one must reconstruct the +hashed set members from the encoded deltas. The procedure to do so is the +reverse of the compression: deltas are decoded one by one and added to a +cumulative sum. Each intermediate sum represents a hashed value in the original +set. The queried item is hashed in the same way as the set members and compared +against the reconstructed values. Note that querying does not require the entire +decompressed set be held in memory at once. + +<pre> +gcs_match(key: [16]byte, compressed_set: []byte, target: []byte, P: uint, N: uint) -> bool: + let F = N << P + let target_hash = hash_to_range(target, F, k) + + stream = new_bit_stream(compressed_set) + + let last_value = 0 + + loop N times: + let delta = golomb_decode(stream, P) + let set_item = last_value + delta + + if set_item == target_hash: + return true + + // Since the values in the set are sorted, terminate the search once + // the decoded value exceeds the target. + if set_item > target_hash: + break + + last_value = set_item + + return false +</pre> + +Some applications may need to check for set intersection instead of membership +of a single item. This can be performed far more efficiently than checking each +item individually by leveraging the sorted structure of the compressed GCS. +First the query elements are all hashed and sorted, then compared in order +against the decompressed GCS contents. See +[[#golomb-coded-set-multi-match|Appendix B]] for pseudocode. + +=== Block Filters === + +This BIP defines two initial filter types: +* Basic (<code>0x00</code>) +* Extended (<code>0x01</code>) + +==== Contents ==== + +The basic filter is designed to contain everything that a light client needs to +sync a regular Bitcoin wallet. A basic filter MUST contain exactly the following +items for each transaction in a block: +* The outpoint of each input, except for the coinbase transaction +* Each data push in the scriptPubKey of each output, ''only if'' the scriptPubKey is parseable +* The <code>txid</code> of the transaction itself + +The extended filter contains extra data that is meant to enable applications +with more advanced smart contracts. An extended filter MUST contain exactly the +following items for each transaction in a block ''except the coinbase'': +* Each item within the witness stack of each input (if the input has a witness) +* Each data push in the scriptSig of each input + +Note that neither filter type interprets P2SH scripts or witness scripts to +extract data pushes from them. If necessary, future filter types may be designed +to do so. + +==== Construction ==== + +Both the basic and extended filter types are constructed as Golomb-coded sets +with the following parameters. + +The parameter <code>P</code> MUST be set to <code>20</code>. This value was +chosen as simulations show that it minimizes the bandwidth utilized, considering +both the expected number of blocks downloaded due to false positives and the +size of the filters themselves. The code along with a demo used for the +parameter tuning can be found +[https://github.com/Roasbeef/bips/blob/83b83c78e189be898573e0bfe936dd0c9b99ecb9/gcs_light_client/gentestvectors.go here]. + +The parameter <code>k</code> MUST be set to the first 16 bytes of the hash of +the block for which the filter is constructed. This ensures the key is +deterministic while still varying from block to block. + +Since the value <code>N</code> is required to decode a GCS, a serialized GCS +includes it as a prefix, written as a CompactSize. Thus, the complete +serialization of a filter is: +* <code>N</code>, encoded as a CompactSize +* The bytes of the compressed filter itself + +==== Signaling ==== + +This BIP allocates a new service bit: + +{| class="wikitable" +|- +| NODE_COMPACT_FILTERS +| style="white-space: nowrap;" | <code>1 << 6</code> +| If enabled, the node MUST respond to all BIP 157 messages for filter types <code>0x00</code> and <code>0x01</code> +|} + +== Compatibility == + +This block filter construction is not incompatible with existing software, +though it requires implementation of the new filters. + +== Acknowledgments == + +We would like to thank bfd (from the bitcoin-dev mailing list) for bringing the +basis of this BIP to our attention, Greg Maxwell for pointing us in the +direction of Golomb-Rice coding and fast range optimization, and Pedro +Martelletto for writing the initial indexing code for <code>btcd</code>. + +We would also like to thank Dave Collins, JJ Jeffrey, and Eric Lombrozo for +useful discussions. + +== Reference Implementation == + +Light client: [https://github.com/lightninglabs/neutrino] + +Full-node indexing: https://github.com/Roasbeef/btcd/tree/segwit-cbf + +Golomb-Rice Coded sets: https://github.com/Roasbeef/btcutil/tree/gcs/gcs + +== Appendix A: Alternatives == + +A number of alternative set encodings were considered before Golomb-coded +sets were settled upon. In this appendix section, we'll list a few of the +alternatives along with our rationale for not pursuing them. + +==== Bloom Filters ==== + +Bloom Filters are perhaps the best known probabilistic data structure for +testing set membership, and were introduced into the Bitcoin protocol with BIP +37. The size of a Bloom filter is larger than the expected size of a GCS with +the same false positive rate, which is the main reason the option was rejected. + +==== Cryptographic Accumulators ==== + +Cryptographic +accumulators<ref>https://en.wikipedia.org/wiki/Accumulator_(cryptography)</ref> +are a cryptographic data structures that enable (amongst other operations) a one +way membership test. One advantage of accumulators are that they are constant +size, independent of the number of elements inserted into the accumulator. +However, current constructions of cryptographic accumulators require an initial +trusted set up. Additionally, accumulators based on the Strong-RSA Assumption +require mapping set items to prime representatives in the associated group which +can be preemptively expensive. + +==== Matrix Based Probabilistic Set Data Structures ==== + +There exist data structures based on matrix solving which are even more space +efficient compared to Bloom +filters<ref>https://arxiv.org/pdf/0804.1845.pdf</ref>. We instead opted for our +GCS-based filters as they have a much lower implementation complexity and are +easier to understand. + +== Appendix B: Pseudocode == + +=== Golomb-Coded Set Multi-Match === + +<pre> +gcs_match_any(key: [16]byte, compressed_set: []byte, targets: [][]byte, P: uint, N: uint) -> bool: + let F = N << P + + // Map targets to the same range as the set hashes. + let target_hashes = [] + for target in targets: + let target_hash = hash_to_range(target, F, k) + target_hashes.append(target_hash) + + // Sort targets so matching can be checked in linear time. + target_hashes.sort() + + stream = new_bit_stream(compressed_set) + + let value = 0 + let target_idx = 0 + let target_val = target_hashes[target_idx] + + loop N times: + let delta = golomb_decode(stream, P) + value += delta + + inner loop: + if target_val == value: + return true + + // Move on to the next set value. + else if target_val > value: + break inner loop + + // Move on to the next target value. + else if target_val < value: + target_idx++ + + // If there are no targets left, then there are no matches. + if target_idx == len(targets): + break outer loop + + target_val = target_hashes[target_idx] + + return false +</pre> + +== Appendix C: Test Vectors == + +TODO: To be generated. + +== References == + +<references/> + +== Copyright == + +This document is licensed under the Creative Commons CC0 1.0 Universal lisence. |