Communication complexity of comparing sets, for Bitcoin

Question

In Bitcoin, when one node wants to tell another node about a block, it sends the block header, then all the transactions it contains. This is inefficient, because the receiving node might already have some or all of those transactions.

The sending node needs to tell the receiving node

the transactions that the receiving node does not have
which transactions are in the block
what order those transactions are in

What would be more efficient than sending all of the transactions? (Each transaction is about 160 bytes)

You could send a list of hashes of the transactions. That would reduce the overhead for each transaction to 20 bytes (if you used SHA1). The receiving node could ask for the transactions that it was missing. This is the most efficient method that I can think of - but I think there’s probably something better.

score 2 · Accepted Answer · answered Sep 03 '14 at 07:24

One possibly better approach would be to use a Merkle tree, or some other hierarchical data structure like that. If there's a large set of transactions that the sender can identify as probably already known to the recipient, then this might save a significant amount of data transfer.

Build a binary tree, where each leaf corresponds to a single transaction, and where each internal node has two children. Now each leaf node will be labelled with 20 bytes, namely, the hash of the transaction. Each internal node will be labelled with 20 bytes, namely, the hash of the concatenation of the labels on its two children. For example, for each $i$, we might have an internal node at height $i$ that covers the first $2^i$ leaves, with a complete binary tree hanging underneath it (along with a bunch of other internal nodes).

The two parties can then compare their Merkle trees via some interactive protocol. Note that if Alice and Bob established that they both have the same label for some node of their tree, they don't need to examine anything under that node; they know the subtree rooted at that node is the same for both of them. If they have a different label for some node of their trees, then they can recurse and compare their children.

You can probably choose the structure of Merkle tree to take advantage of the probability distribution on where differences are most likely to appear. You can also vary the branching factor. This gives you a space of possible trees, with a range of tradeoffs between the typical number of rounds of interaction vs the typical amount of data that must be communicated. You might be able to choose a parameter setting that is optimal in your setting.

Also, as another optimization, Alice can start by sending Bob the value of some fixed set of her nodes. Notice that your example of sending a list of hashes corresponds to Alice proactively sending the value of all of the leaves. However we might be able to do better by taking advantage of interaction: e.g., if Alice expects that one of her subtrees is probably shared by Bob, she can send just the label on the root of that subtree, and not send anything under that node (e.g., not the hashes of any of the transactions). If she is right and Bob does have that transaction, neither has to do any other work. If she is wrong, then some further interaction and additional data must be exchanged.

Many variations on this basic idea are probably possible.

Communication complexity of comparing sets, for Bitcoin

1 Answers1