Tuesday, September 8, 2020

Thinking about different ways to do the tree-sync for LBARD

In the last few blog posts I've been fighting my way through fixing bugs with LBARD's synchronisation algorithms.  While both the tree-sync and the actual transfer of bundles have been problematic, the tree-sync side of things has been niggling at me for HF use, in particular. One of the reasons is that tree-sync has to start its synchronisation again from scratch every time a radio connection is made. This already takes several minutes, with only relatively small numbers of bundles.  Even if neither end has new bundles, the entire synchronisation process has to occur again.  Also the very high latency of the HF radios doesn't really agree with tree-sync, and can result in considerable delays in synchronisation.

Way back around 2014 we had started thinking about reworking the way that Rhizome stores data, by having it use a block-store to represent the bundles. That is, instead of keeping a manifest and payload, we would have a kind of virtual file-system in which the bundles would be represented.  Data blocks would be identified by the hash of their contents, thus providing automatic de-duplication, and making it much easier to cache data.

Now, reworking all of Rhizome to do that would require significant effort.  However, what *might* be doable in reasonable time, is to make the list of bundles an instance holds be represented in such a way.  This would allow each end of an HF link to cache this virtual file-system representation, and thus re-synchronise very rapidly if none or few new bundles have appeared at each end.

My current thinking is along the following lines:

1. Blocks should be small enough to fit in a single LBARD packet. This realistically means a limit of about 200 bytes.

2. We want to minimise the number of nodes in the tree, as each additional node requires an extra round-trip.

3. The small block size means that we need to use relatively small hashes for each block.  Even if they are too short for strong cryptographic security, they should still be strong enough to keep the chance of accidental collisions practically zero.

4. Each block can consist simply of leaf nodes or pointers to child nodes, or a mix of the two. The position of the node in the tree is implicit.

5. It would be nice to have one hex digit of identifier covered per block.  This means 16 pointers to sub-blocks have to be able to fit in a block.  200 bytes / 16 = 12.5 bytes per entry.  This probably means that 12 byte hashes = 96 bit hashes will be the longest possible.  That's ample to keep us safe from Birthday Paradox accidental node collisions (should be safe for up to 2^48 blocks).

6. If all the child nodes for a block can fit in a block, then that is where they are placed.  This avoids having unnecessary intermediate nodes in the tree that would otherwise have to be synchronised.

7. If all the child nodes can't fit in a single block, then we need to have a deterministic way to represent this, so that we don't end up with a proliferation of blocks as we fill nodes.

8. The simplest approach to (7), is to require that an overfull block child blocks for each of the 16 entries it can hold. This may mean that some child blocks point to the empty node (which because of the hashing, will always be the same).

9. The above formulation means that each bundle (or more correctly, version of bundle), will have a deterministic node, which each side can enter into their block cache, thus avoiding the need to ever transfer a block that corresponds to a bundle that the other party already has.

10. Each side can request blocks from the other, to begin exploring the tree structure. Any common sub-trees between the two sides will be implicitly available.

11. The block cache can persist between executions of LBARD, thus allowing the complete state of the other party to be remembered between radio sessions, thus greatly speeding up the initial synchronisation when a radio session commences.

12. Multiple blocks can be requested at a time, thus helping to mask high latency.  In fact, simple optimisations like "please send me block X, plus any children (and optionally further descendents) of it that you don't think that I have" can be used to save further round-trips.

13. When a block is received, its position in the tree is known. This means that in some cases, it will be possible for the receipient to work out the content of the child nodes, if the sender has a strict sub-set of the sender's bundles that are in that part of the tree.  At its simplest, a leaf node can be constructed for every combination of the bundles that it contains.  If any of those match the block ID of the child node indicated in the received block, then there is no need to receive those child-blocks.  Instead, not only can we avoid the need to send those blocks, we can actually immediately infer that the sender is missing those bundles not in the combination that resulted in the matching child node.  

14. The approach from (13) can be extended further up the tree if desired, to help save bandwidth, at the expense of storage space and computation.  At some point, however, the number of combinations that need to be tested becomes excessive, as each additional bundle in the portion of the tree being considered potentially doubles the number of combinations to be computed.  However, because each leaf node is restricted in its potential content, the growth in complexity is actually rather different (leaf nodes are only formed when the collective leaves don't fit in a single block).  The exact complexity of this will need to be modelled.

15. In theory, if the complexity of the above can be handled, then sending just the top of the tree will allow the recipient of any node that contains only a subset of the nodes the recipient possesses, without any further data exchange.  This is more of theoretical interest than practical effect, except for the leaf-node optimisation described in (13).

16. The bundles themselves can also be represented as tree structures, so that similar techniques can be used.  This also allows for implicit transfer of common data sections, including in journal bundles that have invariant starts.  Of course, this ability to transfer common data sections will only work for those that have such commonalities that are aligned with the block size, or better, aligned with a leaf-node of blocks.

17. This approach allows for implicit caching of all transfers, making resumption of communications with other nodes and of bundles much more efficient.

18. This approach also solves many of the authenticity issues with transfers in LBARD, where the transfer of a large bundle could be poisoned with a fake block, or even just by a bit error that escaped the error correcting code. This is a significant advantage of this approach.

19. This approach effectively trades-off storage space (which is cheap and plentiful) for algorithm complexity and RAM requirements (which is in limited supply on the kinds of small embedded devices we want to target, such as low-power meshing phones etc).  This is thus a very nice side-effect of this approach.

Anyway, I'll have a think about whether I implement this now or not, depending on how feasible it looks to get acceptable performance with the existing algorithm.

No comments:

Post a Comment