Monday, September 28, 2020

900MHz 802.11ah WiFi modules

 Hmmm... So it looks like there are the first 900MHz wifi modules, which claim 1 - 1.5km range outdoors, with data rates of megabits to tens of megabits in many circumstances.  The module I have found is this one:

https://www.silextechnology.com/connectivity-solutions/embedded-wireless/sx-newah

The only problem I see so far, is that there is no comment as to whether they can do ad-hoc mode, or only AP or station mode.

If they can do ad-hoc mode, then this is potentially very interesting for use in Serval Mesh Extenders.

More when I discover it... But if anyone else has experience working with these, I would love to hear from you.

UPDATE: It looks like it uses an NRC7292 chipset for the 802.11ah, for which source code is available:

https://github.com/newracom/nrc7292_sw_pkg

They already have monitor mode implemented in the chip, it seems, so it looks like it *should* be possible to implement ad-hoc mode on it.

Thursday, September 24, 2020

Chasing down the last LBARD HF TX queue bugs and Initial Performance Graphs

To recap, in the last blog post I found and fixed a lot of bugs in LBARD over HF data modem operation.  This has taken much longer than hoped, in part due to the 4 to 5 hours it takes to do a complete run.

The last fixes I discussed in the previous post have resolved the remaining TX queue shuffling bugs, so far as I can see. But in some runs I am still seeing that a few bundles are not transferred in one or both directions.  These are typically among the lowest priority bundles, but not strictly the lowest priority bundles.

For example, in the last test run, the following bundles are not transferred:

peer4random79

peer4random94

peer4random96

The files are named in increasing file size, from peer4random00 to peer4random99.  Thus the 94 and 96 files are among the lowest priority bundles, and the 79 file is reasonably close.

The first step to tracking down what happened with them is to get the bundle IDs (BIDs) that correspond to these files:

peer4random79 - 6CE2CB42* - bundle #21 on the sending side

peer4random94 - 469E2B18* - bundle #6 on the sending side

peer4random96 - 338FC557* - bundle #4 on the sending side

Armed with that information, I can see if those bundles get transferred at all.  Let's start with the 79 file.

At 17:53.27 the sender first realises that the recipient does not have the bundle. However, the TX queue is already full, so it doesn't make it onto the TX queue.

At 18:28.39 it gets queued again, and this time there is space in the TX queue, and it does get queued in position 3 in the TX queue.

At 18:28.39 bundle #66 gets queued, which has a higher priority, so our bundle is demoted to position 4 in the TX queue.

At 18:29.06 bundle #53 gets queued, which also has a higher prioity, so our bundle is now demoted to position 5 in the TX queue.

At 18:29.06 bundles #27, #71 and #56 also get queued, and our bundle is now in position 8.

At 18:29.29 bundles #51 and #55 get queued, and our bundle is now in position 10. One more higher prioity bundle, and it will get bumped from the TX queue altogether until the tree-sync process gets restarted.  This happens immediately, as bundle #59 gets queued.

At 19:06.57 our target bundle, #21, gets requeued, however the TX queue is at that time already full of higher-priority bundles, so does not make it ont othe TX queue.

At 19:40.05 it does finally make it back onto the TX queue in position 8.

At 19:40.14 bundle #41 gets queued, bumping our bundle to position 9.

Then at 19:40.49 bundle #27 gets queued, bumping our bundle to position 10, followed immediately by bundle #23, which causes our bundle to get bumped from the TX queue.

Our bundle is next attempted to be requeued at 20:13.42, but the TX queue is already full of bundles with equal or higher priority, so it doesn't get queued.

The next attempt is at 20:43.25. And this looks like it might be the problematic instance. The TX queue is not full, and our bundle has a higher priority than the TX queue's entries, and also compared with the bundle currently being sent.  As a result, it is immediately selected for transmission.  However, then bundle #28 also gets queued for sending, which bumps #21 from being the bundle currently being sent.  However, there is no code that then puts the bundle into the TX queue. 

As the TX queue had not overflowed since the last restart of the tree-sync process, no flag is raised to require the tree-sync to restart one last time, and so this bundle will never finish being sent.  And indeed this is what happens. Similar things happen for the other two bundles.

So the fix is relatively simple: When de-queuing a bundle that should still get sent, we need to insert it into the TX queue.  

I'm heartened that this fault seems to describe the last situation where this problem of bundles not being forwarded can occur.

The only immediate problem I can see, is that there is already code that is supposed to do this.  It turned out the problem was that I attempted to re-queue the bundle being sent before removing it as the bundle being currently sent. The queuing logic was noticing this, and not adding the bundle back into the queue, because as far as it was concerned, the bundle was already being sent.  

By reordering this sequence of events, I was able to solve the problem, and it now reliably delivers every single bundle.  Together with all the performance fixes, we now have things at a state where they work reliably, which is great. So lets get some basic performance statistics from these runs:



We can see that there is considerable variation in the transmission times (and thus speed), with respect to the bundle sizes.  That is, for bundles of about the same size, the delivery time can vary by a factor of 5 or so.  

This is not unexpected, as the prioritisation of bundles during the tree-sync protocol can cause some bundles to be fully or partially transmitted, before tree-sync realises that there is a higher priority bundle. When this occurs, the bundle that pre-empts the lower-priority bundle will appear to take longer to deliver, because part of one or more other bundles will have occurred during the time accounted to the higher-priority bundle. Conversely, if a bundle has been partially transferred before being pre-empted by a higher priority one, the pre-empted bundle will appear to deliver more quickly, because only the time from when it resumes transmission will be accounted.

Otherwise, we see that the average data throughput, excluding the approximately 400 bytes of Bundle Manifest, improves gradually with increasing bundle size. For larger bundles we see transfer speeds of around 40 bytes per second, which is about half of the channel capacity of ~93 bytes per second we discovered previously.  This is still not ideal, but given that it includes all the overheads for the transmissions to occur, to be prioritised and for synchronisation of bundle lists between the two parties, its not too bad.  Note that these figures are for one direction only, with similar results for the opposite direction.

If we look at the overall bundle transfer times, we see that this varies between about 50 and 200 seconds, only somewhat influenced by the size of the bundle's payload.  This corresponds to message delivery rates between 2x (3600/50) = 144 and 2x (3600/200) = 36 bundles per hour over an HF link. That is, between ~850 and ~3,500 bundles per day.  Again, while higher figures would be welcome, being able to deliver one to a handful of text messages very few minutes would seem to be sufficient for delivering a useful level of communications between small isolated communities.  Round-trip message send and reply cycles would likely be of the order of 2 to 10 minutes, which again is sufficient to facilitate useful communications.

So at this point we have reliable and reasonably performant delivery of Serval Mesh traffic over Codan HF data modems.  The next step will be to try to setup a real HF link with two radios spaced many kilometres, or perhaps a couple of hundred kilometres apart, and see how it performs in practice with some real message traffic.

Otherwise, I'll now be turning my attention to implementing LoRa radio support for Mesh Extenders to help make lower-cost and more internationally portable Mesh Extenders, so that we can experiment with more interesting mixed-radio networks.  Because the Wi-Fi links between Mesh Extenders are self-organising, this will allow such mixed-media networks to be easily built. It will then be interesting to see how the system behaves with such networks.

After that, I will start looking at implementing "affinity tags" so that when we start connecting large numbers of such networks together, that we can prevent traffic being replicated places it doesn't need to go.  We should also finally implement the logic for detecting whether senders or recipients are likely to be located on the far end of an HF link, so that we can also factor that into the prioritisation.

Monday, September 21, 2020

Continuing working on the LBARD HF protocol efficiency problems

Ok, back from the last post where I was thinking about some longer-term future approaches, to trying to get the current code working as well as it can.

The problem I am seeing at the moment, is that the same blocks are being sent repeatedly, while other blocks are not being sent at all.  

One problem was that pieces of the manifest were being marked as sent, when they hadn't.  This was due to an out-by-one error in the comparison of the extent of a block being sent, compared with the actual data. The effect was when a block was sent, both it, and the following block were marked as having been sent.

This seems to be a common factor in many of the slow transfers.  It has only started showing up now, because we were previously testing with payload-less bundles, and so the manifest would keep being resent, which would then have a higher chance of re-sending the missing part through the normal cyclic re-transmission.

So let's set a test run going, and see if it solves the problem.

Well, its helped, but close examination of the log files reveals that there are still pieces being counted against the wrong blocks, due to a duplicate code-path.  I've now de-duplicated that.  I also notice that sometimes manifest or body pieces would be counted against the other as having been sent, which of course is also causing problems.  So I have fixed those as well. 

On the plus side, its pretty how these sorts of bugs can be causing these problems: If we are marking things that have not been sent as having been sent, then we are not going to send those any time soon.  Instead, we will keep sending the same pieces over and over, because we think we haven't sent them.

I'm also seeing the situation that the sending of a bundle can get reset part way through, which then makes it forget the transfer bitmap information for that bundle, resulting in it re-sending pieces it has already sent. This might be because an old feature that was built in that causes the tree-sync process to restart periodically, to avoid lock-up in the protocol, which in turn causes the selection of the currently being sent bundle to potentially change.  However, the root causes for those lock-up have all been fixed now, so that can probably be removed, if it is still there.  I'll investigate that while the new run with the previous bug fixes in place runs.

Yes, it now looks like the resetting of sending bundles is the most significant remaining problem. As mentioned, this causes the knowledge of the state of a transfer to be lost.  There are two approaches to solving this: 

1. Don't switch which bundle we are transferring. 

2. Remember the state of other bundles, so that we can continue from where we left off, if we do switch which bundle we are transferring.

Option 1 sounds great in theory, but the synchronisation process may well interrupt the transfer of a lower priority bundle with a higher priority one, for example.  So this is probably not possible to achieve in practice. Although I will instrument the changing of the current bundle being sent, to find out why it is changing.

Thus it probably makes sense to implement something to remember the progress of a bundle when switching.  This can be kept in a nice little cache somewhere.  Of course, this is one of several complications that would go away if/when I get to implementing the new approach I wrote about in the previous blog post.  But for now, the cache will have to do.

I've now implemented this caching of transfer progress.  However, the first run after implementing in it showed a weird situation where bundles transferred quite regularly in one direction, but none in the other.   In the direction where traffic was flowing, I saw 6 bundles received fairly quickly, with all of them transferring in 50 to 80 seconds.  That's the good news.  However, there would then be a gap ranging from a couple of minutes to half an hour with nothing being transferred.  Then the same six bundles would be transferred again.

I think what happened is that one of the servald instances crashed or didn't start properly.  As a result one party had no bundles to share, and could not store them into its Rhizome database, and so the same transfers happened repeatedly.  That might have just been some freak event that caused one of the servald instances to not have started properly.  To find out, I'll run the test again, and see what happens.

Second run worked properly in that regard.  Some bundles transferring at a reasonable speed, but others takeing a VERY long time, upto an hour or so.  

One bug spotted: We don't mark the final body block as sent if it is a partial block. This looks to be able to cause the final block to be sent many times repeatedly, which is clearly not good.  I've tried to fix that now.

The other bigger problem, is that the bundle keeps getting de-queued and re-queued, which of course is causing a lot of delay in transfer.  It also seems that at least one code path can de-queue the current bundle being sent, without stashing its progress for later retrieval. 

Found at least one: The sending of a new generation ID by the sender, which causes us to restart the tree-sync process.  In that code, we stop sending the current bundle.  However, we should just keep on sending the same bundle, unless and until we find a better bundle to send.  This is likely the cause of considerable delays, because the synchronisation will run for potentially several minutes before it finds a new bundle to send. That's likely to be it.  We should probably reduce the generation ID rotation rate for HF, anyway, since it doesn't make sense to reset the tree-sync process so frequently on a high-latency link.

All these problems keep making me think about the new tree-based approach I wrote about recently, as there would be no need to do any of this, and resumption of transmission would happen in a much more natural and efficient manner -- even if the lbard process really did die and restart.

Anyway, back to the existing code, it now looks like transfers are happening reasonably sensibly.  In trying to make a script to show the sequence of transfers, I am however still seeing that the currently selected bundle for transmission tends to change more often than it seems it should.  This led me to look more deeply into the TX queue management, and I'm seeing some odd things there, where some bundles added to the queue seem to disappear, and others that were never reported as having been added to the queue turn up in there.  So I'll add some more debugging for that.

Yes: The TX queue is being completely purged from time to time. This is again the resetting of the generation ID, the unique identifier that is used to indicate when an LBARD instance restarts.  Apparently I was resetting the generation ID every four minutes.  Thus it is no wonder that the TX queues are getting purged and we are ending up stuck in a synchronisation hell.  Instead, the generation ID should only be changed when the TX queue at one end or the other has overflown, and thus been emptied, and so the sending side doesn't know which bundles it should have sent when the TX queue overflowed. 

Given that the transfers are now otherwise working fairly efficiently, I'm hopeful that this will get us to some kind of reasonably reliable transfer with steady transfer times for bundles of a given size.

Running more tests, I have found and fixed a few more problems:

1. If the carrier is lost in a HF modem call, we can end up in a situation where no more normal packets get sent, presumably because the number of unacknowledged packets is high, causing the packet rate to be practically zero.

2. Particularly with bundles that require many data packets, they end up slowing down in transfer over time, and eventually transfer halts.  The halting is probably due in part to (1), but the reducing transfer efficiency must have another cause.

3. Some received pure data packets are corrupted, which is a probable significant contributor to (2).

To fix (1), I have reset the TX sequence number and last acknowledged sequence number whenever a HF call starts.

For (2), I need to examine what is going on in the logs of runs, which will probably make more sense once I have fixed (3).

For (3), I suspect that the modem's TX buffer is being easily overwhelmed.  This is in spite of the fact that I have hardware flow control enabled.  It is possible the hardware flow control doesn't work properly with the USB serial adapters.  In any case, I have implemented a check before sending pure data packets, so that they will not be sent if no normal packet has been seen in the last 30 seconds.  This should stop the buffers getting too over full.  But a fuller investigation will require that I log every byte sent to and received from each modem, and then compare the two streams to see exactly what kind of data errors we are experiencing. Is it lost bytes, as I am suspecting, or is it bit errors, for example?

So adding the logging of bytes sent/received, I can see that indeed something is going wrong with the pure data packets almost immediately.  This makes it much easier to see what is going on, since I don't have to wait an hour or more each time. Now to look at the TX and RX logs to see what is happening...

We are indeed missing large slabs of data.  Taking a look at the FTDI USB serial adapter documentation, it looks like you cannot safely write more than 32 bytes at a time, if using hardware flow-control, as that is the size of the FTDI chips internal buffer.  So I've modified the serial writing code to write 16 bytes at a time, to see if that fixes the missing bytes problem. However the code to back-off due to hardware flow control doesn't seem to get triggered at all, and the problem with corrupt data continues.  Found one problem with the TX and RX logs, where the pure data packets weren't being included.  Fixed that, now lets see how it looks...

Ok, so now I am not seeing any more lost bytes at the start, but the pure data packets were still being munged.  That problem turned out to be caused by the truncation of the data packets on the receive side. I also found and fixed a couple of segmentation fault bugs in the process.  Now re-running...

Still seeing corrupted packets. It turns out the pure data packets were not being properly escaped before sending, thus if they contained a "!" character in their data or header, it was messing things up.  Fix that now, too.  This could well be causing the failure to transfer larger bundles efficiently, as it would effectively stuff up the transfer of larger pieces more often.  So its satisfying to have found this problem.  Let's see what it does to the transfer rates...

Yes, bundles now seem to transfer in between 30 and 200 seconds. However, now I am seeing that the same bundles get sent more than once.  I suspect that this is due to the synctree reset that happens when the TX queue is emptied, but overflow of the TX queue was previously recorded.  This tells LBARD that it needs to start the sync again. However, if the bundles were received, then they should not get transferred again.  Looking in the logs, I am seeing HTTP 400 errors when trying to insert the bundles into the rhizome database. Thus we are still having a problem with corruption during transfer. I'll have to look at the reassembly of bundles. My suspicion is that the problem will be with the pure data packets somewhere.

The corruption problems have now been solved. Part of it was that writing in 16-byte pieces helps the serial over USB transfers, but messes up the HTTP transactions.  So I have split the code to only do that for the serial transfers.  That fixed the problem with bundles not making it into the Rhizome database, and transfers now proceed progressively through more than just the same 10 bundles repeating. 

However, I am still seeing the HF call dropping out sometimes. Those tend to recover fairly quickly, and might be legitimate loss of carrier in the radio setup.

More concerning though, is that there can be a period of half an hour or more where no transfers in either direction occur.  These are quite mysterious, as the HF modem doesn't seem to drop out, but the logs of both sender and receiver simply show nothing at all for a period of 46 minutes in this particular case.  I guess I will need to run it again, and try to catch when this happens.  

This time we hit a segfault after almost an hour and transferring 80 bundles (between both directions) in about 64 minutes. The problem is that sometimes rubbish ends up in the TX queue.  More specifically, the priority of a bundle ends up being stored in the bundle ID field of a slot in the TX queue. 

So I've put more debug instrumentation in to find out how the TX queue list is getting corrupted. In the process, I also fixed the case where if a bundle is currently being sent, it could still end up being queued for later transmission as well, e.g., if the tree-sync process is restarted at some point.  

I've also added code that detects when the top of the TX queue ends up containing an illegal bundle number to reduce the impact. However, I would really like to fix the error at its cause.  What I believe is most likely the problem, is that the shuffling of the TX queue has an out-by-one error somewhere, as the bundle numbers occur immediately before the list of priorities. Thus if one too many entries were copied down, it would copy a bundle priority.

So time to start a new test run...

Its now running through a lot happier.  157 of the ~220 bundles have transferred, but it has taken about 4 hours to get that far. But it's not as bad as it sounds, as it seems that my laptop somehow stops USB communications while the display is locked.  This means that if I don't periodically prod the computer, long periods of no transfers result.  I'm pretty sure that this is the cause of this problem that I noticed previously.

That said, I am also seeing it able to get into a hang up and redial loop, where no packets get sent in the meantime.  I assume that the time to next packet TX is somehow being set wrongly. Or more the point, it can grow large, and wasn't being reset when a call was re-established.  I've examined all the code-paths that can get there, and added explicit resetting of the next packet TX time.

I tracked down the "nothing sent for an hour or more" problem: It's actually not with LBARD, but with the laptop I am using.  It automatically suspends after 2 hours of inaction, even when connected to mains power.  This doesn't really help when it happens mid-run.  So I've disabled that in the laptop, and should now be able to run transfers without this problem.

I've also looked through and found the problem that was causing the illegal bundle numbers ending up in the TX queue, and fixed that. The problem was indeed an out-by-one with the list shuffling code, as expected.  I'm expecting that this will likely fix the problem where only 199 or so of the 202 bundles were being transferred, as it could result in the situation where the TX queue loses an entry or two without actually overflowing to trigger the re-start of the sync process to re-discover the bundles that were lost from the queue.   Thus I'm hoping that runs now will correctly transfer all 202 bundles -- 101 in each direction.  It might also end up being marginally faster, because fewer sync restarts will be required.

In short, I'm now hoping that I am at the point where transfers run reliably and in reasonable time.  I'll do a few runs to get some data on the typical speed and variation, but then we should be good, I think.  What I would like to do next is to plan some actual long-distance tests using the HF radios, with one here at the house in Akaroola and another some distance away.  Perhaps in a neighbouring town.  However to do that, I will need to get some antennae and coordinate with folks in the amateur radio community to help with that.

But as this post has already gone on for ages, I'll stop here, and start a new post looking at the remaining TX issues.

Tuesday, September 8, 2020

Thinking about different ways to do the tree-sync for LBARD

In the last few blog posts I've been fighting my way through fixing bugs with LBARD's synchronisation algorithms.  While both the tree-sync and the actual transfer of bundles have been problematic, the tree-sync side of things has been niggling at me for HF use, in particular. One of the reasons is that tree-sync has to start its synchronisation again from scratch every time a radio connection is made. This already takes several minutes, with only relatively small numbers of bundles.  Even if neither end has new bundles, the entire synchronisation process has to occur again.  Also the very high latency of the HF radios doesn't really agree with tree-sync, and can result in considerable delays in synchronisation.

Way back around 2014 we had started thinking about reworking the way that Rhizome stores data, by having it use a block-store to represent the bundles. That is, instead of keeping a manifest and payload, we would have a kind of virtual file-system in which the bundles would be represented.  Data blocks would be identified by the hash of their contents, thus providing automatic de-duplication, and making it much easier to cache data.

Now, reworking all of Rhizome to do that would require significant effort.  However, what *might* be doable in reasonable time, is to make the list of bundles an instance holds be represented in such a way.  This would allow each end of an HF link to cache this virtual file-system representation, and thus re-synchronise very rapidly if none or few new bundles have appeared at each end.

My current thinking is along the following lines:

1. Blocks should be small enough to fit in a single LBARD packet. This realistically means a limit of about 200 bytes.

2. We want to minimise the number of nodes in the tree, as each additional node requires an extra round-trip.

3. The small block size means that we need to use relatively small hashes for each block.  Even if they are too short for strong cryptographic security, they should still be strong enough to keep the chance of accidental collisions practically zero.

4. Each block can consist simply of leaf nodes or pointers to child nodes, or a mix of the two. The position of the node in the tree is implicit.

5. It would be nice to have one hex digit of identifier covered per block.  This means 16 pointers to sub-blocks have to be able to fit in a block.  200 bytes / 16 = 12.5 bytes per entry.  This probably means that 12 byte hashes = 96 bit hashes will be the longest possible.  That's ample to keep us safe from Birthday Paradox accidental node collisions (should be safe for up to 2^48 blocks).

6. If all the child nodes for a block can fit in a block, then that is where they are placed.  This avoids having unnecessary intermediate nodes in the tree that would otherwise have to be synchronised.

7. If all the child nodes can't fit in a single block, then we need to have a deterministic way to represent this, so that we don't end up with a proliferation of blocks as we fill nodes.

8. The simplest approach to (7), is to require that an overfull block child blocks for each of the 16 entries it can hold. This may mean that some child blocks point to the empty node (which because of the hashing, will always be the same).

9. The above formulation means that each bundle (or more correctly, version of bundle), will have a deterministic node, which each side can enter into their block cache, thus avoiding the need to ever transfer a block that corresponds to a bundle that the other party already has.

10. Each side can request blocks from the other, to begin exploring the tree structure. Any common sub-trees between the two sides will be implicitly available.

11. The block cache can persist between executions of LBARD, thus allowing the complete state of the other party to be remembered between radio sessions, thus greatly speeding up the initial synchronisation when a radio session commences.

12. Multiple blocks can be requested at a time, thus helping to mask high latency.  In fact, simple optimisations like "please send me block X, plus any children (and optionally further descendents) of it that you don't think that I have" can be used to save further round-trips.

13. When a block is received, its position in the tree is known. This means that in some cases, it will be possible for the receipient to work out the content of the child nodes, if the sender has a strict sub-set of the sender's bundles that are in that part of the tree.  At its simplest, a leaf node can be constructed for every combination of the bundles that it contains.  If any of those match the block ID of the child node indicated in the received block, then there is no need to receive those child-blocks.  Instead, not only can we avoid the need to send those blocks, we can actually immediately infer that the sender is missing those bundles not in the combination that resulted in the matching child node.  

14. The approach from (13) can be extended further up the tree if desired, to help save bandwidth, at the expense of storage space and computation.  At some point, however, the number of combinations that need to be tested becomes excessive, as each additional bundle in the portion of the tree being considered potentially doubles the number of combinations to be computed.  However, because each leaf node is restricted in its potential content, the growth in complexity is actually rather different (leaf nodes are only formed when the collective leaves don't fit in a single block).  The exact complexity of this will need to be modelled.

15. In theory, if the complexity of the above can be handled, then sending just the top of the tree will allow the recipient of any node that contains only a subset of the nodes the recipient possesses, without any further data exchange.  This is more of theoretical interest than practical effect, except for the leaf-node optimisation described in (13).

16. The bundles themselves can also be represented as tree structures, so that similar techniques can be used.  This also allows for implicit transfer of common data sections, including in journal bundles that have invariant starts.  Of course, this ability to transfer common data sections will only work for those that have such commonalities that are aligned with the block size, or better, aligned with a leaf-node of blocks.

17. This approach allows for implicit caching of all transfers, making resumption of communications with other nodes and of bundles much more efficient.

18. This approach also solves many of the authenticity issues with transfers in LBARD, where the transfer of a large bundle could be poisoned with a fake block, or even just by a bit error that escaped the error correcting code. This is a significant advantage of this approach.

19. This approach effectively trades-off storage space (which is cheap and plentiful) for algorithm complexity and RAM requirements (which is in limited supply on the kinds of small embedded devices we want to target, such as low-power meshing phones etc).  This is thus a very nice side-effect of this approach.

Anyway, I'll have a think about whether I implement this now or not, depending on how feasible it looks to get acceptable performance with the existing algorithm.

Monday, September 7, 2020

More work on LBARD over Codan HF Data Modem

In the last post, we got to the point where we had Rhizome bundles flowing over the modem, but not as fast as we would like.  Also, we were still seeing some problems where bundle sending would get stuck.  I've been away for a couple of weeks, but now am back at the bench with the HF radios, and can investigate further.

The first thing I realised that I should investigate, is what the actual throughput of the modems is, so that I can determine what fraction of the theoretical throughput we have available.

To do this, I instrumented LBARD to count the number of bytes, and divide that by the number of seconds since the first byte was received.  This yielded results of about 91 bytes per second in each direction.  Given the channel theoretical capacity of 2,400 bits / sec = 300 bytes / sec, this is noticeably lower than expected.  It means that we are only actually transmitting data (91+91)/300 = ~60% of the time. 

Now, if we were just transferring in one direction, instead of switching all the time, we could surely improve that, but we know that there is less than a factor of 2 available through such optimisation.  So I'm not too worried about that right now.

What is more interesting is that we should be able to get near to 90 bytes per second, but are still languishing down around the 20 bytes per second mark due to protocol issues. That is, we have a factor of 4 that we can improve, without having to increase the channel capacity or use strategy at all.

Having spoken with one of my students who is working on a debug framework for LBARD, it is sounding to me like the packet bitmap that is used for acknowledgement and scheduling of parts of a bundle that the receiver doesn't already have is not working properly.  If this is the case, it would be causing the sending side to send parts of the bundle that the receiver already has, effectively wasting bandwidth. 

Looking through logs of previous runs, I can see one problem already: We schedule sending a bundle at the end of the file, even if we have already sent to the end of the file.  That is, we send 0 bytes of data, positioned at the end of the bundle payload.  This is clearly silly, and should be fixed. What we need to watch out for, though, is that we use end-of-payload packets to implicitly indicate the payload length in LBARD transfers. 

We are also seeing the 4 data bytes being sent quite a number of times, before the missing parts of the manifest are sent through.  That resending needs to be fixed. But it also occurred to me, that if we randomly select a point in the last 64 bytes of the manifest or payload, when we will only send <64 bytes, when typically we have space in the packet for 128 or 192 bytes.  I suspect that this has the potential to greatly improve latency for small bundles, by basically eliminating such very low efficiency packets.  Together with preventing the re-transmissions mentioned in the last paragraph, this should be able to get the latency of a sub-1KB bundle down to perhaps 15 seconds, which would be great improvement.

I'll try these improvements, together with some tweaking of debug output, so that I can more easily follow when particular bundle pieces are sent, and what the list of candidate blocks at a given packet transmission are.

This should have been a simple-rerun, but then I hit a segfault caused by something unrelated, and now running it in gdb all sorts of strange things are happening, causing one of the instances to not properly switch to the online data mode when the modems connect.  Well, I found what caused the whole crazy problem to start: When I copy-pasted the arguments into gdb, I left the "1" off the front of "127.0.0.1", thus causing long delays while waiting for "27.0.0.1" to resolve, which caused the protocol to time out in various exciting ways.  So let's try that all again...

Ok. Now it is running again, after fixing that.  I also found and fixed some other bugs.  It is now transferring small bundles in typically 20 - 50 seconds.  This could be quicker, but it is a start.  However, I am seeing instances where a bundle is received, but then fails to insert into the Rhizome database with an error 422. This means that either the manifest is too big, or that the payload doesn't match the hash.  The bundle is then received a second time, and gets inserted. This means that there is some problem with the integrity of the transfer.

My first suspicion was that the pure data packets I implemented recently might be the problem.  But in this case, no such packets had been sent at all for the bundle in question.  As the packets are in any case protected by Reed-Solomon error correction, this would seem to suggest that the contents of the packets were being incorrectly assembled.

Actually, it turns out that this happens fairly often.  This is good, in that it probably explains a lot of the poor transfer performance: Some bundles are having to be sent two or more times, before actually being accepted. So now to find out why.

It looks like decompression of the bundle manifests fails, resulting in the end of the manifest being messed up, and not passing its cryptographic self-consistency checks.  It is the end of the manifest that gets messed up, so I am guessing that somehow another, either the length of the manifest, or the last partial 64-byte block of data for the manifest doesn't get properly received.  It should hopefully not be too hard to find, as it is happening quite frequently, with probably 20% of bundle transfers being affected by it.

To track it down, I'll make both sender and recipient parties display the compressed manifest data, so that I can play spot-the-differences.

So looking at the compressed manifest on reception, it looks to be the right size, but has all zeros in the last data blocks. The problem was that when I made the optimisation to send not just the last fractional block, but as many 64-byte blocks that come before it that can fit, I hadn't adjusted the start point of the packet data being read, thus it was missing the extra bytes.  So time for another run, to see how it goes.

Right. So now we don't seem to have the problem with the code 422s, which is good.  However, we have the problem that sometimes we keep receiving a bundle repeatedly after receiving and registering it.   This seems to be because we don't get a confirmation message from servald that the bundle is now in our rhizome database.  As a result, we don't cull the duplicate reception of the bundle. It also means that we can't immediately start sending the bundle to other peers. 

This isn't such an issue for the HF modem driver, because we don't have other peers to talk to. However, for Mesh Extenders with the UHF packet radios, this is a big problem, as bundles won't get forwarded on to other peers until this occurs.  Thus we need to find and fix this problem.

LBARD communicates with servald via servald's HTTP RESTful interface.  In particular, it listens for new bundles via an asynchronous API, which means that we should get indication from servald essentially immediately on inserting the bundle.  So I presume something is fishy in that.  To investigate this, I have instrumented the code that communicates with servald, to find out where the problem is.

While looking at the logs for these delayed bundle notifications, I am also seeing that some bundle transfers are slow, as one end is not sending any data packets at all to the other end for 10 -- 20 seconds at a time. This should never happen, especially once the initial synchronisation has occurred, that lets each end know what the other needs.  In the current case, I am seeing a delay of 24 seconds between when successive pieces of a bundle are sent.  So we know that the remote end thinks it has a bundle that is worth sending.

I'll start by investigating one of these intervals, from 17:51.02 to 17:51.26.
During that time, we do receive several packets from the sender at 17:51.10, 17:51.12, 17:51.15, and 17:51.21. That is, we have received four consecutive packets that seem to contain no pieces of bundles.  Okay, so it seems that the synchronisation process is continuing to run, and is sending packets of ~200 bytes of synchronisation information.

Thus these delays are a false alarm, caused by the natural synchronisation process.  What is a bit worrying though, is that the synchronisation procedure is taking a long time. With 100 unique bundles on each side, synchronisation is taking at least tens of minutes.  In this example, I am seeing of the order of 300 sync messages of close to 200 bytes each -- and it shows no sign of settling down.
This amounts to somewhere around 68 KB of data out of a total of 84KB sent.

Well, that's very interesting: It explains why our goodput is so low, and generally, why transfers are going so slowly.

Thus together with the Rhizome bundle notification delay, we have to bugs that are able to greatly slow things down.  I now have something quite definite to concentrate on.

I'll still work on the Rhizome problem first, because I am much more familiar with that code. 

The synctree code is going to be trickier to fix, because I didn't write that. What I suspect might be the issue with it, though, is that the sync tree code doesn't expect to be used in an environment where multiple packets get queued up.  This means that when it sees a response to and old packet of its own, it might be getting confused, and starting the tree traversal from higher up the tree again.

Thus the potential fix for this is to find a way to make sure we have only one packet in flight in each direction at a time. This would be a good idea in any case, because it would help to minimise the latency when a bundle is received, so that the sending party doesn't keep needlessly sending parts of that bundle after it has been received.

Well, actually I am going to fix this one-packet-in-flight issue first.  But first, a brief explanation as to why we do this syncing this way in the first instance, rather than, say, just exchanging lists of bundles, and then sending the bundles that the other side needs to get -- because that would surely be more efficient.

The reason is that a simple "exchange lists" and then bundles method would be fine if no new bundles ever arrive during a communications session.  But we want to be able to deliver messages with reasonable latency, even if they arrive while a connection is active.  That is, we would like to have live HF links between Serval Mesh islands, and have those act as near-real-time data conduits.

Now, back to the need for the sync algorithm to not be several packets behind, I have implemented this, and it seems to be working somewhat: The sync traffic is flowing, and is no longer filling the packets.  However, the rate of packet exchange is VERY slow.  The average raw data rate in each direction is only about 8 bytes per second. It looks like each end thinks that it has 2 unacked packets nearly the whole time, which is causing it to throttle back the packet rate.  I am suspecting that the low data rate is interacting in a bad way with the modems trying to be efficient, and waiting until there is enough data to be bothered sending.

But back to thinking about having only one packet in flight in each direction at a time, this means we will always end up with some wasted time, because one end or the other will almost certainly be waiting most of the time.  Investigating further, the round trip time for a packet turns out to be around 15 seconds (!!).
Even allowing for two maximum size packets of 255 bytes, the propagation time would still only be (255x2 bytes)/(2400 bits per second) = 1.7 seconds.  This makes me all the more suspicious about the modems trying to batch data together for efficiency, which in other cases would work.

So we need a solution to this problem.  The previous work I did on inserting the occasional pure data packet might actually be a good solution here: Basically it means that we have a background level of sync exchange through the existing packets, and then we greatly accelerate the exchange of bundles via the pure data packets, with the added benefit that they should help to flush the modem buffers.
                    
I'm now reading through the manual for the modems, with a view of disabling compression, in case the compression is causing increased latency. In the process, on page 6-2 of the "HF Data Modem 3012 Reference Manual", I have found some helpful information: While the data channel can in theory carry upto 2,400 bits per second, the maximum effective throughput is only 1,475 bits per second, i.e., 184.375 bytes per second.  This explains why I was seeing data rates of about 92 bytes in each direction.  While the extra bandwidth would have been nice, it does at least give me a clearer idea of what the maximum realistic capacity is.

On page A-3, I also found *T, which allows setting the timeout before sending data packets.  This is important, because any non-zero value will increase the latency of packet transmission. This setting can be set from 0ms to 120000ms.  The default is apparently 0ms. I'll set it explicitly in any case, just to make sure... Except the modem complains about the command.  It is oddly listed as only *T, not AT*T, but *T also does not work. I tried *0, in case the T is supposed to be the value itself.

Also AT&M=5 is supposed to select the highest data rate and optimise for interactivity, i.e., reduced latency. That was already set.

Doing some more direct tests with the modems, I timed how long it takes after I paste a 62 byte string into one modem, before it appears at the other end.  This reliably takes 7 to 8 seconds, thus confirming this latency problem.

The issue we have, is that our synchronisation protocol currently does not support having multiple packets in flight, as it was designed with radios in mind that can reliably indicate when they are ready to send a packet, and then send the packet immediately, thus allowing each packet to be build just-in-time, to ensure it has the most appropriate content in it.

In fact, just pressing a letter and waiting for it to appear at the end has the same latency of 7 - 10 seconds, so it isn't a bandwidth thing.  Rather, it seems that the modems make at least one full turn-around each, before the data gets sent.  Is this normal?

I did also find the command to disable data compression, AT%C0, and I am now testing that to see if it makes any noticeable difference.
(Another useful command I have found is AT&C=n,n,n,n,n, which allows providing the modem with a list of channels on which to listen for incoming calls, which I'll use later, and record here so that I don't forget about it ;).

But back to AT%C: Disabling data compression DOES make a difference: It drops minimum latency from around 7 - 10 seconds, to only about 4 seconds.  That's a big help, and should greatly improve our throughput.  With 4 seconds in each direction, and sending packets of up to 232 bytes, this should get us a data rate of up to 58 bytes per second of the theoretical 92 bytes per second available.

Thus, while it is a good step forward, it isn't enough to completely solve the problem.  We still will need to send some pure data packets to make full use of the bandwidth.  Part of me also wonders if disabling data compression might help with the buffer overrun problem we were having previously, that was limiting the size of the data packets we could send between normal LBARD packets. 

But before we make further changes, we should time how long it is taking to send small bundles (without payloads in this case), so that we have a base line to compare against.  Having ~4 second latency in each direction instead of ~8 seconds, means that we will detect the end of bundle transmission on average 8 seconds sooner, which should provide a noticeable improvement.

Sending only the normal LBARD packets, without any extra data packets, and it fails, because of the previously reported problem with payload-less bundles.  I've now put a fix in for that, and that helps.

I'm also seeing another problem which is when the receiver tries to send a Bundle Announce Response (BAR) to indicate to the sender that it has received the bundle, we hit a problem if the packet contains another piece of the same bundle.  That 2nd piece causes the receiver to emit an ack for that piece, which replaces the BAR in the transmission queue. As a result the sender never hears that the sender has the bundle, and keeps sending pieces of it.  This is likely a key contributor to the highly variable receive times we have been seeing for bundles.

There are two fixes for this that should both be done:
1. ACKs should never replace BARs in the progress report TX queue.
2. We should simply ignore pieces of bundles we have marked as recently received, and never generate the ACKs for them, but instead produce BARs for them, to indicate that we have already received that bundle.

With that fixed, along with some other tidy-ups, we are now ready to do the comparison again.  We will count the number of zero-length payload bundles (which thus involve sending about 200 -- 220 bytes of compress bundle manifest, and all the house-keeping) that we receive in 1,000 seconds, and the total bytes transferred in that time, so that we can calculate the average bundle transfer time, and the goodput. 

Again, we will do this with no extra data packets, then with a 128 byte data packet after each normal LBARD packet, and then again with sending a data packet every 2 seconds, independently from the LBARD packets. We will average for both directions over our test between the two radios.

With 0 byte payloads, i.e., just manifests:

No data packets:
A -> B : 16.66 bytes per second, 13 bundles transferred
B -> A : 16.93 bytes per second, 13 bundles transferred
Sending 1 x 128 byte data packet after each normal LBARD packet:
A -> B : 25.74 bytes per second, 19 bundles transferred
B -> A : 25.26 bytes per second,  21 bundles transferred
Sending 1 x 256 byte data packet after each normal LBARD packet:
A -> B: 33.22 bytes per second, 26 bundles transferred
B -> A: 33.06 bytes per second, 33 bundles transferred

Sending 1 x 256 byte (including headers) data packet every 2 seconds, as well as normal LBARD packets:
A -> B : Modems hang up after a while, possibly due to data overrun, as bytes per second climbs to ~90 bytes per second.
B -> A : Modems hang up after a while, possibly due to data overrun, as bytes per second climbs to ~90 bytes per second.
(this makes sense
Sending 1 x 200 byte (including headers) data packet every 3 seconds, as well as normal LBARD packets:
A -> B: 48.13 bytes per second, 29 bundles transferred
B -> A: 48.18 bytes per second, 26 bundles transferred
Sending 1 x 256 byte (including headers) data packet every 3 seconds, as well as normal LBARD packets:
A -> B:  78.37 bytes per second,  23 bundles transferred
B -> A:  80.62 bytes per second,  21 bundles transferred
Sending 1 x 200 byte (including headers) data packet every 2 seconds, as well as normal LBARD packets:
A -> B: 68.59 bytes per second, 23 bundles transferred
B -> A: 68.86 bytes per second, 23 bundles transferred

Note that just because we allow a 256 byte data packet, doesn't mean that one will get sent: The actual packet sent might be quite a bit smaller, if there isn't enough data for the bundle currently being sent to stuff in.

We also see that in some cases, the bytes per second is higher, while the number of bundles transferred does not increase, or in fact goes down.  This will most likely be due to the re-transmission of the same data blocks over and over, while waiting for acknowledgement of reception of the whole bundle.  Indeed, switching from 256 byte to 200 byte data packets with the same 3 second interval, we see no decrease in bundles transferred, although the number of bytes sent per second reduces.

Once we move to bundles with non-zero-length payloads, this effect should reverse. Another effect that is likely at play, is that if we send lots of data blocks between the LBARD packets, then we will increase the rount-trip-time of the LBARD packets, which will slow acknowledgement of bundles, and thus reduce the number of bundles that can be received per unit time.

Another factor that must be taken into account here, is that the first bundle transfer does not typically take place until around 120 -- 180 seconds into the 1,000 second test. Thus once actively communicating, the actual time per bundle will be lower than suggested here.

So now to repeat those tests with non-zero payloads for the bundles... Which showed up a bug in payload handling, which was quick to fix. More critically, it showed up an issue where receivers sometimes don't recognise packets as being addressed to them.  This means that when the other party sends them pieces of a bundle that they have already received, they don't tell the sender that they already have it.

The mechanism that is playing up stuffs each message containing a piece of a bundle with the first two bytes of the SID of the receiver.  This works fine for some time, but after a while, the receiver finds garbage in this field.  What is not clear is whether it is a data handling bug, that corrupts it on the receive side or during assembly on the transmit side, or whether the field that contains a peer's SID on the sender side is getting corrupted.  This takes of the order of 15 minutes to occur.

LBARD maintains both  ASCII-hex and binary representations of each peer.  If these end up with different values, then we can assume that memory corruption has occurred. I'll add instrumentation that will help me to see if that is what is going on.

Interestingly, the data that overwrites the binary representation seems to often begin with $47, suggesting that some piece of code might be putting stuff there by mistake.  The "Generation ID" message type does put a $47 followed by 4 random bytes.   This message type also contains no bounds checking -- it just assumes it has been given a buffer with enough space in it.  However, it is only called at the start of the packet assembly process, when it can be safely assumed that there is enough space in the buffer to fit the message.

A few more runs confirms that it is always $47 in the first byte, so almost certainly something screwing up with the Generation ID messages.  I'll add some more instrumentation to see what the new generation ID is each time one is received, and if it matches what I am seeing in the packet, then I will know that its to blame.

Ok, while waiting for the run to hit the problem again, I've taken a closer look at the code that parses the generation ID messages, and discovered that it does indeed put the generation ID into the SID prefix field. I.e., a straight-forward bug.

This fixed the problem, and got everything going, but it still seems to be rather slower than it should be sometimes.  But investigating the remaining problems can go into the next blog post.