Thursday, September 24, 2020

Chasing down the last LBARD HF TX queue bugs and Initial Performance Graphs

To recap, in the last blog post I found and fixed a lot of bugs in LBARD over HF data modem operation.  This has taken much longer than hoped, in part due to the 4 to 5 hours it takes to do a complete run.

The last fixes I discussed in the previous post have resolved the remaining TX queue shuffling bugs, so far as I can see. But in some runs I am still seeing that a few bundles are not transferred in one or both directions.  These are typically among the lowest priority bundles, but not strictly the lowest priority bundles.

For example, in the last test run, the following bundles are not transferred:

peer4random79

peer4random94

peer4random96

The files are named in increasing file size, from peer4random00 to peer4random99.  Thus the 94 and 96 files are among the lowest priority bundles, and the 79 file is reasonably close.

The first step to tracking down what happened with them is to get the bundle IDs (BIDs) that correspond to these files:

peer4random79 - 6CE2CB42* - bundle #21 on the sending side

peer4random94 - 469E2B18* - bundle #6 on the sending side

peer4random96 - 338FC557* - bundle #4 on the sending side

Armed with that information, I can see if those bundles get transferred at all.  Let's start with the 79 file.

At 17:53.27 the sender first realises that the recipient does not have the bundle. However, the TX queue is already full, so it doesn't make it onto the TX queue.

At 18:28.39 it gets queued again, and this time there is space in the TX queue, and it does get queued in position 3 in the TX queue.

At 18:28.39 bundle #66 gets queued, which has a higher priority, so our bundle is demoted to position 4 in the TX queue.

At 18:29.06 bundle #53 gets queued, which also has a higher prioity, so our bundle is now demoted to position 5 in the TX queue.

At 18:29.06 bundles #27, #71 and #56 also get queued, and our bundle is now in position 8.

At 18:29.29 bundles #51 and #55 get queued, and our bundle is now in position 10. One more higher prioity bundle, and it will get bumped from the TX queue altogether until the tree-sync process gets restarted.  This happens immediately, as bundle #59 gets queued.

At 19:06.57 our target bundle, #21, gets requeued, however the TX queue is at that time already full of higher-priority bundles, so does not make it ont othe TX queue.

At 19:40.05 it does finally make it back onto the TX queue in position 8.

At 19:40.14 bundle #41 gets queued, bumping our bundle to position 9.

Then at 19:40.49 bundle #27 gets queued, bumping our bundle to position 10, followed immediately by bundle #23, which causes our bundle to get bumped from the TX queue.

Our bundle is next attempted to be requeued at 20:13.42, but the TX queue is already full of bundles with equal or higher priority, so it doesn't get queued.

The next attempt is at 20:43.25. And this looks like it might be the problematic instance. The TX queue is not full, and our bundle has a higher priority than the TX queue's entries, and also compared with the bundle currently being sent.  As a result, it is immediately selected for transmission.  However, then bundle #28 also gets queued for sending, which bumps #21 from being the bundle currently being sent.  However, there is no code that then puts the bundle into the TX queue. 

As the TX queue had not overflowed since the last restart of the tree-sync process, no flag is raised to require the tree-sync to restart one last time, and so this bundle will never finish being sent.  And indeed this is what happens. Similar things happen for the other two bundles.

So the fix is relatively simple: When de-queuing a bundle that should still get sent, we need to insert it into the TX queue.  

I'm heartened that this fault seems to describe the last situation where this problem of bundles not being forwarded can occur.

The only immediate problem I can see, is that there is already code that is supposed to do this.  It turned out the problem was that I attempted to re-queue the bundle being sent before removing it as the bundle being currently sent. The queuing logic was noticing this, and not adding the bundle back into the queue, because as far as it was concerned, the bundle was already being sent.  

By reordering this sequence of events, I was able to solve the problem, and it now reliably delivers every single bundle.  Together with all the performance fixes, we now have things at a state where they work reliably, which is great. So lets get some basic performance statistics from these runs:



We can see that there is considerable variation in the transmission times (and thus speed), with respect to the bundle sizes.  That is, for bundles of about the same size, the delivery time can vary by a factor of 5 or so.  

This is not unexpected, as the prioritisation of bundles during the tree-sync protocol can cause some bundles to be fully or partially transmitted, before tree-sync realises that there is a higher priority bundle. When this occurs, the bundle that pre-empts the lower-priority bundle will appear to take longer to deliver, because part of one or more other bundles will have occurred during the time accounted to the higher-priority bundle. Conversely, if a bundle has been partially transferred before being pre-empted by a higher priority one, the pre-empted bundle will appear to deliver more quickly, because only the time from when it resumes transmission will be accounted.

Otherwise, we see that the average data throughput, excluding the approximately 400 bytes of Bundle Manifest, improves gradually with increasing bundle size. For larger bundles we see transfer speeds of around 40 bytes per second, which is about half of the channel capacity of ~93 bytes per second we discovered previously.  This is still not ideal, but given that it includes all the overheads for the transmissions to occur, to be prioritised and for synchronisation of bundle lists between the two parties, its not too bad.  Note that these figures are for one direction only, with similar results for the opposite direction.

If we look at the overall bundle transfer times, we see that this varies between about 50 and 200 seconds, only somewhat influenced by the size of the bundle's payload.  This corresponds to message delivery rates between 2x (3600/50) = 144 and 2x (3600/200) = 36 bundles per hour over an HF link. That is, between ~850 and ~3,500 bundles per day.  Again, while higher figures would be welcome, being able to deliver one to a handful of text messages very few minutes would seem to be sufficient for delivering a useful level of communications between small isolated communities.  Round-trip message send and reply cycles would likely be of the order of 2 to 10 minutes, which again is sufficient to facilitate useful communications.

So at this point we have reliable and reasonably performant delivery of Serval Mesh traffic over Codan HF data modems.  The next step will be to try to setup a real HF link with two radios spaced many kilometres, or perhaps a couple of hundred kilometres apart, and see how it performs in practice with some real message traffic.

Otherwise, I'll now be turning my attention to implementing LoRa radio support for Mesh Extenders to help make lower-cost and more internationally portable Mesh Extenders, so that we can experiment with more interesting mixed-radio networks.  Because the Wi-Fi links between Mesh Extenders are self-organising, this will allow such mixed-media networks to be easily built. It will then be interesting to see how the system behaves with such networks.

After that, I will start looking at implementing "affinity tags" so that when we start connecting large numbers of such networks together, that we can prevent traffic being replicated places it doesn't need to go.  We should also finally implement the logic for detecting whether senders or recipients are likely to be located on the far end of an HF link, so that we can also factor that into the prioritisation.

1 comment: