Enabling Communications, Anywhere, Anytime: Fixing bugs and structure of LBARD

During our trips to Vanuatu last year we identified a number of bugs with LBARD, which have been sitting on the queue to be fixed for a while.

At the same time, we have had a couple of folks who have tried to add support for additional radio types to LBARD, but have struggled due to the lack of structure and documentation in the LBARD source code.

Together, these have greatly increased the priority of fixing a variety of things in LBARD. So over the past few days I have started pulling LBARD apart, taking a look at some preliminary refactoring by Lars from Germany, and tried to improve the understandability, maintainability and correctness of LBARD. This has focussed on a few separate areas:

Generally improving the structure of the source code

The most obvious change here is that the source files now live in a set of reasonably appropriately named sub-directories, to make everything easier to find. This is supported by the factoring out of a bunch of functionality into the radio drivers and message handlers described below.

Drivers for different packet radio types are now MUCH easier to write.

Previously you had to hunt through the source code to find the various places where the radio-specific code existed, hoping you found them all, and worked out how they hang together. This also made maintenance of radio drivers more fragile, because of the multiple files that had to be maintained and that could give rise to merge conflicts.

In contrast, the process now consists of creating a header and source file in src/drivers/, that implement just a few functions, and has a single special header line that is used by the Makefile to create the necessary structures to make the drivers usable.

The header file is the simplest: It just has to have prototypes for all of the functions you create in the source file.

The source file is not much more complex, excepting that you have to implement the actual functionality. There are a few assumptions, however, primarily that the radios will all be controlled via a UART. Using the RFD900 driver as an example, let's quickly go through what is involved, beginning with the magic comment at the start:

/*
The following specially formatted comments tell the LBARD build environment about this radio.
See radio_type for the meaning of each field.
See radios.h target in Makefile to see how this comment is used to register support for the radio.

RADIO TYPE: RFD900,"rfd900","RFDesign RFD900, RFD868 or compatible",rfd900_radio_detect,rfd900_serviceloop,rfd900_receive_bytes,rfd900_send_packet,always_ready,0

*/

In order for the compilation to detect the driver and make it available, it requires a line that begins with RADIO TYPE: and is followed by the elements for a struct radio_type record, the details of which are found in include/radio_type.h. You should provide here:

1. the unique radio type suffix (this will get RADIOTYPE_ prefixed to it, and #defined to a unique value by the build system).
2. The short name of the radio type to appear in various messages.
3. A longer description of the radio type.
4. An auto-detect routine for the radio.
5. The main service loop routine for the radio. LBARD will call this for you on a regular basis. You have to decide when the radio is ready to send a packet, as well as to manage any session control, e.g., link establishment for radio types that require it.
6. A routine that is called whenever bytes are ready to be received from the UART the radio is connected to.
7. A routine that when called transmits a given packet.
8. A routine that returns 1 when the radio is ready to transmit a packet.
9. The turn-around delay for HF and other similar radios that take a considerable time to switch which end is transmitting.

These drivers are actually quite simple to write in the grand scheme of things. Even the RFD900 driver, which implements a congestion control scheme and an over-complicated transmit power selection scheme requires is still less than 500 lines of C. The current proof-of-concept drivers for the Barrett and Codan HF radios are less than 300 lines each.

What hasn't been mentioned so far is that if you want your radio driver to get exercised by LBARD's test suite, you also need to add support for it to fakecsmaradio, LBARD's built-in radio simulator. This simulator implements a number of useful features, such as the ability to implement unicast and broadcast transmission, the ability to detect packet collisions and cause loss of packets, adjustable random packet loss and a flexible firewall language that makes it fairly easy to define even quite complex network topologies. The drivers for this are placed in src/drivers/fake_radiotype.h, and will be described in more detail in a future blog post.

LBARD Messages are now each defined in a separate source file

Whereas previously the code to produce messages for insertion into packets, and the code to parse and interpret them were found scattered all over the place, they now all live in a source file each in src/messages/. For further convenience, these files are automatically discovered, along with the message types they contain, and hooked into the packet parsing code. This makes creation of message parsing functions quite trivial: One just needs to create a function called message_parser_XX, where XX is the upper-case hexadecimal value of the message type, i.e., the first byte of the message when encoded in the packet. Where the same routine can be used to decode multiple message types, #define can be used to allow re-use of the function. Apart from performing their internal functions, these routines need only return the number of bytes consumed during parsing. For example, here is the routine that handles four related message types that are used to acknowledge progress during a transfer:

#define message_parser_46 message_parser_41
#define message_parser_66 message_parser_41
#define message_parser_61 message_parser_41

int message_parser_41(struct peer_state *sender,char *prefix,
                      char *servald_server, char *credential,
                      unsigned char *message,int length)
{
   sync_parse_ack(sender,message,prefix,servald_server,credential);
return 17; // length of ACK message consumed from input
}

Fixing LBARD transfer bugs

In addition to the structural improvements described above (and in many regards facilitated by them), I have also found and fixed quite a few bugs with Rhizome bundle transfers. The automated tests now all pass, with the exception of the auto-detection of the HF radios, which don't work yet because we are still writing the drivers for fakecsmaradio for them. Here are a pair of the more important bugs fixed:

1. The code for tracking and acknowledging transfers using acknowledgement bitmaps now actually works. This is important for real-world situations where packet loss can be very high. In contrast to Wi-Fi that tries to hide packet loss through repeated low-latency retransmissions, we have to conserve our bandwidth, and so we keep track of what a recipient says that they have received, and only retransmit that content when required.

This also plays an important role when there are multiple senders, as they listen to each other's transmissions, and use that information to try very hard to make sure that they don't send any data that anyone else has recently sent. This helps tremendously when there are multiple radios in range of one another.

This bug is likely responsible for some of the randomness in transfer time we were seeing in Vanuatu, where some bundles would transfer in a few seconds, while others would take minutes.

2. There were some edge-cases that could cause transmission to get stuck resending the last few bytes of a bundle over and over and over, even though they had already been received, and other parts of the bundle still needed sending.

This particular bug was tickled whenever the length of a bundle, when rounded up to the nearest multiple of 64 bytes was an odd multiple of 64 bytes. In that case, it would mistakenly think that there was 128 bytes there to send, and in trying to be as efficient as possible, send that instead of any single 64 byte piece that was outstanding (since it results in reduced amortised header cost). However, the mishandling of the bundle length meant it would keep sending the last 64 bytes forever, until a higher priority bundle was encountered.

It is quite possible that this bug was responsible for some of the rest of the randomness in latencies we were seeing, and also the "every other MeshMS message never delivers" bug, where sending one MeshMS would work, sending a 2nd wouldn't, but sending a 3rd would cause both the 2nd and 3rd to be delivered -- precisely because each additional message had a high probability of flipping the parity of the length of the bundle in 64 byte blocks.

So the net result is that the various tests we had implemented in the test suite, including delivering bundles over multiple (simulated) UHF hops, holding a MeshMS conversation in the face of 75% packet loss, transferring bundles to many recipients or receiving a bundle efficiently from many nearby senders all work just fine.

The proof of the pudding is in the real-world testing, so I am hoping that we will be able to get outside with a bunch of Mesh Extenders in the coming week and setup some nice multi-hop topologies, and confirm whether we can now reliably deliver MeshMS and MeshMB traffic over multiple UHF radio hops.

In parallel to this, we will work on the drivers for the HF radios, and also for the RFM96-based Lora-compatible radios that are legal to use in Europe, and get that all merged in and tested. But meanwhile, the current work is all there to see at https://github.com/servalproject/lbard/.