Sunday, July 19, 2020

LBARD over Codan HF modem and debugging LBARD sync problems

This week I am finally getting to working on the remaining work for supporting the Codan 3012 HF data modem in LBARD.  This was started a long time ago, but has stalled at various points along the way.

The purpose of using this data modem is that it comes as a software option in the Codan Envoy 2210 HF radios, and can in theory transfer close to 100x more data than the horrible ALE text message method that we had demonstrated a few years ago.  That said, we are still talking about a data link that in theory can do 2,400bits per second. I say in theory, because that relies on good atmospheric conditions for the HF communications, and because the link is actually half-duplex with quite a noticeable turn-around time.  That is, the modem at one end sends a blob of data, stops transmitting, and begins listening. The modem at the other end receives the blob of data, switches to transmit, transmits its blob of data, stops transmitting and begins listening. This is a good thing, but not so helpful for our data throughput. Thus it takes perhaps a tenth of a second or more to do the turn-around.  Because this is all happening with radios that can transmit at 125W, it involves relays and various safe-guards in the TX to RX turn-around.

The modems seem to send a data blob lasting 2 seconds. So in every ~(2 + 0.1 + 2 + 0.1) seconds we get to send one blob of data.  Thus while the channel can carry 2,400 bits per second, we are limited to something below 1,200 bits per second from each end. We'll estimate this to be about 1,000 bits per second for ease of calculation.  This should give us an effective data rate of a bout 125 bytes per second in each direction.

This is pretty modest by modern internet standards, but it does correlate to about one text message per second. Of course, by the time we have overhead for routing, authentication and other house-keeping elements, a text message might be closer to 1KB, which would require about 8 seconds to send under ideal conditions.  This means that a link that is active 24/7 could send and receive about 86,400 seconds/day / 8 seconds = ~10,000 messages per day.  Even with just an hour a day this means that we should be able to send and receive about 400 messages. Given our goal is to connect small isolated island communities in the Pacific, and to provide backup text messaging capability during disasters, this sounds like a pretty reasonable capacity -- and certainly better than the 10 to 15 minutes it was taking to send an authenticated text message via the ALE text messaging channel!

The radio setup I have here is a pair of Codan 2210s kindly loaned to us by Codan themselves.  I have these hooked up to a laptop here via USB serial adapters.

The first step is to enable the data modems on the Envoy radios. If you have obtained these radios without enabling this feature, you will need to give Codan a call to purchase an unlock code.  You shouldn't need any hardware changes, which is handy, if you are already in the middle of nowhere, needing to setup communications.

Next, use the handset to select "Menu", "User data", "Peripherals" and "RFU GP port", and select "2.4kbps data modem". It is important to NOT select "3012 mode", unless you really are using an external modem, rather than the built-in one.  The menus are shown in the following figures:





The modems themselves have an interesting feature, where they report themselves when you connect to them via the USB serial port.  However, if you power the Envoy on with the USB already connected, this doesn't always occur, and sometimes I had problems getting anything out of the modem in that context. The solution is to disconnect the USB port from the computer and reconnect it. 

I'll have to think about a robust solution for that,  as it can cause LBARD's auto-detection of the modem to fail, which would be unfortunate for automatic operation, which is what we want to achieve. It might be possible to do a software USB disconnect and reconnect, for example. Or there might be some magic option I need to setup on the radio.  Anyway, we'll come back to that later.

The modems need only a few simple commands to control them -- assuming the HF radios are sitting on the same channel. I haven't yet used the commands that the modems offer for selecting channels, although it should be quite possible.

The first thing to do is to give the modem a station ID. This is a bit like a phone number. It can have upto 6 digits, but for reasons I don't immediately recall, the last two digits cannot be 00.  This is set using the AT&I=number command, e.g.:

AT&I=2

would set the station ID to 2.  Once that has been setup, you can tell a modem to try to dial it using the good old ATD command, e.g.,

ATD2

would call a modem that had been configured as previously described. This will cause the standard RING, NO ANSWER, NO CARRIER, and CONNECTED messages to be displayed based on whether the remote end answers with ATA or not, or later disconnects with ATH0.  In short, Codan have done a great job of making this modem masquerade as an old-fashion 2400 bps data modem.  This helps it be compatible with various existing software, and helps me, because I have previously had many adventures programming such modems.

The modems default to having compression enabled, although I don't know what kind of compression they have, nor how effective it is.  For now, I am just ignoring it, and leaving it in the default setting of enabled. 

Armed with the information on how to control the modem, I set about creating an LBARD driver for it. This consists mostly of an auto-detection routine, the state machine for managing the call states, and routines for sending and receiving packets.  This is all in the src/drivers/drv_codan3012.c file.

Initialisation just consists of sending a few AT commands to set the modem's HF station ID, and enable hardware flow-control:

int hfcodan3012_initialise(int serialfd)
{
  char cmd[1024];
  fprintf(stderr,"Initialising Codan HF 3012 modem with id '%s'...\n",hfselfid?hfselfid:"<not set>");
  snprintf(cmd,1024,"at&i=%s\r\n",hfselfid?hfselfid:"1");
  write_all(serialfd,cmd,strlen(cmd));
  fprintf(stderr,"Set HF station ID in modem to '%s'\n",hfselfid?hfselfid:"1");

  snprintf(cmd,1024,"at&k=3\r\n");
  write_all(serialfd,cmd,strlen(cmd));
  fprintf(stderr,"Enabling hardware flow control.\n");

  // Slow message rate, so that we don't have overruns all the time,
  // and so that we don't end up with lots of missed packets which messes with the
  // sync algorithm
  message_update_interval = 1000;
 
}


Flow control is a bit interesting to manage. The big problem is that we don't want the buffer of the modem to get too full, as this makes the round-trip time for control messages too long, which slows things down. For example, if the far end doesn't quickly confirm that a given bundle has been received, the sender will keep sending pieces of it. Thus we need to keep well away from hitting the flow control-imposed limit. The packet sending and receiving routines keep track of this by having sequence numbers that are then returned to the sender, so that the sender can get an idea of the number of packets in flight.

LBARD modulates packet flow, not by the number of packets in flight, but by the average interval between packets.  As a first-cut approach, I set the packet interval dynamically based on the number of outstanding packets in flight.  Basically I set the packet interval to some multiple of the number of outstanding packets.  250ms per outstanding packet seems to work ok, with reasonable throughput balanced against not having too many unacknolwedged packets to waste outgoing bandwidth with needless retransmissions.  I can refine that a bit more later.

The state machine for the modem is also fairly simple:
 
  switch(hf_state) {
  case HF_DISCONNECTED:
    if (hfcallplan) {
      int n;
      char remoteid[1024];
      int f=sscanf(&hfcallplan[hfcallplan_pos],
                   "call %[^,]%n",
                   remoteid,&n);
      if (f==1) {
        hfcallplan_pos+=n;
        fprintf(stderr,"Calling station '%s'\n",remoteid);
        char cmd[1024];
        snprintf(cmd,1024,"atd%s\r\n",remoteid);
        write_all(serialfd,cmd,strlen(cmd));
        call_timeout=time(0)+300;
        hf_state=HF_CALLREQUESTED;
     } else {
        fprintf(stderr," remoteid='%s', n=%d, f=%d\n",remoteid,n,f);
      }
     
      while (hfcallplan[hfcallplan_pos]==',') hfcallplan_pos++;
      if (!hfcallplan[hfcallplan_pos]) hfcallplan_pos=0;
    }   
    break;
  case HF_CALLREQUESTED:
    if (time(0)>=call_timeout) hf_state=HF_DISCONNECTED;
    break;
  case HF_ANSWERCALL:
    write_all(serialfd,"ata\r\n",5);
    hf_state=HF_CONNECTING;
    call_timeout=300;
    break;
  case HF_CONNECTING:
    // wait for CONNECT or NO CARRIER message
    break;
  case HF_DATALINK:
    // Modem is connected
    write_all(serialfd,"ato\r\n",5);
    hf_state=HF_DATAONLINE;
    call_timeout=time(0)+120;
    break;
  case HF_DATAONLINE:
    // Mode is online.  Do nothing but indicate that the modem is
    // ready to process packets
    if (time(0)>call_timeout) {
      // Nothing for too long, so hang up
      sleep(2);
      write_all(serialfd,"+++",3);
      sleep(2);
      write_all(serialfd,"ath0\r\n",5);
      hf_state=HF_DISCONNECTED;
    }
    break;
  case HF_DISCONNECTING:
    break;
  default:
    break;
  }
 
In short, we make a call if we have been instructed to, and am not currently on a call.  Otherwise stay on the call unless we get no data for too long, in which case, hang up, and repeat the process.

With the above and the necessary glue, I was able to get packets flowing between the two modems fairly easily.  But when I started trying to transfer bundles, I started seeing the protocol get confused with both sides trying to send each other the same bundle endlessly.  This should never happen, as when one end realises that the other has a given bundle, it should stop trying to send it, because it has pretty clear evidence that the other end has it already. 

I'd seen this problem before in the field with UHF radios in the Mesh Extenders, but it had stubbornly refused to be reproduced reliably in the lab.  So while it is on the one hand annoying, its actually a great relief to have found the problem, and to be able to progressively work on solving it.

It turns out to be caused by a number of things:

First, the TreeSync algorithm seems to not realise when the other end receives a bundle, even though it should.  That module is quite fancy and efficient at doing its job, and I don't understand it well enough to figure out what subtle problem it is suffering from.  The mechanism by which this causes the lockup took a little while to realise, but it made a lot of sense once realised it:  Once both sides have as their highest priority bundle for the other a bundle that the other side actually has, and that trips this TreeSync problem, then they will continue to try to send it forever.

It self-corrects eventually, but only briefly, because when it produces a new generation ID, the tree is re-built, which allows it to send the next few bundles before locking again.  To work around this, I now maintain a list of bundles that we have seen the other end try to send.  If they try to send a bundle, then they must by definition have it.  Therefore we should never try to send that bundle to them again.  This works to stop the TreeSync algorithm from re-scheduling the same bundle over and over.

That fixed the protocol lock-up.  Then I was seeing that the scheduling of which data blocks to send was being completely random.  Now, it is supposed to be rather random, so that if multiple senders are available for a given bundle, a single receiver can receive pieces from all of them, and thus receive the bundle as a whole sooner. 

The problem was that the system for keeping track of which blocks we think the recipient might already have was not being activated, because it was getting confused about which bundle we thought we were sending them, or should be sending them.  As a result very small bundles consisting of only a few blocks would send in reasonable time, as the probability of all their blocks being sent would approach 1 very quickly. But larger bundles consisting of more than a dozen or so blocks would be extremely slow to send, because it was basically pot luck waiting for the last few holes to get filled in.

This required a bit of fiddling about in a few different places to pick up all the corner cases of when we should reset it, and making sure that we clear the bitmap of sent blocks and other minor details. I also fixed the situation where we see another party sending a bundle to someone, and we haven't yet worked out what to send to them.  In that case, we now join in sending the same bundle, so that the receiver can hopefully receive the whole bundle that bit quicker. 

These multi-way transfer optimisations are not really applicable to the HF radios, as the modems are designed for point-to-point, but it made sense to implement them, anyway. Also, down the track, it would be possible for us to look at ways to make a radio listen to multiple HF radio channels in parallel, and receive all the content being delivered on them.  This promiscuous mode of listening could greatly increase the effective capacity of the system, and basically is just a sensible use of the true broadcast nature of radio.

Back to the bitmap sending, even after I got it working, it was still not particularly efficient, as even when there were a relatively few pieces left to send, it would still chose them at random, potentially resending the same block many times until the next bitmap update came through.  As the round-trip latency is typically 10 - 20 seconds, by the time all the modem buffering is done, this can add significant delays through redundant sending. Also, it can get stuck sending manifest pieces for a long time, which again can be quite wasteful, as the manifest is typically less than 300 bytes, and will keep getting resent until the bitmap update comes through indicating that the manifest has been fully received. 

What would be better, would be to keep track of the set of pieces that are outstanding at any point in time, and just randomise their order, and then send them in that randomised order, and then start over.  That way we will avoid redundant sending.  The order can be randomised afresh each time we get a bitmap update.

First implementation of this, seems to work fairly well: It sends the blocks in random order. It's still not as fast as I would like, averaging only about 20 bytes per second in each direction.  This will be partly because I am sending 64 byte blocks at random, but the packets can hold 128 or sometimes even 192 bytes. But if there are no contiguous outstanding blocks, then it will end up being very inefficient.  I'll likely fix that by allowing a message type that can contain more than one piece of the same bundle.  Actually, I figured out an easier quick fix: Consider all odd-numbered blocked to have been send once already, so that we try to send all the even-numbered ones at least once.


The bigger problem, though, is that the indication of when the other end has received a bundle is not working properly, and so we are back to our protocol lock-up situation.

So looking at a particular run when this is occurring, we have LBARD2 and LBARD8 talking to each other.

The begin by exchanging two small bundles with each other, before moving on to transfer bundles of several KB. 

LBARD2 receives bundle 4B47* (4870 bytes long) from LBARD8. LBARD8 receives C884* (4850 bytes long) from LBARD2. 

So both now have C884*.  The problem is that both think that the other doesn't have it. As C884* is smaller than 4B47*, it has a higher TX priority, and thus both sides keep trying to send it to the other.

LBARD8 received C884* at 11:42.38.   By 11:42:47, LBARD8 has received confirmation that the bundle was inserted into the Rhizome database, and thus knows definitively that it has the bundle.

Even though it knows that it has this bundle, it continues to accept receiving the bundle afresh from LBARD2.

Part of the problem is that LBARD8 doesn't immediately note that LBARD2 has the bundle.  It should do this, whenever it sees LBARD2 sending a piece of the bundle. 

In fact, there is no indication whatever that LBARD8 is receiving the pieces of the bundle after it has received it the first time.  I'll need to re-run with more debugging enabled.  Perhaps part of LBARD is noticing that it already has the bundle, thus is ignoring the messages, but without scheduling the indication to the other end that it already has the bundle.  But even if that is not the case, the receiver shouldn't be willing to again start reception of the same bundle, after it knows that it has it itself. 

Each run has randomised bundle IDs, which is a bit annoying.  So I have to trawl through the logs again to see what is going on.

So this time, LBARD2 receives bundle D386* from LBARD8 at 12:18.26.

Interestingly, the "recent senders" information is correctly noting that LBARD8 has sent a piece of this bundle.  So I can make use of that to update the other structures, if necessary.

At the point where LBARD2 receives D386*, it is sending bundle 3A7C* to LBARD8.
Ah, I have just spotted at the point where we realise that the other party has the bundle, instead of dequeuing the bundle to them, we are actually queueing it to them.  That should be fixed. This is also the point at which we should mark that we know that that peer has the bundle already.  So we will try that in the next run.  Simultaneously I have added some instrumentation for the "prioritise even over odd" optimisation, as it doesn't seem to be doing exactly what it should.

Fixed that, and some other related bugs. It now does a fairly reasonable job of scheduling, although there is still room for improvement. The big problem is that it is still sitting at around 20 bytes per second of effective throughput. 

Now, this isn't as bad as it sounds, as that includes the overhead of negotiating what needs to be sent, and packetising it all, so that if the modems do lose packets, we don't lose anything. It also reflects the reality that we are only sending one packet in each direction every 3 seconds or so.  With an effective MTU of 220 bytes, we only end up fitting one or two 64 byte blocks of data in each.  Thus our potential throughput is limited to about 64 or 128 bytes every 3 seconds = ~20 -- 40 bytes per second. 

So at the current state of affairs, we have something that is at least 10x faster than the old ALE messaging based transport -- and its is bidirectional throughput. We are seeing message delivery latencies for short messages that are about 30 to 50 seconds.  This compares very favourably to the ~3 minutes that sending a bundle was taking with the ALE transport.  But it still isn't as good as we would like or, I think, we can achieve.  We know that we should have something like 125 bytes per second of bandwidth in each direction.  Even with the suboptimal scheduling of pieces to send, we should be able to achieve at least half of that figure.

The first port of call is to try to optimise the packet sending rate, as it seems that one packet of ~260 bytes in each direction every ~3 seconds or so is probably not quite filling the 2,400bps channel, even allowing for the change-over.  Well, that sounded like a great theory. Looking closer, the packet rate is closer to one every 2 seconds, not every 3 seconds.  2400bps / 2 = 1200bps = 150 bytes per second raw rate in each direction, before subtracting a bit for turn-around.  Thus sending a 255 byte packet every 2 seconds = ~128 bytes per second, which is pretty much spot on the channel capacity.

This means we will have to look at other options.  The most obvious is to send large slabs of bundles, without all the other protocol fluff.  This is only really possible in this context, because the modem link is point-to-point, so we don't have to worry about any trade-offs. In fact, the whole packet format and message types are really a bit of an over-kill for this configuration.  But short of re-writing the whole thing, we can instead get some substantial gain by just sending a slab of useful bundle data after every normal LBARD packet.  This will get us half of the channel use at close to 100% goodput, and thus bring the average up to >50%.   If we send large contiguous slabs, then that will also reduce the opportunity for the block scheduling to get too much wrong, so it may well be even more effective. 

The modem has no natural packet size limit, so we can even send more than 256 bytes at a time -- if we trust ourselves to the modem's error correction to get the data to the other end intact. If errors creep in, and we don't have a means of detecting them, it will be really bad. The current error correcting code works on blocks of 232 bytes. We don't want to send huge amounts, though, as we want to maintain the agility to respond to changing bundle mixes. 

The largest number of 64 byte blocks we can thus fit would be 192 bytes.  256 would have been nicer, so that it would be an even number of blocks. That's a bit annoying, so I'll have to make a different type of packet that can carry upto 256 bytes of raw data. It will assume that the data is for the currently selected bundle, and thus only have a 4 byte offset and 2 byte length indicator, plus the escaping of the packet termination character. Thus it should have an average overhead of less than 10 bytes, and thus be better than 90% efficient.

Well, the next challenge was that some bytes were getting lost, as though buffer overflow was occurring.  However, as I have hardware flow-control enabled, I'm not sure of the cause here.  My work around is to send only 192 bytes, after all, and to make sure I throttle back the packet sending rate,  to try to keep the data flowing at a reasonable rate, without overwhelming the buffers.  This is a bit worrying, as in real life it is quite likely that we will get data rates below 2,400bps on the HF links quite often. 

Reducing the data packet size to 192 bytes has seemed to fix the buffer overflow problem. Also, I am now seeing the larger bundles go through more quickly.  However, I am still seeing the problem where the same bundle is received more than once.  Presumably it starts being received a 2nd time before we get confirmation from the Rhizome data base that it has been inserted. 

The solution would be to filter the receive list whenever we notice a bundle has been added to the rhizome database.  That way such duplicate transmission of the same bundle should get suppressed. Examining the log file of when this happened, it turns out that the problem was actually that the bundle was corrupted during reception.  This is possible if the bundle being transferred changes part way through, as our pure data packets lack bundle identifying information to save space. However, it looks like we will will need it after, to avoid problems with inferring which bundle is being sent by the remote end.

There is also a pattern where some pieces of a bundle are failing to get sent in the data packets. I'm suspecting that this is due to escaping of characters causing the encoded size of the data packet to become too large.  So I might need to trim the packet size again.  (This is all part of why I don't like it when systems try to hide packetisation behind a pretend stream interface.  You end up having to put framing back in the stream, which adds overhead, and there is no way to make sure your packets actually end on the underlying packet boundaries which increases latency. But I understand why people implement them.)

So I guess I'll have to drop back to 128 bytes per data packet, as well as including the bundle information.

Ok, that has helped quite a lot.  It is now delivering the packets more reliably.  But the throughput is still pretty horrible, with good-put rates still somewhere around 25 bytes per second.  I suspect the modems are really not designed for this kind of two-way sustained communications, but rather are optimised for mostly uni-directional transfers. 

If I can get hardware flow-control working, I should be able to improve things by using larger packet sizes.  But I think its also about time I actually wrote a simple test programme that measures the actual bidirectional throughput that these modems can deliver.  That way I know whether it is my protocol side of things that is the problem, or whether I really am hitting the limit of what these modems can achieve. 

But that will all have to wait for the next blog post.

Thursday, January 9, 2020

It's 2020 and Australia is burning

It almost doesn't need saying, but Australia is burning.  Lives have been lost. Houses destroyed. Livestock, farms and livelihood all gone up in smoke, quite literally.  I can't even imagine the pain and distress that this is all causing for so many. But what I can do, is try to do something about the technology gaps that are making a bad situation even worse for those affected.

As readers of this blog will know, I have been devevloping resilient telecommunications solutions for the last decade or so.  The best known of those is the Serval Project, which is a combination of software for smart phones that can form mesh networks, and small low-cost communications repeaters that we call Serval Mesh Extenders. Here is a brief introduction to what, and why, we are making the Serval Mesh:


 And to go right back in history to almost the beginning, here is the original motivation of the project:


And for another blast from the past, here is the original field test call from back in 2010, at Arkaroola in the Outback, which was also covered by the ABC:



Here's another video about the Mesh Extenders that we made back in 2013, when we were first developing them:


Since then, we now have much more mature hardware, and have tested the hardware for long-term operation in the field in a coastal area of Vanuatu, in large part thanks to a grant from DFAT under the Pacific Humanitarian Challenge a few years ago.

While the Serval Mesh was originally developed with disaster situations abroad, it was always also made with more local situations in mind, for example, following cyclones or in very remote areas lacking phone coverage.  For example, one variant of the technology that was worked on for a while, was the concept of an emergency network that uses vehicles as the main component, because they are the most ubiquitous infrastructure once you get out of the big cities.  Here is a piece that we produced with Toyota on this concept:

That video is also a good general introduction to the potential for the technology.

Basically any situation where cellular or other normal communications infrastructure is missing, damaged or disabled for any reason, the Serval Mesh lets local communities form their own local-area digital communications network.

There are also some branches to this work, for example, creating mobile phones that include the Mesh Extender functionality internally, and are designed to be fully "self-sovereign," that is independent of all power, communications and other infrastructure -- and that can offer security and privacy at least as good as the present state of the art.

So where are we at now?

Basically we have proven all the various parts of the technology that we need to make this a reality. What we need to do now, and our plan for 2020 -- and even before the first started -- is to focus time and energy on shaking down the last wrinkles in the system, so that it is ready for deployment by communities in the field. 

This includes revising the Mesh Extender circuit board, and fixing some known problems in the software, and then testing with larger networks of dozens to hundreds of units, that would more accurately reflect the real use of the Serval Mesh by communities in the field.

Flinders University has kindly granted me a "sabbatical" year to focus on this.  I'll be based at Arkaroola to do this, where we did the original field testing, so that we can use the vast rugged landscape there, including over 600 square kilometres of mountainous desert, to be able to deploy realistic test scenarios, and work on this scaling.

Our goal, is that the Serval Mesh can be ready for individuals and communities who want to use it, to do so, by the end of this year. 

Achieving this will depend on a lot of factors, including the ever present problem of having sufficient funds for the equipment that would allow us to work more quickly, and scale up our tests more meaningfully.  It also doesn't answer the question of how we support communities in their use of the technology once it is finished, or who will offer it for sale.  But those are issues that we can think further through during the coming year.

But one thing is crystal clear to me:  We each need to consider what we can do to mitigate the effects of the fires, and to adapt to a future where such events are more and more likely, and to do what we can to mitigate this threat. Whether we can see the whole solution or not, is to me secondary: We must simply make sure that we do what we can now.  And that is what I am going to do with 2020.

Thursday, November 15, 2018

Serval Chat iOS Port finally emerges from the lab

Goodness me, this is probably the single piece of the Serval Project that has taken the longest compared to what was expected.  We have had a number of misadventures along the way with false starts to make a working port of Serval for iOS, but today, we finally have it.

This adventure began back in about 2014 when the NLnet provided us with a grant to achieve this.  Needless to say, we didn't make our original planned timelines.  We had some turn over of engineers and students on the project, and Apple changed from Objective C to Swift, and we hit some early dead ends and speed humps.  But we now have something that works quite nicely on iOS 12.

This port runs the core serval-dna source in a thread in the background, so that it is fully capable and fully compatible.  Getting serval-dna to compile for iOS, and produce binaries that can run on all iPhone CPU types, and that could be sensibly accessed from Swift, and that would behave properly was non-trivial to say the least.  We got to that point earlier in the year, and then the remaining time has been spent making a proof-of-concept app, and working through the remaining issues, like not having the thread die when the app looses focus, and working around a pile of funny little problems, including using the properly containerised paths in strange places we had overlooked it, and discovering that named sockets just don't work on the iPhone.

So today we reached the happy point of having something working, just hours before our French exchange student was due to finish up.  His last great act for us was to help prepare the following short video to celebrate this milestone.


There is still some missing functionality to implement, and then the real fun begins: Trying to get the app approved for release on the Apple App Store.  But that will be an adventure for another day.

Monday, November 12, 2018

Slashing the cost of tsunami (and bushfire) early warning systems

We have just about finished off a project that we have had funded by the Humanitarian Innovation Fund (HIF), which is funded by the UK's foreign aid program.  This has been a bit of a complementary variation on our existing work on the Serval Mesh.  Rather than on-ground peer-to-peer communications, the goal is to get communications out in to very remote locations, in particular, to get earaly warning for tsunamis, cyclones, bush fires and other hazards.

The tragic events in Sulawesi last month are a stark reminder of the need for tsunami early warning systems.  Following the earthquake, a tsunami warning was generated, but there was no way to get the message out to the impacted communities, in part because the earthquake had knocked out the cellular network, and in part, because many of the smaller villages likely lacked cellular coverage to begin with, or in the absence of grid electricity, many phones would have been turned off to conserve power.

These circumstances are not by any means unique to Indonesia -- they occur all around the Indo-Pacific in various forms and permutations.  The question is how to solve the problem.

The traditional approach is big tsunami early warning towers.  However, those are really expensive, especially to install in remote areas, often costing $10,000 or more per tower.  For isolated villages of a hundred or so people, the per-capita cost is simply infeasible.  Maintenance is also problematic both financially, and logistically

We have tackled this problem from the perspective of making a solution that works best in these resource-constrained environments.  In the end, our solution is rather simple:  Use off-the-shelf satellite TV receiver parts combined with some clever signal processing work by Othernet.is, so that no dish is required -- just point it to within about 10 degrees of where the geostationary satellite is.  This gives us a low-bit-rate digital broadcast signal that we can receive over a wide area.  Add to this direct-leasing of satellite capacity, so that the operating cost of the system is low and fixes, i.e., there is no monthly charge per user, so that it can be scaled up region wide.  Then all that is missing is a cheap automotive air-horn, so that the alarm can be heard within a village.  

The result is receiver hardware that costs maybe $200 in modest quantities, i.e., around 100x cheaper than the old approach, and that can be installed by communities themselves.  That is, we don't have to rely on big programs to roll these things out. Once people know about them, they are cheap enough for a village to buy, and easy enough for them to install themselves.

This just leaves the problem of disaster resilience/warning technologies not getting maintained during the year, and thus often having failed before they were needed. Our solution here is to include a low-power FM radio transmitter, so that news, weather, climate change mitigation and other information can be received in each village.  This means when the unit stops working at some point (nothing lasts forever in the tropics), they know immediately (no more radio signal, and the services it was providing them), and they can organise themselves to get it replaced and reinstalled, again, without having to rely on an NGO or aid program to do it.

In short, it is not only way cheaper, it is also way more appropriate to the needs of the Pacific.  It turns out that these characteristics also make it a great potential solution for bush-fire early-warning in Australia, where cell networks are often taken out by large bush fires (= wild fires, for those on the other side of The Pond), being cheap enough for house-holds to install themselves, and simple enough for them to self-install.

We are at the proof-of-concept stage and are looking for ways to fund the next stage.  Hopefully we will find a funder who can help us bring it to operational readiness in the next year or so.


Wednesday, April 4, 2018

Building an indoor test network and testing LBARD bug fixes

After the successful bug hunting last week, this week I have turned my attention to setting up to easily verify whether the bugs have been fixed. This means setting up a multi-hop Mesh Extender network in a convenient location. 

The building I am based in was purpose-built for the College of Science & Engineering, and we were able to have a number of specific provisions included during the design phase to make this kind of test network easier to deploy, and we are now finally able to start making fuller use of that.  More specifically, we have 13 points around the building where we have copper and fibre directly to our lab, and also have some funky mounting brackets, on which we can fit radio hardware, all nicely protected from the outside weather.  

So the obvious approach was to fit several Mesh Extenders to some of those points, and form a multi-hop UHF network.  Here are some photos from various angles showing the units:

First up, we have Mesh Extender serial #23 with SID BCA306B0*, which is at point NE1. Note the extra long 6dB antenna we fitted to this one to make sure it could talk to the one at N4 (1A15ED32*). Even with that antenna, the link is quite poor, often getting only one packet every 5 to 10 seconds.  While on the one hand annoying, this is actually quite helpful for testing the behaviour of Mesh Extenders in marginal link conditions.



Mesh Extender serial #34 is at point N4, and has SID 1A15ED32*:




Note that we have removed the Wi-Fi antenna from this one. This was necessary to make sure that it can't communicate via Wi-Fi with any of the other units, which it was otherwise able to do intermittently, due to being relatively near to the one in the lab, which is on the same floor, and the one on level 5, to which it almost has line of sight up to via the stair case.

Mesh Extender serial #50 is at NW5 and has SID 5F39F182*:





And finally, Mesh Extender serial #17 (6ABE0326*) is sitting on the rack in our lab:



Mesh Extender #34 (1A15ED32*) is the only node that can see all the others, which we confirmed by using the debug display on LBARD:

I was still fiddling around with things at this point, but you can see already that the RSSI for BCA306B* is quite a lot lower than for the others.  Although here it looks like the link is healthy, it really does come and go a lot, and goes seconds at a time without a single packet arriving.  

The last of the four rows, the one showing 39F1821A* is actually a known-bug manifesting. This is a ghost connection, caused by receiving a packet from 5F39F1821*, with the first byte having gone walk about for some reason. We are still to work out what the cause of this problem is.  It's on our bug list to find and fix.


Now, to give a better idea of how the network is laid out, here are a couple of shots from the outside of the building.  The first is looking along the northern facade to give a sense of the length of the building.  It is from memory about 50m long.


And here is the whole facade:

And, here is the facade showing, in red, the locations of the three units that are located on the mounting points just behind the three layers of metal-film coated glass that acts as a partial Faraday Cage. The black spot shows the approximate position of the lab, as projected onto the northern facade, although it is in reality on the southern side of the building, approximately 25 metres south of the northern facade. Green lines indicate strong UHF links, and the yellow line the poor link between the first and fourth floors.


The glass treatment means that very little signal can actually get out of the building, bounce of something, and find its way back in. We estimate the attenuation to be about 15dB at 923MHz, so a total of 30dB loss to pass out of the windows, and back in again, not counting any free space losses etc.  Thus we suspect that the links between the units are due to internal reflections in the building, and perhaps a bit of common mode radiation along metal parts of the building interior.  It is almost certainly not due to transmission through the ~30 - 50 cm thick reinforced concrete floor decks of the building.

Given that we are in a building with such poor RF propagation properties, it is rather pleasing that we are able to get any link at all between the first and fourth floors, given the multiple concrete decks and other obstacles that the link has to face.

So, with the network in place, it was time to do a little initial testing with it.  For this, I had a phone running Serval Chat in the lab, and I went down to level 1 by BCA306B0 and sent text messages, using the automatic delivery confirmation mechanism to know that a message has got to the other end, and indeed that a bundle has made its way back containing the acknowledgement.

First, we tested using the old version of LBARD, and were able to see the failure of messages to deliver that we had seen in the field.  Then it was a visit to all the Mesh Extenders to update LBARD (we will in the near future supply them all with ethernet connections back to the lab, so that we can remotely upgrade them). The result after upgrading LBARD was messages were reliably delivering, taking between about 1 and 4 minutes to get the delivery confirmation.  This was of course a very pleasing and comforting result, as the inability to reliably deliver messages was the most critical issue we were seeing in Vanuatu.

The large variation in time we expect to be simply due to the very high rates of packet loss that we see on the link between BCA306B0 and 1A15ED32 -- something that we will confirm was we continue testing. The main thing is that we now have a test environment that is both convenient, and realistic enough, for us to reproduce bugs, and confirm that they have been fixed.

Sunday, April 1, 2018

Fixing bugs and structure of LBARD

During our trips to Vanuatu last year we identified a number of bugs with LBARD, which have been sitting on the queue to be fixed for a while. 

At the same time, we have had a couple of folks who have tried to add support for additional radio types to LBARD, but have struggled due to the lack of structure and documentation in the LBARD source code.

Together, these have greatly increased the priority of fixing a variety of things in LBARD.  So over the past few days I have started pulling LBARD apart, taking a look at some preliminary refactoring by Lars from Germany, and tried to improve the understandability, maintainability and correctness of LBARD.  This has focussed on a few separate areas:

Generally improving the structure of the source code

The most obvious change here is that the source files now live in a set of reasonably appropriately named sub-directories, to make everything easier to find.  This is supported by the factoring out of a bunch of functionality into the radio drivers and message handlers described below.

Drivers for different packet radio types are now MUCH easier to write.  

Previously you had to hunt through the source code to find the various places where the radio-specific code existed, hoping you found them all, and worked out how they hang together. This also made maintenance of radio drivers more fragile, because of the multiple files that had to be maintained and that could give rise to merge conflicts.

In contrast, the process now consists of creating a header and source file in src/drivers/, that implement just a few functions, and has a single special header line that is used by the Makefile to create the necessary structures to make the drivers usable. 

The header file is the simplest: It just has to have prototypes for all of the functions you create in the source file.

The source file is not much more complex, excepting that you have to implement the actual functionality.  There are a few assumptions, however, primarily that the radios will all be controlled via a UART.  Using the RFD900 driver as an example, let's quickly go through what is involved, beginning with the magic comment at the start:


/*
The following specially formatted comments tell the LBARD build environment about this radio.
See radio_type for the meaning of each field.
See radios.h target in Makefile to see how this comment is used to register support for the radio.

RADIO TYPE: RFD900,"rfd900","RFDesign RFD900, RFD868 or compatible",rfd900_radio_detect,rfd900_serviceloop,rfd900_receive_bytes,rfd900_send_packet,always_ready,0

*/


In order for the compilation to detect the driver and make it available, it requires a line that begins with RADIO TYPE:  and is followed by the elements for a struct radio_type record, the details of which are found in include/radio_type.h. You should provide here:

1. the unique radio type suffix (this will get RADIOTYPE_ prefixed to it, and #defined to a unique value by the build system).
2. The short name of the radio type to appear in various messages.
3. A longer description of the radio type.
4. An auto-detect routine for the radio.
5. The main service loop routine for the radio. LBARD will call this for you on a regular basis. You have to decide when the radio is ready to send a packet, as well as to manage any session control, e.g., link establishment for radio types that require it.
6. A routine that is called whenever bytes are ready to be received from the UART the radio is connected to.
7. A routine that when called transmits a given packet.
8. A routine that returns 1 when the radio is ready to transmit a packet.
9. The turn-around delay for HF and other similar radios that take a considerable time to switch which end is transmitting.

These drivers are actually quite simple to write in the grand scheme of things. Even the RFD900 driver, which implements a congestion control scheme and an over-complicated transmit power selection scheme requires is still less than 500 lines of C.  The current proof-of-concept drivers for the Barrett and Codan HF radios are less than 300 lines each.

What hasn't been mentioned so far is that if you want your radio driver to get exercised by LBARD's test suite, you also need to add support for it to fakecsmaradio, LBARD's built-in radio simulator.  This simulator implements a number of useful features, such as the ability to implement unicast and broadcast transmission, the ability to detect packet collisions and cause loss of packets, adjustable random packet loss and a flexible firewall language that makes it fairly easy to define even quite complex network topologies.  The drivers for this are placed in src/drivers/fake_radiotype.h, and will be described in more detail in a future blog post.

LBARD Messages are now each defined in a separate source file

Whereas previously the code to produce messages for insertion into packets, and the code to parse and interpret them were found scattered all over the place, they now all live in a source file each in src/messages/. For further convenience, these files are automatically discovered, along with the message types they contain, and hooked into the packet parsing code.  This makes creation of message parsing functions quite trivial: One just needs to create a function called message_parser_XX, where XX is the upper-case hexadecimal value of the message type, i.e., the first byte of the message when encoded in the packet.  Where the same routine can be used to decode multiple message types, #define can be used to allow re-use of the function.  Apart from performing their internal functions, these routines need only return the number of bytes consumed during parsing. For example, here is the routine that handles four related message types that are used to acknowledge progress during a transfer:

#define message_parser_46 message_parser_41
#define message_parser_66 message_parser_41
#define message_parser_61 message_parser_41

int message_parser_41(struct peer_state *sender,char *prefix,
                      char *servald_server, char *credential,
                      unsigned char *message,int length)
{
   sync_parse_ack(sender,message,prefix,servald_server,credential);
  return 17;  // length of ACK message consumed from input
}

Fixing LBARD transfer bugs

In addition to the structural improvements described above (and in many regards facilitated by them), I have also found and fixed quite a few bugs with Rhizome bundle transfers.  The automated tests now all pass, with the exception of the auto-detection of the HF radios, which don't work yet because we are still writing the drivers for fakecsmaradio for them. Here are a pair of the more important bugs fixed:

1. The code for tracking and acknowledging transfers using acknowledgement bitmaps now actually works.  This is important for real-world situations where packet loss can be very high.  In contrast to Wi-Fi that tries to hide packet loss through repeated low-latency retransmissions, we have to conserve our bandwidth, and so we keep track of what a recipient says that they have received, and only retransmit that content when required. 

This also plays an important role when there are multiple senders, as they listen to each other's transmissions, and use that information to try very hard to make sure that they don't send any data that anyone else has recently sent.  This helps tremendously when there are multiple radios in range of one another.

This bug is likely responsible for some of the randomness in transfer time we were seeing in Vanuatu, where some bundles would transfer in a few seconds, while others would take minutes.

2. There were some edge-cases that could cause transmission to get stuck resending the last few bytes of a bundle over and over and over, even though they had already been received, and other parts of the bundle still needed sending.

This particular bug was tickled whenever the length of a bundle, when rounded up to the nearest multiple of 64 bytes was an odd multiple of 64 bytes. In that case, it would mistakenly think that there was 128 bytes there to send, and in trying to be as efficient as possible, send that instead of any single 64 byte piece that was outstanding (since it results in reduced amortised header cost).  However, the mishandling of the bundle length meant it would keep sending the last 64 bytes forever, until a higher priority bundle was encountered. 

It is quite possible that this bug was responsible for some of the rest of the randomness in latencies we were seeing, and also the "every other MeshMS message never delivers" bug, where sending one MeshMS would work, sending a 2nd wouldn't, but sending a 3rd would cause both the 2nd and 3rd to be delivered -- precisely because each additional message had a high probability of flipping the parity of the length of the bundle in 64 byte blocks.

So the net result is that the various tests we had implemented in the test suite, including delivering bundles over multiple (simulated) UHF hops, holding a MeshMS conversation in the face of 75% packet loss, transferring bundles to many recipients or receiving a bundle efficiently from many nearby senders all work just fine. 

The proof of the pudding is in the real-world testing, so I am hoping that we will be able to get outside with a bunch of Mesh Extenders in the coming week and setup some nice multi-hop topologies, and confirm whether we can now reliably deliver MeshMS and MeshMB traffic over multiple UHF radio hops. 

In parallel to this, we will work on the drivers for the HF radios, and also for the RFM96-based Lora-compatible radios that are legal to use in Europe, and get that all merged in and tested. But meanwhile, the current work is all there to see at https://github.com/servalproject/lbard/.

Wednesday, March 21, 2018

Early access to Serval Mesh Extender hardware

A quick post in response to a few enquiries we have had recently, where folks have been asking about getting their hands on some prototype Serval Mesh Extender devices.

So far, we have only produced 50 of these units, which have our (hopefully) IP65/IP66 injection-moulded case, IP67 power/external radio cable, integrated solar panel controller and battery charger if you wish to operate them off-grid.   Most of those are currently deployed in Vanuatu, or are in use in the lab here or with a few other partner organisations.

Although we are still in the process of testing them, they have already been in some interesting places in Vanuatu and beyond, as the images at the end of the post show.

Thus, we would like to gauge the interest in us having another production run, so that folks who would like to work with these experimental units, including to help us identify, document and fix any problems with them, can do so.

A reminder and warning: The devices would be experimental, and have known software issues that need to be resolved before they can be sensibly deployed.  It is also possible that hardware issues will be found in the process, and that these units may well never be deployable by you in any useful way, i.e., cannot be considered as a finished product or fit for merchantability. Rather, you are being given the opportunity to purchase hardware that would allow you to participate in the ongoing development of these devices.

If you want to buy Mesh Extenders to use in a serious way, this is not the offer you are looking for. As soon as we are ready to make them commercially available, we will be sure to let everyone know, and will update this post as well.


Because of the small production size that we would anticipate for this, the per-unit price will likely be quite high -- potentially as high as AU$700 each (and you really need a pair for them to be useful).  If we were able to get 100 or more manufactured in a single run, then this cost would come down, quite likely a lot.  (But please note, this is not at all representative of the final price of these units when we hopefully bring them to market -- we intend that the end price to be a LOT lower than this.  Again, the purpose of this current offer now is to provide early access that would support the finalisation of these devices, and allow us to get a few extra devices made for our internal use at the same time.)

So, given the above, is there any one else out there who would also like to get their hands on one or more Mesh Extenders, in your choice of Wholly White, Rather Red or Generally Grey?

(We realise the price at this stage is rather high, and that this might preclude interest from some folks from participating. During this pre-production stage, we have a couple of options to reduce the cost a little: First, if you already own an RFD900 or RFD868 radio, or want a Wi-Fi-only Mesh Extender, you can subtract AU$100 from the price.  Second, if you want to make an offer contingent on a production run of at least 100 units, which should get the price down by AU$100 - AU$200 per unit (subject to confirming with our suppliers), you are welcome to indicate this.  We will in any case come back to the community before we proceed with any production run.)