Monday, September 21, 2020

Continuing working on the LBARD HF protocol efficiency problems

Ok, back from the last post where I was thinking about some longer-term future approaches, to trying to get the current code working as well as it can.

The problem I am seeing at the moment, is that the same blocks are being sent repeatedly, while other blocks are not being sent at all.  

One problem was that pieces of the manifest were being marked as sent, when they hadn't.  This was due to an out-by-one error in the comparison of the extent of a block being sent, compared with the actual data. The effect was when a block was sent, both it, and the following block were marked as having been sent.

This seems to be a common factor in many of the slow transfers.  It has only started showing up now, because we were previously testing with payload-less bundles, and so the manifest would keep being resent, which would then have a higher chance of re-sending the missing part through the normal cyclic re-transmission.

So let's set a test run going, and see if it solves the problem.

Well, its helped, but close examination of the log files reveals that there are still pieces being counted against the wrong blocks, due to a duplicate code-path.  I've now de-duplicated that.  I also notice that sometimes manifest or body pieces would be counted against the other as having been sent, which of course is also causing problems.  So I have fixed those as well. 

On the plus side, its pretty how these sorts of bugs can be causing these problems: If we are marking things that have not been sent as having been sent, then we are not going to send those any time soon.  Instead, we will keep sending the same pieces over and over, because we think we haven't sent them.

I'm also seeing the situation that the sending of a bundle can get reset part way through, which then makes it forget the transfer bitmap information for that bundle, resulting in it re-sending pieces it has already sent. This might be because an old feature that was built in that causes the tree-sync process to restart periodically, to avoid lock-up in the protocol, which in turn causes the selection of the currently being sent bundle to potentially change.  However, the root causes for those lock-up have all been fixed now, so that can probably be removed, if it is still there.  I'll investigate that while the new run with the previous bug fixes in place runs.

Yes, it now looks like the resetting of sending bundles is the most significant remaining problem. As mentioned, this causes the knowledge of the state of a transfer to be lost.  There are two approaches to solving this: 

1. Don't switch which bundle we are transferring. 

2. Remember the state of other bundles, so that we can continue from where we left off, if we do switch which bundle we are transferring.

Option 1 sounds great in theory, but the synchronisation process may well interrupt the transfer of a lower priority bundle with a higher priority one, for example.  So this is probably not possible to achieve in practice. Although I will instrument the changing of the current bundle being sent, to find out why it is changing.

Thus it probably makes sense to implement something to remember the progress of a bundle when switching.  This can be kept in a nice little cache somewhere.  Of course, this is one of several complications that would go away if/when I get to implementing the new approach I wrote about in the previous blog post.  But for now, the cache will have to do.

I've now implemented this caching of transfer progress.  However, the first run after implementing in it showed a weird situation where bundles transferred quite regularly in one direction, but none in the other.   In the direction where traffic was flowing, I saw 6 bundles received fairly quickly, with all of them transferring in 50 to 80 seconds.  That's the good news.  However, there would then be a gap ranging from a couple of minutes to half an hour with nothing being transferred.  Then the same six bundles would be transferred again.

I think what happened is that one of the servald instances crashed or didn't start properly.  As a result one party had no bundles to share, and could not store them into its Rhizome database, and so the same transfers happened repeatedly.  That might have just been some freak event that caused one of the servald instances to not have started properly.  To find out, I'll run the test again, and see what happens.

Second run worked properly in that regard.  Some bundles transferring at a reasonable speed, but others takeing a VERY long time, upto an hour or so.  

One bug spotted: We don't mark the final body block as sent if it is a partial block. This looks to be able to cause the final block to be sent many times repeatedly, which is clearly not good.  I've tried to fix that now.

The other bigger problem, is that the bundle keeps getting de-queued and re-queued, which of course is causing a lot of delay in transfer.  It also seems that at least one code path can de-queue the current bundle being sent, without stashing its progress for later retrieval. 

Found at least one: The sending of a new generation ID by the sender, which causes us to restart the tree-sync process.  In that code, we stop sending the current bundle.  However, we should just keep on sending the same bundle, unless and until we find a better bundle to send.  This is likely the cause of considerable delays, because the synchronisation will run for potentially several minutes before it finds a new bundle to send. That's likely to be it.  We should probably reduce the generation ID rotation rate for HF, anyway, since it doesn't make sense to reset the tree-sync process so frequently on a high-latency link.

All these problems keep making me think about the new tree-based approach I wrote about recently, as there would be no need to do any of this, and resumption of transmission would happen in a much more natural and efficient manner -- even if the lbard process really did die and restart.

Anyway, back to the existing code, it now looks like transfers are happening reasonably sensibly.  In trying to make a script to show the sequence of transfers, I am however still seeing that the currently selected bundle for transmission tends to change more often than it seems it should.  This led me to look more deeply into the TX queue management, and I'm seeing some odd things there, where some bundles added to the queue seem to disappear, and others that were never reported as having been added to the queue turn up in there.  So I'll add some more debugging for that.

Yes: The TX queue is being completely purged from time to time. This is again the resetting of the generation ID, the unique identifier that is used to indicate when an LBARD instance restarts.  Apparently I was resetting the generation ID every four minutes.  Thus it is no wonder that the TX queues are getting purged and we are ending up stuck in a synchronisation hell.  Instead, the generation ID should only be changed when the TX queue at one end or the other has overflown, and thus been emptied, and so the sending side doesn't know which bundles it should have sent when the TX queue overflowed. 

Given that the transfers are now otherwise working fairly efficiently, I'm hopeful that this will get us to some kind of reasonably reliable transfer with steady transfer times for bundles of a given size.

Running more tests, I have found and fixed a few more problems:

1. If the carrier is lost in a HF modem call, we can end up in a situation where no more normal packets get sent, presumably because the number of unacknowledged packets is high, causing the packet rate to be practically zero.

2. Particularly with bundles that require many data packets, they end up slowing down in transfer over time, and eventually transfer halts.  The halting is probably due in part to (1), but the reducing transfer efficiency must have another cause.

3. Some received pure data packets are corrupted, which is a probable significant contributor to (2).

To fix (1), I have reset the TX sequence number and last acknowledged sequence number whenever a HF call starts.

For (2), I need to examine what is going on in the logs of runs, which will probably make more sense once I have fixed (3).

For (3), I suspect that the modem's TX buffer is being easily overwhelmed.  This is in spite of the fact that I have hardware flow control enabled.  It is possible the hardware flow control doesn't work properly with the USB serial adapters.  In any case, I have implemented a check before sending pure data packets, so that they will not be sent if no normal packet has been seen in the last 30 seconds.  This should stop the buffers getting too over full.  But a fuller investigation will require that I log every byte sent to and received from each modem, and then compare the two streams to see exactly what kind of data errors we are experiencing. Is it lost bytes, as I am suspecting, or is it bit errors, for example?

So adding the logging of bytes sent/received, I can see that indeed something is going wrong with the pure data packets almost immediately.  This makes it much easier to see what is going on, since I don't have to wait an hour or more each time. Now to look at the TX and RX logs to see what is happening...

We are indeed missing large slabs of data.  Taking a look at the FTDI USB serial adapter documentation, it looks like you cannot safely write more than 32 bytes at a time, if using hardware flow-control, as that is the size of the FTDI chips internal buffer.  So I've modified the serial writing code to write 16 bytes at a time, to see if that fixes the missing bytes problem. However the code to back-off due to hardware flow control doesn't seem to get triggered at all, and the problem with corrupt data continues.  Found one problem with the TX and RX logs, where the pure data packets weren't being included.  Fixed that, now lets see how it looks...

Ok, so now I am not seeing any more lost bytes at the start, but the pure data packets were still being munged.  That problem turned out to be caused by the truncation of the data packets on the receive side. I also found and fixed a couple of segmentation fault bugs in the process.  Now re-running...

Still seeing corrupted packets. It turns out the pure data packets were not being properly escaped before sending, thus if they contained a "!" character in their data or header, it was messing things up.  Fix that now, too.  This could well be causing the failure to transfer larger bundles efficiently, as it would effectively stuff up the transfer of larger pieces more often.  So its satisfying to have found this problem.  Let's see what it does to the transfer rates...

Yes, bundles now seem to transfer in between 30 and 200 seconds. However, now I am seeing that the same bundles get sent more than once.  I suspect that this is due to the synctree reset that happens when the TX queue is emptied, but overflow of the TX queue was previously recorded.  This tells LBARD that it needs to start the sync again. However, if the bundles were received, then they should not get transferred again.  Looking in the logs, I am seeing HTTP 400 errors when trying to insert the bundles into the rhizome database. Thus we are still having a problem with corruption during transfer. I'll have to look at the reassembly of bundles. My suspicion is that the problem will be with the pure data packets somewhere.

The corruption problems have now been solved. Part of it was that writing in 16-byte pieces helps the serial over USB transfers, but messes up the HTTP transactions.  So I have split the code to only do that for the serial transfers.  That fixed the problem with bundles not making it into the Rhizome database, and transfers now proceed progressively through more than just the same 10 bundles repeating. 

However, I am still seeing the HF call dropping out sometimes. Those tend to recover fairly quickly, and might be legitimate loss of carrier in the radio setup.

More concerning though, is that there can be a period of half an hour or more where no transfers in either direction occur.  These are quite mysterious, as the HF modem doesn't seem to drop out, but the logs of both sender and receiver simply show nothing at all for a period of 46 minutes in this particular case.  I guess I will need to run it again, and try to catch when this happens.  

This time we hit a segfault after almost an hour and transferring 80 bundles (between both directions) in about 64 minutes. The problem is that sometimes rubbish ends up in the TX queue.  More specifically, the priority of a bundle ends up being stored in the bundle ID field of a slot in the TX queue. 

So I've put more debug instrumentation in to find out how the TX queue list is getting corrupted. In the process, I also fixed the case where if a bundle is currently being sent, it could still end up being queued for later transmission as well, e.g., if the tree-sync process is restarted at some point.  

I've also added code that detects when the top of the TX queue ends up containing an illegal bundle number to reduce the impact. However, I would really like to fix the error at its cause.  What I believe is most likely the problem, is that the shuffling of the TX queue has an out-by-one error somewhere, as the bundle numbers occur immediately before the list of priorities. Thus if one too many entries were copied down, it would copy a bundle priority.

So time to start a new test run...

Its now running through a lot happier.  157 of the ~220 bundles have transferred, but it has taken about 4 hours to get that far. But it's not as bad as it sounds, as it seems that my laptop somehow stops USB communications while the display is locked.  This means that if I don't periodically prod the computer, long periods of no transfers result.  I'm pretty sure that this is the cause of this problem that I noticed previously.

That said, I am also seeing it able to get into a hang up and redial loop, where no packets get sent in the meantime.  I assume that the time to next packet TX is somehow being set wrongly. Or more the point, it can grow large, and wasn't being reset when a call was re-established.  I've examined all the code-paths that can get there, and added explicit resetting of the next packet TX time.

I tracked down the "nothing sent for an hour or more" problem: It's actually not with LBARD, but with the laptop I am using.  It automatically suspends after 2 hours of inaction, even when connected to mains power.  This doesn't really help when it happens mid-run.  So I've disabled that in the laptop, and should now be able to run transfers without this problem.

I've also looked through and found the problem that was causing the illegal bundle numbers ending up in the TX queue, and fixed that. The problem was indeed an out-by-one with the list shuffling code, as expected.  I'm expecting that this will likely fix the problem where only 199 or so of the 202 bundles were being transferred, as it could result in the situation where the TX queue loses an entry or two without actually overflowing to trigger the re-start of the sync process to re-discover the bundles that were lost from the queue.   Thus I'm hoping that runs now will correctly transfer all 202 bundles -- 101 in each direction.  It might also end up being marginally faster, because fewer sync restarts will be required.

In short, I'm now hoping that I am at the point where transfers run reliably and in reasonable time.  I'll do a few runs to get some data on the typical speed and variation, but then we should be good, I think.  What I would like to do next is to plan some actual long-distance tests using the HF radios, with one here at the house in Akaroola and another some distance away.  Perhaps in a neighbouring town.  However to do that, I will need to get some antennae and coordinate with folks in the amateur radio community to help with that.

But as this post has already gone on for ages, I'll stop here, and start a new post looking at the remaining TX issues.

No comments:

Post a Comment