MIT's 6.824 Distributed Systems, Lab 2: RaftTue Mar 16 2021
Went out on
See my notes on the paper here
Lab 2A: Leader election (moderate)
Finished on Sunday 21st Feb 2021.
Lab 2B: log (hard)
Started Saturday 27th Feb 2021. Went to NUS and worked from 3pm to 3am with a lot of breaks in between.
Log is 1-indexed: add a dummy element inside
28th Feb 2021: 12:30pm: Passed TestBasicAgree2B and TestRPCBytes2B, failed the rest.
Tuesday 2nd March 2021: Consistently passes all tests. 2B is now complete.
Some of the bugs I inadvertently introduced:
In TestFailAgree2B one of the servers is disconnected. Since it is disconnected it becomes a candidate and starts to increment its own term again and again. The other two servers happily stay at term 1 and replicate log entries. When it is reconnected, it rejects the AppendEntries because its term is higher. Then the leader immediately becomes a follower, and leader election should occur. Since the disconnected server does not have any of its logs
This is all correct behaviour so far, but for
In my helper function
logAtLeastUpToDate I wrote
when I meant to write
So the issue is that if let's say you have a node 1 that is a follower and has randomly initialised with some high election timeout, and has some logs, and you have a node 2 that has not up-to-date logs but has initialised with a low election timeout. You can have a situation where node 1 keeps rejecting node 2, but node 2 keeps refreshing / incrementing term and randomly initialising state, and because node 1's election timeout is quite high, node 2 will keep asking, and we'll be unlikely to reach a consensus. This isn't a deterministic bug, it only happens sometimes, we need node1 to have a high election timeout
Solution: Don't reset
lastHeardFrom if I receive a
and I do not grant a vote: only refresh lastHeardFrom if
i) I grant a vote to a candidate in a RequestVote RPC call, or
ii) I receive an AppendEntries RPC call with the same or greater term.
Lab 2C: persistence (hard)
Before I implement Lab 3