MIT's 6.824 Distributed Systems, Lab 2: Raft

Tue Mar 16 2021

tags: draft programming computer science self study notes public 6.824 MIT lab distributed systems


Went out on

See my notes on the paper here

Lab 2A: Leader election (moderate)

Finished on Sunday 21st Feb 2021.

Moderate difficulty.

Lab 2B: log (hard)

Started Saturday 27th Feb 2021. Went to NUS and worked from 3pm to 3am with a lot of breaks in between.

Log is 1-indexed: add a dummy element inside log

28th Feb 2021: 12:30pm: Passed TestBasicAgree2B and TestRPCBytes2B, failed the rest.

Tuesday 2nd March 2021: Consistently passes all tests. 2B is now complete.

Some of the bugs I inadvertently introduced:

In TestFailAgree2B one of the servers is disconnected. Since it is disconnected it becomes a candidate and starts to increment its own term again and again. The other two servers happily stay at term 1 and replicate log entries. When it is reconnected, it rejects the AppendEntries because its term is higher. Then the leader immediately becomes a follower, and leader election should occur. Since the disconnected server does not have any of its logs

This is all correct behaviour so far, but for

In my helper function logAtLeastUpToDate I wrote < when I meant to write >. This mea

So the issue is that if let's say you have a node 1 that is a follower and has randomly initialised with some high election timeout, and has some logs, and you have a node 2 that has not up-to-date logs but has initialised with a low election timeout. You can have a situation where node 1 keeps rejecting node 2, but node 2 keeps refreshing / incrementing term and randomly initialising state, and because node 1's election timeout is quite high, node 2 will keep asking, and we'll be unlikely to reach a consensus. This isn't a deterministic bug, it only happens sometimes, we need node1 to have a high election timeout

Solution: Don't reset lastHeardFrom if I receive a RequestVote entry and I do not grant a vote: only refresh lastHeardFrom if i) I grant a vote to a candidate in a RequestVote RPC call, or ii) I receive an AppendEntries RPC call with the same or greater term.

Lab 2C: persistence (hard)

Before I implement Lab 3

Lab 2D: log compaction (hard)