Surviving CS7210 Distributed Systems – Paxos Is Tough

This is the most difficult Georgia Tech course I’ve taken, and that includes CSE6220 Intro To High Performance Computing. I’ll share with you my strategies for a couple of the labs and some general class advice to make it through.

Overview

This class is built around Ellis Michael’s dslabs which he built as a graduate student at the University of Washington. Many universities are adopting dslabs now as a way to teach distributed systems – in particular, Multi-Paxos with leader election and distributed transactions.

dslabs is one of those projects which is best described as not as easy as it looks. Even if you’re an experienced Java engineer with a decade of experience, I think you will struggle with this project.

The novel thing about dslabs is its usage of search based tests to exhaust the test space and check every little corner case of your implementation. The tests are brutal and will find any little bug you have, including performance issues. The class Gradescope environment makes the tests even harder by putting smaller timeouts on them. Half the challenge is understanding how the test framework works, and debugging it.

I was one of a few people to 100% all of the projects in a class of 100 – they are very tough.

Take the projects seriously

The best overall advice I can give is to start the projects as early as possible. Prior to this class, I usually only worked on projects in the last weekend before they were due. Even in iHPC I rarely started the projects right away. This class is not like that. If you’re like me, change your mindset, and be prepared to start working on the project from the day they release. The first 3 projects (corresponding to dslabs lab 1 and 2) will also lull you into a false sense of security. The difficulty significantly ramps up with projects 4 and 5.

I’m only going to cover project 4 and 5 details here, since the prior projects are a lot more straightforward. Most of this is taken from my notes when I did the projects 6 months ago, so some of the details may be slightly murky for me at this point – I apologize.

Project 4 – Multi Paxos with Leader Election

This lab was the most difficult for me out of all the labs, and the one that the fewest number people finished successfully. If you can get all the tests to pass you will be in a better place than everyone else for lab 5, which relies on the Multi-Paxos implementation. If you don’t finish lab 4, you’ll be given a black box reference implementation that is a lot harder to debug. It took me about 80 hours to complete.

The first thing to do is study the Paxos Made Moderately Complex paper, and start by trying to implement its pseudocode. You should understand this paper very well. The key elements to understand from PMMC are:

  • How does the P1/P2 message protocol provide the Paxos guarantee?
  • What do all the roles do and how do they work together?
  • When are events committed to the logs and what makes them immutable?

Conceptually there is a single immutable log which records all events and never changes once an event is committed (agreed by the leader and a quorum). The PMMC paper splits things into an acceptor/decided log, but you may find as I did that it complicates the implementation – and it’s a lot simpler just to have a single log. Similarly, you’ll also probably find it makes sense to merge a lot of the roles once you understand the PMMC paper very well. This will simplify your implementation and make it more efficient.

I relied heavily on the visual debugger for my initial implementation. Test 20 is good for this since it works with the debugger. This will get you a lot of the way towards a basic working implementation.

Your stable leader algorithm is critical and also one of the most difficult things to get right in the lab. Think critically about how it will work and all edge cases, while also optimizing for efficiency. A common issue is leaders ping ponging and not converging – you may need some sort of back-off strategy to give leaders a chance to be elected in certain conditions.

Beyond that, here are some more general tips for this project:

  • Make liberal use of Java’s “assert” command to check your invariants throughout the code. This will make it much easier to catch bugs as you evolve the code.
  • You really need to optimize the number of messages sent to get all the search tests to pass. Batching is absolutely critical (e.g. sending a batch of P1/P2/decision messages when you can, rather than one at a time). You also need to be very careful not to send retries unless necessary. Heartbeats and catch up messages are also areas to heavily optimize, and avoid too many timers.
  • Really understand how the tests work. Using IntelliJ’s debugger can be very helpful here too – you can also modify the test framework code to print information out. This came in handy for me a few times when I got stumped on something really odd.
  • Iterate, iterate, iterate. Start with working, inefficient code, then start getting the harder tests to pass. Getting the basic tests passing is around 10% of the effort, the rest is spent in optimizing the crap out of your implementation to get the search tests to pass.

Project 5 – Sharded Key/Value Store

This project is essentially a miniature version of Google’s Spanner. It provides Key/Value store semantics, built on sharded Paxos replica groups. It will use your implementation from project 4 as the basis for the replica groups. This lab is around 1.5x-2x the work of project 4 – so around 140 hours all in for me. Start working on this as soon as you can.

If you have a working project 4 implementation, you will be in a really good place, although you may find you have to make some slight alterations for this project. Even if you didn’t get all the tests to pass in project 4, it may still be usable for this project which has more forgiving Paxos tests.

Here are my tips for this project:

  • ShardMaster was fairly straightforward – I suggest creating a generic rebalance method which everything feeds into.
  • readOnly() queries should not be stale – they are supposed to go through the Paxos log like all other queries. The project 4 tests won’t catch this, but the project 5 tests will. I found I could keep them stale and still get the tests to pass, with a small trick. However this was only necessary because the class reference Paxos implementation had the same bug, and the Gradescope tests timeouts were based on it (things go much quicker if the reads are allowed to be stale). This might be a trick you can use yourself if you control your Paxos implementation to get more performance.
  • I followed the README suggestions for the most part, including a generic process method that dispatches to individual methods for each of the commands, passing a “replicated” boolean parameter. Inside the command it performs AMO checks and passes the command to Paxos to be replicated if needed.
  • AMO checking of reconfiguration and transactions was by far the most difficult thing to tackle in this project. There can be so many subtle edge cases. I really think if I was to do this project again, I’d try to think of a more generic way to handle AMO across all commands. But I didn’t do this here, I just brute forced every single command, carefully considering how to detect duplicates or stale values in each case so I could reject when necessary, or send back duplicate replies.
  • The fundamental concept that I finally grasped is that no state should ever be modified until things are written to the Paxos log. The log is the only way you can order events in the distributed system, and ensure an entire replica group behaves the same. State is somewhat subtle here, it can also include messages you send (if they are not duplicates).
  • I queue commands during reconfiguration and transactions. I thought queuing was required because you never know the ordering until things are committed to the Paxos log. If I didn’t queue, I hit edge cases where something gets committed to the Paxos log, then my Paxos duplication detection would always reject it from then on. So I couldn’t just throw away things. It may depend on your implementation – some students did this without queuing.
  • For reconfiguration, I had each node keep track of all shards it expects to send and all shards it expects to receive. Then it only exits reconfiguration once each of these tracking lists is empty. To ensure I process each reconfiguration in order, I only query the next expected configuration number (so don’t use -1).
  • As stated before, I queue all commands, both pending ShardMoves (before reconfiguration) and KVStoreCommands (during reconfiguration). When reconfiguration starts I can then process any pending ShardMoves. When reconfiguration ends, I can process any pending KVStoreCommands.
  • Transaction logic is about twice as hard as reconfiguration logic I think. I spent over half the time on the project on transactions. It is quite tricky to get it right.
  • For transactions, the way you handle locking is incredibly important to prevent deadlock or livelock. After many experiments, I decided to use the naive Dijkstra solution to circular deadlock – which is typically taught in undergrad as the Dining Philosopher’s problem. Basically, you assign a global ordering to resources, in this case the shards. And then make sure you acquire locks, one at a time, in that order. So if you have a transaction that needs shards [2, 5, 1], you always acquire them in order of [1], then [2], then [5]. This guarantees you can never deadlock in the system. This approach is not very efficient in practice, but it seems like it was good enough for this project to work.
  • The other tricky thing about transactions is how to manage the transaction state in an AMO way. I think there are different approaches here. I took what I thought was simplest. I just maintain overall transaction state in the coordinator, and ensure clients deterministically go to the same coordinator for a given Transaction command. This latter point is important in my implementation. Since only coordinators manage the transaction history, if a client was to resend a transaction to another coordinator, we could end up applying a transaction twice incorrectly leading to inconsistent state. A drawback in my approach is if the client gets partitioned from the coordinator during a transaction, there is no way to make progress, but this wasn’t a requirement of the project to solve.
  • In addition, I simplified things by making coordinators only handle a single transaction at a time, with an ordered sequence number, very similar to how the clients work. This means most of the AMO logic can be taken care of by simply checking the sequence number to see if there are old messages.
  • Coordinators do it all in my approach. They gather the readSet() from all participants in the prepare phase, execute the transaction locally, then send the writeSet() to all participants in the commit. Participants update their local sharded stores with the writeSet() so further commands are consistent. The coordinator updates its local transaction store with the complete transaction, and replies back to the client. It can then use this local transaction store to detect duplicate transactions from that client and reply to them immediately, just like we do with SingleKeyCommands.
  • One thing I did rather naively initially was to allow recursive calls in my Paxos implementation, such that sending a PaxosDecision could then call back into the PaxosServer. I believe neither the single PaxosServer or the reference one allow you to recursively call into it like this. So you have to be very careful to process all state changes in a single PaxosDecision, rather than calling back to yourself with a new Command. My transaction logic did this initially, which I later removed because it made debugging very tricky due to massive stack traces (even though it was working).

Conclusion

If you manage to survive through all the projects, congratulations. It will probably hurt a lot, but you will end up with a good grasp of the fundamentals of Paxos and distributed transactions. Good luck!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.