I worked on implementing the Bully algorithm for leader election in a distributed system. I worked with a partner on this project as part of a course on Distributed Computing, and our implementation most heavily relied on Hector Garcia-Molina's original conception of the algorithm.
Leader election is a widely worked on problem in distributed computing that involves optimizing the method by which a particular process (or other entity) is assigned
specific powers such as task management and delegation in order to best maximize the efficiency of the system as a whole. Introducing leaders can be greatly beneficial
to the overall distributed system but can also potentially reduce fault tolerance and other vulnerabilities, and finding an optimal solution to this problem
remains a question.
The Bully algorithm is one such proposed method of electing a leader amongst several processes. The algorithm assumes that all processes are aware of each other's IDs, and so any process can detect another's failure. When a process detects failure, it will send messages to all processes with higher IDs than its own. This should continue until the highest process is a candidate for election, at which time it should send out an election message to all processes. Every process compares its own ID to the candidate, rejecting and sending out its own election message if its own ID is larger. Eventually, the process with the highest ID should be elected leader.
We chose to work on this project for several reasons. As mentioned, leader election is a prominent problem in distributed systems. But besides that, it is also one of the most theoretically understandable problems — and the Bully algorithm is similarly simple to understand on a conceptual level.
The algorithm is also highly efficient, and optimal for fault tolerance in a distributed system as any process can be alerted to a failure. It is synchronous, which makes it slightly more challenging to implement in real systems, but we wanted to try our hand at it given its theoretical merit.
We decided to use RPCs to implement the bulk of the messaging in this algorithm, for heartbeat messages as well as candidate and election messages which also helps in our quest to increase efficiency due to the practical use of RPCs in a distributed environment.
For the Bully Algorithm to work properly, all processes in the distributed system must run synchronously to ensure that timeouts occur at the proper time in relation to the other processes. For this to happen, all processes need to be running on synchronized time. We decided to use the Time package in Golang to implement synchronized clocks, which allows our code to work on a truly distributed system (across multiple devices) as well as in local testing. We determined when a process had timed out by using the shared global current time added to the timeout value, which we predefined as a constant value.
We were able to successfully implement the Bully algorithm using RPCs in the end, and tested our implementation using a simulated distributed system we created with AWS EC2. We published our finished implementation in a GitHub repo.