Flexible Paxos and grid quorums

Mark Mc Keown, Chief Architect, Cirata and Dr. Yeturu Aahlad, Chief Scientist, Inventor, and Co-Founder, Cirata

Flexible Paxos is pretty cool! Before talking about Flexible Paxos let's review Paxos and before talking about Paxos let's talk about majorities, these are the simplest examples of the quorum intersection principle. Once we understand the properties of majorities we can think about these properties in a more general sense of quorums. If we have a fixed set of nodes then there are different sets of these nodes that will form majorities. If we have 5 nodes then any 3 or more nodes are a majority. The interesting thing about these different majorities is that each of them will have 1 or more nodes in common with every other majority.

If we have a policy of only making a decision when there is a majority of nodes then we know that there will be nodes in that majority that were part of every decision that was previously made. So when making that decision we would have the history of every decision already made.

Paxos is a consensus algorithm. A consensus algorithm is used by nodes to agree something, a value. Consensus can be used as a basis of replication, if we can get nodes to agree then we can get them to be replicas of each other. For this to be useful we have to handle the case of nodes failing and being unavailable for periods without interfering with operation.

Paxos is based on the ideas outlined above on majorities. In its simplest form Paxos uses majorities as a form of memory to remember what happened in the past.

An instance of Paxos is used to agree something, there can be multiple instances of Paxos agreeing different things - what is the next change in state of the replicas. Each instance of Paxos may have multiple rounds and in each round there are two phases, in each phase the nodes communicate with each other.

In the first phase a node (the proposer) that wants to get something agreed effectively asks the other nodes what is the state of the Paxos instance. The proposer waits for a majority of nodes to respond, if a majority nodes respond then it can know if something might have been agreed or not for this instance of Paxos. If it determines some value might have been agreed already then it takes on the job of making sure that value is agreed in the next phase, if it determines that it does not look like a value has been agreed then it can use its own value in the next phase.

In the second phase the proposer sends the value to the nodes and tells them to accept it. It waits for a majority of responses and if these are acknowledgements then it knows the value has been agreed and will never change.

What happens if two nodes decide to propose something for a Paxos instance? If both start phase 1 and the rest of the nodes receive the phase 1 messages in a jumbled order? Or one node completes phase 1 but before it completes phase two another node starts phase 1? Paxos guarantees nothing bad will happen in this scenario - no split brain - the proposers will have to do more rounds of phase 1 and phase 2 until one is lucky and wins the race or they will race forever doing more and more rounds of Paxos.

Racing forever is not a good thing so we want to engineer a solution to avoid this and there are a number of ways of doing this. One approach is to have the nodes that propose to have randomized backoffs to increase the luck for one node to complete. Another approach is to try and make sure there is only one node proposing (known as the weak leader), there is no perfect approach to this but we know nothing bad can happen if we get it wrong (somehow end up with two leaders for a period).

Next let’s optimize Paxos. To get a value agreed in a Paxos instance without any interference requires two rounds of communication, phase 1 and phase 2. We know that as a proposer if we get stale results from phase 1, round 1, it does not matter as Paxos will still be safe - we will get rejected in the second phase. We can also use a weak leader to avoid excessive conflict. Combining these two we can have the weak leader skip phase 1 for multiple instances of Paxos by effectively doing a single phase 1 for those multiple Paxos instances. This means that in general agreements would only need a single round of communication with nodes, phase 2 – this optimization is known as Multi-Paxos.

This is Paxos from a high level and lacking detail. Lets focus on one of those details, the use of the majority. Paxos relies on the “quorum intersection principle”. A quorum is a set of sets, such that each member set has non-empty intersections with each of the others, the member sets are called "quorum set"s and the parent set is called "quorum". This is obvious for a majority, if we write out all the possible majorities for a small set of nodes we can see they intersect and each majority is a quorum set. So what other ways can we construct quorums? A grid.

If we have 20 nodes then a simple majority would need 11 nodes. If we lay the nodes out in a 5x4 grid. 5 columns and 4 rows. Each column has 4 nodes and each row has 5 nodes. We can define a quorum set as any complete row plus any complete column; the shared node is where the column and row intersect on the grid. With 20 nodes we only need 8 nodes to reach agreement, less than a majority!

What about Flexible Paxos? Flexible Paxos makes the observation that we can use different quorums for phase 1 than for phase 2, but the quorum sets used for phase 1 and phase 2 must intersect. The set of nodes used within a phase do not need to intersect with each other, but there must be shared nodes between phases. Quorums are important because they carry knowledge forward through the shared nodes from the quorum sets. We want to carry knowledge between the phases.

What does Flexible Paxos look like if we use different quorums in different phases? If we have 9 nodes, we could decide that in phase 1 we need 7 nodes and in phase 2 we only need 3 nodes. Any 7 node quorum set used in phase 1 will intersect with any phase 2 quorum set with 3 nodes - so knowledge can be carried between phases. For phase 2 we need 3 nodes available while for phase 1 we need 7 nodes. For Multi-Paxos we would do phase 2 much more often than phase 1.

What are the tradeoffs of this approach? Without Flexible Paxos and using a simple majority we need 5 nodes in a 9 node system to be available to make progress. With Flexible Paxos and the configuration above we only need 3 nodes to make agreements but we now need 7 nodes for a new leader to take over and complete phase 1 for Multi-Paxos.

What happens when we combine Flexible Paxos and Grid Qouroms. For Flexible Paxos we can have different quorum sets for phase 1 and phase 2 but those quorum sets must intersect. If we look at a grid we know the columns and rows intersect. We can use the rows for phase 1 and the columns for phase 2. For a leader to complete phase 1 it would need a complete row of nodes to respond, for phase 2 it would need a complete column of nodes to respond.

Applying this to the 20 node grid with 4 rows and 5 columns. A new leader would need a complete row of 5 nodes for phase 1, and then a complete column of 4 nodes for phase 2. If the system used Multi-Paxos there would be many more phase 2’s than phase 1’s. If the leader had 5 instances of Paxos to complete then he could complete the phase 2’s for each of these Paxos instances with the 5 different columns of nodes - this would spread the load across the nodes on the system. In this configuration we need a column of 4 nodes available to make agreements and a row of 5 nodes available to change the leader - we can also complete phase 2 with different subsets of nodes reducing the load.

There is in general no such thing as a free lunch, so what is the trade off between using Flexible Paxos with grid quorums compared to a simple majority quorum across the two phases? The answer is availability - the system is not as robust to failures and that is something we can explore in another blog.

Subscribe for updates

Check out our other recent blogs

Cirata unlocks GenAI with Databricks

To succeed in the development of generative AI applications, businesses must use their unique data assets more efficiently and effectively. Cirata and Databricks are partnering to help users more quickly build and deploy...

Cirata and Apache Iceberg

What is Apache Iceberg Originally a Netflix project, Iceberg was made open source in 2018 and graduated from incubation under Apache Software Foundation governance in 2020. It is now a leading standard for what is called...

Scale AI in IBM watsonx.data with Cirata Data Migrator

Cirata announces the general availability of Data Migrator 3.0, with full support for IBM watsonx.data. Cirata and IBM continue to collaborate on their partnership with a focus of driving data and simplifying integration...

Cookies and Privacy

We use technology on our website to collect information that helps us enhance your experience and understand what information is most useful to visitors.
By clicking “I ACCEPT,” you agree to the terms of our privacy policy.

Cookie Setting