Multi-Paxos

Last week we’ve seen how basic Paxos works. Today we’re going to extend it in order to run a distributed state machine  – a state machine with the same state on all the nodes.

The idea is to use a distributed log to run the state machine. Each entry in the log is an operation to apply to the state machine. If the log is the same on every node then the same operations are applied in the same order on all the nodes and therefore the state machines are all in the same state.

The question is :

How to make sure the log is the same on all the nodes?

The solution is to run basic-Paxos for every log entry (plus add some tweaks to solve some issues and improve performance)

First we need to identify for which log entry we’re choosing a value. Our first tweak is therefore to add a log-entry number to every propose and accept messages.

And that leads directly to our first problem:

How to figure out which log entry to use?

Simply each server keeps an index of smallest available log entry number (entry with no chosen value) and try to propose the new value for this slot. It keeps proposing the same value on increasing entry number until the value is chosen.

Servers can handle multiple client requests simultaneously (for different log entries). However updating the state machine is sequential (because all the preceding commands must be applied before applying a new one).

Now that the biggest problem is solved in theory, it may not work well in practice because of contention – several servers propose different values for the same log entry. Choosing a value requires at least 2-phases and possibly more in case multiple values are proposed.

The next tweak aims at improving the performance by resorbing the contention.

Let’s assume that we pick a leader and that this leader is the only one who can send “propose” requests.

It means that the leader accepts client requests and acts both as a  proposer and as an acceptor.

The other servers can’t propose values so they redirect clients to the leader and act as acceptors only.

One simple strategy is to use the server ids to determine who the leader is. Simply the leader is the server with highest id.

That means that each server needs to know the ids of the other servers. For this purpose each server sends heartbeat messages every T ms.

If  a server doesn’t receive any message with a higher id than its own for a long enough period (2 x T ms) then it becomes the leader.

This is probably the most simple strategy but other strategies are more efficient for that matter (e.g. leased-based approach).

With this strategy there might be 2 leaders in the system at the same time. This is still ok as Paxos supports it (with a little lack of efficiency).

Improve performances by reducing the number of exchanged messages

If only the leader can run the prepare phase, then we can run it once for the whole log (not for every single log entry) that means we can then eliminate all prepare phases but the first one. So when a message comes in the leaser just run the “accept” phase (reducing by half the number of exchanged messages).

Remember that to goal of the “prepare” phase is twofold:

• block older proposals
• find out possible chosen values (i.e. accepted values)

If we get rid of the “prepare” phase we must solve this two problems. First one is trivial: use a unique proposal number for the entire log. To solve the second one we add a new flag into the propose response : “no more accepted value”. This flag indicates wether or not a server has accepted values for subsequent log entries.

This flag allows the leader to skip sending propose message to servers having “no more accepted value” and moreover once it has received “no more accepted value” from the majority, the leader can omit all the “propose” messages.

The leader skips the “prepare” phase until one “accept” request is rejected (because another server has become a leader).

That’s good progress but we’re not quite there yet. We need to make sure that the state machine is fully replicated on all the servers in case the leader crashes.

Replication of chosen values on all the servers

Remember that with basic-Paxos only the proposer (in this case the leader) knows about chosen values.

To improve replication the leader simply keeps sending “accepts” requests in the background until all servers respond. As the request are sent in the background it doesn’t influence the response-time to the clients. However it doesn’t ensure full replication either. (e.g. if leader crashes before full replication).

For that each server needs to track “chosen” entries. Each server marks the entries known to be chosen with an “acceptedProposal” value equals infinity. (This kind of makes sense because Paxos always keeps “acceptedProposal” with the greatest value – basically it just a way to say that this value won’t be overwritten).

Each server also maintains a “firstUnchosenIndex” to track the lowest log entry which value is not known to be chosen (i.e. a log entry with “acceptedProposal” != infinity).

The proposer (i.e. the leader) tells the other servers about its “firstUnchoosenIndex” in every “accept” request. The proposer is the only one to know when a value is chosen. Embedding its “firstUnchosenIndex” allows the acceptors to know which entries have already been chosen.

On the acceptor side when it receives an accept message, it checks if it has any past log entries older than the received “firstUnchosenIndex” with a proposal number matching the proposal number in the request. In this case it can mark these entries as accepted (i.e. set accepted proposal to infinity).

Still might be some not chosen values from previous leader in the acceptor log entries.

To solve this problem acceptors include their firstUnchosenIndex in the accept replies. When the proposer receives a reply with firstUnchosenIndex older than its own firstUnchosenIndex it sends a “Success” message containing the log entry and the chosen value.

It allows the acceptor to record the chosen value for this log entry. So it marks this log entry as chosen (acceptProposal = infinity) and update its firstUnchosenValue and it includes it in the response. (It allows the leader to send another Success message if needed).

Now our log is fully replicated (and so is the state machine). So let’s focus on the interaction with the client.

Client protocol requirements to support server crashes

The client doesn’t know who the leader is so it contact any server which – if not the leader – will sends back a redirect message pointing to the current leader.

Clients stick with the same leader until they can’t contact it anymore (e.g. request timeout). In which case the client contacts any other server who will point him to the actual leader.

There is still a flaw in the following case:

The client request is chosen for a given log-entry. Then the leader crashes just before sending its reply to the client. The client thinks the request failed and retry the same request with the new leader. In this case the command might be applied twice.

To prevent it the client must embed a unique id inside its request. The server saves this id in the log entry alongside with the command. It allows the leader to find out if a client command has already been chosen.

Finally we’ve reach an exactly once semantic (as long as the client doesn’t crash – if it crashes then it becomes an “at most once” semantic as a client command might be lost).

Cluster management (nodes joining or leaving the cluster)

This last point is really tricky as any changes in the cluster configuration also changes the quorum (i.e. the number of nodes required to form a majority).

Therefore we must ensure there is never 2 overlapping majority that may end up choosing 2 different commands for the same log entry.

The solution is to use the log itself to record the cluster configuration.

It means that all log-entries are a result of previous log-entries holding the cluster configuration.

We introduce a parameter $$\alpha$$ to control the delay between a configuration change and its application.

• if $$\alpha = 1$$ then the configuration changes are applied immediately. It takes effect on the very next entry.
• if $$\alpha = 100$$ then the configuration changes take effect only after 100 more entries have been chosen.

It’s interesting to not that $$\alpha$$ influences how much concurrency is available in the system. If $$\alpha$$ is small (e.g. 1) then there is no parallelism in the system. Every log entry is chosen sequentially (otherwise we might miss a configuration change). If $$\alpha$$ is very large when can choose many entries simultaneously but it will take a very long time for a configuration change to take effect. A workaround is to fill the log with “no-op” entries so that the new configuration takes effect faster.

Basic Paxos is simple enough and well-understood. Mutli-Paxos (although based on the same algorithm) can become quite tricky to implement. However it’s one of the corner-stone of consensus and is used (or its variants) in many distributed system today.