This is a guide to potential cluster failure scenarios and their resolutions.
Leaf Failures in a Redundancy-1 Cluster
When a leaf dies in a redundancy-1 cluster, all partitions hosted on that leaf will be offline to reads and writes. There are three potential resolutions: reintroduce the leaf, introduce a replacement leaf, or recreate the lost partitions on the remaining leaves.
If the leaf machine is recoverable, MemSQL will automatically reattach the leaf as soon as it is back online.
If the leaf machine is unrecoverable but you can still access its data, then you can introduce a replacement following the guide on how to Replace a Dead Leaf in a Redundancy-1 Cluster.
If the leaf machine and its data is unrecoverable, and you wish to introduce a replacement, you can follow the guide on how to Replace a Dead Leaf in a Redundancy-1 Cluster, steps 1 to 3. Then run REBALANCE PARTITIONS … FORCE. MemSQL will create new (empty) partitions to replace those that were lost.
Finally, if you wish to recreate the lost partitions on the remaining leaves, run REBALANCE PARTITIONS … FORCE. MemSQL will distribute the replacement partitions across the remaining leaves.
REBALANCE PARTITIONS .. FORCE locks in data loss. After running it, you can no longer reintroduce the data from the lost machine’s partitions.
Replace a Dead Leaf in a Redundancy-1 Cluster
This guide shows how to replace a dead leaf in a redundancy-1 cluster. It is assumed that data can be recovered from the dead leaf and will be restored at Step 5.
Step 1. Remove the dead leaf from MemSQL Ops
Unmonitor the dead MemSQL Leaf and uninstall the dead MemSQL Ops Agent by executing the following commands:
$ memsql-ops memsql-unmonitor <DEAD_LEAF_ID>
$ memsql-ops agent-uninstall --force --agent-id <DEAD_AGENT_ID>
You can get Leaf IDs and Agent IDs by executing the memsql-ops memsql-list
and memsql-ops agent-list
commands.
Step 2. Remove the dead leaf from MemSQL cluster
Run REMOVE LEAF on the dead leaf to remove it from SHOW LEAVES and free up its pairing slot.
memsql> REMOVE LEAF "<DEAD_LEAF>"[:<PORT>];
Step 3. Deploy a new MemSQL leaf
You can deploy a new MemSQL Ops Agent and MemSQL leaf via the web UI, or via CLI. Once the agent is deployed, you can replace the settings.conf file from the dead agent:
$ memsql-ops agent-deploy --host <HOST_IP> [--user <USER> --identity-file /path/to/id_rsa]
$ memsql-ops agent-stop <NEW_AGENT_ID>
# edit your settings.conf file at /var/lib/memsql-ops/settings.conf
$ memsql-ops agent-start <NEW_AGENT_ID>
$ memsql-ops memsql-deploy --agent-id <NEW_AGENT_ID> --role leaf
Step 4. Stop the new MemSQL leaf
Make sure the new MemSQL Leaf is NOT running, before copying the recovered data:
$ memsql-ops memsql-stop <NEW_MEMSQL_LEAF_ID>
Step 5. Copy the recovered data into the new MemSQL leaf
Copy the recovered data into the new MemSQL data directory – make sure to save the memsql_id file.
$ sudo mv /var/lib/memsql/leaf-3306/data/memsql_id /tmp
$ sudo cp -r /path/to/recovered/leaf-3306/data/* /var/lib/memsql/leaf-3306/data/
$ sudo mv /tmp/memsql_id /var/lib/memsql/leaf-3306/data/
$ sudo chown -R memsql.memsql /var/lib/memsql/leaf-3306/data/
Step 6. Restart the new MemSQL leaf
Restart the new MemSQL leaf.
$ memsql-ops memsql-start <NEW_MEMSQL_LEAF_ID>
Step 7. Reattach partitions
Partitions are currently present in the new MemSQL leaf, but MemSQL distributed system is still unaware of them.
The simplest way to trigger partitions detection is to temporarily remove the new leaf from the cluster. MemSQL Ops will automatically re-attach the leaf, and this action will trigger detecting (and attaching) all partitions.
On the Master Aggregator, run:
memsql> REMOVE LEAF "<NEW_LEAF>";
In the examples below, 10.0.0.101 is the new MemSQL Leaf, and 10.0.2.128 is an existing leaf.
Before reattaching partitions:
memsql> SHOW PARTITIONS ON `memsql_demo`;
+---------+------------+------+--------+--------+
| Ordinal | Host | Port | Role | Locked |
+---------+------------+------+--------+--------+
| 0 | 10.0.2.128 | 3306 | Master | 0 |
| 1 | 10.0.2.128 | 3306 | Master | 0 |
| 2 | NULL | NULL | NULL | 0 |
| 3 | NULL | NULL | NULL | 0 |
...
After reattaching partitions:
memsql> SHOW PARTITIONS ON `memsql_demo`;
+---------+------------+------+--------+--------+
| Ordinal | Host | Port | Role | Locked |
+---------+------------+------+--------+--------+
| 0 | 10.0.2.128 | 3306 | Master | 0 |
| 1 | 10.0.2.128 | 3306 | Master | 0 |
| 2 | 10.0.0.101 | 3306 | Master | 0 |
| 3 | 10.0.0.101 | 3306 | Master | 0 |
...
Leaf Failures in a Redundancy-2 Cluster
One Leaf Dies
Any partitions for which the dead leaf was the partition master will be promoted on the dead leaf’s pair.
You can reintroduce the dead leaf, or add a new leaf to replace it.
Reintroducing the leaf is the simplest solution – MemSQL will automatically reattach the leaf as soon as it is back online.
If you decide to add a replacement leaf on a different host, you can follow this guide on how to replace a dead leaf in a redundancy-2 Cluster.
Step 1. Remove the dead leaf from MemSQL Ops
Unmonitor the dead MemSQL Leaf and uninstall the dead MemSQL Ops Agent:
If the machine for the dead agent is still accessible, ensure all memsqld processes are killed and their data directories are emptied before attempting to uninstall the agent.
$ memsql-ops memsql-unmonitor <DEAD_LEAF_ID>
$ memsql-ops agent-uninstall --force <DEAD_AGENT_ID>
Note that leaf and agent ids can be retrieved with memsql-ops memsql-list and memsql-ops agent-list.
Step 3. Figure out which availability group the failed leaf was in
Run SHOW LEAVES on the master aggregator to identify the dead leaf and its availability group:
memsql> show leaves;
+----------------+------+--------------------+----------------+-----------+---------+--------------------+------------------------------+
| Host | Port | Availability_Group | Pair_Host | Pair_Port | State | Opened_Connections | Average_Roundtrip_Latency_ms |
+----------------+------+--------------------+----------------+-----------+---------+--------------------+------------------------------+
| 54.242.219.243 | 3306 | 1 | 54.196.216.103 | 3306 | online | 1 | 0.640 |
| 54.160.224.3 | 3306 | 1 | 54.234.29.206 | 3306 | online | 1 | 0.623 |
| 54.196.216.103 | 3306 | 2 | 54.242.219.243 | 3306 | online | 1 | 0.583 |
| 54.234.29.206 | 3306 | 2 | 54.160.224.3 | 3306 | offline | 0 | NULL |
+----------------+------+--------------------+----------------+-----------+---------+--------------------+------------------------------+
In this example, the offline leaf was in availability group 2.
Step 2. Remove the dead leaf from MemSQL cluster
Run REMOVE LEAF on the master aggregator to remove the dead leaf from SHOW LEAVES and free up its pairing slot.
memsql> REMOVE LEAF "<DEAD_LEAF>"[:<PORT>] FORCE;
Step 3. Deploy a new MemSQL leaf
You can deploy a new MemSQL Ops Agent and MemSQL leaf via the web UI, or via CLI. Once the agent is deployed, you can replace the settings.conf file from the dead agent:
$ memsql-ops agent-deploy --host <HOST_IP> [--user <USER> --identity-file /path/to/id_rsa]
$ memsql-ops agent-stop <NEW_AGENT_ID>
# edit your settings.conf file at /var/lib/memsql-ops/settings.conf
$ memsql-ops agent-start <NEW_AGENT_ID>
$ memsql-ops memsql-deploy --agent-id <NEW_AGENT_ID> --role leaf --availability-group <GROUP>
After this has completed, SHOW LEAVES
on the master aggregator should indicate that all leaves are online and paired:
+----------------+------+--------------------+----------------+-----------+--------+--------------------+------------------------------+
| Host | Port | Availability_Group | Pair_Host | Pair_Port | State | Opened_Connections | Average_Roundtrip_Latency_ms |
+----------------+------+--------------------+----------------+-----------+--------+--------------------+------------------------------+
| 54.242.219.243 | 3306 | 1 | 54.196.216.103 | 3306 | online | 2 | 0.578 |
| 54.160.224.3 | 3306 | 1 | 54.145.52.142 | 3306 | online | 2 | 0.624 |
| 54.196.216.103 | 3306 | 2 | 54.242.219.243 | 3306 | online | 2 | 0.612 |
| 54.145.52.142 | 3306 | 2 | 54.160.224.3 | 3306 | online | 1 | 0.568 |
+----------------+------+--------------------+----------------+-----------+--------+--------------------+------------------------------+
Step 4. Rebalance partitions on all databases
Finally, you need to rebalance the cluster, by running REBALANCE PARTITIONS for all your databases. On the Master Aggregator, you can run the following script:
#!/bin/bash
for DB in `mysql -u root -h 127.0.0.1 --batch --skip-pager --skip-column-names --execute="SHOW DATABASES" | grep -vE "^(memsql|information_schema|sharding)$"`
do
echo "Rebalancing partitions on DB $DB"
mysql -u root -h 127.0.0.1 --batch --execute "REBALANCE PARTITIONS ON $DB"
done
A Pair of Leaves Die
In this case, the partitions that were hosted on the dead leaves have no remaining instances. This means that these partitions are offline to reads and writes.
If either of the leaf machines (or at least its data) is recoverable, you can reintroduce it and reattach its partitions, as in the Leaf Failures In a Redundancy-1 Cluster section. At this point, the partitions are online for reads and writes. The failure scenario is now the same as the One Leaf Dies scenario in a redundancy-2 cluster.
If neither leaf machine is recoverable, then data loss has occurred. You can now add replacement leaves, after which you may run REBALANCE PARTITIONS … FORCE to create new (empty) replacement partitions.
Many Unpaired Leaves Die
As long as two paired leaves have not both died, all partitions are still available for reads and writes.
As a special case of this scenario, all leaves in one availability group can be down. No data loss is incurred as long as redundancy is restored before an additional leaf from the remaining availability group dies.
Many Leaves Die, Some of Them Paired
Every partition for which both leaves hosting it died is now offline to reads and writes. Partitions for which only one leaf in the relevant pair died remain online.
Offline partitions should be handled as they are in the scenario A pair of leaves die. However, RESTORE REDUNDANCY or REBALANCE PARTITIONS should be run only after all partitions has either been either recovered or given up as lost data (because both leaves that were hosting its data are unrecoverable).
Aggregator Failures
Master Aggregator Dies
If the master aggregator dies and is recoverable, all that is necessary is to restart the MemSQL process.
If it is not recoverable, then a child aggregator must be promoted to master with AGGREGATOR SET AS MASTER.
One Child Aggregator Dies
If the aggregator is recoverable, all that is necessary is to restart the MemSQL process. Otherwise, simply deploy a new MemSQL instance as an aggregator.
Many Aggregators Die
If more than one aggregator dies, they can be reintroduced or replaced one by one. An important caveat is that if half or more of the aggregators die at once, the others can shut down as well. This behavior is necessary to avoid inconsistency in the event of network partitions, as discussed in the next section.
Network Partitions
In the event of a network partition - a network failure which causes the network to be split with some nodes losing contact with other nodes - nodes on separate sides of the network partition cannot continue operating normally after losing contact with each other. If they did, this would risk a potential loss of consistency in the cluster, since cluster metadata changes made by nodes on one side could not be propagated to the other side.
To prevent losses of consistency, aggregators will shut themselves down if they lose contact with a majority of all the aggregators. The specifics are detailed below. This means that in the event of a network partition which splits the network in two, the nodes on one side will continue operating, while the nodes on the other side will shut themselves down, and the side with more aggregators is the one that will continue operating. For example, if 5 total aggregators are partitioned into two sides with 2 and 3 aggregators, the side with 3 aggregators will continue operating, while the aggregators on the side with only 2 will shut themselves down.
This aggregator self-shutdown behavior applies to all cases when an aggregator loses contact with a majority of the aggregators, not just to network partitions, since it is not possible in general to distinguish a network partition from other types of node failures.
This aggregator self-shutdown behavior only applies to clusters with three or more total aggregators. A cluster with exactly two aggregators (1 master and 1 child aggregator) is excepted from this behavior so that you can shut down one aggregator node without shutting down the other, but this makes a cluster with exactly two aggregators vulnerable to loss of consistency in the event of a network partition.
In a cluster with three or more total aggregators, an aggregator must either be able to contact a strict majority of all aggregators (including itself), or (if the total number of aggregators is even) be able to contact exactly half of the aggregators, one of which is the master aggregator. If it is unable to contact sufficient aggregators, it will shut itself down.
For another example, if 6 total aggregators are partitioned into groups of size 3, 2, and 1 (with the master aggregator in the group of size 2), all of them will shut down, since no aggregator in a group of size at least 4, and no aggregator is in a group containing the master aggregator as well as two others.