Outdated Version

You are viewing an older version of this section. View current production version.

Distributed Architecture

MemSQL distributed architecture is designed to be straightforward, simple, and fast. This section provides an overview of MemSQL clusters, including how the various components interact, and what happens in the MemSQL environment when you perform query or administrative operations.

Design Principles

MemSQL’s distributed system is designed around a few key principles:

  • Performance first. MemSQL is designed to enable high throughput on concurrent workloads, and is a highly scalable distributed system. The cluster can be scaled out at any time to provide increased storage capacity and processing power. MemSQL can process billions of rows per second on clusters of commodity hardware.
  • Separate the cluster into two tiers (aggregators and leaves) and push as much work as possible to the leaves. This enables you to scale cluster capacity and query performance online by adding additional leaves to a cluster. MemSQL cluster operations are dynamic - you will never have to restart any aggregators or leaves while managing the cluster.
  • No single point of failure. By running the cluster with redundancy, you ensure that every partition of data is backed up to a hot standby. In the event of any leaf failure, MemSQL will automatically fail over replica partitions. The aggregator tier is also resilient to failure.
  • Powerful but simple cluster management. The distributed system exposes REBALANCE PARTITIONS, which lets you scale up and down the cluster, restore data in the event of failure, and recreate the cluster from scratch in the event of a complete system failure.
  • No implicit data movement. In the event of a failure, MemSQL promotes just enough replica partitions to put the cluster back online. MemSQL will never rebalance or move data around implicitly.

This topic explains the high level concepts behind how MemSQL accomplishes these goals.

Cluster Components

A MemSQL cluster consists of two tiers:

  • Aggregator nodes handle the metadata of the distributed system, route queries, and aggregate results. Depending on query volume, a cluster may have one or many aggregators. A specialized aggregator, the master aggregator, handles only metadata and is responsible for cluster monitoring and failover.
  • Leaf nodes store data and execute SQL queries issued by the aggregator tier. A leaf is a MemSQL server instance consisting of multiple partitions. Each partition is just a database on that server.

Aggregators and leaves share the same MemSQL binary, so you can deploy the same build to every machine in the cluster.

The minimal setup for a MemSQL cluster is just one aggregator (the master aggregator) and one leaf. You can add more aggregators, which will read metadata from the master aggregator, and can run DML commands on the leaves.


MemSQL throws an error if you deploy a cluster with an odd number of leaf nodes, except the minimal deployment cluster configuration.

The number of deployed aggregator and leaf nodes determines the storage size and performance of a cluster. Typical deployments have a 5:1 ratio of leaf:aggregator nodes. In a well-designed cluster:

  • applications that require higher connection capabilities from application servers have a higher aggregator-to-leaf node ratio
  • applications with larger storage requirements have a higher leaf-to-aggregator node ratio

You can list all the aggregators and leaves in cluster using the SHOW AGGREGATORS and SHOW LEAVES commands.

| Host          | Port  | State  | Opened_Connections | Average_Roundtrip_Latency | Master_Aggregator |
|     |  3306 | online |                  0 |                      NULL |                 1 |
| |  3306 | online |                  1 |                     0.200 |                 0 |
2 rows in set (0.00 sec)

memsql> SHOW LEAVES;
| Host          | Port  | Availability_Group | Pair_Host     | Pair_Port | State  | Opened_Connections | Average_Roundtrip_Latency |
| |  3306 |                  1 | |      3306 | online |                  5 |                     0.299 |
| |  3306 |                  1 | |      3306 | online |                  5 |                     0.293 |
| |  3306 |                  2 | |      3306 | online |                  1 |                     0.271 |
| |  3306 |                  2 | |      3306 | online |                  1 |                     0.264 |
4 rows in set (0.00 sec)

Cluster Communication

Except for the transfer of data for replication and heartbeats, most of the communication between MemSQL nodes is implemented with SQL commands.

Querying MemSQL

The first time a query is executed against MemSQL, it is compiled and cached in memory (see Code Generation).

User queries are always directed to an aggregator. Except for DDL operations and writes to reference tables, which must go through the master aggregator, any query can be run against any aggregator.

Queries that involve only reference tables are executed directly on the aggregator. The aggregator does not send these queries to the leaves because reference tables are fully replicated to every aggregator and leaf.

Queries that involve sharded tables are more involved.

  • In the simplest case, the query involves only data on one partition, so that the query can be forwarded to the correct leaf unchanged except for rewriting the database name to reflect the partition. An example is INSERT INTO db.table VALUES (15). If the row maps to partition 3 of db, the query can be rewritten as INSERT INTO db_3.table VALUES (15) and forwarded to the correct leaf.

  • If the query involves data that spans more than one partition, however, then the aggregator combines results from many leaves. For example, SELECT COUNT(*) from t will send a COUNT(*) to each partition, sum up the responses, and return the final result as one row to the user.

Some queries involve significantly more query transformation and aggregation logic, but they follow the same general flow. Note that running EXPLAIN on a query will show the planned division of labor between aggregator and leaves, including the rewritten queries that will be sent to the leaves.

Data Distribution

MemSQL automatically shards data on distributed tables by hashing each row’s primary key (hash partitioning). Because each primary key is unique and the hash function is roughly uniform, the cluster is able to enforce a relatively even data distribution and minimize data skew.

At CREATE DATABASE time, MemSQL splits the database into a number of partitions. Each partition owns an equal portion of the hash range. You can specify the number of partitions explicitly with the PARTITIONS=X option to CREATE DATABASE. By default the total number of partitions is the value of the default_partitions_per_leaf engine variable times the number of leaves. MemSQL distributes partitions evenly among available leaves.

Each partition is implemented as a database on a leaf. When a sharded table is created, it is split according to the number of partitions of its encapsulating database. This table holds the partition’s slice of data. Secondary indexes are managed within each partition and currently MemSQL enforces that unique indexes are prefixed by the primary key. If you run a query that seeks on a secondary index, the aggregator will fan out the query across the cluster and, in parallel, each partition will employ its local secondary index.

Queries that match the shard key exactly (INSERT queries and selective UPDATE , DELETE , and SELECT queries) are routed to a single leaf. Otherwise, the aggregator sends the query across the cluster and aggregates results. You can use EXPLAIN on the aggregator to examine the generated query for the leaves and the query distribution strategy.

Availability Groups

An availability group is set of leaves which store data redundantly to ensure high availability. Each availability group contains a copy of every partition in the system - some as masters and some as replicas. Currently, MemSQL supports up to two availability groups. You can set the number of availability groups via the redundancy_level variable on the master aggregator. From this point forward, we’ll discuss the redundancy-2 case.

Each leaf in an availability group has a corresponding pair node in the other availability group. A leaf and its pair share the same set of partitions, but the masters are split evenly between them. In the event of a failure, MemSQL will automatically promote replica partitions on a leaf’s pair.

By default, the ADD LEAF command will add a leaf into the smaller of the two groups. However, if you know your cluster’s topology in advance, you can specify the group explicitly with the INTO GROUP N suffix. By grouping together machines that share resources like a network switch or power supply, you can isolate common hardware failures to a single group and dramatically improve the cluster’s uptime.

MemSQL automatically displays which availability group a leaf belongs to in the SHOW LEAVES command.