Outdated Version

You are viewing an older version of this section. View current production version.

/Cluster Management/ /

SingleStore DB Components to Monitor min read

SingleStore DB is a distributed system. Because of this, all machines running SingleStore DB should be monitored to ensure smooth operation. This section demonstrates how to programmatically monitor a cluster.

Monitor all SingleStore nodes

Similar to heartbeats sent by intra-cluster communication, all SingleStore nodes should be pinged with:

select 1;

It is recommended to do this every minute.

Monitor OS resources

If you are using third party monitoring tools, make sure to monitor the following resources within each machine of the cluster:

CPU Usage
CPU Load Average
Memory Usage
Memory Paging (page ins, page outs)
Disk Utilization
Disk Queue Time
Network Usage
Dropped packets / TCP retransmits

Info

Paging refers to a technique that Linux and other operating systems use to deal with high memory usage. If your system is consistently paging, you should add more memory capacity or you will experience severely performance degradation.

When the operating system predicts that it will require more memory than it has physically available, it will move infrequently accessed pages of memory out of RAM and onto the disk to make room for more frequently accessed memory. When this memory is used later by a process, the process must wait for the page to be read off disk and into RAM. If memory used by SingleStore DB is moved to disk, the latency of queries that access that memory will be substantially increased.

You can measure paging on the command line by using the Linux tool by running the command vmstat 1 and looking at the swap section (si and so refer to paging memory off the disk and into RAM and out of RAM and onto disk, respectively)

Monitor cluster status through MV_CLUSTER_STATUS table

To know the status of the databases on your cluster, as well as information about the nodes in your cluster, query the information_schema.MV_CLUSTER_STATUS table from an aggregator. You can also access this table through SHOW CLUSTER STATUS; however, querying the table provides the advantage of being able to join against it.

Table description

Field	Data Type (Size)	Description	Example Value
NODE_ID	bigint(10)	ID of node	1
HOST	varchar(512)	Host of the node	127.0.0.1
PORT	bigint(10)	The port of the node	10000
DATABASE_NAME	varchar(512)	Name of database	vigilantia_0_AUTO_SLAVE
ROLE	varchar(512)	Database’s role (e.g. orphan, master, slave, reference)	master
STATE	varchar(256)	Database state	replicating
POSITION	varchar(256)	Position in transaction log	0:8832
MASTER_HOST	varchar(256)	Host of this node’s aggregator	127.0.0.1
MASTER_PORT	bigint(10)	Port of this node’s aggregator	127.0.0.1
METADATA_MASTER_NODE_ID	bigint(10)	Master’s node ID expected by metadata	1
METADATA_MASTER_HOST	varchar(256)	Master’s host expected by metadata	127.0.0.1
METADATA_MASTER_PORT	bigint(10)	Master’s port expected by metadata	3306
METADATA_ROLE	varchar(512)	Database’s role based on metadata	Orphan
DETAILS	varchar(512)	Extra details	stage: packet wait, state: x_streaming, err: no

Info

When querying from the master aggregator, the value for any “Master” fields (e.g. MASTER_PORT, METADATA_MASTER_HOST, etc.) will be NULL.

Sample output

SELECT * FROM information_schema.MV_CLUSTER_STATUS;

+---------+-----------+-------+-------------------------+-------------+-------------+----------+-------------+-------------+-------------------------+----------------------+----------------------+---------------+-------------------------------------------------+
| NODE_ID | HOST      | PORT  | DATABASE_NAME           | ROLE        | STATE       | POSITION | MASTER_HOST | MASTER_PORT | METADATA_MASTER_NODE_ID | METADATA_MASTER_HOST | METADATA_MASTER_PORT | METADATA_ROLE | DETAILS                                         |
+---------+-----------+-------+-------------------------+-------------+-------------+----------+-------------+-------------+-------------------------+----------------------+----------------------+---------------+-------------------------------------------------+
|       1 | 127.0.0.1 | 10000 | cluster                 | master      | online      | 0:46     | NULL        |        NULL |                    NULL | NULL                 |                 NULL | Reference     |                                                 |
|       1 | 127.0.0.1 | 10000 | monitoring              | master      | online      | 0:8832   | NULL        |        NULL |                    NULL | NULL                 |                 NULL | Reference     |                                                 |
|       1 | 127.0.0.1 | 10000 | vigilantia              | master      | online      | 0:24616  | NULL        |        NULL |                    NULL | NULL                 |                 NULL | Reference     |                                                 |
|       3 | 127.0.0.1 | 10001 | cluster                 | async slave | replicating | 0:45     | 127.0.0.1   |       10000 |                       1 | 127.0.0.1            |                10000 | Reference     | stage: packet wait, state: x_streaming, err: no |
|       3 | 127.0.0.1 | 10001 | monitoring              | sync slave  | replicating | 0:8832   | 127.0.0.1   |       10000 |                       1 | 127.0.0.1            |                10000 | Reference     |                                                 |
|       3 | 127.0.0.1 | 10001 | monitoring_0            | master      | online      | 0:58893  | NULL        |        NULL |                    NULL | NULL                 |                 NULL | Master        |                                                 |
|       3 | 127.0.0.1 | 10001 | monitoring_0_AUTO_SLAVE | async slave | replicating | 0:58893  | 127.0.0.1   |       10001 |                    NULL | NULL                 |                 NULL | Orphan        |                                                 |
|       3 | 127.0.0.1 | 10001 | monitoring_1            | master      | online      | 0:57439  | NULL        |        NULL |                    NULL | NULL                 |                 NULL | Master        |                                                 |
|       3 | 127.0.0.1 | 10001 | monitoring_1_AUTO_SLAVE | async slave | replicating | 0:57439  | 127.0.0.1   |       10001 |                    NULL | NULL                 |                 NULL | Orphan        |                                                 |
|       3 | 127.0.0.1 | 10001 | monitoring_2            | master      | online      | 0:49952  | NULL        |        NULL |                    NULL | NULL                 |                 NULL | Master        |                                                 |
|       3 | 127.0.0.1 | 10001 | monitoring_2_AUTO_SLAVE | async slave | replicating | 0:49952  | 127.0.0.1   |       10001 |                    NULL | NULL                 |                 NULL | Orphan        |                                                 |
|       3 | 127.0.0.1 | 10001 | vigilantia              | sync slave  | replicating | 0:24616  | 127.0.0.1   |       10000 |                       1 | 127.0.0.1            |                10000 | Reference     |                                                 |
|       3 | 127.0.0.1 | 10001 | vigilantia_0            | master      | online      | 0:25874  | NULL        |        NULL |                    NULL | NULL                 |                 NULL | Master        |                                                 |
|       3 | 127.0.0.1 | 10001 | vigilantia_0_AUTO_SLAVE | async slave | replicating | 0:25874  | 127.0.0.1   |       10001 |                    NULL | NULL                 |                 NULL | Orphan        |                                                 |
|       3 | 127.0.0.1 | 10001 | vigilantia_1            | master      | online      | 0:25874  | NULL        |        NULL |                    NULL | NULL                 |                 NULL | Master        |                                                 |
|       3 | 127.0.0.1 | 10001 | vigilantia_1_AUTO_SLAVE | async slave | replicating | 0:25874  | 127.0.0.1   |       10001 |                    NULL | NULL                 |                 NULL | Orphan        |                                                 |
|       3 | 127.0.0.1 | 10001 | vigilantia_2            | master      | online      | 0:25874  | NULL        |        NULL |                    NULL | NULL                 |                 NULL | Master        |                                                 |
|       3 | 127.0.0.1 | 10001 | vigilantia_2_AUTO_SLAVE | async slave | replicating | 0:25874  | 127.0.0.1   |       10001 |                    NULL | NULL                 |                 NULL | Orphan        |                                                 |
+---------+-----------+-------+-------------------------+-------------+-------------+----------+-------------+-------------+-------------------------+----------------------+----------------------+---------------+-------------------------------------------------+

Monitor cluster events through MV_EVENTS table

As another facet in monitoring the health of your cluster, the information_schema.MV_EVENTS table provides cluster-level event reporting that you can query against. Querying the information_schema.MV_EVENTS table provides events from the entire cluster and can only be done from an aggregator. To monitor events from individual leaves, connect to that leaf and query the information_schema.LMV_EVENTS table, which is exactly the same in structure.

Table description

Field	Data Type (Size)	Description	Example Value
ORIGIN_NODE_ID	bigint(4)	ID of node where the event happened.	3
EVENT_TIME	timestamp	Timestamp when event occurred.	2018-04-25 18:08:13
SEVERITY	varchar(512)	Severity of the event. Can be one of the following values: `NOTICE`, `WARNING`, or `ERROR`.	NOTICE
EVENT_TYPE	varchar(512)	Type of event that occurred. See the section below for more details.	NODE_ONLINE
DETAILS	varchar(512)	Additional information about the event, in JSON format.	{“node”:“172.18.0.2:3306”}

Event type definitions

Node events

Event type	Description
`NODE_ONLINE`	A node has come online
`NODE_OFFLINE`	A node has gone offline
`NODE_ATTACHING`	A node is in the process of attaching
`NODE_DETACHED`	A node has become detached

Details output

Variable	Value	Description
`node`	“Hostname:port”	Address of node

Rebalance events

Event type	Description
`REBALANCE_STARTED`	A partition rebalance has started
`REBALANCE_FINISHED`	A partition rebalance has ended

Details output

Variable	Value	Description
`Database`	“database_name or (null)”	Database being rebalanced (80 characters truncated)
`user_initiated`	“true/false”	If the rebalance was initiated by the user or cluster

Replication events

Event type	Description
`DATABASE_REPLICATION_START`	A database has started replication
`DATABASE_REPLICATION_STOP`	A database has stopped or paused replication

Details output

Variable	Value	Description
`local_database`	“local_database_name”	The name of the database being replicated to
`remote_database`	“remote_database_name”	The name of the database being replicated from

Network status events

Event type	Description
`NODE_UNREACHABLE`	A node is unreachable from the master aggregator, either starting the grace period or going offline
`NODE_REACHABLE`	A node is now reachable from the master aggregator, recovering within the grace period

Details output

Variable	Value	Description
`node`	“Hostname:port”	Address of node
`message`	“message about event”	For unreachable: describing which stage of `unreachable_node` the node is in
`grace_period_in_secs`	“int”	The number of seconds the grace period is set to

Backup/Restore events

Event type	Description
`BACKUP_DB`	A database has completed a `BACKUP DATABASE` command
`RESTORE_DB`	A database has completed a `RESTORE DATABASE` command

Details output

Variable	Value	Description
`db`	“database_name”	Name of the database being backed up
`type`	“S3 or FS”	Where the backup is going to, S3 or Filesystem
`backup_id`	“unsigned int”	Id of the backup (only for backup)

Out of Memory Events

Event type	Description
`MAX_MEMORY`	Maximum server memory has been hit
`MAX_TABLE_MEMORY`	A table has hit the max table memory value

Details output

Variable	Value	Description
`actual_memory_mb`	“memory use in mb”	Current memory usage in mb
`maximum_memory_mb`	“maximum memory in mb”	Value of variable `maximum_memory`
`actual_table_memory_mb`	“memory use in mb”	Memory use of offending table
`maximum_table_memory_mb`	“maximum table memory variable value”	Value of variable `maximum_table_memory`
`memory_needed_for_redundancy`	“memory in mb needed "	Memory needed to allow the requested redundancy to fit in memory

Miscellaneous events

Event type	Description
`NOTIFY_AGGREGATOR_PROMOTED`	An aggregator has been promoted to master
`SYSTEM_VAR_CHANGED`	A sensitive engine variable has been changed
`PARTITION_UNRECOVERABLE`	A partition is lost due to failure and no longer can be recovered

Sensitive variables

auto_attach
leaf_failure_detection
enable_background_merger
enable_background_flusher
columnstore_window_size
internal_columnstore_window_minimum_blob_size
sync_permissions
max_connection_threads

See the list of engine variables for more information on these variables.

Details output

For NOTIFY_AGGREGATOR_PROMOTED: "{}"

For SYSTEM_VAR_CHANGED:

Variable	Value	Description
`variable`	“variable_name” (such as “max_connection_threads”)	The name of the engine variable that has been changed
`new_value`	“new_value” (such as “256”)	The new value that the engine variable has been changed to

For PARTITION_UNRECOVERABLE:

Variable	Value	Description
`database`	“db_name”	Name of the partition that is unrecoverable
`reason`	“Database couldn’t commit transaction”	Reason for partition going unrecoverable

Examples

SELECT * FROM information_schema.MV_EVENTS;

+----------------+---------------------+----------+----------------------------+------------------------------------------------------------+
| ORIGIN_NODE_ID | EVENT_TIME          | SEVERITY | EVENT_TYPE                 | DETAILS                                                    |
+----------------+--------------------+-----------+----------------------------+------------------------------------------------------------+
|              2 | 2018-05-15 13:21:03 | NOTICE   | NODE_ONLINE                | {"node":"127.0.0.1:10001"}                                 |
|              3 | 2018-05-15 13:21:05 | NOTICE   | NODE_ONLINE                | {"node":"127.0.0.1:10002"}                                 |
|              1 | 2018-05-15 13:21:12 | NOTICE   | REBALANCE_STARTED          | {"database":"db1", "user_initiated":"true"}                |
|              1 | 2018-05-15 13:21:12 | NOTICE   | REBALANCE_FINISHED         | {"database":"db1", "user_initiated":"true"}                |
|              3 | 2018-05-15 13:21:15 | WARNING  | NODE_DETACHED              | {"node":"127.0.0.1:10002"}                                 |
|              3 | 2018-05-15 13:21:16 | NOTICE   | NODE_ATTACHING             | {"node":"127.0.0.1:10002"}                                 |
|              3 | 2018-05-15 13:21:22 | NOTICE   | NODE_ONLINE                | {"node":"127.0.0.1:10002"}                                 |
|              2 | 2018-05-15 13:21:25 | WARNING  | NODE_OFFLINE               | {"node":"127.0.0.1:10001"}                                 |
|              2 | 2018-05-15 13:21:29 | NOTICE   | NODE_ATTACHING             | {"node":"127.0.0.1:10001"}                                 |
|              2 | 2018-05-15 13:21:30 | NOTICE   | NODE_ONLINE                | {"node":"127.0.0.1:10001"}                                 |
|              1 | 2018-05-15 13:21:35 | NOTICE   | DATABASE_REPLICATION_START | {"local_database":"db2", "remote_database":"db1"}          |
|              1 | 2018-05-15 13:21:40 | NOTICE   | DATABASE_REPLICATION_STOP  | {"database":"db2"}                                         |
|              2 | 2018-05-15 13:21:42 | WARNING  | NODE_OFFLINE               | {"node":"127.0.0.1:10001"}                                 |
|              2 | 2018-05-15 13:21:47 | NOTICE   | NODE_ATTACHING             | {"node":"127.0.0.1:10001"}                                 |
|              2 | 2018-05-15 13:21:48 | NOTICE   | NODE_ONLINE                | {"node":"127.0.0.1:10001"}                                 |
|              3 | 2018-05-15 13:23:48 | NOTICE   | REBALANCE_STARTED          | {"database":"(null)", "user_initiated":"false"}            |
|              3 | 2018-05-15 13:23:57 | NOTICE   | REBALANCE_FINISHED         | {"database":"(null)", "user_initiated":"false"}            |
|              1 | 2018-05-15 13:23:57 | NOTICE   | SYSTEM_VAR_CHANGED         | {"variable": "leaf_failure_detection", "new_value": "off"} |
+----------------+---------------------+----------+--------------------+------------------------------------------------------------+

SELECT * FROM information_schema.LMV_EVENTS;

+----------------+---------------------+----------+--------------------+------------------------------------------------------------+
| ORIGIN_NODE_ID | EVENT_TIME          | SEVERITY | EVENT_TYPE         | DETAILS                                                    |
+----------------+---------------------+----------+--------------------+------------------------------------------------------------+
|              1 | 2018-06-28 11:56:09 | NOTICE   | SYSTEM_VAR_CHANGED | {"variable": "max_connection_threads", "new_value": "256"} |
|              1 | 2018-06-28 11:56:11 | NOTICE   | NODE_STARTING      | {}                                                         |
|              1 | 2018-06-28 11:56:47 | NOTICE   | NODE_ONLINE        | {"node":"127.0.0.1:10001"}                                 |
|              1 | 2018-06-28 11:56:47 | NOTICE   | LEAF_ADD           | {"node":"127.0.0.1:10001"}                                 |
|              1 | 2018-06-28 17:42:28 | NOTICE   | LEAF_REMOVE        | {"node":"127.0.0.1:10001"}                                 |
|              1 | 2018-06-28 17:42:37 | NOTICE   | NODE_ONLINE        | {"node":"127.0.0.1:10001"}                                 |
|              1 | 2018-06-28 17:42:37 | NOTICE   | LEAF_ADD           | {"node":"127.0.0.1:10001"}                                 |
+----------------+---------------------+----------+--------------------+------------------------------------------------------------+

Was this Article Helpful?

Need help or have suggestions? Go to SingleStore Forums

SingleStore DB Components to Monitor min read

Monitor all SingleStore nodes

Monitor OS resources

Monitor cluster status through MV_CLUSTER_STATUS table

Table description

Sample output

Monitor cluster events through MV_EVENTS table

Table description

Event type definitions

Node events

Rebalance events

Replication events

Network status events

Backup/Restore events

Out of Memory Events

Miscellaneous events

Examples

Sorry About That