Next Steps min read


View the Dashboards

When all cluster monitoring components are installed, configured, and running, the Grafana dashboards can be used to monitor cluster health over time.

Identify Trends

Each dashboard provides insights that can be used to identify trends that may require intervention, including:

Active Session History

Chart Name What it shows When to use it
Active Session History The activities running on the cluster and their respective resource usage (including CPU, memory, and network) To view queries that are currently running, and that have been run, on the cluster

To view current and past session wait events to identify databases and activities that consume considerable resources, including:
- Why queries are running slowly
- Where a cluster is resource-constrained

Activity History

Chart Name What it shows When to use it
Execution History The number of times a given query shape was executed over time To view query statistics over time

To determine the number of times a given query has been run

To identify if the performance and resource usage of a single activity has regressed over time
Average Time A view of the average time spent waiting for resources for a given query over time (milliseconds) To identify and compare the history of the time spent on query resource usage to understand if it’s performing similar to, or different than, previous executions
Average Resource Use The average resource use for a given query including disk bytes, memory bytes, memory bytes/second, and network bytes/second To identify and compare the history of query resource usage to understand if it’s using resources similar to, or different than, previous executions

Detailed Cluster View

Chart Name What it shows When to use it
Database CPU Breakdown The CPU cycles spent by each activity, grouped by database To identify which databases incur the most CPU usage

Note: A blank database indicates system activity, which is not related to a user database.
Query Rate The number of reads/writes per second of the queries running on the system To understand typical (“normal”) cluster activity to:
- Benchmark workloads and their query rates
- Identify anomalies in the read/write workload
Rows Read or Written The number of rows read/written To understand typical (“normal”) cluster activity to:
- Benchmark workload read/write row counts
- Identify anomalies in the number of rows read/written
SysInfo CPU The percentage of the host’s CPU that is being used To understand CPU usage and host resource usage in general, or for a given workload

To identify if any non-SingleStore DB activity is affecting a host’s CPU
SysInfo Memory The percent of the host’s memory that is being used To understand host memory usage for a given workload over time

To identify if any non-SingleStore DB activity is affecting a host’s memory
SysInfo Network The network bytes sent and received To understand network usage for a given workload and identify bottlenecks

To identify if any non-SingleStore DB activity is affecting a host’s network
memsqld Memory A summary of host memory usage To view cross-sections of memory usage within the cluster and identify anomalous memory use
Workload Management Queries The queries running and their states if affected by workload management To understand the current cluster load and identify high-workload issues that are affecting the cluster’s ability to process queries
Cluster Events The count and output of the warning and errors on the cluster (as per mv_events) To recognize cluster health issues by reviewing the number of events

To drill into cluster events to identify and understand what the issues are

Memory Usage

Chart Name What it shows When to use it
Used vs. Total Limit The memory in use compared to the total memory available (megabytes) To perform capacity planning for memory

To identify if the cluster is not performing optimally due to a shortage of memory
Query Memory vs. Total Limit The query memory in use compared to the total memory available (megabytes) To perform capacity planning for workloads

To identify if workloads in general, or workload spikes in particular, are putting the cluster at risk of running out of memory
Data Memory Used vs. Total Limit The data memory in use versus the total memory available (megabytes) To perform capacity planning for data memory

To identify if given write workloads are putting the cluster at risk of running out of memory
Internal Memory Allocators vs. Limit The memory used by SingleStore DB memory allocators (megabytes) To identify why memory allocations have increased, or are anomalously large, when there are no other indicators of increased memory use, such as workload or data

To discover where memory is allocated (table, query, etc.)
Detailed Breakout of Memory Allocators vs. Limit The memory used by extended SingleStore DB memory allocators (megabytes) To identify if any memory allocations have increased, or are anomalously large, when there are no other indicators of increased memory use

To discover where memory is allocated (table, query, etc.)

SingleStore DB Status & Change

Chart Name What it shows When to use it
Show Status and Change
(two charts)
The values and changes in SHOW STATUS (sizing units based on the variable) To identify if any anomalous changes have occurred to SingleStore DB status variables

SingleStore DB Variables & Change

Chart Name What it shows When to use it
Show Variables and Change
(two charts)
The values and changes to SingleStore DB variables (sizing units depending on the variable) To view changes to SingleStore DB engine variables over time to identify if any anomalous changes have occurred

Information Schema View

Chart Name What it shows When to use it
Table Statistics The row counts for tables across schemas To identify anomalies in table sizes in general, and workloads in particular

Node Metrics Breakout

Chart Name What it shows When to use it
CPU Utilization The System Info CPU utilization (percent) To view a host’s CPU utilization and hardware health to identify if processes outside of SingleStore DB could be affecting them
Filesystem The filesystem usage (bytes) To view host-level filesystem usage and identify if processes outside of SingleStore DB could be affecting it
Network Rate The System Info network rate (byes) To view host-level network usage and identify if processes outside of SingleStore DB could be affecting it
Memory Bytes The System Info memory usage (bytes) To identify host-level and cgroup memory usage to identify if processes outside SingleStore DB could be affecting them

Node Metrics Drilldown

Chart Name What it shows When to use it
Node Metrics Drilldown The memsql_exporter execution details To determine if the exporter process is running efficiently and/or to identify lags in data collection

Related Resources

Troubleshoot Your Monitoring Setup

Pipelines

Check the Monitoring Tables for Data

  1. Connect to the database.

  2. Run the following SQL. The default database name is metrics. If your database name is different from the default name, replace metrics with your database name.

    use metrics;
    select * from metrics limit 10;
    

    Optional:

    select * from all monitoring tables
    

    If these queries return an empty set, review the pipelines error tables using the next step.

  3. Review the monitoring pipelines.

    show pipelines status
    
  4. If a monitoring pipeline (with a name resembling *_metrics and *_blobs) is in a state other than running, start the pipeline.

    START PIPELINE <pipeline-name>
    
  5. Check the information_schema.pipelines_errors table for errors.

    select * from information_schema.pipelines_errors
    

Resolve Pipeline Errors

If you receive an Cannot extract data for the pipeline error in the pipelines_error table, perform the following steps.

  1. Confirm that port 9104 is accessible from all hosts in the cluster. This is the default port used for monitoring. To test this, run the following command at the Linux command line and review the output.

    curl http://<endpoint>:9104/cluster-metrics
    

    For example:

    curl http://192.168.1.100:9104/cluster-metrics
    
  2. If the hostname of the Master Aggregator is localhost, and a pipeline was created using localhost, recreate the pipeline using the Master Aggregator host’s IP addresses. For example:

    metrics pipeline:

    create or replace pipeline `metrics` as load data prometheus_exporter 
    "http://<host-ip-address>:9104/cluster-metrics" 
    config '{"is_memsql_internal":true}' 
    into procedure `load_metrics` format json;
    
    start pipeline if not running metrics;
    

    blobs pipeline:

    create or replace pipeline `blobs` as load data prometheus_exporter 
    "http://<host-ip-address>:9104/samples" 
    config '{"is_memsql_internal":true, "download_type":"samples"}' 
    into procedure `load_blobs` format json;
    
    start pipeline if not running blobs;