Outdated Version

You are viewing an older version of this section. View current production version.

Connecting to Dremio

Dremio is an open source project that enables business analysts and data scientists to explore and analyze any data at any time, regardless of its location, size, or structure. Dremio combines a scale-out architecture with columnar execution and acceleration to achieve interactive performance on any data volume.

This topic describes how to connect Dremio to MemSQL using the MemSQL Connector.

The certification matrix below shows the recommended versions for each product or component:

Certification Matrix Versions
Dremio 4.0
MemSQL 6.7
MariaDB JDBC Driver 2.3.0

Prerequisites

Following prerequisites need to be satisfied before making the connection:

  1. Prior to configuring Dremio to connect to MemSQL, download and install Dremio.

  2. Download the MariaDB JDBC driver: MariaDB Java Connector 2.3.0

  3. Make sure the MemSQL Cluster is running on the server.

  4. Download the MemSQL connector for Dremio.

Configuring JARS and running Dremio

  1. Log in to the linux machine. Place the MariaDB 2.3.0 JDBC driver in the following directory:

    /opt/dremio/jars/3rdparty/mariadb-java-client-2.3.0.jar

  2. Place the MemSQL connector in the jars directory:

    /opt/dremio/jars/dremio-memsqlarp-plugin.jar

  3. Start the dremio service:

    sudo/opt/dremio/bin/dremio start

  4. Open a browser and connect to Dremio running on the linux server:

    http://<IP Address of the server running the Dremio service>:9047/

Connecting Dremio to MemSQL

  1. Open the following URL: http://<IP Address of the server running the Dremio service>:9047/

  2. Enter a username and password to log in.

  3. Click Sources on the left hand side of the page.

  4. Select memSQL in the Select New Source Type pop-up.

  5. In the New Source window provide the following information:

    • Name of the connection
    • Host: IP address of the MemSQL server
    • Database: Name of the database to connect with
    • Username and Password: User ID and password of the MemSQL database

    Click Save.

  6. The new MemSQL connection should be available under Sources.

Data Reflection and Query Acceleration

Data Reflection

Dremio maintains physically optimized representations of source data known as Data Reflections. The query optimizer can accelerate a query by utilizing one or more Data Reflections to partially or entirely satisfy that query, rather than processing the raw data in the underlying data source.

Dremio supports two fundamental types of Data Reflections: Raw Reflections and Aggregation Reflections. Many of the options for configuring and managing both types of Data Reflections are the same, but they each optimize different types of query patterns.

Raw Reflections

Raw Reflections preserve row-level fidelity of the anchor dataset. A Raw Reflection includes one or more fields from the anchor dataset, and is sorted and partitioned by specific fields in the dataset retrieved from the database on Dremio. You can use Raw Reflections to perform a number of optimizations.

Aggregation Reflections

Aggregation Reflections maintain summary data about the anchor dataset. An Aggregation Reflection includes one or more dimensions and measures fields from the anchor dataset, sorted, partitioned and distributed by specified columns. Some of these columns are configured as dimensions that will be used in GROUP BY (or DISTINCT) statements, and other columns are configured as measures that will be used in calculations such as MAX, MIN, AVG, SUM, and COUNT.

Query Acceleration

Dremio uses Data Reflection for query acceleration.

When Dremio receives a user query, it first determines whether any Data Reflections have at least one physical dataset in common with the query after both have undergone dataset expansion. All Data Reflections that pass this step are then evaluated to determine if they cover the query.

For Data Reflections that cover the query, Dremio will determine the cost of using the Data Reflection to execute the query. These costs are then compared to the cost of executing the query against the physical datasets, and the lowest cost query plan is selected for physical plan generation. Typically using one or more Data Reflections will be less expensive than executing the query against the raw physical data.

Best practices guide

The following sections represent best practices for administering and operating a Dremio cluster.

Think in Terms of Several Discrete Data Reflections

Data Reflections allow administrators to be iterative in their approach to performance optimization. Because Data Reflections do not require any change in the behavior of the Data Consumer, administrators can add and refine Data Reflections on an ongoing basis with little to no impact to ongoing workloads.

Optimize data reflection

To determine the optimal set of Data Reflections, Administrators should isolate known query patterns into groups that do not interact with one another. Having more discrete query groups means:

  • Smaller reflections on disk.
  • More efficient Data Reflection maintenance can be performed.
  • Queries can be executed more efficiently.

Keep in mind that a single query can use multiple Data Reflections and a single Data Reflection can serve many queries.

Accelerate a query pattern

Dremio supports two types of Data Reflections: Raw Reflections and Aggregation Reflections. If a known query pattern returns row-level information, Raw Reflections are appropriate. If the query returns summarized data based on GROUP BY expressions or aggregations (e.g., SUM, AVG, COUNT, MIN, MAX), then an Aggregation Reflection is appropriate.

Aggregation

Dremio can pre-aggregate data at multiple levels of granularity. Then, at query time, Dremio can determine how to further aggregate the data as appropriate. Administrators can create Aggregation Reflections that include the lowest level granularity as well as the most coarse granularity, and Dremio will automatically aggregate at the appropriate level at query time.

Calculated Fields

For calculated fields that are frequently used by Data Consumers, administrators have a few different options for accelerating these calculations:

  • Add the calculated field to a virtual dataset - The administrator can add a new column that provides the calculation. Depending on the expression, Dremio may be able to match the new column without making the Data Consumers explicitly use the new column. Otherwise, they will need to include the new column in their queries.
  • Use a Supporting Anchor Dataset - The administrator can create a Supporting Anchor Dataset that includes the calculated field along with other fields from the dataset, and Dremio will automatically use the associated Data Reflection to accelerate the query.