Outdated Version

You are viewing an older version of this section. View current production version.

SELECT

Retrieves data from a table.

Syntax

[with_clause]
SELECT
    [ALL | DISTINCT | DISTINCTROW]
    select_expr [[AS] alias], ...
    [ FROM table_references
          [WHERE expr]
          [GROUP BY {{col_name | expr | position}, ...
            | extended_grouping_expr}]
          [HAVING expr]
          [ORDER BY {col_name | expr | position} [ASC | DESC], ...]
          [LIMIT {[offset,] row_count | row_count OFFSET offset}]
          [FOR UPDATE]
    ]
    [{  INTO OUTFILE 'file_name' |
        INTO S3 bucket/target CONFIG configuration_json CREDENTIALS credentials_json
     }
        [format_options]
    ]

format_options:
        [{FIELDS | COLUMNS}
          [TERMINATED BY 'string']
          [[OPTIONALLY] ENCLOSED BY 'char']
          [ESCAPED BY 'char']
        ]
        [LINES
          [STARTING BY 'string']
          [TERMINATED BY 'string']
        ]

table_references:
         table_factor | join_table

table_factor:
           tbl_name [[AS] alias] [sample_ratio_clause]
         | (subquery) [AS] alias

Remarks

  • The join_table clause is documented in the JOIN and Subqueries section below.
  • format_options clauses are documented in the Format Options section below.
  • sample_ratio_clause is documented in the SELECT … WITH (SAMPLE_RATIO = <value>) section below.
  • with_clause is documented on the WITH Common Table Expressions page.
  • If you are querying against a columnstore table with a FULLTEXT index defined, see MATCH and HIGHLIGHT for syntax and examples on how to do full text search queries.
  • Non-aggregated data can also be transformed into a pivot table output format. See PIVOT for syntax and examples.
  • extended_grouping_expr clauses include CUBE and ROLLUP. See CUBE and ROLLUP for syntax and examples.

Aggregations for Expression Syntax

MemSQL supports the following aggregations for expression syntax in SELECT statements:

COUNT(*)                        -- This returns the count of rows matching the query
COUNT(expr)                     -- This returns the count of rows matching the query where the expression is non-NULL
COUNT(DISTINCT expr)            -- This returns the count over distinct non-NULL values of the expression
APPROX_COUNT_DISTINCT(expr)     -- This returns an approximation of the number of distinct values of the expression
SUM(expr)                       -- This returns the sum value of the expression
SUM(DISTINCT expr)              -- This returns the sum of the distinct values of the expression
AVG(expr)                       -- This returns the average value of the expression
AVG(DISTINCT expr)              -- This returns the average of the distinct values of the expression
MAX(expr)                       -- This returns the maximum value of the expression
MIN(expr)                       -- This returns the minimum value of the expression
STDDEV_POP(expr)                -- This is the population standard deviation
STD(expr)                       -- This is a synonym for STDDEV_POP
STDDEV(expr)                    -- This is a synonym for STDDEV_POP
STDDEV_SAMP(expr)               -- This is the sample standard deviation
VAR_POP(expr)                   -- This is the population variance
VARIANCE(expr)                  -- This is a synonym for VAR_POP
VAR_SAMP(expr)                  -- This is the sample variance
GROUP_CONCAT(expr)              -- This concatenates the non-NULL values of the expression into a string
GROUP_CONCAT(DISTINCT expr)     -- This concatenates the distinct non-NULL values of the expression into a string

All aggregate functions exclude NULLs from their computations. For example, SELECT COUNT(c) FROM t returns the number of rows in t where c is not NULL. SELECT AVG(c) FROM t returns the average of the non-NULL c’s in t.

COUNT(*) is equivalent to COUNT(1). It counts every row matching the query, even rows whose column data contain nothing but NULLs.

SELECT … LIMIT

Syntax

LIMIT {[offset,] row_count | row_count OFFSET offset}

Remarks

  • The LIMIT clause constrains the number of rows returned by the SELECT statement.
  • Both the arguments must be non-negative integer constants.
  • The row_count specifies the number of rows to return from the beginning of the result set, and the offset specifies the offset of the first row to return.
  • The offset of the first row in a table is 0 (not 1).

Examples

SELECT * FROM hrRec;
****
+-----------+-----------+----------+--------+
| FirstName | LastName  | City     | Tenure |
+-----------+-----------+----------+--------+
| Adam      | Gale      | Brooklyn |     40 |
| Samantha  | Beck      | New York |     44 |
| Clara     | Wakefield | DC       |     24 |
| Todd      | Bridges   | DC       |     30 |
| Ron       | Fletcher  | New York |     23 |
+-----------+-----------+----------+--------+
SELECT * FROM hrRec LIMIT 2;
****
+-----------+----------+----------+--------+
| FirstName | LastName | City     | Tenure |
+-----------+----------+----------+--------+
| Adam      | Gale     | Brooklyn |     40 |
| Samantha  | Beck     | New York |     44 |
+-----------+----------+----------+--------+
SELECT * FROM hrRec LIMIT 1,2;
****
+-----------+-----------+----------+--------+
| FirstName | LastName  | City     | Tenure |
+-----------+-----------+----------+--------+
| Samantha  | Beck      | New York |     44 |
| Clara     | Wakefield | DC       |     24 |
+-----------+-----------+----------+--------+
SELECT * FROM hrRec LIMIT 2 OFFSET 1;
****
+-----------+-----------+----------+--------+
| FirstName | LastName  | City     | Tenure |
+-----------+-----------+----------+--------+
| Samantha  | Beck      | New York |     44 |
| Clara     | Wakefield | DC       |     24 |
+-----------+-----------+----------+--------+

SELECT … FOR UPDATE

The SELECT ... FOR UPDATE command is intended to be used inside of a transaction. This command takes write locks on the rows returned from the SELECT query and the locks are held until the end of the transaction. Other transactions are blocked and cannot write to the locked rows until the locks are released.

SELECT ... FOR UPDATE can only be used in the outermost SELECT block; it cannot be used in sub-select queries.

We recommend the following when using SELECT ... FOR UPDATE:

  • It is recommended to commit or abort the transaction immediately so that the locks are released. Too many locks in a transaction can result in a huge queue of transactions waiting on the locked rows.
  • To limit the number of rows that are locked, it is recommended to use a column with unique values in the WHERE clause of the SELECT statement, for example the PRIMARY KEY column.

Example

The following example uses the Orders table:

DESCRIBE Orders;
****
+--------------+-------------+------+------+---------+-------+
| Field        | Type        | Null | Key  | Default | Extra |
+--------------+-------------+------+------+---------+-------+
| OrderNumber  | int(11)     | NO   | PRI  | NULL    |       |
| Status       | varchar(20) | NO   |      | NULL    |       |
| CustomerName | char(20)    | YES  |      | NULL    |       |
+--------------+-------------+------+------+---------+-------+
SELECT * FROM Orders;
****
+-------------+------------+--------------+
| OrderNumber | Status     | CustomerName |
+-------------+------------+--------------+
|           1 | Delivered  | John         |
|           3 | In Transit | Bon          |
|           2 | Delivered  | Kerry        |
|           4 | Delivered  | Tom          |
+-------------+------------+--------------+

The following transaction locks the row where OrderNumber is 3, using the FOR UPDATE clause. The row will remain locked as long as the transaction is open.

BEGIN;
SELECT * FROM Orders WHERE OrderNumber = 3 FOR UPDATE;
****
+-------------+------------+--------------+
| OrderNumber | Status     | CustomerName |
+-------------+------------+--------------+
|           3 | In Transit | Bon          |
+-------------+------------+--------------+

Now, execute the following query in a different connection:

UPDATE Orders SET Status = "Delivered" WHERE OrderNumber=3;
****
ERROR 1205 (HY000): Leaf Error (127.0.0.1:3307): Lock wait timeout exceeded; try restarting transaction.  Lock owned by connection id 77, query `open idle transaction`

The above query returns an error since the rows are locked by the previous transaction. To remove the lock on the rows, commit or rollback the open transaction. See the Query Errors topic for more information on locked rows.

JOIN and Subqueries

Syntax

MemSQL supports the following JOIN and subquery syntax for the table_reference part of SELECT statements:

join_table:
           table_reference {LEFT | RIGHT | FULL} [OUTER] JOIN table_factor join_condition
         | table_reference [INNER | CROSS] JOIN table_factor [join_condition]
         | table_reference NATURAL {LEFT | RIGHT} [OUTER] JOIN table_factor
         | table_reference STRAIGHT_JOIN table_factor [join_condition]
         | table_reference, table_factor

join_condition:
         ON conditional_expr

Remarks

  • This command must be run on the master aggregator or a child aggregator node (see Node Requirements for MemSQL Commands)

  • STRAIGHT_JOIN forces tables to be joined in the order in which they are listed in the FROM clause

  • FULL OUTER JOIN requires the join condition to be an equality. Also, the right argument to a full outer join cannot be a reference table.

  • MemSQL doesn’t yet support nested joins. However, in many cases you can rewrite queries that seem to require a nested join without one. For example, the following query is not supported:

Nested join: not allowed

... FROM tableA LEFT JOIN (tableB LEFT JOIN tableC on tableB.col = tableC.col) ON tableA.col = tableB.col

The previous query can be rewritten without a nested join in the following way:

Rewritten as two left joins

... FROM tableA LEFT JOIN tableB ON tableA.col = tableB.col LEFT JOIN tableC ON tableB.col = tableC.col

Examples

memsql> SELECT * FROM my_MemSQL_table WHERE col = 1;


memsql> SELECT COUNT(*), user_name, page_url from clicks, users, pages
 ->     WHERE clicks.user_id = users.user_id AND pages.page_id = clicks.page_id
 ->     GROUP BY users.user_id, pages.page_id
 ->     ORDER BY COUNT(*) DESC;
 +- ---------+- ----------+- -------------------------------+
 | COUNT(*) | user_name | page_url                       |
 +- ---------+- ----------+- -------------------------------+
 |        5 | jake      | memsql.com                     |
 |        2 | john      | http://www.memsql.com/download |
 |        1 | jake      | docs.memsql.com                |
 |        1 | jake      | memsql.com                     |
 |        1 | jake      | http://www.memsql.com/download |
 +- ---------+- ----------+- -------------------------------+
 5 rows in set (0.00 sec)
SELECT t1.*, t2.* FROM t1 FULL OUTER JOIN t2 ON t1.a = t2.a;

SELECT … INTO OUTFILE

SELECT ... INTO OUTFILE formats and writes the results of a SELECT query to a text file. The format_options are very similar to the parsing options used with LOAD DATA. See the Format Options section below for more information.

format_options:
    [{FIELDS | COLUMNS}
        [TERMINATED BY 'string']
        [[OPTIONALLY] ENCLOSED BY 'char']
        [ESCAPED BY 'char']
    ]
    [LINES
        [STARTING BY 'string']
        [TERMINATED BY 'string']
    ]

Remarks

  • If a relative path that that does not start with / is specified, MemSQL writes the file to the directory specified in the global variable tmpdir (the same as datadir by default). To specify another location, enter the absolute path to the file as the file_name parameter.
  • The default text formatting, used when the user omits the FIELDS and LINES blocks, is to separate values with tabs (\t) and rows with newlines (\n).

Example

The following query writes the contents of table_name to a file in the home directory of username. The file is formatted as a comma-separated values (CSV) file.

memsql> SELECT * FROM table_name INTO OUTFILE '/home/username/file_name.csv'
        FIELDS TERMINATED BY ','
        LINES TERMINATED BY '\n'

SELECT … INTO S3

SELECT ... INTO S3 performs a distributed select into a S3 bucket.

configuration_json:
'{"region":"your_region"}'

credentials_json:
'{"aws_access_key_id": "replace_with_your_access_key_id",
  "aws_secret_access_key": "replace_with_your_secret_access_key",
  ["aws_session_token": "replace_with_your_temp_session_token",]
  ["role_arn":"replace_with_your_role_arn"]
}'

The format_options clause, if specified, must be included after the CONFIG and CREDENTIALS clauses. The format_options are similar to the parsing options used with LOAD DATA. See the Format Options section below for more information.

format_options:
    [{FIELDS | COLUMNS}
        [TERMINATED BY 'string']
        [[OPTIONALLY] ENCLOSED BY 'char']
        [ESCAPED BY 'char']
    ]
    [LINES
        [STARTING BY 'string']
        [TERMINATED BY 'string']
    ]

Remarks

Warning

The S3 bucket needs to be created before running this command.

The output of SELECT ... INTO S3 is stored with the content type binary/octet-stream in the S3 bucket.

If the insert select logic determines that the SELECT ... INTO S3 query can be run as distributed, the query will be pushed down to each leaf and executed on each partition. The name of each object will be:

<bucket/target>_<partition ID>

If the insert select logic determines that the SELECT ... INTO S3 query can only be on the aggregator because it contains aggregated, ORDER BY, GROUP BY, etc. then the query will be run on each leaf but the result will be collected on the aggregator and then output to S3. The object name will just be:

<bucket/target>

The SELECT query will validate if the <bucket/target> or <bucket/target>_ already exists on the S3 bucket first and fail if any of the object(s) already exist.

Examples

The following simple example shows how to save all rows in table t1 to an S3 bucket using an AWS access key.

SELECT *
FROM t1
INTO S3 'testing/output'
CONFIG '{"region":"us-east-1"}'
CREDENTIALS '{"aws_access_key_id":"your_access_key_id","aws_secret_access_key":"your_secret_access_key"}'

The following example saves the result set of a SELECT query with an GROUP BY clause to an S3 bucket using an Amazon Resource Name (ARN) for AWS Identity and Access Management (IAM).

SELECT t1.a, t2.a
FROM t1, t2
WHERE t1.a = t2.a
GROUP BY t1.a
INTO S3 'bucket_name/file_name'
CONFIG '{"region":"us-east-1"}'
CREDENTIALS '{"role_arn": "arn:aws:iam::<AccountID>:role/EC2AmazonS3FullAccess"}'

The following example uses the format options to output the data in CSV format.

SELECT *
FROM t
INTO S3 'tmp/a'
CONFIG '{"region":"us-east-1"}'
CREDENTIALS '{"aws_access_key_id":"your_access_key_id","aws_secret_access_key":"your_secret_access_key"}'
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'

Format Options

[{FIELDS | COLUMNS}
    [TERMINATED BY 'string']
    [[OPTIONALLY] ENCLOSED BY 'char']
    [ESCAPED BY 'char']
]
[LINES
    [TERMINATED BY 'string']
    [STARTING BY 'string']
]
  • FIELDS TERMINATED BY 'string': The delimiter to use between each column value in the message. If not specified, defaults to \t.
  • FIELDS ENCLOSED BY 'char': Prefixes and suffixes each column value of the message with the ‘char’ value. If not specified, defaults to nothing. The OPTIONALLY keyword does not affect the behavior of this option; it exists to maintain compatibility with MySQL.
  • FIELDS ESCAPED BY 'char': Translates the escape sequence in the ‘char’ value, instead of using the literal ‘char’ value. Translation is done for all occurences of the ‘char’ value in the message. If not specified, defaults to \\.
  • LINES TERMINATED BY 'string': A string to insert at the end of the message. If not specified, defaults to \n.
  • LINES STARTING BY 'string': A string to insert at the beginning of the message. If not specified, defaults to nothing.

SELECT … WITH (SAMPLE_RATIO = <value>)

SELECT ... WITH (SAMPLE_RATIO = <value>) retrieves a random sample of data from a table (specifically, a Bernoulli sample).

sample_ratio_clause: WITH (SAMPLE_RATIO = ratio)

Argument

ratio

A decimal between 0.0 and 1.0 specifying the probability that each row will be retrieved from the table. The probability is applied to each row independently.

Remarks

  • When you run multiple SELECT statements against the same table using WITH (SAMPLE_RATIO = <value>), you will receive similar results. This is the case even when your sample ratio is different in your SELECT statements. For example, the sample returned for a sample ratio of 0.25 is often a subset of the sample ratio returned for 0.5.
  • Using a sample ratio that is one divided by a power of two allows WITH (SAMPLE_RATIO = <value>) to run more efficiently.
  • You can use WITH (SAMPLE_RATIO = <value>) in a SELECT subquery.
Warning

You can use WITH (SAMPLE_RATIO = <value>) with any table. It operates most efficiently when used with a rowstore table whose primary key is not a hash key. Otherwise, it will scan all rows, or all rows in a range, to do sampling.

Examples

Simple SELECT statements

The following example populates a table with the integers 1 to 10 and selects two random samples.

DROP DATABASE IF EXISTS memsql_docs_example;
CREATE DATABASE memsql_docs_example;
USE memsql_docs_example;
CREATE TABLE example_table(num INT);

INSERT INTO example_table VALUES (5), (6), (10), (11), (12),
  (14), (20), (24), (33), (40);

The query below returns a random sample of the num column. Each row in the sample has a seventy percent probability of being returned.

SELECT num FROM example_table WITH (SAMPLE_RATIO = 0.7) ORDER BY num;

Example output follows. You will likely see different results, as the query returns a random sample.

+------+
| num  |
+------+
|    6 |
|   11 |
|   14 |
|   20 |
|   24 |
|   33 |
+------+

The query below returns the average of a random sample of the num column. Each row in the sample has an eighty percent probability being included in the sample and therefore in the AVG computation.

SELECT AVG(num) FROM example_table WITH (SAMPLE_RATIO = 0.8);

Example output follows. You will likely see different results, as the query returns a random sample.

+----------+
| AVG(num) |
+----------+
|  18.1111 |
+----------+

Using WITH (SAMPLE_RATIO = <value>) with SELECT Subqueries

You may use WITH (SAMPLE_RATIO = <value>)as part of a SELECT subquery, as shown in the following example.

CREATE TABLE example_table_2 AS SELECT * FROM example_table
  WITH (SAMPLE_RATIO = 0.6);
SELECT num FROM example_table2 ORDER BY num;

Example output follows. You will likely see different results, as the query returns a random sample.

+------+
| num  |
+------+
|   10 |
|   12 |
|   14 |
|   24 |
|   33 |
|   40 |
+------+

Using WITH (SAMPLE_RATIO = <value>) with Joins

You may use WITH (SAMPLE_RATIO = <value>) in a SELECT statement that contains a join. However, you should do so carefully to ensure the join yields the results you intend. For example, consider joining a CUSTOMER table to an ORDER table using the customer_id field in the ORDER table. The placement of the WITH (SAMPLE_RATIO = <value>) clause within the join will greatly impact the result, as shown below.

The following query selects a sample of customers and returns the orders for those customers only.

SELECT c.name, o.order_id, o.order_total
FROM customer WITH (SAMPLE_RATIO = 0.4) c
JOIN order o
ON c.customer_id = o.customer_id

The following query selects a sample of all orders and returns only those orders along with their associated customers.

SELECT c.name, o.order_id, o.order_total
FROM customer c
JOIN order o WITH (SAMPLE_RATIO = 0.4)
ON c.customer_id = o.customer_id