You are viewing an older version of this section. View current production version.
JSON Guide
About JSON in MemSQL
MemSQL exposes a Javascript Object Notation (JSON) column-type that implements the JSON standard. You can define columns in MemSQL tables as the JSON data type in order to efficiently store and query JSON values. A JSON value is a JavaScript object, which means it is represented as a series of name/value pairs that are written according to a structured format and sequence. For more information, see http://www.ietf.org/rfc/rfc4627.txt.
Managing Collections of Metadata
JSON is useful for managing a collection of diverse data, represented as name-value pairs, that might otherwise be cumbersome to refactor into a formalized key-value table, or that might be stored in a table that is sparsely populated. For example, suppose an organization had an asset management application using MemSQL to track all the information about its physical assets. The asset data is diverse—what’s relevant for a desk differs from what’s important for a server machine or a company car. All assets might have common attributes, such as asset tag ID, asset type, asset name, and description. Each type of asset might have unique attributes, such as size and weight dimensions, hostname and IP address, or gas mileage.
Instead of creating a highly granular table to manage all data as key-value pairs, this organization could simply create a MemSQL table using a JSON column to efficiently manage the unique attribute data. With this design:
- Each asset gets a row in the table.
- Attributes that are common to all assets have their own column in the table. These columns allow you to query on common features and quickly narrow down the final result set as much as possible (for example, filtering by asset type).
- The various, remaining attributes associated with each asset are stored in a JSON column (which might be named something like property_bag). For example, the JSON column for an office desk asset could include JSON data such as size, weight, and number of drawers. The JSON data for a server machine could include rack location, number of cores, and MAC address.
DDL: Defining Tables with JSON Columns
Any MemSQL table can contain one or more columns of data type JSON
. A JSON column can optionally be suffixed with NOT NULL
. Using the NOT NULL suffix is recommended because it causes MemSQL to give an error if the user attempts to insert invalid JSON.
Comparing JSON and LONGTEXT Columns
A JSON column is analogous to a LONGTEXT
column in the following ways:
- JSON columns can store arbitrarily large JSON values in a normalized text representation.
- JSON columns have the same storage requirement, as if the JSON value were stored in a text column.
The key difference is that JSON data is stored in a normalized format, which makes many operations faster than if the data were stored manually in a text column. The following is an example of non-normalized data, which is valid JSON but is relatively difficult to parse:
'{ "b\u000a": 1,"a": 2 ,"a":3 } '
Normalized data, on the other hand, is easier to parse because duplicate keys are merged, keys are sorted by value, and extraneous whitespace is removed.
'{"a":3,"b\n":1}'
Defining JSON Columns
Defining a JSON column in a MemSQL table is as simple as specifying the JSON data type in the CREATE TABLE
command:
CREATE TABLE assets (
tag_id BIGINT PRIMARY KEY,
name TEXT NOT NULL,
description TEXT,
properties JSON NOT NULL);
It is recommended that users define JSON columns with the NOT NULL suffix. This way MemSQL gives an error if the user attempts to insert invalid JSON.
Indexing Data in JSON Columns
JSON columns are not indexed directly - they are indexed using computed columns. For the fastest performance, you should not use JSON built-ins or ::
notation in your filters. Instead, create a computed column that includes the JSON column in the computation, and then use the computed column for the index. In this way, the index gets updated only when the relevant JSON data is updated in a row.
CREATE TABLE assets (
tag_id BIGINT PRIMARY KEY,
name TEXT NOT NULL,
description TEXT,
properties JSON NOT NULL,
weight AS properties::%weight PERSISTED DOUBLE,
license_plate AS properties::$license_plate PERSISTED TEXT,
KEY(license_plate), KEY(weight));
In this example, the following queries will be strictly more efficient:
SELECT * FROM assets ORDER BY weight;
SELECT * FROM assets WHERE license_plate = "VGB116";
than these equivalent queries:
SELECT * FROM assets ORDER BY properties::%weight;
SELECT * FROM assets WHERE properties::$license_plate = "VGB116";
DML: Accessing Data in JSON Columns
This section describes how to insert and update data in a MemSQL table with one or more JSON columns.
Adding Data to a JSON Column Upon INSERT
When inserting a row in a table, adding data to a JSON column is straightforward.
INSERT INTO posts (contents, usercomments) VALUES
('My first post!', '[]'),
('Followup to first post', '[]');
Columnstore Tables Having JSON Columns With Null Values or Empty Arrays
By default, MemSQL does not preserve columnstore JSON NULL
values and empty arrays. To change this behavior and preserve original columnstore JSON, set the preserve_original_colstore_json
engine variable to ON
. This flag preserves the original columnstore JSON for any new data that is loaded; it does not update the existing data. By default, this variable is set to AUTO
(same as OFF
).
Behavior When preserve_original_colstore_json
is Set to OFF
When you store a JSON column in a columnstore table (and preserve_original_colstore_json
is set to OFF
), then null values and empty arrays in the JSON object are handled as follows when the object is written to the columnstore:
- Name/value pairs with the value
NULL
are normally removed from the JSON object. - Name/value pairs containing empty arrays are normally removed from the JSON object.
- If the JSON object has only the value
NULL
or[]
, the value is replaced withNULL
.
Example commands you can use to store data are INSERT
, UPDATE
, and LOAD DATA
.
An example INSERT
scenario follows. Consider a table that is defined as:
CREATE TABLE json_empty_values_table(a INT, b JSON, KEY (a) USING CLUSTERED COLUMNSTORE);
Insert five rows into the table:
INSERT INTO json_empty_values_table VALUES (1, '{"v":null}');
INSERT INTO json_empty_values_table VALUES (2, '{"w":[]}');
INSERT INTO json_empty_values_table VALUES (3, '{"x":"foo","y":null,"z":[]}');
INSERT INTO json_empty_values_table VALUES (4, 'null');
INSERT INTO json_empty_values_table VALUES (5, '[]');
Manually flush the inserted data to the columnstore:
OPTIMIZE TABLE json_empty_values_table FLUSH;
Query the table:
SELECT * FROM json_empty_values_table ORDER BY a;
****
+------+-------------+
| a | b |
+------+-------------+
| 1 | {} |
| 2 | {} |
| 3 | {"x":"foo"} |
| 4 | NULL |
| 5 | NULL |
+------+-------------+
Accessing Fields in a JSON Object
To access a field of a JSON object stored in a column, use the name of the column suffixed with ::keyName
. For example, if column data contains {"bits":[true,false]}
, then the expression data::bits
evaluates to the JSON value [true,false]
.
Converting a JSON String Value into a SQL String
To transparently convert a JSON string value (enclosed in quotation marks and possibly containing escape sequences) into a SQL string, use the name of the column suffixed with ::$keyname
. For example, if the SQL string '{"magic":"hello\\nworld"}'
is inserted into column data, then:
- The expression data::magic evaluates to
'"hello\\nworld"'
. This a valid JSON entity, properly escaped and quoted, that can safely be reinserted into a JSON column. - The expression
data::$magic
evaluates to'hello\nworld'
. This is a SQL string with an embedded newline.
SELECT uid, CONCAT(userdata::name::$first, ' ', userdata::name::$last) FROM users;
Converting a JSON Number or Boolean Value into a SQL DOUBLE
To transparently convert a JSON number or Boolean value into a SQL DOUBLE
, use the name of the column suffixed with ::%keyname
. For example, if column data contains {"valid":true,"value":3.14}
, then data::%valid
is 1 and data::%value
is 3.14.
Using the ::$ and ::% Operators
If the ::$
or ::%
operator is used on a JSON field that is not of the appropriate type, the JSON result is quietly converted to string or double, following the rules described in JSON_EXTRACT_<type>. In fact, these access operators are simply convenient aliases for the JSON_EXTRACT_<type>
built-in function, and they follow all the same rules.
Note that the ::
syntax only produces the expected output when keys are given as literals like “a”, or 5. The ::
syntax returns NULL
for variables or expressions, hence use JSON_EXTRACT_<type>
instead.
For example, the following SELECT
statement returns the expected output if the value 2
is provided in the JSON field along with the ::
operator:
SELECT json, json::a::`2` FROM (SELECT '{"a":[1,2,3,4]}' AS json) sub;
****
+-----------------+--------------+
| json | json::a::`2` |
+-----------------+--------------+
| {"a":[1,2,3,4]} | 3 |
+-----------------+--------------+
While for an expression like 1+1
, use the JSON_EXTRACT_<type>
function:
SELECT json, json_extract_json(json, 'a', 1+1) FROM (SELECT '{"a":[1,2,3,4]}' AS json) sub;
****
+-----------------+-----------------------------------+
| json | json_extract_json(json, 'a', 1+1) |
+-----------------+-----------------------------------+
| {"a":[1,2,3,4]} | 3 |
+-----------------+-----------------------------------+
Accessing Nested JSON Objects
To access nested JSON objects, chain the colon-colon operator to form a keypath
. For example, data::address::street
means the street
field of the address
field of the data
column.
If one of the keys in the keypath is not present in the nested object, then the entire
colon-colon expression yields SQL NULL
.
Using Colon-Colon Notation in UPDATE Queries
You can use the colon-colon notation for JSON keypaths in UPDATE queries.
For example, the following two UPDATE
queries perform exactly the same operation:
UPDATE users SET userdata::name::$first = 'Alex';
UPDATE users SET userdata = JSON_SET_STRING(userdata, 'name', 'first', 'Alex');
In fact, these access operators are simply convenient aliases for the JSON_SET_<type>
built-in function (see JSON_SET_<type>), and they follow all the same rules.
Field Name Syntax for JSON Access and UPDATE Queries
In both the JSON access and update contexts, each field name must either:
-
follow the syntax for a valid SQL identifier, or
-
be escaped with backticks in the same manner as a SQL identifier
For example, the following two SELECT queries perform exactly the same operation:
SELECT ticker_symbol FROM stocks WHERE statistics::%`P/E` > 1.5;
SELECT ticker_symbol FROM stocks WHERE JSON_EXTRACT_DOUBLE(statistics, 'P/E') > 1.5;
When using the JSON_EXTRACT_<type>
form of the query (see JSON_EXTRACT_<type> ), there is no constraint on the contents of the keystring. A JSON object can contain any string as a key, even ""
, or a string with "embedded\u0000nulls"
. For more information on extracting JSON data, see JSON LOAD DATA.
Implementation Considerations
This section describes some unique behaviors in MemSQL’s implementation of the JSON standard (RFC 4627).
Infinities and NaNs
MemSQL does not attempt to support entities such as the nan
(“not a number”) entity, as in {"value":nan}
. Although the JavaScript language supports nan
and positive and negative infinities, neither the JSON standard nor SQL’s DOUBLE
type provide any way to represent such non-finite values. If your application requires these special values, you might try using null
, 0.0
, or strings (such as "nan"
and "inf"
) in their place.
Unicode Support
JSON strings can contain any of the 65536 Unicode characters in the Basic Multilingual Plane,
either directly as themselves or escaped with JSON’s \uABCD
notation. JSON normalization
converts all such ASCII escape sequences into UTF-8 byte sequences, except for those characters that must remain \u
-encoded in order to keep the string valid JSON.
MemSQL (like MySQL) does not support characters outside the Basic Multilingual Plane (characters whose codepoints are in the range 0xFFFF to 0x10FFFF). This includes some uncommon Chinese characters, and symbols such as emoji. Attempts to insert or use JSON entities containing those characters are rejected, which, in MemSQL, causes those entities to become SQL NULL
instead. Such characters can be unambiguously encoded using JSON’s surrogate pair notation:
-- Here, ":)" stands in for the four-byte Unicode character U+1F602
UPDATE posts SET usercomments = JSON_ARRAY_PUSH_JSON(usercomments, '"Nice post! :)"'); -- pushes JSON 'null' instead
UPDATE posts SET usercomments = JSON_ARRAY_PUSH_JSON(usercomments, '"Nice post! \\uD83D\\uDE02"'); -- succeeds
SELECT usercomments FROM posts;
However, if your front-end encodes high Unicode characters in this way, it must also be prepared to deal with the UTF-8-encoded surrogate pairs returned in queries. For example, the SELECT
query above will return a string of the form [null,"Nice post! HhhLll"]
, where HhhLll
represents a high surrogate and low surrogate encoded as invalid UTF-8.
Character Encodings and Collation
Collation
provides a set of rules to use when comparing characters in a character set. By default, JSON columns in MemSQL tables use the utf8_bin
collation, which is a case-sensitive collation that sorts by Unicode codepoint value. The collation of a JSON column affects the following:
- Output of
SELECT DISTINCT
,ORDER BY
, and other queries that compare entire values. - Identification of duplicate keys inside a single JSON object during normalization. Example: Whether the string
{"Schlüssel":1,"Schluessel":2}
is normalized to{"Schluessel":2}
. - Sorting of keys inside a single JSON object. Example: Whether the string
{"Schlüssel":1,"Schluff":2}
is normalized to{"Schluff":2,"Schlüssel":1}
.
The utf8_bin
collation generally provides the desired behavior. However, you can override this default at the table or column level.
MemSQL recommends that you use the utf8_bin
collation for all JSON columns - either implicitly by default, or explicitly at the column level.
Table-level Override
In the following example, for the users
table, both userdata
and lastname
use the table collation (which has been specified as utf8_unicode_ci
).
CREATE TABLE users (
uid INT AUTO_INCREMENT PRIMARY KEY,
userdata JSON,
lastname AS data::name::$last PERSISTED TEXT) COLLATE=utf8_unicode_ci;
Column-level Override
In the following example, for the orders
table:
- The
data
column usesutf8_unicode_ci
. - The
productdetails
column usesutf8_bin
. - Both the
productname
column and thecomments
column useutf8_general_ci
, which is MemSQL’s default database collation
CREATE TABLE orders (
oid INT AUTO_INCREMENT PRIMARY KEY,
data JSON COLLATE utf8_unicode_ci,
productname AS data::product::$name PERSISTED TEXT,
productdetails AS data::product::$details PERSISTED TEXT COLLATE utf8_bin,
comments VARCHAR(256));
Maximum JSON Value Size
Under the hood, JSON data is stored in LONGBLOB
columns. As a practical matter, the size limit of a JSON value is bounded by the max-allowed-packet
variable.