Outdated Version

You are viewing an older version of this section. View current production version.

JSON Guide

About JSON in MemSQL

MemSQL exposes a Javascript Object Notation (JSON) column-type that implements the JSON standard. You can define columns in MemSQL tables as the JSON data type in order to efficiently store and query JSON values. A JSON value is a JavaScript object, which means it is represented as a series of name/value pairs that are written according to a structured format and sequence. For more information, see http://www.ietf.org/rfc/rfc4627.txt.

Managing Collections of Metadata

JSON is useful for managing a collection of diverse data, represented as name-value pairs, that might otherwise be cumbersome to refactor into a formalized key-value table, or that might be stored in a table that is sparsely populated. For example, suppose an organization had an asset management application using MemSQL to track all the information about its physical assets. The asset data is diverse—what’s relevant for a desk differs from what’s important for a server machine or a company car. All assets might have common attributes, such as asset tag ID, asset type, asset name, and description. Each type of asset might have unique attributes, such as size and weight dimensions, hostname and IP address, or gas mileage.

Instead of creating a highly granular table to manage all data as key-value pairs, this organization could simply create a MemSQL table using a JSON column to efficiently manage the unique attribute data. With this design:

  • Each asset gets a row in the table.
  • Attributes that are common to all assets have their own column in the table. These columns allow you to query on common features and quickly narrow down the final result set as much as possible (for example, filtering by asset type).
  • The various, remaining attributes associated with each asset are stored in a JSON column (which might be named something like property_bag). For example, the JSON column for an office desk asset could include JSON data such as size, weight, and number of drawers. The JSON data for a server machine could include rack location, number of cores, and MAC address.

DDL: Defining Tables with JSON Columns

Any MemSQL table can contain one or more columns of data type JSON. A JSON column can optionally be suffixed with NOT NULL. Using the NOT NULL suffix is recommended because it causes MemSQL to give an error if the user attempts to insert invalid JSON.

Comparing JSON and LONGTEXT Columns

A JSON column is analogous to a LONGTEXT column in the following ways:

  • JSON columns can store arbitrarily large JSON values in a normalized text representation.
  • JSON columns have the same storage requirement, as if the JSON value were stored in a text column.

The key difference is that JSON data is stored in a normalized format, which makes many operations faster than if the data were stored manually in a text column. The following is an example of non-normalized data, which is valid JSON but is relatively difficult to parse:

'{ "b\u000a": 1,"a": 2 ,"a":3 } '

Normalized data, on the other hand, is easier to parse because duplicate keys are merged, keys are sorted by value, and extraneous whitespace is removed.

'{"a":3,"b\n":1}'

Defining JSON Columns

Defining a JSON column in a MemSQL table is as simple as specifying the JSON data type in the CREATE TABLE command:

CREATE TABLE assets (
   tag_id BIGINT PRIMARY KEY,
   name TEXT NOT NULL,
   description TEXT,
    properties JSON NOT NULL);
Info

It is recommended that users define JSON columns with the NOT NULL suffix. This way MemSQL gives an error if the user attempts to insert invalid JSON.

Indexing Data in JSON Columns

JSON columns are not indexed directly - they are indexed using computed columns. For the fastest performance, you should not use JSON built-ins or :: notation in your filters. Instead, create a computed column that includes the JSON column in the computation, and then use the computed column for the index. In this way, the index gets updated only when the relevant JSON data is updated in a row.

CREATE TABLE assets (
   tag_id BIGINT PRIMARY KEY,
   name TEXT NOT NULL,
   description TEXT,
   properties JSON NOT NULL,
   weight AS properties::%weight PERSISTED DOUBLE,
   license_plate AS properties::$license_plate PERSISTED TEXT,
   KEY(license_plate), KEY(weight));

In this example, the following queries will be strictly more efficient:

SELECT * FROM assets ORDER BY weight;
SELECT * FROM assets WHERE license_plate = "VGB116";

than these equivalent queries:

SELECT * FROM assets ORDER BY properties::%weight;
SELECT * FROM assets WHERE properties::$license_plate = "VGB116";

DML: Accessing Data in JSON Columns

This section describes how to insert and update data in a MemSQL table with one or more JSON columns.

Adding Data to a JSON Column Upon INSERT

When inserting a row in a table, adding data to a JSON column is straightforward.

INSERT INTO posts (contents, usercomments) VALUES
    ('My first post!', '[]'),
    ('Followup to first post', '[]');

Columnstore Tables Having JSON Columns With Null Values or Empty Arrays

When you store a JSON column in a columnstore table, null values and empty arrays in the JSON object are handled as follows when the object is written to the columnstore.

  • Name/value pairs with the value null are normally removed from the JSON object.
  • Name/value pairs containing empty arrays are normally removed from the JSON object.
  • If the JSON object has only the value null or [], the value is replaced with NULL.

Example commands you can use to store data are INSERT, UPDATE, and LOAD DATA.

An example INSERT scenario follows. Consider a table that is defined as:

CREATE TABLE json_empty_values_table(a INT, b JSON, KEY (a) USING CLUSTERED COLUMNSTORE);

Insert five rows into the table:

INSERT INTO json_empty_values_table VALUES (1, '{"v":null}');
INSERT INTO json_empty_values_table VALUES (2, '{"w":[]}');
INSERT INTO json_empty_values_table VALUES (3, '{"x":"foo","y":null,"z":[]}');
INSERT INTO json_empty_values_table VALUES (4, 'null');
INSERT INTO json_empty_values_table VALUES (5, '[]');

Manually flush the inserted data to the columnstore:

OPTIMIZE TABLE json_empty_values_table FLUSH;

Query the table:

SELECT * FROM json_empty_values_table ORDER BY a;

Output:

+------+-------------+
| a    | b           |
+------+-------------+
|    1 | {}          |
|    2 | {}          |
|    3 | {"x":"foo"} |
|    4 | NULL        |
|    5 | NULL        |
+------+-------------+

Accessing Fields in a JSON Object

To access a field of a JSON object stored in a column, use the name of the column suffixed with ::keyName. For example, if column data contains {"bits":[true,false]}, then the expression data::bits evaluates to the JSON value [true,false].

Converting a JSON String Value into a SQL String

To transparently convert a JSON string value (enclosed in quotation marks and possibly containing escape sequences) into a SQL string, use the name of the column suffixed with ::$keyname. For example, if the SQL string '{"magic":"hello\\nworld"}' is inserted into column data, then:

  • The expression data::magic evaluates to '"hello\\nworld"'. This a valid JSON entity, properly escaped and quoted, that can safely be reinserted into a JSON column.
  • The expression data::$magic evaluates to 'hello\nworld'. This is a SQL string with an embedded newline.
SELECT uid, CONCAT(userdata::name::$first, ' ', userdata::name::$last) FROM users;

Converting a JSON Number or Boolean Value into a SQL DOUBLE

To transparently convert a JSON number or Boolean value into a SQL DOUBLE, use the name of the column suffixed with ::%keyname. For example, if column data contains {"valid":true,"value":3.14}, then data::%valid is 1 and data::%value is 3.14.

Using the ::$ and ::% Operators

If the ::$ or ::% operator is used on a JSON field that is not of the appropriate type, the JSON result is quietly converted to string or double, following the rules described in JSON_EXTRACT_<type>. In fact, these access operators are simply convenient aliases for the JSON_EXTRACT_<type> built-in function, and they follow all the same rules.

Note that the :: syntax only accepts a constant key path while the JSON_EXTRACT_<type> can take any values:

select json, json::a::`2` from (select '{"a":[1,2,3,4]}' as json) sub;
select json, json_extract_json(json, 'a', 1+1) from (select '{"a":[1,2,3,4]}' as json) sub;

Accessing Nested JSON Objects

To access nested JSON objects, chain the colon-colon operator to form a keypath. For example, data::address::street means the street field of the address field of the data column.

Info

If one of the keys in the keypath is not present in the nested object, then the entire colon-colon expression yields SQL NULL.

Using Colon-Colon Notation in UPDATE Queries

You can use the colon-colon notation for JSON keypaths in UPDATE queries. For example, the following two UPDATE queries perform exactly the same operation:

UPDATE users SET userdata::name::$first = 'Alex';
UPDATE users SET userdata = JSON_SET_STRING(userdata, 'name', 'first', 'Alex');

In fact, these access operators are simply convenient aliases for the JSON_SET_<type> built-in function (see JSON_SET_<type>), and they follow all the same rules.

Field Name Syntax for JSON Access and UPDATE Queries

In both the JSON access and update contexts, each field name must either:

  • follow the syntax for a valid SQL identifier, or

  • be escaped with backticks in the same manner as a SQL identifier

For example, the following two SELECT queries perform exactly the same operation:

SELECT ticker_symbol FROM stocks WHERE statistics::%`P/E` > 1.5;
SELECT ticker_symbol FROM stocks WHERE JSON_EXTRACT_DOUBLE(statistics, 'P/E') > 1.5;

When using the JSON_EXTRACT_<type> form of the query (see JSON_EXTRACT_<type> ), there is no constraint on the contents of the keystring. A JSON object can contain any string as a key, even "", or a string with "embedded\u0000nulls".

Implementation Considerations

This section describes some unique behaviors in MemSQL’s implementation of the JSON standard (RFC 4627).

Infinities and NaNs

MemSQL does not attempt to support entities such as the nan (“not a number”) entity, as in {"value":nan}. Although the JavaScript language supports nan and positive and negative infinities, neither the JSON standard nor SQL’s DOUBLE type provide any way to represent such non-finite values. If your application requires these special values, you might try using null, 0.0, or strings (such as "nan" and "inf") in their place.

Unicode Support

JSON strings can contain any of the 65536 Unicode characters in the Basic Multilingual Plane, either directly as themselves or escaped with JSON’s \uABCD notation. JSON normalization converts all such ASCII escape sequences into UTF-8 byte sequences, except for those characters that must remain \u-encoded in order to keep the string valid JSON.

MemSQL (like MySQL) does not support characters outside the Basic Multilingual Plane (characters whose codepoints are in the range 0xFFFF to 0x10FFFF). This includes some uncommon Chinese characters, and symbols such as emoji. Attempts to insert or use JSON entities containing those characters are rejected, which, in MemSQL, causes those entities to become SQL NULL instead. Such characters can be unambiguously encoded using JSON’s surrogate pair notation:

-- Here, ":)" stands in for the four-byte Unicode character U+1F602
UPDATE posts SET usercomments = JSON_ARRAY_PUSH_JSON(usercomments, '"Nice post! :)"');  -- pushes JSON 'null' instead
UPDATE posts SET usercomments = JSON_ARRAY_PUSH_JSON(usercomments, '"Nice post! \\uD83D\\uDE02"');  -- succeeds
SELECT usercomments FROM posts;

However, if your front-end encodes high Unicode characters in this way, it must also be prepared to deal with the UTF-8-encoded surrogate pairs returned in queries. For example, the SELECT query above will return a string of the form [null,"Nice post! HhhLll"], where HhhLll represents a high surrogate and low surrogate encoded as invalid UTF-8.

Character Encodings and Collation

Collation provides a set of rules to use when comparing characters in a character set. By default, JSON columns in MemSQL tables use the utf8_bin collation, which is a case-sensitive collation that sorts by Unicode codepoint value. The collation of a JSON column affects the following:

  • Output of SELECT DISTINCT, ORDER BY, and other queries that compare entire values.
  • Identification of duplicate keys inside a single JSON object during normalization. Example: Whether the string {"Schlüssel":1,"Schluessel":2} is normalized to {"Schluessel":2}.
  • Sorting of keys inside a single JSON object. Example: Whether the string {"Schlüssel":1,"Schluff":2} is normalized to {"Schluff":2,"Schlüssel":1}.

The utf8_bin collation generally provides the desired behavior. However, you can override this default at the table or column level.

Info

MemSQL recommends that you use the utf8_bin collation for all JSON columns - either implicitly by default, or explicitly at the column level.

Table-level Override

In the following example, for the users table, both userdata and lastname use the table collation (which has been specified as utf8_unicode_ci).

CREATE TABLE users (
   uid INT AUTO_INCREMENT PRIMARY KEY,
   userdata JSON,
   lastname AS data::name::$last PERSISTED TEXT) COLLATE=utf8_unicode_ci;

Column-level Override

In the following example, for the orders table:

  • The data column uses utf8_unicode_ci.
  • The productdetails column uses utf8_bin.
  • Both the productname column and the comments column use utf8_general_ci, which is MemSQL’s default database collation
CREATE TABLE orders (
    oid INT AUTO_INCREMENT PRIMARY KEY,
    data JSON COLLATE utf8_unicode_ci,
    productname AS data::product::$name PERSISTED TEXT,
    productdetails AS data::product::$details PERSISTED TEXT COLLATE utf8_bin,
    comments VARCHAR(256));

Maximum JSON Value Size

Under the hood, JSON data is stored in LONGBLOB columns. As a practical matter, the size limit of a JSON value is bounded by the max-allowed-packet variable.