You are viewing an older version of this section. View current production version.
LOAD DATA
See more syntax options for LOAD DATA
in CREATE PIPELINE.
SingleStore Managed Service does not support the INFILE
syntax option for this command.
Insert data stored in a CSV, JSON, or Avro file into a table.
CSV LOAD DATA
Syntax
LOAD DATA [LOCAL] INFILE 'file_name'
[REPLACE | IGNORE | SKIP { ALL | CONSTRAINT | DUPLICATE KEY } ERRORS]
INTO TABLE tbl_name
[CHARACTER SET charset_name]
[{FIELDS | COLUMNS}
[TERMINATED BY 'string']
[[OPTIONALLY] ENCLOSED BY 'char']
[ESCAPED BY 'char']
]
[LINES
[STARTING BY 'string']
[TERMINATED BY 'string']
]
[TRAILING NULLCOLS]
[NULL DEFINED BY string_to_insert_as_null]
[IGNORE number LINES]
[ ({col_name | @variable_name}, ...) ]
[SET col_name = expr,...]
[WHERE expr,...]
[MAX_ERRORS number]
Remarks
To understand LOAD DATA
's error handling behavior, see the section below.
If no FIELDS
or LINES
clause is specified, then MemSQL uses the following defaults:
FIELDS TERMINATED BY '\t'
ENCLOSED BY ''
ESCAPED BY '\\'
LINES TERMINATED BY '\n'
STARTING BY ''
The SET
clause allows you to set columns using specific values or expressions with user defined variables. For example, if your input file has 9 columns but the table has a 10th column called foo
, you can add SET foo=0
or SET foo=@myVariable
. Note that column names may only be used on the left side of SET
expressions.
The WHERE
clause allows you to do filtering on incoming data. Only rows that satisfy the expression in the WHERE
clause will be loaded into MemSQL. For an example of how to use the WHERE
clause, see the examples section below.
The TRAILING NULLCOLS
clause allows the input file to contain rows having fewer than the number of columns in the table. These missing fields must be trailing columns in the row; they are inserted as NULL
values in the table.
The NULL DEFINED BY
clause inserts NULL
field values in the table for fields in the input file having the value string_to_insert_as_null
.
The behavior of MemSQL’s LOAD DATA
command has several functional differences from MySQL’s command:
LOAD DATA
will load the data into MemSQL in parallel to maximize performance. This makesLOAD DATA
in MemSQL much faster on machines with a larger number of processors.LOAD DATA
supports loading compressed .gz files.- The only supported charset_name is utf8.
- While it is possible to execute
LOAD DATA
on leaf nodes, it should only be run on master aggregator or child aggregator node types. See Node Requirements for MemSQL Commands for more information. Note that when running this command on reference tables you must connect to the master aggregator.
The mysqlimport
utility can also be used to import data into MemSQL. mysqlimport
uses LOAD DATA
internally.
MemSQL stores information about errors encountered during each LOAD DATA
operation, but the number of errors is limited
to 1000 by default. When this limit is reached, the load fails. This prevents out-of-memory issues when
unintentionally loading large files with incorrect format or an incorrect LOAD DATA
statement.
Use MAX_ERRORS
at the end of the statement to change this limit.
Writing to multiple databases in a transaction is not supported.
Examples
Example 1
If the order of columns in the table is different from the order in the source file, you can name them explicitly. In this example, the columns are loaded in reverse order:
LOAD DATA INFILE 'foo.tsv'
INTO TABLE foo (fourth, third, second, first);
Example 2
You can also skip columns in the source file using the “@” sign. In this example only the first and fourth columns are imported into table foo:
LOAD DATA INFILE 'foo.tsv'
INTO TABLE foo (bar, @, @, baz);
Example 3
The default column delimiter is the tab (“t”) character, ASCII code 09. You can specify a different delimiter, even multi-character delimiters, with the COLUMNS TERMINATED BY
clause:
LOAD DATA INFILE 'foo.csv'
INTO TABLE foo
COLUMNS TERMINATED BY ',';
Example 4
The following example demonstrates loading a file that has unusual column separators (|||
):
LOAD DATA INFILE 'foo.oddformat'
INTO TABLE foo
COLUMNS TERMINATED BY '|||';
Example 5
You can also filter out unwanted rows using the WHERE
clause. In this example only rows where bar is equal to 5 will be loaded. All other rows will be discarded:
LOAD DATA INFILE 'foo.oddformat'
INTO TABLE foo (bar, baz)
WHERE bar = 5;
Example 6
Complex transformations can be performed in both the ‘SET’ and ‘WHERE’ clauses. For example, if you have an input file with a EventDate
field and an EventId
field:
10-1-2016,1
4-15-2016,2
1-10-2017,3
4-10-2017,4
You want to only load the rows with a date that is within three months from a certain date, 10/15/2016, for instance. This can be accomplished by the following:
CREATE TABLE foo (EventDate date, EventId int);
LOAD DATA INFILE 'date_event.csv'
INTO TABLE foo
FIELDS TERMINATED BY ','
(@EventDate, EventId)
SET EventDate = STR_TO_DATE(@EventDate, '%m-%d-%Y')
WHERE ABS(MONTHS_BETWEEN(EventDate, date('2016-10-15'))) < 3;
SELECT * FROM t;
****
+------------+---------+
| EventDate | EventId |
+------------+---------+
| 2016-10-01 | 1 |
| 2017-01-10 | 3 |
+------------+---------+
While both column names and variables can be referenced in the WHERE
clause column names can only be
assigned to in the SET
clause. The scope of these clauses is restricted to the current row and therefore SELECT
statements cannot be evaluated.
Example 7
The following example demonstrates how to use the TRAILING NULLCOLS
clause using the file numbers.csv
, whose contents are shown below.
1,2,3
4,5
6
Run the following commands:
CREATE TABLE foo(a INT, b INT, c INT);
LOAD DATA INFILE 'numbers.csv' INTO TABLE foo COLUMNS TERMINATED BY ',' TRAILING NULLCOLS;
SELECT * FROM foo;
****
+------+------+------+
| a | b | c |
+------+------+------+
| 1 | 2 | 3 |
| 4 | 5 | NULL |
| 6 | NULL | NULL |
+------+------+------+
Example 8
The following example demonstrates how to use the NULL DEFINED BY
clause using the file numbers2.csv
, whose contents are shown below.
1,2,3
,10,
50,,
Run the following commands:
CREATE TABLE foo(a INT, b INT, c INT);
LOAD DATA INFILE 'numbers2.csv' INTO TABLE foo COLUMNS TERMINATED BY ',' NULL DEFINED BY '';
SELECT * FROM foo;
****
+------+------+------+
| a | b | c |
+------+------+------+
| 1 | 2 | 3 |
| NULL | 10 | NULL |
| 50 | NULL | NULL |
+------+------+------+
JSON LOAD DATA
Syntax
LOAD DATA [LOCAL] INFILE 'file_name'
[REPLACE | SKIP { CONSTRAINT | DUPLICATE KEY } ERRORS]
INTO TABLE tbl_name
FORMAT JSON
subvalue_mapping
[SET col_name = expr,...]
[WHERE expr,...]
[MAX_ERRORS number]
subvalue_mapping:
( {col_name | @variable_name} <- subvalue_path [DEFAULT literal_expr], ...)
subvalue_path:
{% | [%::]ident [::ident ...]}
Semantics
Extract specified subvalues from each JSON value in file_name
. Assign them to specified columns of a new row in tbl_name
, or to variables used for a column assignment in a SET
clause. If a specified subvalue can’t be found in an input JSON, assign the DEFAULT
clause literal instead. Discard rows which don’t match the WHERE
clause.
The file named by file_name
must consist of concatenated UTF-8 encoded JSON values, optionally separated by whitespace. Newline-delimited JSON is accepted, for example.
Non-standard JSON values like NaN
, Infinity
, and -Infinity
must not occur in file_name
.
If file_name
ends in .gz
or .lz4
, it will be decompressed.
JSON LOAD DATA
supports a subset of the error recovery options allowed by CSV LOAD DATA
. Their behavior is as described under CSV LOAD DATA.
Writing to multiple databases in a transaction is not supported.
Extracting JSON Values
subvalue_mapping
specifies which subvalues are extracted and the column or variable to which each one is assigned.
LOAD DATA
uses the ::
-separated list of keys in a subvalue_path
to perform successive key lookups in nested JSON objects, as if applying the ::
SQL operator. Unlike the ::
operator, subvalue_path
may not be used to extract an element of a JSON array. The path %
refers to the entire JSON value being processed. Leading %::
may be omitted from paths which are otherwise non-empty.
If a path can’t be found in an input JSON value, then if the containing element of subvalue_mapping
has a DEFAULT
clause, its literal_expr
will be assigned; otherwise, LOAD DATA
will terminate with an error.
Path components containing whitespace or punctuation must be surrounded by backticks. For example, the paths %::`a.a`::b
and `a.a`::b
will both extract 1
from the input object {"a.a":{"b":1},"c":2}
.
Array elements may be indirectly extracted by applying JSON_EXTRACT_<type>
in a SET
clause.
Converting JSON Values
Before assignment or set clause evaluation, the JSON value extracted according to a subvalue_path
is converted to a binary collation SQL string whose value depends on the extracted JSON type as follows:
JSON Type | Converted Value |
---|---|
null |
SQL NULL |
true /false |
"1" /"0" |
number |
Verbatim, from extracted string. |
string |
All JSON string escape sequences, including \u-escape sequences are converted to UTF-8. Verbatim otherwise. |
array |
Verbatim, from extracted string. For example, '[1,2]' |
object |
Verbatim, from extracted string. For example, '{"k":true}' |
Conversion isn’t recusive. So, for example, true
isn’t converted to "1"
when it’s a subvalue of an object which is being extracted whole.
Examples
Example 1
If example.json
consists of:
{"a":{"b":1}, "c":null}
{"a":{"b":2}, "d":null}
Then it can be loaded as follows:
CREATE TABLE t(a INT);
LOAD DATA LOCAL INFILE "example.json" INTO TABLE t(a <- a::b) FORMAT JSON;
SELECT * FROM t;
****
+------+
| a |
+------+
| 1 |
| 2 |
+------+
Example 2
If example2.json
consists of:
{"b":true, "s":"A\u00AE\u0022A", "n":-1.4820790816978637e-25, "a":[1,2], "o":{"subobject":1}}
{"b":false}
"hello"
Then we can perform a more complicated LOAD DATA
:
CREATE TABLE t(b bool NOT NULL, s TEXT, n DOUBLE, a INT, o JSON NOT NULL, whole longblob);
LOAD DATA LOCAL INFILE "example2.json" INTO TABLE t FORMAT JSON
-> (b <- b default true,
-> s <- s default NULL,
-> n <- n default NULL,
-> @avar <- a default NULL,
-> o <- o default '{"subobject":"replaced"}',
-> whole <- %)
-> SET a = json_extract_double(@avar, 1)
-> WHERE b = true;
SELECT * FROM t;
****
+---+-------+-------------------------+------+--------------------------+-----------------------------------------------------------------------------------------------+
| b | s | n | a | o | whole |
+---+-------+-------------------------+------+--------------------------+-----------------------------------------------------------------------------------------------+
| 1 | A®"A | -1.4820790816978637e-25 | 2 | {"subobject":1} | {"b":true, "s":"A\u00AE\u0022A", "n":-1.4820790816978637e-25, "a":[1,2], "o":{"subobject":1}} |
| 1 | NULL | NULL | NULL | {"subobject":"replaced"} | hello |
+---+-------+-------------------------+------+--------------------------+-----------------------------------------------------------------------------------------------+
There are several things to note in the example above:
true
was converted to"1"
for columnsb
, but not for columnwhole
."1"
was further converted to theBOOL
value1
.- The \u-escapes
"\u00AE"
and"\u0022"
were converted to UTF-8 for columns
, but not for columnwhole
. Note thatwhole
would have become invalid JSON if we had translated"\u0022"
. - The second row was discarded because it failed to match the
WHERE
clause. - None of the paths in
subvalue_mapping
could be found in the third row, soDEFAULT
literals like'{"subobject":"replaced"}'
were assigned instead. - We assigned
a
to an intermediate variable so that we could extract an array element in theSET
clause. - The “top-level” JSON values in
example2.json
weren’t all JSON objects."hello"
is a valid “top-level” JSON value.
Avro LOAD DATA
Syntax
LOAD DATA [LOCAL] INFILE 'file_name'
[REPLACE | SKIP { CONSTRAINT | DUPLICATE KEY } ERRORS]
INTO TABLE tbl_name
FORMAT AVRO
subvalue_mapping
[SET col_name = expr,...]
[WHERE expr,...]
[MAX_ERRORS number]
[SCHEMA 'avro_schema']
subvalue_mapping:
( {col_name | @variable_name} <- subvalue_path, ...)
subvalue_path:
{% | [%::]ident [::ident ...]}
Semantics
Extract specified subvalues from each Avro value in file_name
. Assign them to specified columns of a new row in tbl_name
, or to variables used for a column assignment in a SET
clause. Discard rows which don’t match the WHERE
clause.
Avro LOAD DATA
expects Avro data in one of two “sub-formats”, depending on the SCHEMA
clause.
If no SCHEMA
clause is provided, file_name
must name an Avro Object Container File as described in version 1.8.2 of the Avro specification. In addition, the following restrictions hold:
-
The compression codec of the file must be
null
. -
Array and map values must not have more than 16384 elements.
-
The type name of a
record
must not be used in a symbolic “reference to previously defined name” in any of its fields. It may still be used in a symbolic reference outside the record definition, however.For example, self-referential schemas like the following are rejected by
LOAD DATA
:{ "type": "record", "name": "PseudoLinkedList", "fields" : [{"name": "value", "type": "long"}, {"name": "next", "type": ["null", "PseudoLinkedList"]}] }
If a SCHEMA
clause is provided, the file must be a “raw stream” consisting of only the concatenated binary encodings of instances of avro_schema
. avro_schema
must be a SQL string containing a JSON Avro schema. The restrictions on Object Container Files also apply to “raw stream” files.
It’s an error to provide a SCHEMA
clause when loading an Object Container File because it contains metadata alongside the encoded values.
All optional Avro schema attributes except the namespace
attribute are ignored. Notably, logicalType
attributes are ignored.
If file_name
ends in .gz
or .lz4
, it will be decompressed.
Avro LOAD DATA
supports a subset of the error recovery options allowed by CSV LOAD DATA
. Their behavior is as described under CSV LOAD DATA.
Writing to multiple databases in a transaction is not supported.
Extracting Avro Values
subvalue_mapping
specifies which subvalues are extracted and the column or variable to which each one is assigned.
LOAD DATA
uses the ::
-separated list of names in a subvalue_path
to perform successive field name or union branch type name lookups in nested Avro records or unions. subvalue_path
may not be used to extract elements of Avro arrays or maps. The path %
refers to the entire Avro value being processed. Leading %::
may be omitted from paths which are otherwise non-empty.
If a path can’t be found in an input Avro value, then:
- If a prefix of the path matches a record whose schema has no field matching the next name in the path, then
LOAD DATA
will terminate with an error. - If a prefix matches a union whose schema has no branch matching the next name, then
LOAD DATA
will terminate terminate with an error. - If a prefix matches a union whose schema has a branch matching the next name, but that branch isn’t the selected branch in that instance of the union schema, then Avro
null
will be extracted instead andLOAD DATA
will continue.
Path components naming union branches must use the two-part fullname of the branch’s type if that type is in a namespace.
Path components containing whitespace or punctuation must be surrounded by backticks.
Array and map elements may be indirectly extracted by applying JSON_EXTRACT_<type>
in a SET
clause.
For example, consider two Avro records with the union schema:
[
"int",
{ "type" : "record",
"name" : "a",
"namespace" : "n",
"fields" : [{ "name" : "f1",
"type" : "int" }]
}
]
The paths %::`n.a`::f1
and `n.a`::f1
will both extract 1
from an instance of this schema whose JSON encoding is {"n.a":{"f1":1}}
.
They will extract null
from an instance whose encoding is {"int":2}
.
The paths %::int
and int
will extract 2
from the second instance and null
from the first.
Converting Avro Values
Before assignment or set clause evaluation, the Avro value extracted according to a subvalue_path
is converted to an unspecified SQL type which may be further explicitly or implicitly converted as if from a SQL string whose value is as follows:
Avro Type | Converted Value |
---|---|
null |
SQL NULL |
boolean |
"1" /"0" |
int |
The string representation of the value |
long |
The string representation of the value |
float |
SQL NULL if not finite. Otherwise, a string convertible without loss of precision to FLOAT |
double |
SQL NULL if not finite. Otherwise, a string convertible without loss of precision to DOUBLE |
enum |
The string representation of the enum. |
bytes |
Verbatim, from input bytes |
string |
Verbatim, from input bytes |
fixed |
Verbatim, from input bytes |
record |
The JSON encoding of the value |
map |
The JSON encoding of the value |
array |
The JSON encoding of the value |
union |
The JSON encoding of the value |
logicalType
attributes are ignored and have no effect on conversion.
Examples
Example 1
Consider an Avro Object Container File example.avro
with the following schema:
{
"type": "record",
"name": "data",
"fields": [{ "name": "id", "type": "long"},
{ "name": "payload", "type": [ "null",
"string" ]}]
}
example.avro
contains three Avro values whose JSON encodings are:
{"id":1,"payload":{"string":"first"}}
{"id":1,"payload":{"string":"second"}}
{"id":1,"payload":null}
example.avro
can be loaded as follows:
CREATE TABLE t(payload TEXT, input_record JSON);
LOAD DATA LOCAL INFILE "example.avro"
-> INTO TABLE t
-> FORMAT AVRO
-> ( payload <- %::payload::string,
-> input_record <- % );
SELECT * FROM t;
****
+---------+----------------------------------------+
| payload | input_record |
+---------+----------------------------------------+
| first | {"id":1,"payload":{"string":"first"}} |
| second | {"id":1,"payload":{"string":"second"}} |
| NULL | {"id":1,"payload":null} |
+---------+----------------------------------------+
LOAD DATA
was able to parse example.avro
because Avro Object Container Files have a header which contains their schema.
Example 2
Consider a file named example.raw_avro
, with the same values as example.avro
from Example 1 but in the “raw stream format”. That is, example.raw_avro
consists of the binary encoded values and nothing else. We add a SCHEMA
clause to tell LOAD DATA
to expect a “raw stream” with the provided schema:
CREATE TABLE t(payload TEXT, input_record JSON);
LOAD DATA LOCAL INFILE "example.raw_avro"
-> INTO TABLE t
-> FORMAT AVRO
-> ( payload <- %::payload::string,
-> input_record <- % )
-> schema
-> '{
'> "type": "record",
'> "name": "data",
'> "fields": [{ "name": "id", "type": "long"},
'> { "name": "payload", "type": [ "null",
'> "string" ]}]
'> }';
SELECT * FROM t;
****
+---------+----------------------------------------+
| payload | input_record |
+---------+----------------------------------------+
| first | {"id":1,"payload":{"string":"first"}} |
| second | {"id":1,"payload":{"string":"second"}} |
| NULL | {"id":1,"payload":null} |
+---------+----------------------------------------+
Example 3
Consider an Object Container File example3.avro
with a more complicated payload than Example 1. We illustrate extracting values from nested unions and records, and also indirectly extracting elements of nested maps and arrays.
{ "type": "record",
"namespace": "ns",
"name": "data",
"fields": [
{ "name": "id", "type": "long" },
{ "name": "payload", "type":
[ "null",
{ "type": "record",
"name": "payload_record",
"namespace": "ns",
"fields": [
{ "name": "f_bytes", "type": "bytes"},
{ "name": "f_string", "type": "string"},
{ "name": "f_map", "type":
{ "type": "map",
"values": { "type": "array",
"items": "int" }}}
]
}
]
}
]
}
The raw JSON encoding of the contents of this file can be seen in column c_whole_raw
after the following LOAD DATA
:
CREATE TABLE t (
-> c_id bigint,
-> c_bytes longblob,
-> c_string longblob,
-> c_array_second int,
-> c_whole_raw longblob,
-> c_whole_json json
-> );
LOAD DATA INFILE "example3.avro"
-> INTO TABLE t
-> FORMAT AVRO
-> ( c_id <- %::id,
-> c_bytes <- %::payload::`ns.payload_record`::f_bytes,
-> c_string <- %::payload::`ns.payload_record`::f_string,
-> @v_map <- %::payload::`ns.payload_record`::f_map,
-> c_whole_raw <- %,
-> c_whole_json <- %)
-> SET c_array_second = JSON_EXTRACT_JSON(@v_map, "a", 1);
SELECT * FROM t \G
****
*************************** 1. row ***************************
c_id: 1
c_bytes: NULL
c_string: NULL
c_array_second: NULL
c_whole_raw: {"id":1,"payload":null}
c_whole_json: {"id":1,"payload":null}
*************************** 2. row ***************************
c_id: 2
c_bytes: "A
c_string: "A
c_array_second: 2
c_whole_raw: {"id":2,"payload":{"ns.payload_record":{"f_bytes":"\u0022\u0041","f_string":"\"A","f_map":{"a":[1,2]}}}}
c_whole_json: {"id":2,"payload":{"ns.payload_record":{"f_bytes":"\"A","f_map":{"a":[1,2]},"f_string":"\"A"}}}
There are several things to note:
- We attempted to extract subvalues of the
payload_record
branch of the union-typepayload
field. Since that wasn’t the selected member of the union in record 1,LOAD DATA
assignedNULL
toc_bytes
and@v_map
. - We assigned the JSON encoding of
f_map
to@v_map
and then performed JSON map and array lookups in theSET
clause to ultimately extract2
. f_string
andf_bytes
had the same contents, but we can see how their different Avro types affected their JSON encodings and interacted with the SQL JSON type:- The JSON encoding of the Avro
string
valuef_string
, as seen inc_whole_raw
, encodes special characters like"
as the escape sequence\"
. - The JSON encoding of the Avro
bytes
valuef_bytes
, as seen inc_whole_raw
, encodes every byte with a JSON \u-escape. - When converting the JSON encoding of record 2 to the SQL JSON type while assigning to
c_whole_json
,LOAD DATA
normalized both representations of the byte sequence"A
to\"A
.
- The JSON encoding of the Avro
Differences Between SKIP ERRORS and IGNORE
If you attempt to insert invalid data into a table, MemSQL’s default behavior is to return a client error. As a result, no data will be inserted into the table. This behavior applies to any LOAD DATA
, INSERT
, INSERT ... SELECT
, or CREATE PIPELINE
statement.
However, if you want MemSQL to automatically handle errors during insertion, the following options exist:
LOAD DATA [LOCAL] INFILE 'file_name' SKIP { ALL | CONSTRAINT | DUPLICATE KEY } ERRORS
LOAD DATA [LOCAL] INFILE 'file_name' IGNORE
The error handling behavior of SKIP ... ERRORS
and IGNORE
is very different, and may produce unexpected results. It’s important to understand how each option affects the way data is inserted into the destination table.
For most scenarios, SKIP ... ERRORS
should be used instead of IGNORE
. If you use IGNORE
and are uncertain how some error cases may be handled, unintended and unexpected data corruption in the destination table may occur. See the IGNORE Behavior section for more information.
SKIP ERRORS Behavior
MemSQL’s SKIP ... ERRORS
behavior allows you to specify an error scenario that, when encountered, discards an offending row. Three kinds of error scenarios can be skipped:
SKIP DUPLICATE KEY ERRORS
: Any row in the source data that contains a duplicate unique or primary key will be discarded. If the row contains invalid data other than a duplicate key, an error will be generated. See SKIP DUPLICATE KEY ERRORS below.SKIP CONSTRAINT ERRORS
: Inclusive ofSKIP DUPLICATE KEY ERRORS
. If a row violates a column’sNOT NULL
constraint, or the row contains invalid JSON or Geospatial values, the row will be discarded. If the row contains invalid data outside the scope of constraint or invalid value errors, an error will be generated. See SKIP CONSTRAINT ERRORS below.SKIP ALL ERRORS
: Inclusive ofSKIP DUPLICATE KEY ERRORS
andSKIP CONSTRAINT ERRORS
. Also includes any parsing errors in the row caused by issues such as an invalid number of fields. See SKIP ALL ERRORS below.
If the specified error type is encountered when inserting a row, it will generate a warning instead of an error, which can be displayed by executing a SHOW LOAD WARNINGS
statement.
SKIP DUPLICATE KEY ERRORS
When SKIP DUPLICATE KEY ERRORS
is specified, source files that contain duplicate unique or primary key values will be handled in the following way:
- If the destination table’s schema specifies a unique or primary key column, and
- The source file contains one or more rows with a duplicate key value that already exists in the destination table or exists elsewhere in the source file, then
- Every duplicate row in the source file will be discarded and will not be inserted into the destination table.
SKIP DUPLICATE KEY ERRORS
cannot be combined with REPLACE
.
Example
Create a new table with a PRIMARY KEY
column:
CREATE TABLE orders
-> (
-> id BIGINT PRIMARY KEY,
-> customer_id INT,
-> item_description VARCHAR(255),
-> order_time TIMESTAMP NOT NULL
-> );
****
Query OK, 0 rows affected (0.00 sec)
The following CSV file will loaded be into this table as orders.csv
. Note the duplicate primary key value of 2
in line 3:
1,372,Apples,2016-05-09
2,138,Pears,2016-07-14
2,236,Bananas,2016-06-23
3,307,Oranges,2016-07-31
Load the data into the table:
LOAD DATA INFILE 'orders.csv'
-> SKIP DUPLICATE KEY ERRORS
-> INTO TABLE orders
-> FIELDS TERMINATED BY ',';
****
Query OK, 3 rows affected (0.11 sec)
Note that only 3 rows were inserted even though 4 rows were present in the source file. Line 3 in the source file contained a duplicate primary key, and you can verify that it was not inserted by displaying load warnings:
SHOW LOAD WARNINGS\G
****
*************************** 1. row ***************************
Unix Timestamp: 1485157780436644
Level: Warning
Code: 1062
Message: Duplicate entry for unique key
Line Contents: 2,236,Bananas,2016-06-23
Line Number: 3
Host: 127.0.0.1
Port: 3307
1 row in set (0.00 sec)
SKIP CONSTRAINT ERRORS
SKIP CONSTRAINT ERRORS
is inclusive of SKIP DUPLICATE KEY ERRORS
if REPLACE
is not specified. It also applies to rows that violate a column’s NOT NULL
constraint and fields that contain invalid JSON or Geospatial values, and handles the offending rows in the following ways:
NOT NULL Constraint
- If a column in the destination table specifies a
NOT NULL
constraint, and - The source file contains one or more rows with a null value for the constraint column, then
- The offending row(s) will be discarded and will not be inserted into the destination table.
Invalid JSON or Geospatial Data
- If a column in the destination table specifies a
JSON
,GEOGRAPHYPOINT
, orGEOGRAPHY
data type, and - The source file contains one or more rows with invalid values for fields of these types, then
- The offending row(s) will be discarded and will not be inserted into the destination table.
SKIP CONSTRAINT ERRORS
can also be combined with the REPLACE
clause.
Example
Create a new table with a JSON
column type that also has a NOT NULL
constraint:
CREATE TABLE orders
-> (
-> id BIGINT PRIMARY KEY,
-> customer_id INT,
-> item_description VARCHAR(255),
-> order_properties JSON NOT NULL
-> );
****
Query OK, 0 rows affected (0.02 sec)
The following CSV file will loaded be into this table as orders.csv
. Note the malformed JSON in line 2, as well as a null value (\N
) for JSON in line 3:
1,372,Apples,{"order-date":"2016-05-09"}
2,138,Pears,{"order-date"}
3,236,Bananas,{"order-date":"2016-06-23"}
4,307,Oranges,\N
Load the data into the table:
LOAD DATA INFILE 'orders.csv'
-> SKIP CONSTRAINT ERRORS
-> INTO TABLE orders
-> FIELDS TERMINATED BY ',';
****
Query OK, 2 rows affected (0.04 sec)
Note that only 2 rows were inserted even though 4 rows were present in the source file. Line 2 contained malformed JSON, and Line 4 contained an invalid null value. You can verify that both of these offending rows were not inserted by executing SHOW LOAD WARNINGS
:
SHOW LOAD WARNINGS\G
****
*************************** 1. row ***************************
Unix Timestamp: 1485158775521220
Level: Warning
Code: 1844
Message: Invalid JSON value for column 'order_properties'
Line Contents: 2,138,Pears,{"order-date"}
Line Number: 2
Host: 127.0.0.1
Port: 3307
*************************** 2. row ***************************
Unix Timestamp: 1485158775521227
Level: Warning
Code: 1263
Message: NULL supplied to NOT NULL column 'order_properties' at row 4
Line Contents: 4,307,Oranges,\N
Line Number: 4
Host: 127.0.0.1
Port: 3307
2 rows in set (0.00 sec)
SKIP ALL ERRORS
SKIP ALL ERRORS
is inclusive of SKIP DUPLICATE KEY ERRORS
and SKIP CONSTRAINT ERRORS
in addition to any parsing error. Offending rows are handled in the following way:
- If one or more rows in the source file cause
... DUPLICATE KEY ...
or... CONSTRAINT ...
errors, or - If one or more rows in the source file cause parsing errors such as invalid delimiters or an invalid number of fields,
- The offending row(s) will be discarded and will not be inserted into the destination table.
SKIP ALL ERRORS
can also be combined with REPLACE
.
Example
Create a new table with a JSON
column type that also has a NOT NULL
constraint:
CREATE TABLE orders
-> (
-> id BIGINT PRIMARY KEY,
-> customer_id INT,
-> item_description VARCHAR(255),
-> order_properties JSON NOT NULL
-> );
****
Query OK, 0 rows affected (0.02 sec)
The following CSV file will loaded be into this table as orders.csv
. There are three things wrong with this file:
- Line 2 contains only 3 fields
- Line 3 has a duplicate primary key
- Line 4 has a null value for a
NOT NULL
constraint
1,372,Apples,{"order-date":"2016-05-09"}
2,138,Pears
1,236,Bananas,{"order-date":"2016-06-23"}
4,307,Oranges,\N
Load the data into the table:
LOAD DATA INFILE 'orders.csv'
-> SKIP ALL ERRORS
-> INTO TABLE orders
-> FIELDS TERMINATED BY ',';
****
Query OK, 1 row affected (0.04 sec)
Only 1 row was written, despite the source file containing 4 rows. Line 2 was dropped because it contained an invalid number of fields, Line 3 was dropped because it contained a duplicate primary key, and line 4 was dropped because it contained a null value for a NOT NULL
constraint.
You can verify which lines were dropped by executing SHOW LOAD WARNINGS
:
SHOW LOAD WARNINGS\G
****
*************************** 1. row ***************************
Unix Timestamp: 1485160945860979
Level: Warning
Code: 1261
Message: Row 2 doesn't contain data for all columns
Line Contents: 2,138,Pears
Line Number: 2
Host: 127.0.0.1
Port: 3307
*************************** 2. row ***************************
Unix Timestamp: 1485160945860986
Level: Warning
Code: 1263
Message: NULL supplied to NOT NULL column 'order_properties' at row 4
Line Contents: 4,307,Oranges,\N
Line Number: 4
Host: 127.0.0.1
Port: 3307
*************************** 3. row ***************************
Unix Timestamp: 1485160945860990
Level: Warning
Code: 1062
Message: Duplicate entry for unique key
Line Contents: 1,236,Bananas,{"order-date":"2016-06-23"}
Line Number: 3
Host: 127.0.0.1
Port: 3307
3 rows in set (0.00 sec)
REPLACE
Error Handling
MemSQL’s REPLACE
behavior allows you to replace the existing rows with the new rows; only those rows that have the same value for a primary key or unique index as the input rows are replaced. In case of an error that arises due to duplicate key value, it first deletes the conflicting row in the destination that has the duplicate key value and then inserts the new row from the source file.
LOAD DATA
inserts source file rows into the destination table in the order in which the rows appear in the source file. When REPLACE
is specified, source files that contain duplicate unique or primary key values will be handled in the following way:
-
If the destination table’s schema specifies a unique or primary key column, and
-
The source file contains a row with the same primary or unique key value as the destination table, then
-
The row in the destination table that has the same unique or primary key value as the row in the source file will be deleted and a new row from the source file that matches the primary key value will be inserted into the destination table.
Note: If the source file contains multiple rows with the same primary or unique key value as the destination table, then only the last row in the source file with the same primary or unique key value (as the destination table) replaces the existing row in the destination table.
Note: REPLACE
cannot be combined with SKIP DUPLICATE KEY ERRORS
. The default behavior of MemSQL is to throw an error for duplicate keys. However, both REPLACE
and SKIP DUPLICATE KEY ERRORS
does not throw a duplicate key error; REPLACE
replaces the old row with the new row, while SKIP DUPLICATE KEY ERRORS
discards the new row and retains the old row.
Example
Create a new table with a PRIMARY KEY
column:
CREATE TABLE orders(
id BIGINT PRIMARY KEY,
customer_id INT,
item_description VARCHAR(255),
order_time DATETIME NOT NULL
);
A row with a primary key 4
is inserted as follows:
INSERT INTO orders VALUES(4,236,"Bananas",2016-06-23);
The following CSV file is loaded into the table as orders.csv
. Note the duplicate primary key value of 4
in line 2:
1,372,Apples,2016-05-09
4,138,Pears,2016-07-14
3,307,Oranges,2016-07-31
Load the data into the table:
LOAD DATA INFILE 'orders.csv'
REPLACE
INTO TABLE orders
FIELDS TERMINATED BY ','
ERRORS HANDLE 'orders_errors';
Line 2 in the source file contained a duplicate primary key 4
. The REPLACE
error handler deletes the row 4,236,"Bananas",2016-06-23
in the destination table and replaces it with the value 4,138,Pears,2016-07-14
from the source file.
IGNORE Behavior
MemSQL’s IGNORE
behavior is identical to MySQL’s IGNORE
behavior, and exists only to support backwards compatibility with applications written for MySQL. IGNORE
either discards malformed rows, discards extra fields, or replaces invalid values with default data type values. In addition, if an inserted row would have produced an error if IGNORE
was not specified, it will be converted to a warning instead.
Consequences of Using IGNORE Instead of SKIP ERRORS
Unlike SKIP ... ERRORS
which discards offending rows, IGNORE
may change the inserted row’s data to ensure that it adheres to the table schema. This behavior can have serious repercussions for the data integrity of the destination table.
In a best case scenario where a malformed row uses the proper delimiters and contains the correct number of fields, the row can be partially salvaged. Any invalid values are updated with default values, and the modified row is written to the destination table. The result is that at least some of the source data was written to the destination table.
However, the worst case scenario can be severe. For example, if a row’s values are separated by an invalid delimiter, each field is updated with meaningless default values and the modified row is written to the destination table. For the sake of the table’s data integrity, it would have been better if the offending row was discarded. But a row with meaningless data was inserted instead.
Due to the potential consequences of using IGNORE
, in most cases SKIP ... ERRORS
is a better option. To understand IGNORE
's behavior for each error scenario, continue reading the sections below:
- Duplicate unique or primary key values
- Values with invalid types according to the destination table’s schema
- Rows that contain an invalid number of fields
Duplicate Unique or Primary Key Values
When IGNORE
is specified, source files that contain duplicate unique or primary key values will be handled in the following way:
- If the destination table’s schema specifies a unique or primary key column, and
- The source file contains one or more rows with a duplicate key value that already exists in the destination table or exists elsewhere in the source file, then
- Every duplicate row in the source file will be discarded (ignored) and will not be inserted into the destination table.
Example
Create a new table with a PRIMARY KEY
column:
CREATE TABLE orders
-> (
-> id BIGINT PRIMARY KEY,
-> customer_id INT,
-> item_description VARCHAR(255),
-> order_time DATETIME NOT NULL
-> );
****
Query OK, 0 rows affected (0.00 sec)
The following CSV file will loaded be into this table as orders.csv
. Note the duplicate primary key value of 2
in line 3:
1,372,Apples,2016-05-09
2,138,Pears,2016-07-14
2,236,Bananas,2016-06-23
3,307,Oranges,2016-07-31
Load the data into the table:
LOAD DATA INFILE 'orders.csv'
-> IGNORE
-> INTO TABLE orders
-> FIELDS TERMINATED BY ',';
****
Query OK, 3 rows affected (0.11 sec)
Note that only 3 rows were inserted even though 4 rows were present in the source file. Line 3 in the source file contained a duplicate primary key, and you can verify that it was not inserted by displaying load warnings:
SHOW LOAD WARNINGS\G
****
*************************** 1. row ***************************
Unix Timestamp: 1484789864957010
Level: Warning
Code: 1062
Message: Duplicate entry for unique key
Line Contents: 2,236,Bananas,2016-06-23
Line Number: 3
Host: 127.0.0.1
Port: 3307
1 row in set (0.00 sec)
Line 3 in the source file contained a duplicate primary key and was discarded because line 2 was inserted first. You can verify by checking the contents of orders
:
SELECT * FROM orders ORDER BY 1;
****
+----+-------------+------------------+---------------------+
| id | customer_id | item_description | order_time |
+----+-------------+------------------+---------------------+
| 1 | 372 | Apples | 2016-05-09 00:00:00 |
| 2 | 138 | Pears | 2016-07-14 00:00:00 |
| 3 | 307 | Oranges | 2016-07-31 00:00:00 |
+----+-------------+------------------+---------------------+
3 rows in set (0.12 sec)
Values with Invalid Types According to the Destination Table’s Schema
When IGNORE
is specified, source files that contain rows with invalid types that violate the destination table’s schema will be handled in the following way:
- If the source file contains one or more rows with values that do not adhere to the destination table’s schema,
- Each value of an invalid type in a row will be replaced with the default value of the appropriate type, and
- The modified row(s) will be inserted into the destination table.
IGNORE
behaves in a potentially unexpected way for columns that have a DEFAULT
value specified. When an invalid value in the inserted row is replaced with the default value of the column’s type, the column’s DEFAULT
value is ignored. Instead, the default value for the column’s data type is used.
Example
Create a new table with a PRIMARY KEY
column:
CREATE TABLE orders
-> (
-> id BIGINT PRIMARY KEY,
-> customer_id INT,
-> item_description VARCHAR(255),
-> order_time DATETIME NOT NULL
-> );
****
Query OK, 0 rows affected (0.01 sec)
The following CSV file will be loaded be into this table as orders.csv
. Line 4 contains a NULL
value for order_time
, whereas the table schema does not allow NULL
values for this field.
1,372,Apples,2016-05-09
2,138,Pears,2016-07-14
3,236,Bananas,2016-06-23
4,307,Oranges,\N
Load the data into the table:
LOAD DATA INFILE 'orders.csv'
-> IGNORE
-> INTO TABLE orders
-> FIELDS TERMINATED BY ',';
****
Query OK, 4 rows affected (0.01 sec)
Note that 4 rows were inserted despite the fact that line 4 in the source file contained a null value for a NOT NULL
column. To see how this error was handled, execute SHOW LOAD WARNINGS
:
SHOW LOAD WARNINGS\G
****
*************************** 1. row ***************************
Unix Timestamp: 1485204211720050
Level: Warning
Code: 1263
Message: NULL supplied to NOT NULL column 'order_time' at row 4
Line Contents: 4,307,Oranges,\N
Line Number: 4
Host: 127.0.0.1
Port: 3307
1 row in set (0.00 sec)
To see what was inserted by replacing the invalid DATETIME
value with a default value, query the table:
SELECT * FROM orders ORDER BY 1;
****
+----+-------------+------------------+---------------------+
| id | customer_id | item_description | order_time |
+----+-------------+------------------+---------------------+
| 1 | 372 | Apples | 2016-05-09 00:00:00 |
| 2 | 138 | Pears | 2016-07-14 00:00:00 |
| 3 | 236 | Bananas | 2016-06-23 00:00:00 |
| 4 | 307 | Oranges | 0000-00-00 00:00:00 |
+----+-------------+------------------+---------------------+
4 rows in set (0.09 sec)
In this example, the invalid null DATETIME
value was replaced with its default value: 0000-00-00 00:00:00
.
Rows That Contain an Invalid Number of Fields
When IGNORE
is specified, source files that contain rows with an invalid number of fields will be handled in one of two ways:
Too Few Fields
- If the source file contains one or more rows with too few fields according to the destination table’s schema,
- Each row’s empty field(s) will be updated with default values, and
- The row will be inserted into the destination table.
Too Many Fields
- If the source file contains one or more rows with too many fields according to the destination table’s schema,
- Each extra field in the row(s) will be discarded (ignored), and
- The row will be inserted into the destination table.
Example
Create a new table with a PRIMARY KEY
column:
CREATE TABLE orders
-> (
-> id BIGINT PRIMARY KEY,
-> customer_id INT,
-> item_description VARCHAR(255),
-> order_time DATETIME NOT NULL
-> );
****
Query OK, 0 rows affected (0.00 sec)
The following CSV file will loaded be into this table as orders.csv
. There are two things wrong with this file:
- Line 2 contains only 3 fields instead of 4 and does not have a
TIMESTAMP
: - Line 4 contains an extra field, for a total of 5
1,372,Apples,2016-05-09
2,138,Pears
3,236,Bananas,2016-06-23
4,307,Oranges,2016-07-31,Berries
Load the data into the table:
LOAD DATA INFILE 'orders.csv'
-> IGNORE
-> INTO TABLE orders
-> FIELDS TERMINATED BY ',';
****
Query OK, 4 rows affected (0.01 sec)
Note that 4 rows were inserted despite the invalid number of fields for two of the rows. To see how the errors were handled, you can display load warnings:
SHOW LOAD WARNINGS\G
****
*************************** 1. row ***************************
Unix Timestamp: 1485204497796029
Level: Warning
Code: 1261
Message: Row 2 doesn't contain data for all columns
Line Contents: 2,138,Pears
Line Number: 2
Host: 127.0.0.1
Port: 3307
*************************** 2. row ***************************
Unix Timestamp: 1485204497796035
Level: Warning
Code: 1262
Message: Row 4 was truncated; it contained more data than there were input columns
Line Contents: 4,307,Oranges,2016-07-31,Berries
Line Number: 4
Host: 127.0.0.1
Port: 3307
2 rows in set (0.00 sec)
Note that there is a warning for the missing value in row 2 and the extra value in row 4. To see how the data was inserted, query the table:
SELECT * FROM orders ORDER BY 1;
****
+----+-------------+------------------+---------------------+
| id | customer_id | item_description | order_time |
+----+-------------+------------------+---------------------+
| 1 | 372 | Apples | 2016-05-09 00:00:00 |
| 2 | 138 | Pears | 0000-00-00 00:00:00 |
| 3 | 236 | Bananas | 2016-06-23 00:00:00 |
| 4 | 307 | Oranges | 2016-07-31 00:00:00 |
+----+-------------+------------------+---------------------+
4 rows in set (0.02 sec)
Line 2 did not have a DATETIME
value, so the default value for its type was inserted instead. Line 4’s extra value was discarded, and otherwise the row was inserted with the expected data.
Performance Considerations
Shard Keys Loading data into a table with a shard key requires reading the necessary columns on the aggregator to compute the shard key before sending data to the leaves. For CSV LOAD DATA only, it is recommended that columns included in the shard key appear earlier in input rows, if either the shard key design or the input format is flexible. Order does not significantly affect the performance of Avro or JSON LOAD DATA.
Keyless Sharding
Loading data into a keylessly sharded table (no shard key is declared, or shard()
is specified) will result in batches of data loaded into different partitions, in a round-robin fashion.
Retrieve loading status
The information_schema.LMV_LOAD_DATA_STATUS
table reports information about rows and bytes read by in-progress LOAD DATA
queries.
It also reports activity and database names, which you can use to find corresponding rows in workload profiling tables. See Management View Reference for more details.
Result sets will only be returned if LMV_LOAD_DATA_STATUS
is queried on the same aggregator as the in-progress LOAD_DATA
queries.
information_schema.LMV_LOAD_DATA_STATUS Table Schema
Column Name | Description |
---|---|
ID |
The connection ID. |
ACTIVITY_NAME |
The name of the database activity. |
DATABASE_NAME |
The name of the database associated with the file being loaded into the cluster. |
BYTES_READ |
Bytes read from the input file stream. |
ROWS_READ |
A count of rows read in from the source file (including skipped rows). |
SELECT * FROM information_schema.LMV_LOAD_DATA_STATUS;
****
+------+------------------------------------+---------------+------------+-----------+
| ID | ACTIVITY_NAME | DATABASE_NAME | BYTES_READ | ROWS_READ |
+------+------------------------------------+---------------+------------+-----------+
| 2351 | load_data_company_0e8dec6d07d9cba5 | trades | 94380647 | 700512 |
+------+------------------------------------+---------------+------------+-----------+
1 row in set (0.01 sec)
Related Topics