Outdated Version

You are viewing an older version of this section. View current production version.

LOAD DATA

Load data stored in a text file into a table.

Syntax

LOAD DATA [LOCAL] INFILE 'file_name'
  [REPLACE | IGNORE | SKIP { ALL | CONSTRAINT | DUPLICATE KEY } ERRORS]
  INTO TABLE tbl_name
  [CHARACTER SET charset_name]
  [{FIELDS | COLUMNS}
  [TERMINATED BY 'string']
      [[OPTIONALLY] ENCLOSED BY 'char']
      [ESCAPED BY 'char']
  ]
  [LINES
  [STARTING BY 'string']
  [TERMINATED BY 'string']
  ]
  [IGNORE number LINES]
  [ (col_name, ...) ]
  [SET col_name = expr,...]
  [WHERE expr,...]
  [MAX_ERRORS number]

Remarks

To understand LOAD DATA's error handling behavior, see the section below.

If no FIELDS or LINES clause is specified, then MemSQL uses the following defaults:

FIELDS TERMINATED BY '\t'
ENCLOSED BY ''
ESCAPED BY '\\'
LINES TERMINATED BY '\n'
STARTING BY ''

The SET clause allows you to set columns using specific values or expressions with user defined variables. For example, if your input file has 9 columns but the table has a 10th column called foo, you can add SET foo=0 or SET foo=@myVariable. Note that column names may only be used on the left side of SET expressions.

The WHERE clause allows you to do filtering on incoming data. Only rows that satisfy the expression in the WHERE clause will be loaded into MemSQL. For an example of how to use the WHERE clause, see the examples section below.

The behavior of MemSQL’s LOAD DATA command has several functional differences from MySQL’s command:

  • LOAD DATA will load the data into MemSQL in parallel to maximize performance. This makes LOAD DATA in MemSQL much faster on machines with a larger number of processors.
  • LOAD DATA supports loading compressed .gz files.
  • The only supported charset_name is utf8.
  • While it is possible to execute LOAD DATA on leaf nodes, it should only be run on master aggregator or child aggregator node types. See Node Requirements for MemSQL Commands for more information. Note that when running this command on reference tables you must connect to the master aggregator.

The mysqlimport utility can also be used to import data into MemSQL. mysqlimport uses LOAD DATA internally.

MemSQL stores information about errors encountered during each LOAD DATA operation, but the number of errors is limited to 1000 by default. When this limit is reached, the load fails. This prevents out-of-memory issues when unintentionally loading large files with incorrect format or an incorrect LOAD DATA statement. Use MAX_ERRORS at the end of the statement to change this limit.

Writing to multiple databases in a transaction is not supported.

Examples

If the order of columns in the table is different from the order in the source file, you can name them explicitly. In this example, the columns are loaded in reverse order:

LOAD DATA INFILE 'foo.tsv'
INTO TABLE foo (fourth, third, second, first);

You can also skip columns in the source file using the “@” sign. In this example only the first and fourth columns are imported into table foo:

LOAD DATA INFILE 'foo.tsv'
INTO TABLE foo (bar, @, @, baz);

The default column delimiter is the tab (“t”) character, ASCII code 09. You can specify a different delimiter, even multi-character delimiters, with the COLUMNS TERMINATED BY clause:

LOAD DATA INFILE 'foo.csv'
INTO TABLE foo
COLUMNS TERMINATED BY ',';

The following example demonstrates loading a file that has unusual column separators (|||):

LOAD DATA INFILE 'foo.oddformat'
INTO TABLE foo
COLUMNS TERMINATED BY '|||';

You can also filter out unwanted rows using the WHERE clause. In this example only rows where bar is equal to 5 will be loaded. All other rows will be discarded:

LOAD DATA INFILE 'foo.oddformat'
INTO TABLE foo (bar, baz)
WHERE bar = 5;

Complex transformations can be performed in both the ‘SET’ and ‘WHERE’ clauses. For example, if you have an input file with a EventDate field and an EventId field:

10-1-2016,1
4-15-2016,2
1-10-2017,3
4-10-2017,4

You want to only load the rows with a date that is within three months from a certain date, 10/15/2016, for instance. This can be accomplished by the following:


CREATE TABLE foo (EventDate date, EventId int);

LOAD DATA INFILE 'date_event.csv'
INTO TABLE foo
FIELDS TERMINATED BY ','
(@EventDate, EventId)
SET EventDate = STR_TO_DATE(@EventDate, '%m-%d-%Y')
WHERE ABS(MONTHS_BETWEEN(EventDate, date('2016-10-15'))) < 3;

SELECT * FROM t;
+------------+---------+
| EventDate  | EventId |
+------------+---------+
| 2016-10-01 |       1 |
| 2017-01-10 |       3 |
+------------+---------+

While both column names and variables can be referenced in the WHERE clause column names can only be assigned to in the SET clause. The scope of these clauses is restricted to the current row and therefore SELECT statements cannot be evaluated.

Differences Between SKIP ERRORS and IGNORE

If you attempt to insert invalid data into a table, MemSQL’s default behavior is to return a client error. As a result, no data will be inserted into the table. This behavior applies to any LOAD DATA, INSERT, INSERT ... SELECT, or CREATE PIPELINE statement.

However, if you want MemSQL to automatically handle errors during insertion, the following options exist:

  • LOAD DATA [LOCAL] INFILE 'file_name' SKIP { ALL | CONSTRAINT | DUPLICATE KEY } ERRORS
  • LOAD DATA [LOCAL] INFILE 'file_name' IGNORE

The error handling behavior of SKIP ... ERRORS and IGNORE is very different, and may produce unexpected results. It’s important to understand how each option affects the way data is inserted into the destination table.

Info

For most scenarios, SKIP ... ERRORS should be used instead of IGNORE. If you use IGNORE and are uncertain how some error cases may be handled, unintended and unexpected data corruption in the destination table may occur. See the IGNORE Behavior section for more information.

SKIP ERRORS Behavior

MemSQL’s SKIP ... ERRORS behavior allows you to specify an error scenario that, when encountered, discards an offending row. Three kinds of error scenarios can be skipped:

  • SKIP DUPLICATE KEY ERRORS: Any row in the source data that contains a duplicate unique or primary key will be discarded. If the row contains invalid data other than a duplicate key, an error will be generated. See SKIP DUPLICATE KEY ERRORS below.
  • SKIP CONSTRAINT ERRORS: Inclusive of SKIP DUPLICATE KEY ERRORS. If a row violates a column’s NOT NULL constraint, or the row contains invalid JSON or Geospatial values, the row will be discarded. If the row contains invalid data outside the scope of constraint or invalid value errors, an error will be generated. See SKIP CONSTRAINT ERRORS below.
  • SKIP ALL ERRORS: Inclusive of SKIP DUPLICATE KEY ERRORS and SKIP CONSTRAINT ERRORS. Also includes any parsing errors in the row caused by issues such as an invalid number of fields. See SKIP ALL ERRORS below.

If the specified error type is encountered when inserting a row, it will generate a warning instead of an error, which can be displayed by executing a SHOW LOAD WARNINGS statement.

SKIP DUPLICATE KEY ERRORS

When SKIP DUPLICATE KEY ERRORS is specified, source files that contain duplicate unique or primary key values will be handled in the following way:

  • If the destination table’s schema specifies a unique or primary key column, and
  • The source file contains one or more rows with a duplicate key value that already exists in the destination table or exists elsewhere in the source file, then
  • Every duplicate row in the source file will be discarded and will not be inserted into the destination table.

SKIP DUPLICATE KEY ERRORS cannot be combined with REPLACE.

Example

Create a new table with a PRIMARY KEY column:

memsql> CREATE TABLE orders
    -> (
    -> id BIGINT PRIMARY KEY,
    -> customer_id INT,
    -> item_description VARCHAR(255),
    -> order_time TIMESTAMP NOT NULL
    -> );
Query OK, 0 rows affected (0.00 sec)

The following CSV file will loaded be into this table as orders.csv. Note the duplicate primary key value of 2 in line 3:

1,372,Apples,2016-05-09
2,138,Pears,2016-07-14
2,236,Bananas,2016-06-23
3,307,Oranges,2016-07-31

Load the data into the table:

memsql> LOAD DATA INFILE 'orders.csv'
    -> SKIP DUPLICATE KEY ERRORS
    -> INTO TABLE orders
    -> FIELDS TERMINATED BY ',';
Query OK, 3 rows affected (0.11 sec)

Note that only 3 rows were inserted even though 4 rows were present in the source file. Line 3 in the source file contained a duplicate primary key, and you can verify that it was not inserted by displaying load warnings:

memsql> SHOW LOAD WARNINGS\G;
*************************** 1. row ***************************
Unix Timestamp: 1485157780436644
         Level: Warning
          Code: 1062
       Message: Duplicate entry for unique key
 Line Contents: 2,236,Bananas,2016-06-23

   Line Number: 3
          Host: 127.0.0.1
          Port: 3307
1 row in set (0.00 sec)

SKIP CONSTRAINT ERRORS

SKIP CONSTRAINT ERRORS is inclusive of SKIP DUPLICATE KEY ERRORS if REPLACE is not specified. It also applies to rows that violate a column’s NOT NULL constraint and fields that contain invalid JSON or Geospatial values, and handles the offending rows in the following ways:

NOT NULL Constraint

  • If a column in the destination table specifies a NOT NULL constraint, and
  • The source file contains one or more rows with a null value for the constraint column, then
  • The offending row(s) will be discarded and will not be inserted into the destination table.

Invalid JSON or Geospatial Data

  • If a column in the destination table specifies a JSON, GEOGRAPHYPOINT, or GEOGRAPHY data type, and
  • The source file contains one or more rows with invalid values for fields of these types, then
  • The offending row(s) will be discarded and will not be inserted into the destination table.

SKIP CONSTRAINT ERRORS can also be combined with the REPLACE clause.

Example

Create a new table with a JSON column type that also has a NOT NULL constraint:

memsql> CREATE TABLE orders
    -> (
    -> id BIGINT PRIMARY KEY,
    -> customer_id INT,
    -> item_description VARCHAR(255),
    -> order_properties JSON NOT NULL
    -> );
Query OK, 0 rows affected (0.02 sec)

The following CSV file will loaded be into this table as orders.csv. Note the malformed JSON in line 2, as well as a null value (\N) for JSON in line 3:

1,372,Apples,{"order-date":"2016-05-09"}
2,138,Pears,{"order-date"}
3,236,Bananas,{"order-date":"2016-06-23"}
4,307,Oranges,\N

Load the data into the table:

memsql> LOAD DATA INFILE 'orders.csv'
    -> SKIP CONSTRAINT ERRORS
    -> INTO TABLE orders
    -> FIELDS TERMINATED BY ',';
Query OK, 2 rows affected (0.04 sec)

Note that only 2 rows were inserted even though 4 rows were present in the source file. Line 2 contained malformed JSON, and Line 4 contained an invalid null value. You can verify that both of these offending rows were not inserted by executing SHOW LOAD WARNINGS:

memsql> SHOW LOAD WARNINGS\G;
*************************** 1. row ***************************
Unix Timestamp: 1485158775521220
         Level: Warning
          Code: 1844
       Message: Invalid JSON value for column 'order_properties'
 Line Contents: 2,138,Pears,{"order-date"}

   Line Number: 2
          Host: 127.0.0.1
          Port: 3307
*************************** 2. row ***************************
Unix Timestamp: 1485158775521227
         Level: Warning
          Code: 1263
       Message: NULL supplied to NOT NULL column 'order_properties' at row 4
 Line Contents: 4,307,Oranges,\N

   Line Number: 4
          Host: 127.0.0.1
          Port: 3307
2 rows in set (0.00 sec)

SKIP ALL ERRORS

SKIP ALL ERRORS is inclusive of SKIP DUPLICATE KEY ERRORS and SKIP CONSTRAINT ERRORS in addition to any parsing error. Offending rows are handled in the following way:

  • If one or more rows in the source file cause ... DUPLICATE KEY ... or ... CONSTRAINT ... errors, or
  • If one or more rows in the source file cause parsing errors such as invalid delimiters or an invalid number of fields,
  • The offending row(s) will be discarded and will not be inserted into the destination table.

SKIP ALL ERRORS can also be combined with REPLACE.

Example

Create a new table with a JSON column type that also has a NOT NULL constraint:

memsql> CREATE TABLE orders
    -> (
    -> id BIGINT PRIMARY KEY,
    -> customer_id INT,
    -> item_description VARCHAR(255),
    -> order_properties JSON NOT NULL
    -> );
Query OK, 0 rows affected (0.02 sec)

The following CSV file will loaded be into this table as orders.csv. There are three things wrong with this file:

  • Line 2 contains only 3 fields
  • Line 3 has a duplicate primary key
  • Line 4 has a null value for a NOT NULL constraint
1,372,Apples,{"order-date":"2016-05-09"}
2,138,Pears
1,236,Bananas,{"order-date":"2016-06-23"}
4,307,Oranges,\N

Load the data into the table:

memsql> LOAD DATA INFILE 'orders.csv'
    -> SKIP ALL ERRORS
    -> INTO TABLE orders
    -> FIELDS TERMINATED BY ',';
Query OK, 1 row affected (0.04 sec)

Only 1 row was written, despite the source file containing 4 rows. Line 2 was dropped because it contained an invalid number of fields, Line 3 was dropped because it contained a duplicate primary key, and line 4 was dropped because it contained a null value for a NOT NULL constraint.

You can verify which lines were dropped by executing SHOW LOAD WARNINGS:

memsql> SHOW LOAD WARNINGS\G;
*************************** 1. row ***************************
Unix Timestamp: 1485160945860979
         Level: Warning
          Code: 1261
       Message: Row 2 doesn't contain data for all columns
 Line Contents: 2,138,Pears

   Line Number: 2
          Host: 127.0.0.1
          Port: 3307
*************************** 2. row ***************************
Unix Timestamp: 1485160945860986
         Level: Warning
          Code: 1263
       Message: NULL supplied to NOT NULL column 'order_properties' at row 4
 Line Contents: 4,307,Oranges,\N

   Line Number: 4
          Host: 127.0.0.1
          Port: 3307
*************************** 3. row ***************************
Unix Timestamp: 1485160945860990
         Level: Warning
          Code: 1062
       Message: Duplicate entry for unique key
 Line Contents: 1,236,Bananas,{"order-date":"2016-06-23"}

   Line Number: 3
          Host: 127.0.0.1
          Port: 3307
3 rows in set (0.00 sec)

IGNORE Behavior

MemSQL’s IGNORE behavior is identical to MySQL’s IGNORE behavior, and exists only to support backwards compatibility with applications written for MySQL. IGNORE either discards malformed rows, discards extra fields, or replaces invalid values with default data type values. In addition, if an inserted row would have produced an error if IGNORE was not specified, it will be converted to a warning instead.

Consequences of Using IGNORE Instead of SKIP ERRORS

Unlike SKIP ... ERRORS which discards offending rows, IGNORE may change the inserted row’s data to ensure that it adheres to the table schema. This behavior can have serious repercussions for the data integrity of the destination table.

In a best case scenario where a malformed row uses the proper delimiters and contains the correct number of fields, the row can be partially salvaged. Any invalid values are updated with default values, and the modified row is written to the destination table. The result is that at least some of the source data was written to the destination table.

However, the worst case scenario can be severe. For example, if a row’s values are separated by an invalid delimiter, each field is updated with meaningless default values and the modified row is written to the destination table. For the sake of the table’s data integrity, it would have been better if the offending row was discarded. But a row with meaningless data was inserted instead.

Due to the potential consequences of using IGNORE, in most cases SKIP ... ERRORS is a better option. To understand IGNORE's behavior for each error scenario, continue reading the sections below:

Duplicate Unique or Primary Key Values

When IGNORE is specified, source files that contain duplicate unique or primary key values will be handled in the following way:

  • If the destination table’s schema specifies a unique or primary key column, and
  • The source file contains one or more rows with a duplicate key value that already exists in the destination table or exists elsewhere in the source file, then
  • Every duplicate row in the source file will be discarded (ignored) and will not be inserted into the destination table.
Example

Create a new table with a PRIMARY KEY column:

memsql> CREATE TABLE orders
    -> (
    -> id BIGINT PRIMARY KEY,
    -> customer_id INT,
    -> item_description VARCHAR(255),
    -> order_time DATETIME NOT NULL
    -> );
Query OK, 0 rows affected (0.00 sec)

The following CSV file will loaded be into this table as orders.csv. Note the duplicate primary key value of 2 in line 3:

1,372,Apples,2016-05-09
2,138,Pears,2016-07-14
2,236,Bananas,2016-06-23
3,307,Oranges,2016-07-31

Load the data into the table:

memsql> LOAD DATA INFILE 'orders.csv'
    -> IGNORE
    -> INTO TABLE orders
    -> FIELDS TERMINATED BY ',';
Query OK, 3 rows affected (0.11 sec)

Note that only 3 rows were inserted even though 4 rows were present in the source file. Line 3 in the source file contained a duplicate primary key, and you can verify that it was not inserted by displaying load warnings:

memsql> SHOW LOAD WARNINGS\G;
*************************** 1. row ***************************
Unix Timestamp: 1484789864957010
         Level: Warning
          Code: 1062
       Message: Duplicate entry for unique key
 Line Contents: 2,236,Bananas,2016-06-23

   Line Number: 3
          Host: 127.0.0.1
          Port: 3307
1 row in set (0.00 sec)

Line 3 in the source file contained a duplicate primary key and was discarded because line 2 was inserted first. You can verify by checking the contents of orders:

memsql> SELECT * FROM orders ORDER BY 1;
+----+-------------+------------------+---------------------+
| id | customer_id | item_description | order_time          |
+----+-------------+------------------+---------------------+
|  1 |         372 | Apples           | 2016-05-09 00:00:00 |
|  2 |         138 | Pears            | 2016-07-14 00:00:00 |
|  3 |         307 | Oranges          | 2016-07-31 00:00:00 |
+----+-------------+------------------+---------------------+
3 rows in set (0.12 sec)

Values with Invalid Types According to the Destination Table’s Schema

When IGNORE is specified, source files that contain rows with invalid types that violate the destination table’s schema will be handled in the following way:

  • If the source file contains one or more rows with values that do not adhere to the destination table’s schema,
  • Each value of an invalid type in a row will be replaced with the default value of the appropriate type, and
  • The modified row(s) will be inserted into the destination table.

IGNORE behaves in a potentially unexpected way for columns that have a DEFAULT value specified. When an invalid value in the inserted row is replaced with the default value of the column’s type, the column’s DEFAULT value is ignored. Instead, the default value for the column’s data type is used.

Example

Create a new table with a PRIMARY KEY column:

memsql> CREATE TABLE orders
    -> (
    -> id BIGINT PRIMARY KEY,
    -> customer_id INT,
    -> item_description VARCHAR(255),
    -> order_time DATETIME NOT NULL
    -> );
Query OK, 0 rows affected (0.01 sec)

The following CSV file will loaded be into this table as orders.csv. Line 4 contains a NULL value for order_time, whereas the table schema does not allow NULL values for this field.

1,372,Apples,2016-05-09
2,138,Pears,2016-07-14
3,236,Bananas,2016-06-23
4,307,Oranges,\N

Load the data into the table:

memsql> LOAD DATA INFILE 'orders.csv'
    -> IGNORE
    -> INTO TABLE orders
    -> FIELDS TERMINATED BY ',';
Query OK, 4 rows affected (0.01 sec)

Note that 4 rows were inserted despite the fact that line 4 in the source file contained a null value for a NOT NULL column. To see how this error was handled, execute SHOW LOAD WARNINGS:

memsql> SHOW LOAD WARNINGS\G;
*************************** 1. row ***************************
Unix Timestamp: 1485204211720050
         Level: Warning
          Code: 1263
       Message: NULL supplied to NOT NULL column 'order_time' at row 4
 Line Contents: 4,307,Oranges,\N

   Line Number: 4
          Host: 127.0.0.1
          Port: 3307
1 row in set (0.00 sec)

To see what was inserted by replacing the invalid DATETIME value with a default value, query the table:

memsql> SELECT * FROM orders ORDER BY 1;
+----+-------------+------------------+---------------------+
| id | customer_id | item_description | order_time          |
+----+-------------+------------------+---------------------+
|  1 |         372 | Apples           | 2016-05-09 00:00:00 |
|  2 |         138 | Pears            | 2016-07-14 00:00:00 |
|  3 |         236 | Bananas          | 2016-06-23 00:00:00 |
|  4 |         307 | Oranges          | 0000-00-00 00:00:00 |
+----+-------------+------------------+---------------------+
4 rows in set (0.09 sec)

In this example, the invalid null DATETIME value was replaced with its default value: 0000-00-00 00:00:00.

Rows That Contain an Invalid Number of Fields

When IGNORE is specified, source files that contain rows with an invalid number of fields will be handled in one of two ways:

Too Few Fields

  • If the source file contains one or more rows with too few fields according to the destination table’s schema,
  • Each row’s empty field(s) will be updated with default values, and
  • The row will be inserted into the destination table.

Too Many Fields

  • If the source file contains one or more rows with too many fields according to the destination table’s schema,
  • Each extra field in the row(s) will be discarded (ignored), and
  • The row will be inserted into the destination table.
Example

Create a new table with a PRIMARY KEY column:

memsql> CREATE TABLE orders
    -> (
    -> id BIGINT PRIMARY KEY,
    -> customer_id INT,
    -> item_description VARCHAR(255),
    -> order_time DATETIME NOT NULL
    -> );
Query OK, 0 rows affected (0.00 sec)

The following CSV file will loaded be into this table as orders.csv. There are two things wrong with this file:

  • Line 2 contains only 3 fields instead of 4 and does not have a TIMESTAMP:
  • Line 4 contains an extra field, for a total of 5
1,372,Apples,2016-05-09
2,138,Pears
3,236,Bananas,2016-06-23
4,307,Oranges,2016-07-31,Berries

Load the data into the table:

memsql> LOAD DATA INFILE 'orders.csv'
    -> IGNORE
    -> INTO TABLE orders
    -> FIELDS TERMINATED BY ',';
Query OK, 4 rows affected (0.01 sec)

Note that 4 rows were inserted despite the invalid number of fields for two of the rows. To see how the errors were handled, you can display load warnings:

memsql> SHOW LOAD WARNINGS\G;
*************************** 1. row ***************************
Unix Timestamp: 1485204497796029
         Level: Warning
          Code: 1261
       Message: Row 2 doesn't contain data for all columns
 Line Contents: 2,138,Pears

   Line Number: 2
          Host: 127.0.0.1
          Port: 3307
*************************** 2. row ***************************
Unix Timestamp: 1485204497796035
         Level: Warning
          Code: 1262
       Message: Row 4 was truncated; it contained more data than there were input columns
 Line Contents: 4,307,Oranges,2016-07-31,Berries

   Line Number: 4
          Host: 127.0.0.1
          Port: 3307
2 rows in set (0.00 sec)

Note that there is a warning for the missing value in row 2 and the extra value in row 4. To see how the data was inserted, query the table:

memsql> select * from orders order by 1;
+----+-------------+------------------+---------------------+
| id | customer_id | item_description | order_time          |
+----+-------------+------------------+---------------------+
|  1 |         372 | Apples           | 2016-05-09 00:00:00 |
|  2 |         138 | Pears            | 0000-00-00 00:00:00 |
|  3 |         236 | Bananas          | 2016-06-23 00:00:00 |
|  4 |         307 | Oranges          | 2016-07-31 00:00:00 |
+----+-------------+------------------+---------------------+
4 rows in set (0.02 sec)

Line 2 did not have a DATETIME value, so the default value for its type was inserted instead. Line 4’s extra value was discarded, and otherwise the row was inserted with the expected data.

Performance Considerations

Shard keys Loading data into a table with a shard key requires reading the necessary columns on the aggregator to compute the shard key before sending data to the leaves. It is recommended that columns included in the shard key appear earlier in input rows, if either the shard key design or the input format is flexible.