This tutorial shows you how to build a sample database containing millions of rows of simulated stock trade data. You can use this database for testing the features and functionality of MemSQL. Some sample queries are also provided to show you how MemSQL performs when doing concurrent reads while data is being written to disk.
Step 1: Create the database
Run the following SQL statements to create a new database called trades
which has two tables: trade
and company
. The trades
table is a columnstore table meant to contain analytical data about specific trades on a given company and the company
table is a small rowstore table that provides metadata about that company.
CREATE DATABASE IF NOT EXISTS trades;
USE trades;
DROP TABLE IF EXISTS trade;
CREATE TABLE trade(
id BIGINT NOT NULL,
stock_symbol CHAR(5) NOT NULL,
shares DECIMAL(18,4) NOT NULL,
share_price DECIMAL(18,4) NOT NULL,
trade_time DATETIME(6) NOT NULL,
KEY(stock_symbol) USING CLUSTERED COLUMNSTORE,
SHARD KEY(stock_symbol)
);
DROP TABLE IF EXISTS company;
CREATE TABLE company(
symbol CHAR(5) NOT NULL,
name VARCHAR(50),
last_sale VARCHAR(10),
market_cap VARCHAR(15),
IPO_year INT,
sector VARCHAR(80),
industry VARCHAR(80),
summary_quote VARCHAR(50),
extra VARCHAR(50)
);
Step 2: Load company profile data
Download companylist.csv onto your local machine and then load it into the company
table by either using a MySQL client (such as Sequel Pro, MySQLWorkbench) or by using the LOAD DATA command.
If you are using a MySQL client, open your MySQL client and navigate to the menu item for importing data from a file. Import companylist.csv into the company
table you created in the previous step.
If you are using a terminal, run the LOAD DATA
command to load the data into the company table you created in the previous section.
LOAD DATA INFILE '/path/to/companylist.csv'
INTO TABLE company
FIELDS TERMINATED BY ',' ENCLOSED BY '"'
LINES TERMINATED BY '\r\n'
IGNORE 1 LINES;
Step 3: Create data generator functionality
The following are several functions used to generate data. The utility function, marketcap_to_DECIMAL
, is used by the seed_trades
stored procedure to convert a string value for the market capitalization of a stock into a decimal value.
DELIMITER //
CREATE OR REPLACE FUNCTION marketcap_to_DECIMAL(s VARCHAR(15))
RETURNS DECIMAL(18,2) AS
DECLARE
m CHAR(1) = SUBSTR(s, LENGTH(s), 1); -- M or B
raw_v DECIMAL(18,2) = SUBSTR(s, 2, LENGTH(s) - 1);
v DECIMAL(18,2) = NULL;
BEGIN
IF m = "B" THEN
v = raw_v * 1000;
ELSE
v = raw_v;
END IF;
RETURN v;
END //
DELIMITER ;
The seed_trades
stored procedure generates rows of “seed” trade data based on an integer input value. This provides a query to rank the companies by market capitalization limited to the top 200.
The stored procedure generates approximately the same number of trades for each company.
DELIMITER //
CREATE OR REPLACE PROCEDURE seed_trades(num_trades INT) RETURNS INT AS
DECLARE
ranked_companies ARRAY(RECORD(symbol CHAR(5), _rank INT));
DECLARE
q QUERY(symbol CHAR(5), _rank INT) =
SELECT symbol, rank() OVER (ORDER BY marketcap_to_DECIMAL(market_cap)) AS _rank
FROM company
WHERE LENGTH(symbol) < 5
ORDER BY _rank DESC LIMIT 200;
i INT = 0;
rank_num INT;
next_id INT = 1;
sym CHAR(5);
price_base DECIMAL(18,4);
current_prices ARRAY(INT);
l ARRAY(RECORD(symbol CHAR(5), _rank INT));
BEGIN
l = collect(q);
FOR r IN l LOOP
i += 1;
rank_num = r._rank;
sym = r.symbol;
price_base = FLOOR(rand() * 50) + 50;
FOR j IN 1..((rank_num / 10) + RAND() * 10) LOOP
INSERT trade VALUES(
next_id,
sym,
FLOOR(1 + RAND() * 10) * 100, -- shares
price_base, -- share_price
DATE_ADD(NOW(), INTERVAL RAND() * 6 HOUR)); -- random time during trading day, roughly
next_id += 1;
IF next_id > num_trades THEN RETURN(next_id); END IF;
END LOOP;
END LOOP;
RETURN(next_id);
END //
DELIMITER ;
This last stored procedure, iter_stocks
, generates additional trade events using a “random walk” to determine the next price for each company. It will store the current price for each company, then add a random number between -1 and 1 to set the new price for each iteration.
Each iteration inserts new records equal to num_shares
. When iter_stocks
completes, the total number of records in the trade
table will be equal to num_shares
* iterations
+ num_shares
.
DELIMITER //
CREATE OR REPLACE PROCEDURE iter_stocks(iterations INT) as
DECLARE
tickers ARRAY(CHAR(5));
prices ARRAY(DECIMAL(18,4));
last_ids ARRAY(bigINT);
counts ARRAY(INT);
next_id bigINT = 1;
ticker CHAR(5);
price DECIMAL(18,4);
c INT;
rand DECIMAL(18,4);
tickers_q QUERY(t CHAR(5), p DECIMAL(18,4), lid BIGINT, c INT) = SELECT stock_symbol, share_price, MIN(id), COUNT(*) FROM trade GROUP BY stock_symbol;
q ARRAY(RECORD(t CHAR(5), p DECIMAL(18,4), lid bigINT, c INT));
q_count QUERY(c INT) = SELECT COUNT(*) FROM trade;
total_c INT;
BEGIN
q = COLLECT(tickers_q);
tickers = CREATE_ARRAY(LENGTH(q));
prices = CREATE_ARRAY(LENGTH(q));
last_ids = CREATE_ARRAY(LENGTH(q));
counts = CREATE_ARRAY(LENGTH(q));
total_c = SCALAR(q_count);
FOR r IN 0..LENGTH(q)-1 LOOP
tickers[r] = q[r].t;
prices[r] = q[r].p;
last_ids[r] = q[r].lid;
counts[r] = q[r].c;
END LOOP;
FOR j IN 0..(iterations-1) LOOP
FOR i IN 0..LENGTH(tickers)-1 LOOP
ticker = tickers[i];
price = prices[i];
next_id = last_ids[i];
c = counts[i];
rand = POW(-1, FLOOR(RAND()*2)) * RAND();
INSERT INTO trade
SELECT id + total_c, stock_symbol, shares, share_price + rand, trade_time FROM trade WHERE stock_symbol = ticker AND id >= next_id;
prices[i] = price + rand;
last_ids[i] = next_id + total_c;
END LOOP;
END LOOP;
END //
DELIMITER ;
Step 4: Generate trade data
Now that you have created the functionality to generate stock ticker data, run the following commands to generate the data. Note, this process may take several minutes depending on your cluster topology.
CALL seed_trades(100000);
CALL iter_stocks(100);
Changing the input value for iter_stocks
defines the size of the table. For example, changing from 100 to 200 doubles the amount of rows.
Step 5: Run analytic queries
In a new shell/sequel pro window, run the following analytical queries while iter_stocks inserts records in the background.
Query 1
Finds the most traded stocks. This query is very fast over very large volumes of data.
SELECT stock_symbol, COUNT(*) AS c
FROM trade
GROUP BY stock_symbol
ORDER BY c DESC limit 10;
Query 2
Finds the most “volatile” stocks (highest variance in prices).
SELECT stock_symbol, VARIANCE(share_price) var
FROM trade
GROUP BY stock_symbol
ORDER BY var DESC;
Query 3
Finds the most “volatile” stocks (highest variance in prices) in the last 5 seconds.
SELECT stock_symbol, VARIANCE(share_price) var
FROM trade
WHERE trade_time * 1 > ( NOW() - 5 )
GROUP BY stock_symbol
ORDER BY var DESC;
Query 4
This is a portfolio aggregation query that uses Common Table Expression (CTE), JOIN, and window functions. It also computes min, max, standard deviation, weighted average, and percentiles for each company stock.
WITH folio AS (
SELECT id, stock_symbol, shares, share_price, trade_time
FROM trade
),
AggCalcs AS (
SELECT
stock_symbol AS ACsymb,
MAX(share_price) AS pmax,
MIN(share_price) AS pmin,
STD(share_price) AS pstd,
SUM(share_price*shares)/SUM(shares) AS avg_pps, ## Weighted Average
SUM(share_price*shares) AS total_pvalue
FROM trade
GROUP BY 1
)
SELECT
DISTINCT folio.stock_symbol,
avg_pps,
pmin,
pmax,
percentile_cont(.25) WITHIN group (ORDER BY share_price) OVER (PARTITION BY stock_symbol) AS Q1,
percentile_cont(.5) WITHIN group (ORDER BY share_price) OVER (PARTITION BY stock_symbol) AS median,
percentile_cont(.75) WITHIN group (ORDER BY share_price) OVER (PARTITION BY stock_symbol) AS Q3
FROM folio
JOIN AggCalcs ON (folio.stock_symbol = ACsymb)
ORDER BY folio.stock_symbol;