Outdated Version

You are viewing an older version of this section. View current production version.

Build a Sample Stock Trade Database min read


This tutorial shows you how to build a sample database containing millions of rows of simulated stock trade data. You can use this database for testing the features and functionality of MemSQL. Some sample queries are also provided to show you how MemSQL performs when doing concurrent reads while data is being written to disk.

Step 1: Create the database

Run the following SQL statements to create a new database called trades which has two tables: trade and company. The trades table is a columnstore table meant to contain analytical data about specific trades on a given company and the company table is a small rowstore table that provides metadata about that company.

CREATE DATABASE IF NOT EXISTS trades;
USE trades;

DROP TABLE IF EXISTS trade;
CREATE TABLE trade(
  id BIGINT NOT NULL, 
  stock_symbol CHAR(5) NOT NULL,
  shares DECIMAL(18,4) NOT NULL,
  share_price DECIMAL(18,4) NOT NULL,
  trade_time DATETIME(6) NOT NULL,
  KEY(stock_symbol) USING CLUSTERED COLUMNSTORE,
  SHARD KEY(stock_symbol)
);

DROP TABLE IF EXISTS company;
CREATE TABLE company(
  symbol CHAR(5) NOT NULL,
  name VARCHAR(50),
  last_sale VARCHAR(10),
  market_cap VARCHAR(15),
  IPO_year INT,
  sector VARCHAR(80),
  industry VARCHAR(80),
  summary_quote VARCHAR(50),
  extra VARCHAR(50)
);

Step 2: Load company profile data

Download companylist.csv onto your local machine and then load it into the company table by either using a MySQL client (such as Sequel Pro, MySQLWorkbench) or by using the LOAD DATA command.

If you are using a MySQL client, open your MySQL client and navigate to the menu item for importing data from a file. Import companylist.csv into the company table you created in the previous step.

If you are using a terminal, run the LOAD DATA command to load the data into the company table you created in the previous section.

LOAD DATA INFILE '/path/to/companylist.csv'
INTO TABLE company
FIELDS TERMINATED BY ',' ENCLOSED BY '"'
LINES TERMINATED BY '\r\n'
IGNORE 1 LINES;

Step 3: Create data generator functionality

The following are several functions used to generate data. The utility function, marketcap_to_DECIMAL, is used by the seed_trades stored procedure to convert a string value for the market capitalization of a stock into a decimal value.

DELIMITER //
CREATE OR REPLACE FUNCTION marketcap_to_DECIMAL(s VARCHAR(15)) 
  RETURNS DECIMAL(18,2) AS
DECLARE 
  m CHAR(1) = SUBSTR(s, LENGTH(s), 1); -- M or B
  raw_v DECIMAL(18,2) = SUBSTR(s, 2, LENGTH(s) - 1);
  v DECIMAL(18,2) = NULL;
BEGIN
  IF m = "B" THEN 
    v = raw_v * 1000;
  ELSE
    v = raw_v;
  END IF;
  RETURN v;
END //
DELIMITER ;

The seed_trades stored procedure generates rows of “seed” trade data based on an integer input value. This provides a query to rank the companies by market capitalization limited to the top 200.

The stored procedure generates approximately the same number of trades for each company.

DELIMITER //
CREATE OR REPLACE PROCEDURE seed_trades(num_trades INT) RETURNS INT AS
DECLARE
  ranked_companies ARRAY(RECORD(symbol CHAR(5), _rank INT)); 
DECLARE
  q QUERY(symbol CHAR(5), _rank INT) =
    SELECT symbol, rank() OVER (ORDER BY marketcap_to_DECIMAL(market_cap)) AS _rank
    FROM company
    WHERE LENGTH(symbol) < 5
    ORDER BY _rank DESC LIMIT 200;
  i INT = 0;
  rank_num INT;
  next_id INT = 1;
  sym CHAR(5);
  price_base DECIMAL(18,4);
  current_prices ARRAY(INT);
  l ARRAY(RECORD(symbol CHAR(5), _rank INT));
BEGIN
  l = collect(q);
  FOR r IN l LOOP
    i += 1;
    rank_num = r._rank;
    sym = r.symbol;
    price_base = FLOOR(rand() * 50) + 50;
    FOR j IN 1..((rank_num / 10) + RAND() * 10) LOOP
      INSERT trade VALUES(
        next_id, 
        sym, 
        FLOOR(1 + RAND() * 10) * 100, -- shares
        price_base,      -- share_price
        DATE_ADD(NOW(), INTERVAL RAND() * 6 HOUR)); -- random time during trading day, roughly
      next_id += 1;
      IF next_id > num_trades THEN RETURN(next_id); END IF;
    END LOOP;
  END LOOP;
  RETURN(next_id);
END //
DELIMITER ;

This last stored procedure, iter_stocks, generates additional trade events using a “random walk” to determine the next price for each company. It will store the current price for each company, then add a random number between -1 and 1 to set the new price for each iteration.

Each iteration inserts new records equal to num_shares. When iter_stocks completes, the total number of records in the trade table will be equal to num_shares * iterations + num_shares.

DELIMITER //
CREATE OR REPLACE PROCEDURE iter_stocks(iterations INT) as
DECLARE
    tickers ARRAY(CHAR(5));
    prices ARRAY(DECIMAL(18,4));
    last_ids ARRAY(bigINT);
    counts ARRAY(INT);
    next_id bigINT = 1;
    ticker CHAR(5);
    price DECIMAL(18,4);
    c INT;
    rand DECIMAL(18,4);
    tickers_q QUERY(t CHAR(5), p DECIMAL(18,4), lid BIGINT, c INT) = SELECT stock_symbol, share_price, MIN(id), COUNT(*) FROM trade GROUP BY stock_symbol;
    q ARRAY(RECORD(t CHAR(5), p DECIMAL(18,4), lid bigINT, c INT));
    q_count QUERY(c INT) = SELECT COUNT(*) FROM trade;
    total_c INT;
BEGIN
    q = COLLECT(tickers_q);
    tickers = CREATE_ARRAY(LENGTH(q));
    prices = CREATE_ARRAY(LENGTH(q));
    last_ids = CREATE_ARRAY(LENGTH(q));
    counts = CREATE_ARRAY(LENGTH(q));
    total_c = SCALAR(q_count);
    FOR r IN 0..LENGTH(q)-1 LOOP
        tickers[r] = q[r].t;
        prices[r] = q[r].p;
        last_ids[r] = q[r].lid;
        counts[r] = q[r].c;
    END LOOP;
    FOR j IN 0..(iterations-1) LOOP
        FOR i IN 0..LENGTH(tickers)-1 LOOP
            ticker = tickers[i];
            price = prices[i];
            next_id = last_ids[i];
            c = counts[i];
            rand = POW(-1, FLOOR(RAND()*2)) * RAND();
            INSERT INTO trade 
            SELECT id + total_c, stock_symbol, shares, share_price + rand, trade_time FROM trade WHERE stock_symbol = ticker AND id >= next_id;
            prices[i] = price + rand;
            last_ids[i] = next_id + total_c;
        END LOOP;
    END LOOP;
END //
DELIMITER ;

Step 4: Generate trade data

Now that you have created the functionality to generate stock ticker data, run the following commands to generate the data. Note, this process may take several minutes depending on your cluster topology.

CALL seed_trades(100000);
CALL iter_stocks(100);
Info

Changing the input value for iter_stocks defines the size of the table. For example, changing from 100 to 200 doubles the amount of rows.

Step 5: Run analytic queries

In a new shell/sequel pro window, run the following analytical queries while iter_stocks inserts records in the background.

Query 1

Finds the most traded stocks. This query is very fast over very large volumes of data.

SELECT stock_symbol, COUNT(*) AS c 
  FROM trade 
  GROUP BY stock_symbol 
  ORDER BY c DESC limit 10;

Query 2

Finds the most “volatile” stocks (highest variance in prices).

SELECT stock_symbol, VARIANCE(share_price) var 
  FROM trade 
  GROUP BY stock_symbol 
  ORDER BY var DESC;

Query 3

Finds the most “volatile” stocks (highest variance in prices) in the last 5 seconds.

SELECT stock_symbol, VARIANCE(share_price) var 
  FROM trade 
  WHERE trade_time * 1 > ( NOW() - 5 ) 
  GROUP BY stock_symbol 
  ORDER BY var DESC;

Query 4

This is a portfolio aggregation query that uses Common Table Expression (CTE), JOIN, and window functions. It also computes min, max, standard deviation, weighted average, and percentiles for each company stock.

WITH folio AS (
  SELECT id, stock_symbol, shares, share_price, trade_time
  FROM trade 
),
AggCalcs AS (
  SELECT
    stock_symbol AS ACsymb,
    MAX(share_price) AS pmax,
    MIN(share_price) AS pmin,
    STD(share_price) AS pstd,
    SUM(share_price*shares)/SUM(shares) AS avg_pps,  ## Weighted Average
    SUM(share_price*shares) AS total_pvalue
  FROM trade 
  GROUP BY 1
)
SELECT 
  DISTINCT folio.stock_symbol,
  avg_pps,
  pmin,
  pmax,
  percentile_cont(.25) WITHIN group (ORDER BY share_price) OVER (PARTITION BY stock_symbol) AS Q1,
  percentile_cont(.5) WITHIN group (ORDER BY share_price) OVER (PARTITION BY stock_symbol)  AS median,
  percentile_cont(.75) WITHIN group (ORDER BY share_price) OVER (PARTITION BY stock_symbol) AS Q3
FROM folio
JOIN AggCalcs ON (folio.stock_symbol = ACsymb)
ORDER BY folio.stock_symbol;