Reference tables are relatively small tables that do not need to be distributed and are present on every node in the cluster. Reference tables are implemented via primary-secondary replication to every node in the cluster from the master aggregator. Replication enables reference tables to be dynamic: updates that you perform to a reference table on the master aggregator are quickly reflected on every machine in the cluster. Since reference tables are replicated to every node in the cluster, it eliminates the need to transfer the table’s data across the network during query execution. Reference tables should mostly be used for tables that change rarely, because the write operations on reference tables consume a lot more resources.
aggregators can take advantage of reference tables’ ubiquity by pushing joins between reference tables and a distributed table onto the leaves. Imagine you have a distributed
clicks table storing billions of records and a smaller
customers table with just a few million records. Since the
customers table is small, it can be replicated on every node in the cluster. If you run a join between the
clicks table and the
customers table, then the bulk of the work for the join will occur on the leaves.
Reference tables are a convenient way to implement dimension tables.