Learning Data Layouts for Big Data Analytics

By Zongheng Yang, Badrish Chandramouli, Chi Wang, Johannes Gehrke, Yinan Li, Umar Farooq Minhas, Per-Ake Larson, Donald Kossmann, Rajeev Acharya

Query-data routing trees (qd-trees) are neural network-generated decision trees that produce I/O-efficient data layouts for big data workloads. Qd-trees enable excellent workload-specific data skipping.

Qd-tree

Paper

Methods

Qd-trees produce block-level layouts with excellent I/O performance. Their objective is to minimize the number of blocks/records accessed by a workload.

Example qd-tree

Figure: an example qd-tree with 3 cuts and 4 leaf data blocks.

When run as a shim, qd-tree layouts can be used by any query engines (Spark, SQL Server, ...) and any data formats (Parquet, CSV, custom, ...).

Qd-tree system architecture

Figure: Qd-tree system architecture.

Deep reinforcement learning. We use deep reinforcement learning to emit qd-trees, where an RL agent learns to construct better and better trees for a given workload/dataset, by trial and error. Proximal Policy Optimization (PPO) is used for the tree-construction agent.

The states, nodes in a qd-tree, are subspaces of the entire dataset. Nodes are described by two fields.

Field Definition
node.range Ranges of all columns under this node's subspace. 2N-dimensional array. Example: [0,MAX_cpu),[0,MAX_mem).
node.categorical_mask A string of bits for each categorical column. Zero represents a value is definitively not present under this node.

While range succinctly describes continuous dimensions, categorical_mask enhances support for categorical columns. As a bonus, these descriptions readily interoperate with the min-max and dictionary statistics used in popular big data formats (e.g., Parquet).

Actions are defined to be predicates extracted from a query log that can be pushed down to storage. Executing an action, or a cut, splits a node into two children.

In summary, the deep RL agent learns to construct I/O-efficient qd-trees. It does so by learning a policy that maps a state to a distribution over actions. By executing the learned policy, we obtain qd-trees that can be deployed. No query execution is required during learning, so this process usually finishes in minutes.

Talk

A 12-minute presentation virtually given at SIGMOD 2020 (June, 2020):

Slides: here.