Learning Data Layouts for Big Data Analytics

Query-data routing trees (qd-trees) are neural network-generated decision trees that produce I/O-efficient data layouts for big data workloads. Qd-trees enable excellent workload-specific data skipping.

Paper

Qd-tree: Learning Data Layouts for Big Data Analytics.
Zongheng Yang, Badrish Chandramouli, Chi Wang, Johannes Gehrke, Yinan Li, Umar Farooq Minhas, Per-Ake Larson, Donald Kossmann, Rajeev Acharya.
SIGMOD 2020.

Methods

Qd-trees produce block-level layouts with excellent I/O performance. Their objective is to minimize the number of blocks/records accessed by a workload.

Example qd-tree

Figure: an example qd-tree with 3 cuts and 4 leaf data blocks.

When run as a shim, qd-tree layouts can be used by any query engines (Spark, SQL Server, ...) and any data formats (Parquet, CSV, custom, ...).

Qd-tree system architecture

Figure: Qd-tree system architecture.

Deep reinforcement learning. We use deep reinforcement learning to emit qd-trees, where an RL agent learns to construct better and better trees for a given workload/dataset, by trial and error. Proximal Policy Optimization (PPO) is used for the tree-construction agent.

The states, nodes in a qd-tree, are subspaces of the entire dataset. Nodes are described by two fields.

Field	Definition
`node.range`	Ranges of all columns under this node's subspace. 2N-dimensional array. Example: `[0,MAX_cpu),[0,MAX_mem)`.
`node.categorical_mask`	A string of bits for each categorical column. Zero represents a value is definitively not present under this node.

While range succinctly describes continuous dimensions, categorical_mask enhances support for categorical columns. As a bonus, these descriptions readily interoperate with the min-max and dictionary statistics used in popular big data formats (e.g., Parquet).

Actions are defined to be predicates extracted from a query log that can be pushed down to storage. Executing an action, or a cut, splits a node into two children.

In summary, the deep RL agent learns to construct I/O-efficient qd-trees. It does so by learning a policy that maps a state to a distribution over actions. By executing the learned policy, we obtain qd-trees that can be deployed. No query execution is required during learning, so this process usually finishes in minutes.

Talk

A 12-minute presentation virtually given at SIGMOD 2020 (June, 2020):

Slides: here.