Query-data routing trees (qd-trees) are neural network-generated decision trees that produce I/O-efficient data layouts for big data workloads. Qd-trees enable excellent workload-specific data skipping.
Qd-trees produce block-level layouts with excellent I/O performance. Their objective is to minimize the number of blocks/records accessed by a workload.
Figure: an example qd-tree with 3 cuts and 4 leaf data blocks.
When run as a shim, qd-tree layouts can be used by any query engines (Spark, SQL Server, ...) and any data formats (Parquet, CSV, custom, ...).
Figure: Qd-tree system architecture.
Deep reinforcement learning. We use deep reinforcement learning to emit qd-trees, where an RL agent learns to construct better and better trees for a given workload/dataset, by trial and error. Proximal Policy Optimization (PPO) is used for the tree-construction agent.
The states, nodes in a qd-tree, are subspaces of the entire dataset. Nodes are described by two fields.
||Ranges of all columns under this node's subspace. 2N-dimensional array. Example:
||A string of bits for each categorical column. Zero represents a value is definitively not present under this node.|
range succinctly describes continuous dimensions,
categorical_mask enhances support for categorical columns. As a bonus, these descriptions readily interoperate with the min-max and dictionary statistics used in popular big data formats (e.g., Parquet).
Actions are defined to be predicates extracted from a query log that can be pushed down to storage. Executing an action, or a cut, splits a node into two children.
In summary, the deep RL agent learns to construct I/O-efficient qd-trees. It does so by learning a policy that maps a state to a distribution over actions. By executing the learned policy, we obtain qd-trees that can be deployed. No query execution is required during learning, so this process usually finishes in minutes.