You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Paul Rogers (JIRA)" <ji...@apache.org> on 2017/01/21 04:25:26 UTC

[jira] [Created] (DRILL-5209) Standardize Drill's batch size

Paul Rogers created DRILL-5209:
----------------------------------

             Summary: Standardize Drill's batch size
                 Key: DRILL-5209
                 URL: https://issues.apache.org/jira/browse/DRILL-5209
             Project: Apache Drill
          Issue Type: Improvement
    Affects Versions: 1.9.0
            Reporter: Paul Rogers
            Priority: Minor


Drill is columnar, implemented as a set of value vectors. Value vectors consume memory, which is a fixed resource on each Drillbit. Effective resource management requires the ability to control (or at least predict) resource usage.

Most data consists of more than one column. A collection of columns (or rows, depending on your perspective) is a record batch.

Many parts of Drill use 64K rows as the target size of a record batch. The Flatten operator targets batch sizes of 512 MB. The text scan operator appears to target batch sizes of 128 MB. Other operators may use other sizes.

Operators that target 64K rows use, essentially, unknown and potentially unlimited amounts of memory. While 64K rows of an integer each is fine, 64K rows of Varchar columns of 50K each leads to a batch of 3.2 GB in size, which is rather large.

This ticket requests three improvements.

1. Define a preferred batch size which is a balance between various needs: memory use, network efficiency, benefits of vector operations, etc.
2. Provide a reliable way to learn the size of each row as it is added to a batch.
3. Use the above to limit batches to the preferred batch size.

The above will go a long way to easing the task of managing memory because the planner will have some hope of understanding how much memory to allocate to various operations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)