You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Paul Rogers (JIRA)" <ji...@apache.org> on 2017/01/22 02:39:27 UTC
[jira] [Commented] (DRILL-5209) Standardize Drill's batch size

    [ https://issues.apache.org/jira/browse/DRILL-5209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15833242#comment-15833242 ] 

Paul Rogers commented on DRILL-5209:
------------------------------------

See DRILL-5211. It turns out that Drill uses a memory allocation scheme that caches blocks of 16 MB. If any single vector allocation is larger than this amount, Drill must allocate memory directly from the JVM. The result is that Drill can hit OOM due to memory fragmentation: plenty of memory exists as 16 MB blocks, but none at larger sizes.

As a result, every batch must be aware not just of row width, but also of _column_ width. No batch may have more rows than  fills any given column vector above 16 MB. This logic does not exist anywhere in Drill today. As noted above, we instead allocate based on aggregate batch totals or row counts, leaving us susceptible to memory fragmentation and no good way to avoid the problem.

> Standardize Drill's batch size
> ------------------------------
>
>                 Key: DRILL-5209
>                 URL: https://issues.apache.org/jira/browse/DRILL-5209
>             Project: Apache Drill
>          Issue Type: Improvement
>    Affects Versions: 1.9.0
>            Reporter: Paul Rogers
>            Priority: Minor
>
> Drill is columnar, implemented as a set of value vectors. Value vectors consume memory, which is a fixed resource on each Drillbit. Effective resource management requires the ability to control (or at least predict) resource usage.
> Most data consists of more than one column. A collection of columns (or rows, depending on your perspective) is a record batch.
> Many parts of Drill use 64K rows as the target size of a record batch. The Flatten operator targets batch sizes of 512 MB. The text scan operator appears to target batch sizes of 128 MB. Other operators may use other sizes.
> Operators that target 64K rows use, essentially, unknown and potentially unlimited amounts of memory. While 64K rows of an integer each is fine, 64K rows of Varchar columns of 50K each leads to a batch of 3.2 GB in size, which is rather large.
> This ticket requests three improvements.
> 1. Define a preferred batch size which is a balance between various needs: memory use, network efficiency, benefits of vector operations, etc.
> 2. Provide a reliable way to learn the size of each row as it is added to a batch.
> 3. Use the above to limit batches to the preferred batch size.
> The above will go a long way to easing the task of managing memory because the planner will have some hope of understanding how much memory to allocate to various operations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)