You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@drill.apache.org by "Paul Rogers (JIRA)" <ji...@apache.org> on 2017/02/21 16:57:44 UTC

[jira] [Created] (DRILL-5282) Rationalize record batch sizes in all readers and operators

Paul Rogers created DRILL-5282:
----------------------------------

Summary: Rationalize record batch sizes in all readers and operators
Key: DRILL-5282
URL: https://issues.apache.org/jira/browse/DRILL-5282
Project: Apache Drill
Issue Type: Improvement
Affects Versions: 1.10.0
Reporter: Paul Rogers

Drill uses record batches to process data. A record batch consists of a "bundle" of vectors that, combined, hold the data for some number of records.

The key consideration for a record batch is memory consumed. Various operators and readers have vastly different ideas of the size of a batch. The text reader can produce batches of 100s of K, while the flatten operator produces batches of half a GB. Other operators are randomly in between. Some readers produce batches of unlimited size driven by average row width.

Another key consideration is record count. Batches have a hard physical limit of 64K (the number indexed by a two-byte selection vector.) Some operators produce this much, others far less. In one case, we saw a reader that produced 64K+1 records.

A final consideration is the size of individual vectors. Drill incurs severe memory fragmentation when vectors grow above 16 MB.

In some cases, operators (such as the Parquet reader) allocate large batches, but only partially fill them, creating a large amount of wasted space. That space adds up when we must buffer it during a sort.

This ticket asks to research an optimal batch size. Create a framework to build such batches. Retrofit all operators that produce batches to use that framework to produce uniform batches.

--
This message was sent by Atlassian JIRA
(v6.3.15#6346)