You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by "Paul Rogers (JIRA)" <ji...@apache.org> on 2017/06/30 05:01:00 UTC

[jira] [Created] (DRILL-5635) External sort need not buffer two incoming batches; just two spill files' worth

Paul Rogers created DRILL-5635:
----------------------------------

             Summary: External sort need not buffer two incoming batches; just two spill files' worth
                 Key: DRILL-5635
                 URL: https://issues.apache.org/jira/browse/DRILL-5635
             Project: Apache Drill
          Issue Type: Improvement
    Affects Versions: 1.10.0
            Reporter: Paul Rogers
            Assignee: Paul Rogers
            Priority: Minor


The external sort has long had the rule that it must have space in memory to buffer two incoming batches. This rule stems from the merge/sort algorithm that requires a minimum of two *records* per file to make progress.

But, Drill is batch-based, not record-based. Drill must spill at least two records (preferably much more) to make progress.

The original code confused these issues and required room for two incoming batches. This restriction can be softened. In the worst case, Drill only needs two records: either one batch of (at least) two records, or two batches of one record each.

The difference in rule will better handle the case of very large incoming batches relative to memory available for sort. Suppose an incoming batch is 500 MB in size and the sort has 750 MB available. The code today fails the sort because two batches cannot fit in memory. But, the sort, by default, creates spill files of 250 MB. Since a single input batch can fill two spill files, there is no reason to wait for a second batch before spilling.

Special care is needed to handle this case. Divide the "jumbo" batch into spill files of near-equal size, but each must contain at least two records.

The only limit occurs if individual records (assuming one record per batch) exceed 25% of memory. (Because, today, the sort copies records prior to spilling. Another JIRA will address that fault.)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)