You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Aman Sinha (JIRA)" <ji...@apache.org> on 2015/01/14 01:48:38 UTC

[jira] [Commented] (DRILL-1993) Fix allocation issues in HashTable and HashAgg to reduce memory waste

    [ https://issues.apache.org/jira/browse/DRILL-1993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14276290#comment-14276290 ] 

Aman Sinha commented on DRILL-1993:
-----------------------------------

The primary change for the allocation looks good to me and as discussed part of it would be subsumed eventually by the reallocation changes in DRILL-1960.   A few comments: 
Is there a reason to remove the following assert ? The subsequent check for size == 0 does not subsume the assert. 
{code}
-    assert batchIdx < batchHolders.size();
+    if (batchHolders.size() == 0) {
+      return true;
+    }
+
{code}
You could store the VARIABLE_WIDTH_RECORD_SIZE * BATCH_SIZE as static final constant to avoid computation for each vector allocation.  The change in outputValues() wasn't immediately clear (a comment would help) .. since the outNumRecordsHolder.value is reset to 0 each time this method is invoked, so we will start writing at output index 0 each time ?  

> Fix allocation issues in HashTable and HashAgg to reduce memory waste
> ---------------------------------------------------------------------
>
>                 Key: DRILL-1993
>                 URL: https://issues.apache.org/jira/browse/DRILL-1993
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Execution - Operators
>            Reporter: Venki Korukanti
>            Assignee: Venki Korukanti
>             Fix For: 0.8.0
>
>         Attachments: DRILL-1993-1.patch
>
>
> Issues found:
> + Key container allocation issue in HashTable
> Currently we allocate 2^16 records capacity memory for "hashValues" and "links" vectors in each BatchHolder, but for "key" holders we use the allocateNew which by default allocates low capacity memory compared to "hashValues" and "links" vector capacities (incase of Integer key, capacity is 4096 records). This causes "key" holders to fill up much sooner even though "hashValues" and "links" vectors still have lot of free entries. As a result we create more BatchHolders than required causing wasted space in "links" and "hashValues" vectors in each BatchHolder. And for each new BatchHolder we create a SV4 vector in HashJoinHelper which is another overhead.
> + Allocation issues in HashAggTemplate
> HashAggTemplate has its own BatchHolders which has vectors allocated using allocateNew (i.e small capacity). Whenever a BatchHolder in HashAggTemplate reaches its capacity, we add a new BatchHolder in HashTable. This causes the HashTable BatchHolders to be not space efficient.
> + Update the HashAggTemplate.outputCurrentBatch to consider cases where all entries in a single BatchHolder are can't be copied over to output vectors in a single pass (output vectors capacity is lower than the number of records in BatchHolder)
> + Lazy BatchHolder creation for both HashAgg and HashTable
> Don't allocate the BatchHolder until first put request is received. This way we don't waste space in fragments which don't receive any input records. This is possible when the group by key has very few distinct values (such as shipmode in TPCH) only few fragments receive the data. Our current parallelization code is not considering the distinct values when parallelizing hash exchanges.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)