You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Boaz Ben-Zvi (JIRA)" <ji...@apache.org> on 2017/08/17 01:47:00 UTC

[jira] [Commented] (DRILL-5588) Hash Aggregate: Avoid copy on output of aggregate columns

    [ https://issues.apache.org/jira/browse/DRILL-5588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16129749#comment-16129749 ] 

Boaz Ben-Zvi commented on DRILL-5588:
-------------------------------------

For AVG(), the division does not take place in the Hash-Agg operator; instead both the SUM and the COUNT columns are returned downstream to the project operator which performs the division. However for VARIANCE() and STDDEV() the computation of the result does take place in the Hash Agg !
  

> Hash Aggregate: Avoid copy on output of aggregate columns
> ---------------------------------------------------------
>
>                 Key: DRILL-5588
>                 URL: https://issues.apache.org/jira/browse/DRILL-5588
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Execution - Relational Operators
>    Affects Versions: 1.10.0
>            Reporter: Boaz Ben-Zvi
>
>  When the Hash Aggregate operator outputs its result batches downstream, the key columns (value vectors) are returned as is, but for the aggregate columns new value vectors are allocated and the values are copied. This has an impact on performance. (see the method allocateOutgoing() ). A second effect is on memory management (as this allocation is not planned for by the code that controls spilling, etc).
>    For some simple aggregate functions (e.g. SUM), the stored value vectors for the aggregate values can be returned as is. For functions like AVG, there is a need to divide the SUM values by the COUNT values. Still this can be done in-place (of the SUM values) and avoid new allocation and copy. 
>    For VarChar type aggregate values (only used by MAX or MIN), there is another issue -- currently any such value vector is allocated as an ObjectVector (see BatchHolder()) (and on the JVM heap, not in direct memory). This is to manage the sizes of the values, which could change as the aggregation progresses (e.g., for MAX(name) -- first record has 'abe', but the next record has 'benjamin' which is both bigger ('b' > 'a') and longer). For the final output, this requires a new allocation and a copy in order to have a compact value vector in direct memory. Maybe the ObjectVector could be replaced with some direct memory implementation that is optimized for "good" values (e.g., all are of similar size), but penalized "bad" values (e.g., reallocates or moves values, when needed) ?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)