You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "Alex Rodoni (JIRA)" <ji...@apache.org> on 2018/08/30 18:47:00 UTC

[jira] [Updated] (IMPALA-1430) Codegen all aggregate functions, including UDAs

     [ https://issues.apache.org/jira/browse/IMPALA-1430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alex Rodoni updated IMPALA-1430:
--------------------------------
    Docs Text:   (was: This fix enables codegen for aggregate functions that previously didn't support codegen. This can lead to dramatic performance improvements (5x in some cases)

The first part enables codegen for all builtin aggregate functions, except those with CHAR arguments. E.g. min(STRING), sum(TIMESTAMP), group_concat())

> Codegen all aggregate functions, including UDAs
> -----------------------------------------------
>
>                 Key: IMPALA-1430
>                 URL: https://issues.apache.org/jira/browse/IMPALA-1430
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Backend
>    Affects Versions: Impala 2.0
>            Reporter: Skye Wanderman-Milne
>            Assignee: Tim Armstrong
>            Priority: Minor
>              Labels: codegen, performance, ramp-up
>             Fix For: Impala 2.9.0
>
>
> Currently codegen is disabled for the entire aggregation operator if a single aggregate function can't be codegen'd. We should address this by making it so all aggregate functions can be codegen'd, including UDAs. For UDAs in .so's, the codegen'd function will call into the UDA library. This also affects aggregation operator on timestamp.
> This perf hit can be especially bad for COMPUTE STATS which is heavily CPU bound on the aggregation and because there is no easy way to exclude the TIMESTAMP columns when computing the column stats (i.e., there is no simple workaround). 
> Even if the portions involving TIMESTAMP cannot be codegen'd it would still be worthwhile to come up with a workaround where codegen for the other types is still enabled.
> *Workaround*
> If you are experiencing very slow COMPUTE STATS due to this issue, then you may be able to temporarily ALTER the TIMESTAMP columns to STRING or INT type before running COMPUTE STATS. After the command completed, the columns can be altered back to TIMESTAMP.
> Note the workaround is only apply to text data, not parquet data. parquet require compatibles data type. TIMESTAMP is INT96, it's not compatible with STRING or BIGINT.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org