You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "Tim Armstrong (Jira)" <ji...@apache.org> on 2020/06/13 00:04:00 UTC
[jira] [Resolved] (IMPALA-2522) Improve the reliability and effectiveness of ETL

     [ https://issues.apache.org/jira/browse/IMPALA-2522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tim Armstrong resolved IMPALA-2522.
-----------------------------------
    Resolution: Fixed

Will mark as fixed for now, since the vast majority of subtasks are completed and there hasn't been movement for a while.

> Improve the reliability and effectiveness of ETL
> ------------------------------------------------
>
>                 Key: IMPALA-2522
>                 URL: https://issues.apache.org/jira/browse/IMPALA-2522
>             Project: IMPALA
>          Issue Type: Epic
>          Components: Backend
>    Affects Versions: Impala 2.2, Impala 2.3.0, Impala 2.5.0, Impala 2.4.0, Impala 2.6.0, Impala 2.7.0
>            Reporter: Mostafa Mokhtar
>            Assignee: Lars Volker
>            Priority: Major
>              Labels: ETL, performance
>
> h4. Reduce the memory requirements of INSERTs into partitioned tables.
> Impala inserts into partitioned Parquet tables suffer from high memory requirements because each Impala Daemon will keep ~256MB of buffer space per open partition in the table sink. This often leads to large insert jobs hitting "Memory limit exceeded" errors. The behavior can be improved by pre-clustering the data such that only one partition needs to be buffered at a time in the table sink.
> Add a new "clustered" plan hint for insert statements. Example:
> {code}
> CREATE TABLE dst (...) PARTITIONED BY (year INT, month INT);
> INSERT INTO dst PARTITION(year,month) /*+ clustered */ SELECT * FROM src;
> {code}
> The hint specifies that the data fed into the table sink should be clustered based on the partition columns. For now, we'll use a sort to achieve clustering, and the plan should look like this:
> SCAN -> SORT (year,month) -> TABLE SINK
> h4. Give users additional control over the insertion order.
> In order to improve compression and/or the effectiveness of min/max pruning, it is desirable to control the order in which rows are inserted into table (mostly for Parquet).
> Introduce a "sortby" plan hint for insert statements: Example
> {code}
> CREATE TABLE dst (...) PARTITIONED BY (year INT, month INT);
> INSERT INTO dst PARTITION(year,month) /*+ clustered sortby(day,hour) */ SELECT * FROM src
> {code}
> This would produce the following plan:
> SCAN -> SORT(year,month,day,hour) -> TABLE SINK
> h4. Improve the sort efficiency
> The additional sorting step introduced by both solutions above should be as efficient as possible.
> Codegen TupleRowComparator and Tuple::MaterializeExprs.
> h4. Summary
> With more predictable and resource-efficient ETL users will extract more value out of Impala and will need to rely less on slow legacy ETL tools like Hive.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org