You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "Csaba Ringhofer (Jira)" <ji...@apache.org> on 2021/01/11 12:02:00 UTC

[jira] [Created] (IMPALA-10431) Consider partitioning instead of sorting before inserts

Csaba Ringhofer created IMPALA-10431:
----------------------------------------

             Summary: Consider partitioning instead of sorting before inserts
                 Key: IMPALA-10431
                 URL: https://issues.apache.org/jira/browse/IMPALA-10431
             Project: IMPALA
          Issue Type: Improvement
          Components: Backend, Frontend
            Reporter: Csaba Ringhofer


Currently before partitioned inserts Impala can add a sort node to sort by the partitioning columns - the benefit of this is that the insert will only have to write one file at a time, which reduces memory requirements if the number of partitions is large. The drawback is the extra O(n*log n) CPU cost and the potential spilling.

As the order we write the partitions in doesn't matter, partitioning the rows in a spilling  hash table would give the same benefits but reduce the O(n*log) n CPU cost to O(n). It may be even possible to avoid spilling in some cases - e.g. if only a single partition is written, and the partitioner passes through the first partition it sees to the file sink.

The drawback is that unlike sort, we don't have  a partitioner node in Impala yet, but the algorithm is very similar to what's happening in grouping aggregation or hash join builders, so probably this could be done with medium amount of code.

Note that we can also sort by other columns than partitioning columns during insert to make min-max stat filtering efficient - in this case we should probably stick to the sorting before insert.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org