You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@impala.apache.org by "Thomas Tauber-Marshall (JIRA)" <ji...@apache.org> on 2017/07/24 14:32:00 UTC

[jira] [Resolved] (IMPALA-5498) Support for partial sorts

     [ https://issues.apache.org/jira/browse/IMPALA-5498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thomas Tauber-Marshall resolved IMPALA-5498.
--------------------------------------------
       Resolution: Fixed
    Fix Version/s: Impala 2.10.0

commit ad0c6e7499534d70d5b7de8e38199a9c5cfcbb48
Author: Thomas Tauber-Marshall <tm...@cloudera.com>
Date:   Thu Jun 22 12:26:48 2017 -0700

    IMPALA-5498: Support for partial sorts in Kudu INSERTs
    
    Impala currently supports total sorts (the entire set of data
    is sorted) and top-n sorts (only the highest/lowest n elements
    are sorted). This patch adds the ability to do partial sorts,
    where the data is divided up into some number of subsets, each
    of which is sorted individually.
    
    It accomplishes this by adding a new exec node, PartialSortNode.
    When PartialSortNode::GetNext() is called, it retrieves input
    up to the query memory limit, uses the existing Sorter class to sort
    it, and outputs it. This is faster than a total sort with SortNode
    as it avoids the need to spill if the input is larger than the
    memory limit.
    
    Future work will look into setting a more restrictive memory limit
    on the PartialSortNode. (IMPALA-5669)
    
    In the planner, the SortNode plan node is used, with an enum value
    indicating if it is a total or partial sort.
    
    This also adds a new counter 'RunSize' to the runtime profile which
    tracks the min, max, and avg size of the generated runs, in tuples.
    
    As a first use case, partial sort is used where a total sort was
    used previously for inserts/upserts into Kudu tables only. Future
    work can extend this to other table sinks. (IMPALA-5649)
    
    Testing:
    - E2E test with a large INSERT into a Kudu table with a mem limit.
      Checks that no spills occurred.
    - Updated planner tests.
    - Existing E2E tests and stress test verify correctness of INSERT.
    - Perf tests on the 10 node cluster: inserting tpch_100.lineitem
      into a Kudu table with mem_limit=3gb:
      Previously: 5 runs are spilled, sort took 7m33s
      Now: no spills, sort takes 6m19s, for ~18% speedup
    
    Change-Id: Ieec2a15a0cc5240b1c13682067ab64670d1e0a38
    Reviewed-on: http://gerrit.cloudera.org:8080/7267
    Reviewed-by: Thomas Tauber-Marshall <tm...@cloudera.com>
    Tested-by: Impala Public Jenkins

> Support for partial sorts
> -------------------------
>
>                 Key: IMPALA-5498
>                 URL: https://issues.apache.org/jira/browse/IMPALA-5498
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Backend
>    Affects Versions: Impala 2.10.0
>            Reporter: Thomas Tauber-Marshall
>            Assignee: Thomas Tauber-Marshall
>              Labels: kudu, performance
>             Fix For: Impala 2.10.0
>
>
> Impala's sorting code currently only allows for full sorts, but it could be extended to support partial sorts.
> This would be useful in situations where the sorting is being done for performance rather than correctness. For example, a recent change (IMPALA-3742) sorts rows for an INSERT before sending them to Kudu. Doing only a partial sort could speed this up and reduce the memory required while still retaining the primary benefit of reducing the load on Kudu.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)