You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hive.apache.org by "Andrew Sherman (JIRA)" <ji...@apache.org> on 2017/10/03 00:23:00 UTC

[jira] [Created] (HIVE-17677) Investigate using hive statistics information to optimize HoS parallel order by

Andrew Sherman created HIVE-17677:
-------------------------------------

             Summary: Investigate using hive statistics information to optimize HoS parallel order by
                 Key: HIVE-17677
                 URL: https://issues.apache.org/jira/browse/HIVE-17677
             Project: Hive
          Issue Type: Improvement
    Affects Versions: 3.0.0
            Reporter: Andrew Sherman
            Assignee: Andrew Sherman


I think Spark's native parallel order by works in a similar way to what we do for Hive-on-MR.  That is, it scans the RDD once and sample the data to determine what ranges the data should be partitioned into, and then scans the RDD again to do the actual order by (with multiple reducers). 

One optimization suggested by [~stakiar] is that if we have column stats about the col we are ordering by, then the first scan on the RDD is not necessary. If we have histogram data about the RDD, we already know what the ranges of the order by should be. This should work when running parallel order by on simple tables, will be harder when we run it on derived datasets (although not impossible). 

To do his we would have to understand more about the internals of JavaPairRDD. 




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)