You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Rui Li (JIRA)" <ji...@apache.org> on 2016/03/25 13:18:25 UTC

[jira] [Commented] (HIVE-13293) Query occurs performance degradation after enabling parallel order by for Hive on Spark

    [ https://issues.apache.org/jira/browse/HIVE-13293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15211748#comment-15211748 ] 

Rui Li commented on HIVE-13293:
-------------------------------

Just did some research about this. Actually the overhead is not so big as I thought. If a query is complicated, we'll have multiple stages. For spark, intermediate stages are called {{ShuffleMapStage}} and the last stage is called {{ResultStage}}. Suppose we have the following stage graph:
{noformat}
ShuffleMapStage1
      | (join)
ShuflleMapStage2
      | (groupBy)
ShuffleMapStage3
      | (sortByKey)
   ResultStage4
{noformat}
When calling sortByKey, spark launches the sampling job. The job triggers computation of ShuffleMapStage1, ShuflleMapStage2 and a ResultStage that shares most of ShuffleMapStage3. When we launch the real job, we'll submit the above stage graph. But at this point, spark will consider ShuffleMapStage1 and ShuflleMapStage2 as already computed because the shuffle outputs are still in local disk. Therefore what's re-computed is just ShuffleMapStage3. I have done some tests to verify this.
That being said, when ShuffleMapStage3 is complicated enough, we'll still have some considerable overhead. And I think that's the case for Q10 in TPCx-BB.
Rather than splitting the task, I think a better and easier way is to cache the RDD before calling sortByKey. We can use DISK_ONLY storage level if memory is a concern. I'll come up with a patch for review.

> Query occurs performance degradation after enabling parallel order by for Hive on Spark
> ---------------------------------------------------------------------------------------
>
>                 Key: HIVE-13293
>                 URL: https://issues.apache.org/jira/browse/HIVE-13293
>             Project: Hive
>          Issue Type: Bug
>          Components: Spark
>    Affects Versions: 2.0.0
>            Reporter: Lifeng Wang
>            Assignee: Rui Li
>
> I use TPCx-BB to do some performance test on Hive on Spark engine. And found query 10 has performance degradation when enabling parallel order by.
> It seems that sampling cost much time before running the real query.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)