You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Xuefu Zhang (JIRA)" <ji...@apache.org> on 2016/05/11 15:49:12 UTC

[jira] [Commented] (HIVE-13293) Query occurs performance degradation after enabling parallel order by for Hive on Spark

    [ https://issues.apache.org/jira/browse/HIVE-13293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15280312#comment-15280312 ] 

Xuefu Zhang commented on HIVE-13293:
------------------------------------

[~lirui], thanks for working on this. The patch looks good, but one thing I'm not very sure of is the persistence level. Order by is almost always at the end of stages. Thus, does it make sense to have a mixed of memory and disk?

As a side, out of scope question, do we need to explicitly call rdd.unpersist() for those cached rdds once a query is completed? Right now, rdds are never reused across queries.

> Query occurs performance degradation after enabling parallel order by for Hive on Spark
> ---------------------------------------------------------------------------------------
>
>                 Key: HIVE-13293
>                 URL: https://issues.apache.org/jira/browse/HIVE-13293
>             Project: Hive
>          Issue Type: Bug
>          Components: Spark
>    Affects Versions: 2.0.0
>            Reporter: Lifeng Wang
>            Assignee: Rui Li
>         Attachments: HIVE-13293.1.patch, HIVE-13293.1.patch
>
>
> I use TPCx-BB to do some performance test on Hive on Spark engine. And found query 10 has performance degradation when enabling parallel order by.
> It seems that sampling cost much time before running the real query.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)