You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@hive.apache.org by "Rui Li (JIRA)" <ji...@apache.org> on 2016/04/11 07:17:25 UTC

[jira] [Updated] (HIVE-13293) Query occurs performance degradation after enabling parallel order by for Hive on Spark

     [ https://issues.apache.org/jira/browse/HIVE-13293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Rui Li updated HIVE-13293:
--------------------------
    Attachment: HIVE-13293.1.patch

I have tried both splitting the task and caching the RDD and chose the latter here. Because it's simpler and works with queries that have only one ShuffleMapStage. Regarding performance, these two solutions provide roughly same performance in my local tests. I used DISK_ONLY as storage level which I think is good enough for performance and avoids more memory overhead.
Lifeng, could you help test the patch with your data set? Thanks.

> Query occurs performance degradation after enabling parallel order by for Hive on Spark
> ---------------------------------------------------------------------------------------
>
>                 Key: HIVE-13293
>                 URL: https://issues.apache.org/jira/browse/HIVE-13293
>             Project: Hive
>          Issue Type: Bug
>          Components: Spark
>    Affects Versions: 2.0.0
>            Reporter: Lifeng Wang
>            Assignee: Rui Li
>         Attachments: HIVE-13293.1.patch
>
>
> I use TPCx-BB to do some performance test on Hive on Spark engine. And found query 10 has performance degradation when enabling parallel order by.
> It seems that sampling cost much time before running the real query.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)