You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@hive.apache.org by "Sahil Takiar (JIRA)" <ji...@apache.org> on 2017/06/20 22:20:00 UTC

[jira] [Comment Edited] (HIVE-16923) Hive-on-Spark DPP Improvements

    [ https://issues.apache.org/jira/browse/HIVE-16923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16056570#comment-16056570 ] 

Sahil Takiar edited comment on HIVE-16923 at 6/20/17 10:19 PM:
---------------------------------------------------------------

Will post a design doc soon.

Two of the biggest limitations of the current DPP implementation are that it requires an additional Spark job and it requires writing some intermediate data to HDFS. We should evaluate the overhead of these limitations and if its possible to remove them.

Ideally, DPP shouldn't hurt performance for any query. One way to ensure this is to build some type of cost-based model that predicts whether or not DPP will help perf or not. For example, a simple cost-based model could simply enable DPP for map-joins only. Since map-joins already require two Spark jobs and writing intermediate data to HDFS, there shouldn't be significant overhead to running DPP with a map-join.


was (Author: stakiar):
Will post a design doc soon.

Two of the biggest limitations of the current DPP implementation are that it requires an additional Spark job and it requires writing some intermediate data to HDFS.

Ideally, DPP shouldn't hurt performance for any query. One way to ensure this is to build some type of cost-based model that predicts whether or not DPP will help perf or not. For example, a simple cost-based model could simply enable DPP for map-joins only. Since map-joins already require two Spark jobs and writing intermediate data to HDFS, there shouldn't be significant overhead to running DPP with a map-join.

> Hive-on-Spark DPP Improvements
> ------------------------------
>
>                 Key: HIVE-16923
>                 URL: https://issues.apache.org/jira/browse/HIVE-16923
>             Project: Hive
>          Issue Type: Bug
>          Components: Spark
>            Reporter: Sahil Takiar
>            Assignee: Sahil Takiar
>
> Improvements to Hive-on-Spark DPP so that it is production ready.
> Hive-on-Spark DPP was implemented in HIVE-9152. However, it is disabled by default. The goal of this JIRA is to improve the DPP implementation so that it can be enabled by default.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)