You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@hive.apache.org by "Sahil Takiar (JIRA)" <ji...@apache.org> on 2017/06/20 22:19:00 UTC

[jira] [Commented] (HIVE-16923) Hive-on-Spark DPP Improvements

    [ https://issues.apache.org/jira/browse/HIVE-16923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16056570#comment-16056570 ] 

Sahil Takiar commented on HIVE-16923:
-------------------------------------

Will post a design doc soon.

Two of the biggest limitations of the current DPP implementation are that it requires an additional Spark job and it requires writing some intermediate data to HDFS.

Ideally, DPP shouldn't hurt performance for any query. One way to ensure this is to build some type of cost-based model that predicts whether or not DPP will help perf or not. For example, a simple cost-based model could simply enable DPP for map-joins only. Since map-joins already require two Spark jobs and writing intermediate data to HDFS, there shouldn't be significant overhead to running DPP with a map-join.

> Hive-on-Spark DPP Improvements
> ------------------------------
>
>                 Key: HIVE-16923
>                 URL: https://issues.apache.org/jira/browse/HIVE-16923
>             Project: Hive
>          Issue Type: Bug
>          Components: Spark
>            Reporter: Sahil Takiar
>            Assignee: Sahil Takiar
>
> Improvements to Hive-on-Spark DPP so that it is production ready.
> Hive-on-Spark DPP was implemented in HIVE-9152. However, it is disabled by default. The goal of this JIRA is to improve the DPP implementation so that it can be enabled by default.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)