You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Xuefu Zhang (JIRA)" <ji...@apache.org> on 2014/09/16 18:03:34 UTC
[jira] [Comment Edited] (HIVE-8118) SparkMapRecorderHandler and SparkReduceRecordHandler should be initialized with multiple result collectors[Spark Branch]

    [ https://issues.apache.org/jira/browse/HIVE-8118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14135517#comment-14135517 ] 

Xuefu Zhang edited comment on HIVE-8118 at 9/16/14 4:02 PM:
------------------------------------------------------------

Hi [~chengxiang li],

Thank you for your input. I'm not sure if I understand your thought right. Let me clarify the problem  by giving a SparkWork like this:
{code}
MapWork1 -> ReduceWork1
          \-> ReduceWork2
{code}
it means that MapWork1 will generate different datasets to feed to ReduceWork1 and ReduceWork2. In case of multi-insert, ReduceWork1 and ReduceWork2 will have a FS operator. Inside MapWork1, there will be two operator branches consuming the same data, and push different data sets to two RS operators. (ReduceWork1 and ReduceWork2 have different HiveReduceFunctions.)

However, current implemenation only takes the first data set and feed it to both reduce works. The same problem can happen also if MapWork1 were a reduce work following other ReduceWork or MapWork.

With this problem, I'm not sure how we can get around without letting MapWork1 generate two output RDDs, one for each following reduce work. Potentially, we can duplicate MapWork1 and have the following diagram:
{code}
MapWork11 -> ReduceWork1
MapWork12 -> ReduceWork2
{code}
where MapWork11 and MapWork12 consume the same input table (input table as RDD), and feed its first output RDD to ReduceWork1 and the second to ReduceWork2. This has its complexity, but more importantly, there will be wasted READ (unless SPark is smart enough to cache the input table, which is unlikely) and COMPUTATION (computing data twice). I feel that it's unlikely to get such optimizations from Spark framework in the near term.

Thus, I think we have to take into consideration that a map work or a reduce work might generate multiple RDDs, one feeds to each of its children. Since SparkMapRecorderHandler and SparkReduceRecordHandler are doing the data processing on map and reduce side, they need to have a way to generate multiple outputs.

Please correct me if I understood you wrong. Thanks.



was (Author: xuefuz):
Hi [~chengxiang li],

Thank you for your input. I'm not sure if I understand your thought right. Let me clarify the problem  by giving a SparkWork like this:
{code}
MapWork1 -> ReduceWork1
                 \-> ReduceWork2
{code}
it means that MapWork1 will generate different datasets to feed to ReduceWork1 and ReduceWork2. In case of multi-insert, ReduceWork1 and ReduceWork2 will have a FS operator. Inside MapWork1, there will be two operator branches consuming the same data, and push different data sets to two RS operators. (ReduceWork1 and ReduceWork2 have different HiveReduceFunctions.)

However, current implemenation only takes the first data set and feed it to both reduce works. The same problem can happen also if MapWork1 were a reduce work following other ReduceWork or MapWork.

With this problem, I'm not sure how we can get around without letting MapWork1 generate two output RDDs, one for each following reduce work. Potentially, we can duplicate MapWork1 and have the following diagram:
{code}
MapWork11 -> ReduceWork1
MapWork12 -> ReduceWork2
{code}
where MapWork11 and MapWork12 consume the same input table (input table as RDD), and feed its first output RDD to ReduceWork1 and the second to ReduceWork2. This has its complexity, but more importantly, there will be wasted READ (unless SPark is smart enough to cache the input table, which is unlikely) and COMPUTATION (computing data twice). I feel that it's unlikely to get such optimizations from Spark framework in the near term.

Thus, I think we have to take into consideration that a map work or a reduce work might generate multiple RDDs, one feeds to each of its children. Since SparkMapRecorderHandler and SparkReduceRecordHandler are doing the data processing on map and reduce side, they need to have a way to generate multiple outputs.

Please correct me if I understood you wrong. Thanks.


> SparkMapRecorderHandler and SparkReduceRecordHandler should be initialized with multiple result collectors[Spark Branch]
> ------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-8118
>                 URL: https://issues.apache.org/jira/browse/HIVE-8118
>             Project: Hive
>          Issue Type: Bug
>          Components: Spark
>            Reporter: Xuefu Zhang
>            Assignee: Venki Korukanti
>              Labels: Spark-M1
>
> In the current implementation, both SparkMapRecordHandler and SparkReduceRecorderHandler takes only one result collector, which limits that the corresponding map or reduce task can have only one child. It's very comment in multi-insert queries where a map/reduce task has more than one children. A query like the following has two map tasks as parents:
> {code}
> select name, sum(value) from dec group by name union all select name, value from dec order by name
> {code}
> It's possible in the future an optimation may be implemented so that a map work is followed by two reduce works and then connected to a union work.
> Thus, we should take this as a general case. Tez is currently providing a collector for each child operator in the map-side or reduce side operator tree. We can take Tez as a reference.
> Likely this is a big change and subtasks are possible. 
> With this, we can have a simpler and clean multi-insert implementation. This is also the problem observed in HIVE-7731.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)