You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Pradeep Kamath (JIRA)" <ji...@apache.org> on 2012/12/07 03:39:21 UTC
[jira] [Updated] (HIVE-3733) Improve Hive's logic for conditional merge

     [ https://issues.apache.org/jira/browse/HIVE-3733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Pradeep Kamath updated HIVE-3733:
---------------------------------

    Attachment: HIVE-3733.4.patch.txt

Changed the code which looks in the Operator Stack to look for ReduceSinkOperator instead of the exact CurrWork.getReducer() instance.

union19 no longer performs a conditional merge with this change. My hypothesis for this follows:

The union19 query is:
FROM (select 'tst1' as key, cast(count(1) as string) as value from src s1
UNION ALL
select s2.key as key, s2.value as value from src s2) unionsrc
INSERT OVERWRITE TABLE DEST1 SELECT unionsrc.key, count(unionsrc.value) group by unionsrc.key
INSERT OVERWRITE TABLE DEST2 SELECT unionsrc.key, unionsrc.value, unionsrc.value;

The from subquery has an implicit group by/ReduceSink due to the count. So though the second insert in the multi insert by itself does not have a groupby/ReduceSink, the subquery in the from clause causes the groupby/ReduceSink to appear in the stack and hence we decide not to do the conditional merge since the FileSink will be in the reduce.
                
> Improve Hive's logic for conditional merge
> ------------------------------------------
>
>                 Key: HIVE-3733
>                 URL: https://issues.apache.org/jira/browse/HIVE-3733
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Pradeep Kamath
>            Assignee: Pradeep Kamath
>         Attachments: HIVE-3733.1.patch.txt, HIVE-3733.3.patch.txt, HIVE-3733.4.patch.txt
>
>
> If the config hive.merge.mapfiles is set to true and hive.merge.mapredfiles is set to false then when hive encounters a FileSinkOperator when generating map reduce tasks, it will look at the entire job to see if it has a reducer, if it does it will not merge. Instead it should be check if the FileSinkOperator is a child of the reducer. This means that outputs generated in the mapper will be merged, and outputs generated in the reducer will not be, the intended effect of setting those configs.
> Simple repro:
> set hive.merge.mapfiles=true;
> set hive.merge.mapredfiles=false;
> EXPLAIN
> FROM <input_table>
> INSERT OVERWRITE TABLE <output_table1> SELECT key, COUNT(*) group by key
> INSERT OVERWRITE TABLE <output_table2> SELECT *;
> The output should contain a Conditional Operator, Mapred Stages, and Move tasks

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira