You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Ning Zhang (JIRA)" <ji...@apache.org> on 2010/05/12 19:48:46 UTC

[jira] Commented: (HIVE-1307) More generic and efficient merge method

    [ https://issues.apache.org/jira/browse/HIVE-1307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12866655#action_12866655 ] 

Ning Zhang commented on HIVE-1307:
----------------------------------

Some design notes:

This task should benefit not only the dynamic partition inserts, but any inserts that requires merging (hive.merge.mapfiles/mapredfiles=true). The idea is as follows:

The current merge job is a MapReduce job for each partition. The mappers are just reading the files and pass alone to only 1 reducer. The reducer is responsible to consolidate all inputs into a single stream. The extra work in the boundary of mapper/reducer (e.g., copying, shuffling and sorting) are not necessary. 

With the CombineHiveInputFormat, the merge job is map-only and it should take care of multiple partitions. The idea is that one mapper should be generated for each partition. The input format for that mapper should be CombineHiveInputFormat so that it will read multiple files and output to one file.  

Since CombineHiveInputFormat depends on a Hadoop 0.20 feature, this feature relies on shim to tell whether to use the new merge job (M) or old one (MR). With this restriction, merging after dynamic partition insert only works for Hadoop 0.20. 

> More generic and efficient merge method
> ---------------------------------------
>
>                 Key: HIVE-1307
>                 URL: https://issues.apache.org/jira/browse/HIVE-1307
>             Project: Hadoop Hive
>          Issue Type: New Feature
>    Affects Versions: 0.6.0
>            Reporter: Ning Zhang
>            Assignee: Ning Zhang
>             Fix For: 0.6.0
>
>
> Currently if hive.merge.mapfiles/mapredfiles=true, a new mapreduce job is create to read the input files and output to one reducer for merging. This MR job is created at compile time and one MR job for one partition. In the case of dynamic partition case, multiple partitions could be created at execution time and generating merging MR job at compile time is impossible. 
> We should generalize the merge framework to allow multiple partitions and most of the time a map-only job should be sufficient if we use CombineHiveInputFormat. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.