You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "dhruba borthakur (JIRA)" <ji...@apache.org> on 2010/01/20 23:23:54 UTC
[jira] Commented: (HIVE-1071) Making RCFile "concatenatable" to
reduce the number of files of the output
[ https://issues.apache.org/jira/browse/HIVE-1071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12803025#action_12803025 ]
dhruba borthakur commented on HIVE-1071:
----------------------------------------
we could create a API in HDFS that concatenates a set of files into one file. The partial last block of each file will be zero filled, this is required because all the blocks (except the last block) in a single HDFS file should have the same size.
once we have the above-mentioned HDFS API, then we can merge a bunch of RC files into one single file without doing much physical IO. The RC file format has to be such that it can safely ignore zero-filled areas in the middle of the file. Can it do this?
> Making RCFile "concatenatable" to reduce the number of files of the output
> --------------------------------------------------------------------------
>
> Key: HIVE-1071
> URL: https://issues.apache.org/jira/browse/HIVE-1071
> Project: Hadoop Hive
> Issue Type: Improvement
> Reporter: Zheng Shao
>
> Hive automatically determine the number of reducers most of the time.
> Sometimes, we create a lot of small files.
> Hive has an option to "merge" those small files though a map-reduce job.
> Dhruba has the idea which can fix it even faster:
> if we can make RCFile concatenatable, then we can simply tell the namenode to "merge" these files.
> Pros: This approach does not do any I/O so it's faster.
> Cons: We have to zero-fill the files to make sure they can be concatenated (all blocks except the last have to be full HDFS blocks).
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.