You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hive.apache.org by "Zheng Shao (JIRA)" <ji...@apache.org> on 2010/01/20 23:15:54 UTC

[jira] Created: (HIVE-1071) Making RCFile "concatenatable" to reduce the number of files of the output

Making RCFile "concatenatable" to reduce the number of files of the output
--------------------------------------------------------------------------

                 Key: HIVE-1071
                 URL: https://issues.apache.org/jira/browse/HIVE-1071
             Project: Hadoop Hive
          Issue Type: Improvement
            Reporter: Zheng Shao


Hive automatically determine the number of reducers most of the time.
Sometimes, we create a lot of small files.

Hive has an option to "merge" those small files though a map-reduce job.

Dhruba has the idea which can fix it even faster:
if we can make RCFile concatenatable, then we can simply tell the namenode to "merge" these files.

Pros: This approach does not do any I/O so it's faster.
Cons: We have to zero-fill the files to make sure they can be concatenated (all blocks except the last have to be full HDFS blocks).




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1071) Making RCFile "concatenatable" to reduce the number of files of the output

Posted by "Jeff Hammerbacher (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12803040#action_12803040 ] 

Jeff Hammerbacher commented on HIVE-1071:
-----------------------------------------

bq. we could create a API in HDFS that concatenates a set of files into one file.

Would be a fantastic primitive to add to HDFS.

> Making RCFile "concatenatable" to reduce the number of files of the output
> --------------------------------------------------------------------------
>
>                 Key: HIVE-1071
>                 URL: https://issues.apache.org/jira/browse/HIVE-1071
>             Project: Hadoop Hive
>          Issue Type: Improvement
>            Reporter: Zheng Shao
>
> Hive automatically determine the number of reducers most of the time.
> Sometimes, we create a lot of small files.
> Hive has an option to "merge" those small files though a map-reduce job.
> Dhruba has the idea which can fix it even faster:
> if we can make RCFile concatenatable, then we can simply tell the namenode to "merge" these files.
> Pros: This approach does not do any I/O so it's faster.
> Cons: We have to zero-fill the files to make sure they can be concatenated (all blocks except the last have to be full HDFS blocks).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1071) Making RCFile "concatenatable" to reduce the number of files of the output

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12803479#action_12803479 ] 

He Yongqiang commented on HIVE-1071:
------------------------------------

Concating files to a single file has 2 problems to solve: 
1) the partial last block of each middle files need to zero filled (why hdfs assume all blocks in a single file have the same size, will the DfsClient check that?) .  
2) remove the file header of all middle files.
1) is easy to do, but how we do 2)? 
Another possible consideration is to use sth like HAR. We can pack files into a single file, and let hdfs/namenode only know about the packed file. In this way, we even can pack files with different file formats together.

> Making RCFile "concatenatable" to reduce the number of files of the output
> --------------------------------------------------------------------------
>
>                 Key: HIVE-1071
>                 URL: https://issues.apache.org/jira/browse/HIVE-1071
>             Project: Hadoop Hive
>          Issue Type: Improvement
>            Reporter: Zheng Shao
>
> Hive automatically determine the number of reducers most of the time.
> Sometimes, we create a lot of small files.
> Hive has an option to "merge" those small files though a map-reduce job.
> Dhruba has the idea which can fix it even faster:
> if we can make RCFile concatenatable, then we can simply tell the namenode to "merge" these files.
> Pros: This approach does not do any I/O so it's faster.
> Cons: We have to zero-fill the files to make sure they can be concatenated (all blocks except the last have to be full HDFS blocks).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1071) Making RCFile "concatenatable" to reduce the number of files of the output

Posted by "Ning Zhang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12803492#action_12803492 ] 

Ning Zhang commented on HIVE-1071:
----------------------------------

@Zheng and Dhruba, if a lot of them are small files (say less than the block size), would it be more efficient to merge them in a compact way rather than filling them with "zeros"? Say if we have 1000 files, each of them is 10MB. If we taking this approach, we will have 1000 blocks, which can be fit into ~40 blocks if block size is 256MB. 

> Making RCFile "concatenatable" to reduce the number of files of the output
> --------------------------------------------------------------------------
>
>                 Key: HIVE-1071
>                 URL: https://issues.apache.org/jira/browse/HIVE-1071
>             Project: Hadoop Hive
>          Issue Type: Improvement
>            Reporter: Zheng Shao
>
> Hive automatically determine the number of reducers most of the time.
> Sometimes, we create a lot of small files.
> Hive has an option to "merge" those small files though a map-reduce job.
> Dhruba has the idea which can fix it even faster:
> if we can make RCFile concatenatable, then we can simply tell the namenode to "merge" these files.
> Pros: This approach does not do any I/O so it's faster.
> Cons: We have to zero-fill the files to make sure they can be concatenated (all blocks except the last have to be full HDFS blocks).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1071) Making RCFile "concatenatable" to reduce the number of files of the output

Posted by "Eli Collins (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12803374#action_12803374 ] 

Eli Collins commented on HIVE-1071:
-----------------------------------

Nice idea. Making HDFS deal with partial intermediate blocks so applications don't have to be responsible for zero-filling blocks (they may not know they need to at the time they're written) or modify their file formats might be worthwhile.

> Making RCFile "concatenatable" to reduce the number of files of the output
> --------------------------------------------------------------------------
>
>                 Key: HIVE-1071
>                 URL: https://issues.apache.org/jira/browse/HIVE-1071
>             Project: Hadoop Hive
>          Issue Type: Improvement
>            Reporter: Zheng Shao
>
> Hive automatically determine the number of reducers most of the time.
> Sometimes, we create a lot of small files.
> Hive has an option to "merge" those small files though a map-reduce job.
> Dhruba has the idea which can fix it even faster:
> if we can make RCFile concatenatable, then we can simply tell the namenode to "merge" these files.
> Pros: This approach does not do any I/O so it's faster.
> Cons: We have to zero-fill the files to make sure they can be concatenated (all blocks except the last have to be full HDFS blocks).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1071) Making RCFile "concatenatable" to reduce the number of files of the output

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12803054#action_12803054 ] 

Namit Jain commented on HIVE-1071:
----------------------------------

If the table happens to be bucketed, sampling queries may not work after concatenation.
The offsets need to be stored (in the metastore) for the buckets, and the offsets should be used to calculate the splits.

> Making RCFile "concatenatable" to reduce the number of files of the output
> --------------------------------------------------------------------------
>
>                 Key: HIVE-1071
>                 URL: https://issues.apache.org/jira/browse/HIVE-1071
>             Project: Hadoop Hive
>          Issue Type: Improvement
>            Reporter: Zheng Shao
>
> Hive automatically determine the number of reducers most of the time.
> Sometimes, we create a lot of small files.
> Hive has an option to "merge" those small files though a map-reduce job.
> Dhruba has the idea which can fix it even faster:
> if we can make RCFile concatenatable, then we can simply tell the namenode to "merge" these files.
> Pros: This approach does not do any I/O so it's faster.
> Cons: We have to zero-fill the files to make sure they can be concatenated (all blocks except the last have to be full HDFS blocks).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1071) Making RCFile "concatenatable" to reduce the number of files of the output

Posted by "dhruba borthakur (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12803025#action_12803025 ] 

dhruba borthakur commented on HIVE-1071:
----------------------------------------

we could create a API in HDFS that concatenates a set of files into one file. The partial last block of each file will be zero filled, this is required because all the blocks (except the last block) in a single HDFS file should have the same size.

once we have the above-mentioned HDFS API, then we can merge a bunch of RC files into one single file without doing much physical IO. The RC file format has to be such that it can safely ignore zero-filled areas in the middle of the file. Can it do this?

> Making RCFile "concatenatable" to reduce the number of files of the output
> --------------------------------------------------------------------------
>
>                 Key: HIVE-1071
>                 URL: https://issues.apache.org/jira/browse/HIVE-1071
>             Project: Hadoop Hive
>          Issue Type: Improvement
>            Reporter: Zheng Shao
>
> Hive automatically determine the number of reducers most of the time.
> Sometimes, we create a lot of small files.
> Hive has an option to "merge" those small files though a map-reduce job.
> Dhruba has the idea which can fix it even faster:
> if we can make RCFile concatenatable, then we can simply tell the namenode to "merge" these files.
> Pros: This approach does not do any I/O so it's faster.
> Cons: We have to zero-fill the files to make sure they can be concatenated (all blocks except the last have to be full HDFS blocks).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.