You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Ning Zhang (JIRA)" <ji...@apache.org> on 2010/05/14 01:01:44 UTC

[jira] Created: (HIVE-1343) add an interface in RCFile to support concatenation of two files without (de)compression

add an interface in RCFile to support concatenation of two files without (de)compression
----------------------------------------------------------------------------------------

                 Key: HIVE-1343
                 URL: https://issues.apache.org/jira/browse/HIVE-1343
             Project: Hadoop Hive
          Issue Type: New Feature
    Affects Versions: 0.6.0
            Reporter: Ning Zhang
            Assignee: He Yongqiang
             Fix For: 0.6.0


If two files are concatenated, we need to read each record in these files and write them back to the destination file. The IO cost is mostly unavoidable due to the lack of append functionality in HDFS. However the CPU cost could be significantly reduced by avoiding compression and decompression of the files.

The File Format layer should provide API that implement the block-level concatenation. 



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-1343) add an interface in RCFile to support concatenation of two files without (de)compression

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

He Yongqiang updated HIVE-1343:
-------------------------------

    Attachment: HIVE-1343.1.patch

> add an interface in RCFile to support concatenation of two files without (de)compression
> ----------------------------------------------------------------------------------------
>
>                 Key: HIVE-1343
>                 URL: https://issues.apache.org/jira/browse/HIVE-1343
>             Project: Hadoop Hive
>          Issue Type: New Feature
>    Affects Versions: 0.6.0
>            Reporter: Ning Zhang
>            Assignee: He Yongqiang
>             Fix For: 0.6.0
>
>         Attachments: HIVE-1343.1.patch
>
>
> If two files are concatenated, we need to read each record in these files and write them back to the destination file. The IO cost is mostly unavoidable due to the lack of append functionality in HDFS. However the CPU cost could be significantly reduced by avoiding compression and decompression of the files.
> The File Format layer should provide API that implement the block-level concatenation. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-1343) add an interface in RCFile to support concatenation of two files without (de)compression

Posted by "Ning Zhang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12867557#action_12867557 ] 

Ning Zhang commented on HIVE-1343:
----------------------------------

Yongqiang this patch only exposes the FileInputReader to the client and the client has to merge the file locally. This won't be scalable. What we should do is to run this merge job as a map-only job so that it can be run in parallel.

Talked with Dhruba and he think it would be possible to make it a map-only job. The idea is to define a new RecordReader that does not do decompression and iterate over records. Instead it iterates over uncompressed blocks. 

> add an interface in RCFile to support concatenation of two files without (de)compression
> ----------------------------------------------------------------------------------------
>
>                 Key: HIVE-1343
>                 URL: https://issues.apache.org/jira/browse/HIVE-1343
>             Project: Hadoop Hive
>          Issue Type: New Feature
>    Affects Versions: 0.6.0
>            Reporter: Ning Zhang
>            Assignee: He Yongqiang
>             Fix For: 0.6.0
>
>         Attachments: HIVE-1343.1.patch
>
>
> If two files are concatenated, we need to read each record in these files and write them back to the destination file. The IO cost is mostly unavoidable due to the lack of append functionality in HDFS. However the CPU cost could be significantly reduced by avoiding compression and decompression of the files.
> The File Format layer should provide API that implement the block-level concatenation. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.