You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Zheng Shao (JIRA)" <ji...@apache.org> on 2010/02/05 06:38:29 UTC

[jira] Created: (HIVE-1133) Refactor InputFormat and OutputFormat for Hive

Refactor InputFormat and OutputFormat for Hive
----------------------------------------------

                 Key: HIVE-1133
                 URL: https://issues.apache.org/jira/browse/HIVE-1133
             Project: Hadoop Hive
          Issue Type: Improvement
    Affects Versions: 0.6.0
            Reporter: Zheng Shao


Currently we ran into several problems of the FileInputFormat/OutputFormat in Hive.

The requirements are:
R1. We want to support HBase: HIVE-806
R2. We want to selectively include files based on file names: HIVE-951
R3. We want to optionally choose to recurse on the directory structure: HIVE-108
R4. We want to pass the filter condition into the storage (very useful for HBase, and indexed data format)
R5. We want to pass the column selection information into the storage (already done as part of the RCFile, but we can do it better)

We need to structure these requirements and the code structure in a good way to make it extensible.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-1133) Refactor InputFormat and OutputFormat for Hive

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-1133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Zheng Shao updated HIVE-1133:
-----------------------------

    Description: 
Currently we ran into several problems of the FileInputFormat/OutputFormat in Hive.

The requirements are:
R1. We want to support HBase: HIVE-806
R2. We want to selectively include files based on file names: HIVE-951
R3. We want to optionally choose to recurse on the directory structure: HIVE-1083
R4. We want to pass the filter condition into the storage (very useful for HBase, and indexed data format)
R5. We want to pass the column selection information into the storage (already done as part of the RCFile, but we can do it better)

We need to structure these requirements and the code structure in a good way to make it extensible.


  was:
Currently we ran into several problems of the FileInputFormat/OutputFormat in Hive.

The requirements are:
R1. We want to support HBase: HIVE-806
R2. We want to selectively include files based on file names: HIVE-951
R3. We want to optionally choose to recurse on the directory structure: HIVE-108
R4. We want to pass the filter condition into the storage (very useful for HBase, and indexed data format)
R5. We want to pass the column selection information into the storage (already done as part of the RCFile, but we can do it better)

We need to structure these requirements and the code structure in a good way to make it extensible.



> Refactor InputFormat and OutputFormat for Hive
> ----------------------------------------------
>
>                 Key: HIVE-1133
>                 URL: https://issues.apache.org/jira/browse/HIVE-1133
>             Project: Hadoop Hive
>          Issue Type: Improvement
>    Affects Versions: 0.6.0
>            Reporter: Zheng Shao
>
> Currently we ran into several problems of the FileInputFormat/OutputFormat in Hive.
> The requirements are:
> R1. We want to support HBase: HIVE-806
> R2. We want to selectively include files based on file names: HIVE-951
> R3. We want to optionally choose to recurse on the directory structure: HIVE-1083
> R4. We want to pass the filter condition into the storage (very useful for HBase, and indexed data format)
> R5. We want to pass the column selection information into the storage (already done as part of the RCFile, but we can do it better)
> We need to structure these requirements and the code structure in a good way to make it extensible.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-1133) Refactor InputFormat and OutputFormat for Hive

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-1133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12829977#action_12829977 ] 

Zheng Shao commented on HIVE-1133:
----------------------------------

Thanks for the note, Bennie. In the future, please assign it to yourself click "submit patch" so that we know it's ready for review (we will "cancel patch" if we have comments).



> Refactor InputFormat and OutputFormat for Hive
> ----------------------------------------------
>
>                 Key: HIVE-1133
>                 URL: https://issues.apache.org/jira/browse/HIVE-1133
>             Project: Hadoop Hive
>          Issue Type: Improvement
>    Affects Versions: 0.6.0
>            Reporter: Zheng Shao
>
> Currently we ran into several problems of the FileInputFormat/OutputFormat in Hive.
> The requirements are:
> R1. We want to support HBase: HIVE-806
> R2. We want to selectively include files based on file names: HIVE-951
> R3. We want to optionally choose to recurse on the directory structure: HIVE-1083
> R4. We want to pass the filter condition into the storage (very useful for HBase, and indexed data format)
> R5. We want to pass the column selection information into the storage (already done as part of the RCFile, but we can do it better)
> We need to structure these requirements and the code structure in a good way to make it extensible.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-1133) Refactor InputFormat and OutputFormat for Hive

Posted by "Bennie Schut (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-1133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12829975#action_12829975 ] 

Bennie Schut commented on HIVE-1133:
------------------------------------

This could conflict with some changes I made for HIVE-1019. A patch is available for that one.

> Refactor InputFormat and OutputFormat for Hive
> ----------------------------------------------
>
>                 Key: HIVE-1133
>                 URL: https://issues.apache.org/jira/browse/HIVE-1133
>             Project: Hadoop Hive
>          Issue Type: Improvement
>    Affects Versions: 0.6.0
>            Reporter: Zheng Shao
>
> Currently we ran into several problems of the FileInputFormat/OutputFormat in Hive.
> The requirements are:
> R1. We want to support HBase: HIVE-806
> R2. We want to selectively include files based on file names: HIVE-951
> R3. We want to optionally choose to recurse on the directory structure: HIVE-1083
> R4. We want to pass the filter condition into the storage (very useful for HBase, and indexed data format)
> R5. We want to pass the column selection information into the storage (already done as part of the RCFile, but we can do it better)
> We need to structure these requirements and the code structure in a good way to make it extensible.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-1133) Refactor InputFormat and OutputFormat for Hive

Posted by "Ning Zhang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-1133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12830214#action_12830214 ] 

Ning Zhang commented on HIVE-1133:
----------------------------------

R4 (pushing down simple predicates) is also useful for RCFile or any FileFormat internal to Hive since we can implement a faster "search-based" HiveRecordReader that takes a set of predicates and only returns satisfying records. 

> Refactor InputFormat and OutputFormat for Hive
> ----------------------------------------------
>
>                 Key: HIVE-1133
>                 URL: https://issues.apache.org/jira/browse/HIVE-1133
>             Project: Hadoop Hive
>          Issue Type: Improvement
>    Affects Versions: 0.6.0
>            Reporter: Zheng Shao
>
> Currently we ran into several problems of the FileInputFormat/OutputFormat in Hive.
> The requirements are:
> R1. We want to support HBase: HIVE-806
> R2. We want to selectively include files based on file names: HIVE-951
> R3. We want to optionally choose to recurse on the directory structure: HIVE-1083
> R4. We want to pass the filter condition into the storage (very useful for HBase, and indexed data format)
> R5. We want to pass the column selection information into the storage (already done as part of the RCFile, but we can do it better)
> We need to structure these requirements and the code structure in a good way to make it extensible.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-1133) Refactor InputFormat and OutputFormat for Hive

Posted by "Carl Steinbach (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-1133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Carl Steinbach updated HIVE-1133:
---------------------------------

    Component/s: HBase Handler
                 Serializers/Deserializers

> Refactor InputFormat and OutputFormat for Hive
> ----------------------------------------------
>
>                 Key: HIVE-1133
>                 URL: https://issues.apache.org/jira/browse/HIVE-1133
>             Project: Hadoop Hive
>          Issue Type: Improvement
>          Components: HBase Handler, Serializers/Deserializers
>    Affects Versions: 0.6.0
>            Reporter: Zheng Shao
>
> Currently we ran into several problems of the FileInputFormat/OutputFormat in Hive.
> The requirements are:
> R1. We want to support HBase: HIVE-806
> R2. We want to selectively include files based on file names: HIVE-951
> R3. We want to optionally choose to recurse on the directory structure: HIVE-1083
> R4. We want to pass the filter condition into the storage (very useful for HBase, and indexed data format)
> R5. We want to pass the column selection information into the storage (already done as part of the RCFile, but we can do it better)
> We need to structure these requirements and the code structure in a good way to make it extensible.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-1133) Refactor InputFormat and OutputFormat for Hive

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-1133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12831054#action_12831054 ] 

He Yongqiang commented on HIVE-1133:
------------------------------------

Add another possible requirement:
add support for Zebra's file format.

> Refactor InputFormat and OutputFormat for Hive
> ----------------------------------------------
>
>                 Key: HIVE-1133
>                 URL: https://issues.apache.org/jira/browse/HIVE-1133
>             Project: Hadoop Hive
>          Issue Type: Improvement
>    Affects Versions: 0.6.0
>            Reporter: Zheng Shao
>
> Currently we ran into several problems of the FileInputFormat/OutputFormat in Hive.
> The requirements are:
> R1. We want to support HBase: HIVE-806
> R2. We want to selectively include files based on file names: HIVE-951
> R3. We want to optionally choose to recurse on the directory structure: HIVE-1083
> R4. We want to pass the filter condition into the storage (very useful for HBase, and indexed data format)
> R5. We want to pass the column selection information into the storage (already done as part of the RCFile, but we can do it better)
> We need to structure these requirements and the code structure in a good way to make it extensible.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-1133) Refactor InputFormat and OutputFormat for Hive

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-1133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12830280#action_12830280 ] 

Namit Jain commented on HIVE-1133:
----------------------------------

R4. We can even exploit the sorted characteristics of the data.We know that  a table is sorted/bucketed,
but never make use of it.

> Refactor InputFormat and OutputFormat for Hive
> ----------------------------------------------
>
>                 Key: HIVE-1133
>                 URL: https://issues.apache.org/jira/browse/HIVE-1133
>             Project: Hadoop Hive
>          Issue Type: Improvement
>    Affects Versions: 0.6.0
>            Reporter: Zheng Shao
>
> Currently we ran into several problems of the FileInputFormat/OutputFormat in Hive.
> The requirements are:
> R1. We want to support HBase: HIVE-806
> R2. We want to selectively include files based on file names: HIVE-951
> R3. We want to optionally choose to recurse on the directory structure: HIVE-1083
> R4. We want to pass the filter condition into the storage (very useful for HBase, and indexed data format)
> R5. We want to pass the column selection information into the storage (already done as part of the RCFile, but we can do it better)
> We need to structure these requirements and the code structure in a good way to make it extensible.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-1133) Refactor InputFormat and OutputFormat for Hive

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-1133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12829965#action_12829965 ] 

Zheng Shao commented on HIVE-1133:
----------------------------------

Functions related to this refactoring:

{code}
ExecDriver.addInputPaths
HiveInputFormat.getSplits
CombineHiveInputFormat.getSplits
ExecMap.configure
{code}


> Refactor InputFormat and OutputFormat for Hive
> ----------------------------------------------
>
>                 Key: HIVE-1133
>                 URL: https://issues.apache.org/jira/browse/HIVE-1133
>             Project: Hadoop Hive
>          Issue Type: Improvement
>    Affects Versions: 0.6.0
>            Reporter: Zheng Shao
>
> Currently we ran into several problems of the FileInputFormat/OutputFormat in Hive.
> The requirements are:
> R1. We want to support HBase: HIVE-806
> R2. We want to selectively include files based on file names: HIVE-951
> R3. We want to optionally choose to recurse on the directory structure: HIVE-1083
> R4. We want to pass the filter condition into the storage (very useful for HBase, and indexed data format)
> R5. We want to pass the column selection information into the storage (already done as part of the RCFile, but we can do it better)
> We need to structure these requirements and the code structure in a good way to make it extensible.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.