You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hive.apache.org by "Schubert Zhang (JIRA)" <ji...@apache.org> on 2009/08/29 07:54:32 UTC

[jira] Created: (HIVE-806) Hive with HBase as data store to support MapReduce and direct query

Hive with HBase as data store to support MapReduce and direct query
-------------------------------------------------------------------

                 Key: HIVE-806
                 URL: https://issues.apache.org/jira/browse/HIVE-806
             Project: Hadoop Hive
          Issue Type: New Feature
            Reporter: Schubert Zhang


Current Hive uses only HDFS as the underlayer data store, it can query and analyze files in HDFS via MapReduce.
But in some engineering cases, our data are stored/organised/indexed in HBase or other data stores. This jira-issue will implement hive to use HBase as data store.  And except for supporting MapReduce on HBase, we will support direct query on HBase.

This is a brother jira-issue of HIVE-705 (Let Hive can analyse hbase's tables, https://issues.apache.org/jira/browse/HIVE-705). Because this implementation and use cases have some differences from HIVE-705, this jira-issue is created to avoid confusions. It is possible to combine the two issues in the future.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-806) Hive with HBase as data store to support MapReduce and direct query

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779785#action_12779785 ] 

Namit Jain commented on HIVE-806:
---------------------------------

Is someone working on this ?

> Hive with HBase as data store to support MapReduce and direct query
> -------------------------------------------------------------------
>
>                 Key: HIVE-806
>                 URL: https://issues.apache.org/jira/browse/HIVE-806
>             Project: Hadoop Hive
>          Issue Type: New Feature
>            Reporter: Schubert Zhang
>
> Current Hive uses only HDFS as the underlayer data store, it can query and analyze files in HDFS via MapReduce.
> But in some engineering cases, our data are stored/organised/indexed in HBase or other data stores. This jira-issue will implement hive to use HBase as data store.  And except for supporting MapReduce on HBase, we will support direct query on HBase.
> This is a brother jira-issue of HIVE-705 (Let Hive can analyse hbase's tables, https://issues.apache.org/jira/browse/HIVE-705). Because this implementation and use cases have some differences from HIVE-705, this jira-issue is created to avoid confusions. It is possible to combine the two issues in the future.
> Initial developers: Kula Liao, Stephen Xie, Tao Jiang and Schubert Zhang.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HIVE-806) Hive with HBase as data store to support MapReduce and direct query

Posted by "John Sichi (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

John Sichi updated HIVE-806:
----------------------------

    Component/s: HBase Handler

> Hive with HBase as data store to support MapReduce and direct query
> -------------------------------------------------------------------
>
>                 Key: HIVE-806
>                 URL: https://issues.apache.org/jira/browse/HIVE-806
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: HBase Handler
>            Reporter: Schubert Zhang
>
> Current Hive uses only HDFS as the underlayer data store, it can query and analyze files in HDFS via MapReduce.
> But in some engineering cases, our data are stored/organised/indexed in HBase or other data stores. This jira-issue will implement hive to use HBase as data store.  And except for supporting MapReduce on HBase, we will support direct query on HBase.
> This is a brother jira-issue of HIVE-705 (Let Hive can analyse hbase's tables, https://issues.apache.org/jira/browse/HIVE-705). Because this implementation and use cases have some differences from HIVE-705, this jira-issue is created to avoid confusions. It is possible to combine the two issues in the future.
> Initial developers: Kula Liao, Stephen Xie, Tao Jiang and Schubert Zhang.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HIVE-806) Hive with HBase as data store to support MapReduce and direct query

Posted by "Schubert Zhang (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Schubert Zhang updated HIVE-806:
--------------------------------

    Description: 
Current Hive uses only HDFS as the underlayer data store, it can query and analyze files in HDFS via MapReduce.
But in some engineering cases, our data are stored/organised/indexed in HBase or other data stores. This jira-issue will implement hive to use HBase as data store.  And except for supporting MapReduce on HBase, we will support direct query on HBase.

This is a brother jira-issue of HIVE-705 (Let Hive can analyse hbase's tables, https://issues.apache.org/jira/browse/HIVE-705). Because this implementation and use cases have some differences from HIVE-705, this jira-issue is created to avoid confusions. It is possible to combine the two issues in the future.

Initial developers: Kula Liao, Stephen Xie, Tao Jiang and Schubert Zhang.

  was:
Current Hive uses only HDFS as the underlayer data store, it can query and analyze files in HDFS via MapReduce.
But in some engineering cases, our data are stored/organised/indexed in HBase or other data stores. This jira-issue will implement hive to use HBase as data store.  And except for supporting MapReduce on HBase, we will support direct query on HBase.

This is a brother jira-issue of HIVE-705 (Let Hive can analyse hbase's tables, https://issues.apache.org/jira/browse/HIVE-705). Because this implementation and use cases have some differences from HIVE-705, this jira-issue is created to avoid confusions. It is possible to combine the two issues in the future.


> Hive with HBase as data store to support MapReduce and direct query
> -------------------------------------------------------------------
>
>                 Key: HIVE-806
>                 URL: https://issues.apache.org/jira/browse/HIVE-806
>             Project: Hadoop Hive
>          Issue Type: New Feature
>            Reporter: Schubert Zhang
>
> Current Hive uses only HDFS as the underlayer data store, it can query and analyze files in HDFS via MapReduce.
> But in some engineering cases, our data are stored/organised/indexed in HBase or other data stores. This jira-issue will implement hive to use HBase as data store.  And except for supporting MapReduce on HBase, we will support direct query on HBase.
> This is a brother jira-issue of HIVE-705 (Let Hive can analyse hbase's tables, https://issues.apache.org/jira/browse/HIVE-705). Because this implementation and use cases have some differences from HIVE-705, this jira-issue is created to avoid confusions. It is possible to combine the two issues in the future.
> Initial developers: Kula Liao, Stephen Xie, Tao Jiang and Schubert Zhang.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-806) Hive with HBase as data store to support MapReduce and direct query

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12779866#action_12779866 ] 

He Yongqiang commented on HIVE-806:
-----------------------------------

I think Schubert is on vocation right now. Will try to contact him.

> Hive with HBase as data store to support MapReduce and direct query
> -------------------------------------------------------------------
>
>                 Key: HIVE-806
>                 URL: https://issues.apache.org/jira/browse/HIVE-806
>             Project: Hadoop Hive
>          Issue Type: New Feature
>            Reporter: Schubert Zhang
>
> Current Hive uses only HDFS as the underlayer data store, it can query and analyze files in HDFS via MapReduce.
> But in some engineering cases, our data are stored/organised/indexed in HBase or other data stores. This jira-issue will implement hive to use HBase as data store.  And except for supporting MapReduce on HBase, we will support direct query on HBase.
> This is a brother jira-issue of HIVE-705 (Let Hive can analyse hbase's tables, https://issues.apache.org/jira/browse/HIVE-705). Because this implementation and use cases have some differences from HIVE-705, this jira-issue is created to avoid confusions. It is possible to combine the two issues in the future.
> Initial developers: Kula Liao, Stephen Xie, Tao Jiang and Schubert Zhang.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-806) Hive with HBase as data store to support MapReduce and direct query

Posted by "Schubert Zhang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12752877#action_12752877 ] 

Schubert Zhang commented on HIVE-806:
-------------------------------------

@Zheng, we are in desigining and coding now.  and we had a talk with Samuel days ago.  Because this is involved in one of our ongoing project, I am sorry the update will be not so quick.
I describe something of out consideration bellow, and will update when we complete our implementation and verification.

1. A new HBaseInputFormat.

The current TableInputFormat always scan the whole HBase HTable, it is usually unnecessary, especially when we know one or more row-range.
A new HBaseInputFormat will be implemented to provide more parameters to control the behavior of HTable scan. e.g.:
(1) row-ranges (one or more startRow and endRow paires)
(2) column list (some times we need not read all columns, HBase is a column-oriented store)
(3) filter tree (predicate pushdow, filter rows/columns at region server)
(4) maybe, we can do some computing on region server. (optional)

2. SerDe

We use more flexible SerDe for engineering practice. 
(1) we will support the MAP data type to map to HBase's (sparse) column family:column qualifers. This is a rigid mapping between Hive table schema and HTable schema, and sometimes it is not so effective for structurized data.
(2) use a nested SerDe to implement the codec of RowKey and Columns. Since usually, the rowkey in HTable are a combination of more than one hive-columns; and we support do store a column list in to a HTable column family but do not use HBase's column quailfer feature, but the columns in a column family are self-coded (such as use of comma delimiter).
      RowSerDe { RowKeySerDe,  ColumnSerDe}

This is example of above SerDe design.

CREATE TABLE t1(rowkey1 int, rowkey2 string, value1 string, valuer2 int, value3 long, valuer string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.hbase.HBaseRowSerDe'
WITH SERDEPROPERTIES (

"rowkey.serde.class"="org.apache.hadoop.hive.serde2.hbase.SimpleRowkeySerDe" //this will be a build-in SerDe for rowkey
"rowkey.columns"="rowkey2,rowkey1"  //the rowkey in HTable is a combination of tow hive-columns.
"rowkey.column.lengths"="12,2"             //the lengths of the two hive-columns in rowkey
"rowkey.column.delimiter"=","                 //the delimiter in rowkey (it may be omit if not be defined)

"column.families"="cf1:(value1,value2); cf2:(value3,value4)"  //there two column families in HTable, cf1 and cf2 have tow column respectively
"column.family.serde.class"="cf1:org.apache.hadoop.hive.serde2.hbase.SimpleColumnSerDe;
cf2:org.apache.hadoop.hive.serde2.hbase.ColumnSerDe1" //cf1 and cf2 can use different SerDe
"column.family.cf1.delimiter"=","

) STORED AS HBASETABLE;

(Note: we have complete above code and verified)

We shall also support the rigid mapping (MAP) like HIVE-705, e.g.

CREATE TABLE hbase_table_1(rowkey1 int, rowkey2 string, value1 string, valuer2 int,  abcd MAP<string, string>)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.hbase.HBaseRowSerDe'
WITH SERDEPROPERTIES (

"rowkey.serde.class"="org.apache.hadoop.hive.serde2.hbase.SimpleRowkeySerDe"
"rowkey.columns"="rowkey2,rowkey1"
"rowkey.column.lengths"="12,2"
"rowkey.column.delimiter"=","

"column.families"="cf1:(value1,value2); cf2:=abcd"
"column.family.serde.class"="cf1:org.apache.hadoop.hive.serde2.hbase.SimpleColumnSerDe;
cf2:org.apache.hadoop.hive.serde2.hbase.QualiferColumnSerDe"
"column.family.cf1.delimiter"=","

) STORED AS HBASETABLE;

3. To support direct query (scan or get) from HBase HTable

Some straightforward query target to HTable need not use mapreduce,  we can difectly scan or get rows from HTable, since HTable is a global indexed store. We can use some features of HBase to improve the performance.
(1) rowkey or rowkey ranges
(2) column list
(3) filter tree (predicate pushdow)
(4) .....

(Note: we have complete above code and verified)

4. other...




> Hive with HBase as data store to support MapReduce and direct query
> -------------------------------------------------------------------
>
>                 Key: HIVE-806
>                 URL: https://issues.apache.org/jira/browse/HIVE-806
>             Project: Hadoop Hive
>          Issue Type: New Feature
>            Reporter: Schubert Zhang
>
> Current Hive uses only HDFS as the underlayer data store, it can query and analyze files in HDFS via MapReduce.
> But in some engineering cases, our data are stored/organised/indexed in HBase or other data stores. This jira-issue will implement hive to use HBase as data store.  And except for supporting MapReduce on HBase, we will support direct query on HBase.
> This is a brother jira-issue of HIVE-705 (Let Hive can analyse hbase's tables, https://issues.apache.org/jira/browse/HIVE-705). Because this implementation and use cases have some differences from HIVE-705, this jira-issue is created to avoid confusions. It is possible to combine the two issues in the future.
> Initial developers: Kula Liao, Stephen Xie, Tao Jiang and Schubert Zhang.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-806) Hive with HBase as data store to support MapReduce and direct query

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12752790#action_12752790 ] 

Zheng Shao commented on HIVE-806:
---------------------------------

Any update on this?

> Hive with HBase as data store to support MapReduce and direct query
> -------------------------------------------------------------------
>
>                 Key: HIVE-806
>                 URL: https://issues.apache.org/jira/browse/HIVE-806
>             Project: Hadoop Hive
>          Issue Type: New Feature
>            Reporter: Schubert Zhang
>
> Current Hive uses only HDFS as the underlayer data store, it can query and analyze files in HDFS via MapReduce.
> But in some engineering cases, our data are stored/organised/indexed in HBase or other data stores. This jira-issue will implement hive to use HBase as data store.  And except for supporting MapReduce on HBase, we will support direct query on HBase.
> This is a brother jira-issue of HIVE-705 (Let Hive can analyse hbase's tables, https://issues.apache.org/jira/browse/HIVE-705). Because this implementation and use cases have some differences from HIVE-705, this jira-issue is created to avoid confusions. It is possible to combine the two issues in the future.
> Initial developers: Kula Liao, Stephen Xie, Tao Jiang and Schubert Zhang.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-806) Hive with HBase as data store to support MapReduce and direct query

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12774130#action_12774130 ] 

Namit Jain commented on HIVE-806:
---------------------------------

Any updates on this ?

> Hive with HBase as data store to support MapReduce and direct query
> -------------------------------------------------------------------
>
>                 Key: HIVE-806
>                 URL: https://issues.apache.org/jira/browse/HIVE-806
>             Project: Hadoop Hive
>          Issue Type: New Feature
>            Reporter: Schubert Zhang
>
> Current Hive uses only HDFS as the underlayer data store, it can query and analyze files in HDFS via MapReduce.
> But in some engineering cases, our data are stored/organised/indexed in HBase or other data stores. This jira-issue will implement hive to use HBase as data store.  And except for supporting MapReduce on HBase, we will support direct query on HBase.
> This is a brother jira-issue of HIVE-705 (Let Hive can analyse hbase's tables, https://issues.apache.org/jira/browse/HIVE-705). Because this implementation and use cases have some differences from HIVE-705, this jira-issue is created to avoid confusions. It is possible to combine the two issues in the future.
> Initial developers: Kula Liao, Stephen Xie, Tao Jiang and Schubert Zhang.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-806) Hive with HBase as data store to support MapReduce and direct query

Posted by "Schubert Zhang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12780979#action_12780979 ] 

Schubert Zhang commented on HIVE-806:
-------------------------------------

@yongqiang,
I am in vocation now, I will try to contact someone to update it.

Guangxian,

Could you please do something about this issue to contrib hive when  
you hive time?


发自我的 iPhone

在 2009-11-19，16:17，"He Yongqiang (JIRA)" <ji...@apache.org> 写到：



> Hive with HBase as data store to support MapReduce and direct query
> -------------------------------------------------------------------
>
>                 Key: HIVE-806
>                 URL: https://issues.apache.org/jira/browse/HIVE-806
>             Project: Hadoop Hive
>          Issue Type: New Feature
>            Reporter: Schubert Zhang
>
> Current Hive uses only HDFS as the underlayer data store, it can query and analyze files in HDFS via MapReduce.
> But in some engineering cases, our data are stored/organised/indexed in HBase or other data stores. This jira-issue will implement hive to use HBase as data store.  And except for supporting MapReduce on HBase, we will support direct query on HBase.
> This is a brother jira-issue of HIVE-705 (Let Hive can analyse hbase's tables, https://issues.apache.org/jira/browse/HIVE-705). Because this implementation and use cases have some differences from HIVE-705, this jira-issue is created to avoid confusions. It is possible to combine the two issues in the future.
> Initial developers: Kula Liao, Stephen Xie, Tao Jiang and Schubert Zhang.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (HIVE-806) Hive with HBase as data store to support MapReduce and direct query

Posted by "John Sichi (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

John Sichi resolved HIVE-806.
-----------------------------

    Resolution: Incomplete

Marking this one incomplete.  If there's still interest in any of the material here, please create new JIRA issue(s) with the details.

> Hive with HBase as data store to support MapReduce and direct query
> -------------------------------------------------------------------
>
>                 Key: HIVE-806
>                 URL: https://issues.apache.org/jira/browse/HIVE-806
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: HBase Handler
>            Reporter: Schubert Zhang
>
> Current Hive uses only HDFS as the underlayer data store, it can query and analyze files in HDFS via MapReduce.
> But in some engineering cases, our data are stored/organised/indexed in HBase or other data stores. This jira-issue will implement hive to use HBase as data store.  And except for supporting MapReduce on HBase, we will support direct query on HBase.
> This is a brother jira-issue of HIVE-705 (Let Hive can analyse hbase's tables, https://issues.apache.org/jira/browse/HIVE-705). Because this implementation and use cases have some differences from HIVE-705, this jira-issue is created to avoid confusions. It is possible to combine the two issues in the future.
> Initial developers: Kula Liao, Stephen Xie, Tao Jiang and Schubert Zhang.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.