You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Weichen Xu (JIRA)" <ji...@apache.org> on 2016/06/10 12:44:21 UTC

[jira] [Commented] (SPARK-15874) HBase rowkey optimization support for Hbase-handler

    [ https://issues.apache.org/jira/browse/SPARK-15874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15324385#comment-15324385 ] 

Weichen Xu commented on SPARK-15874:
------------------------------------

[~rxin]What do you think about it ?

> HBase rowkey optimization support for Hbase-handler
> ---------------------------------------------------
>
>                 Key: SPARK-15874
>                 URL: https://issues.apache.org/jira/browse/SPARK-15874
>             Project: Spark
>          Issue Type: New Feature
>          Components: SQL
>            Reporter: Weichen Xu
>   Original Estimate: 720h
>  Remaining Estimate: 720h
>
> Currently, Spark-SQL use  `org.apache.hadoop.hive.hbase.HBaseStorageHandler` for Hbase table support, which has poor optimization. for example, query such as
> select * from hbase_tab1 where rowkey_col = 'abc';
> will cause full table scan(each table region turn into a scan split and do full region scan).
> In fact, it is easy to implement the following optimization:
> 1.
> SQL such as
> `select * from hbase_tab1 where rowkey_col = 'abc';`
> or
> `select * from hbase_tab1 where rowkey_col = 'abc' or rowkey_col = 'abd' or ...;`
> can use hbase rowkey `Get`/`multiGet` API to execute efficiently.
> 2.
> SQL such as
> `select * from hbase_tab1 where rowkey_col = 'abc%';`
> can use hbase rowkey `Scan` API to execute efficiently.
> Higher-level SQL optimization will benefit from such optimization, for example, there is a very small table(such as incremental Data) `small_tab1`,
> SQL such as
> `select * from small_tab1 join hbase_tab1 on small_tab1.key1 = hbase_tab1.rowkey_col`
> can use classic small-table driven join optimization:
> loop each record of small_tab1, and exact each small_tab1.key1 as hbase_tab1's rowkey, and use hbase Get API, the join will execute efficiently.
> The scenario described above is very common, manay business system may have several tables which has main-key such as userID, and they often
> store them in HBase. But, several times people have requirement to do some analysis with SQL, and these SQL will have good optimization if the SQL execution plan has a good support to HBase rowkey.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org