You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Weichen Xu (JIRA)" <ji...@apache.org> on 2016/06/10 12:44:21 UTC
[jira] [Commented] (SPARK-15874) HBase rowkey optimization support
for Hbase-handler
[ https://issues.apache.org/jira/browse/SPARK-15874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15324385#comment-15324385 ]
Weichen Xu commented on SPARK-15874:
------------------------------------
[~rxin]What do you think about it ?
> HBase rowkey optimization support for Hbase-handler
> ---------------------------------------------------
>
> Key: SPARK-15874
> URL: https://issues.apache.org/jira/browse/SPARK-15874
> Project: Spark
> Issue Type: New Feature
> Components: SQL
> Reporter: Weichen Xu
> Original Estimate: 720h
> Remaining Estimate: 720h
>
> Currently, Spark-SQL use `org.apache.hadoop.hive.hbase.HBaseStorageHandler` for Hbase table support, which has poor optimization. for example, query such as
> select * from hbase_tab1 where rowkey_col = 'abc';
> will cause full table scan(each table region turn into a scan split and do full region scan).
> In fact, it is easy to implement the following optimization:
> 1.
> SQL such as
> `select * from hbase_tab1 where rowkey_col = 'abc';`
> or
> `select * from hbase_tab1 where rowkey_col = 'abc' or rowkey_col = 'abd' or ...;`
> can use hbase rowkey `Get`/`multiGet` API to execute efficiently.
> 2.
> SQL such as
> `select * from hbase_tab1 where rowkey_col = 'abc%';`
> can use hbase rowkey `Scan` API to execute efficiently.
> Higher-level SQL optimization will benefit from such optimization, for example, there is a very small table(such as incremental Data) `small_tab1`,
> SQL such as
> `select * from small_tab1 join hbase_tab1 on small_tab1.key1 = hbase_tab1.rowkey_col`
> can use classic small-table driven join optimization:
> loop each record of small_tab1, and exact each small_tab1.key1 as hbase_tab1's rowkey, and use hbase Get API, the join will execute efficiently.
> The scenario described above is very common, manay business system may have several tables which has main-key such as userID, and they often
> store them in HBase. But, several times people have requirement to do some analysis with SQL, and these SQL will have good optimization if the SQL execution plan has a good support to HBase rowkey.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org