You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Reynold Xin (JIRA)" <ji...@apache.org> on 2016/06/11 22:52:20 UTC

[jira] [Commented] (SPARK-15874) HBase rowkey optimization support for Hbase-Storage-handler

    [ https://issues.apache.org/jira/browse/SPARK-15874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15326069#comment-15326069 ] 

Reynold Xin commented on SPARK-15874:
-------------------------------------

I'm confused -- Apache Spark's code base itself does not include a hbase connector. Which one are you referring to?

> HBase rowkey optimization support for Hbase-Storage-handler
> -----------------------------------------------------------
>
>                 Key: SPARK-15874
>                 URL: https://issues.apache.org/jira/browse/SPARK-15874
>             Project: Spark
>          Issue Type: New Feature
>          Components: SQL
>            Reporter: Weichen Xu
>   Original Estimate: 720h
>  Remaining Estimate: 720h
>
> Currently, Spark-SQL use  `org.apache.hadoop.hive.hbase.HBaseStorageHandler` for Hbase table support, which has poor optimization. for example, query such as
> select * from hbase_tab1 where rowkey_col = 'abc';
> will cause full table scan(each table region turn into a scan split and do full region scan).
> In fact, it is easy to implement the following optimization:
> 1.
> SQL such as
> `select * from hbase_tab1 where rowkey_col = 'abc';`
> or
> `select * from hbase_tab1 where rowkey_col = 'abc' or rowkey_col = 'abd' or ...;`
> can use hbase rowkey `Get`/`multiGet` API to execute efficiently.
> 2.
> SQL such as
> `select * from hbase_tab1 where rowkey_col = 'abc%';`
> can use hbase rowkey `Scan` API to execute efficiently.
> Higher-level SQL optimization will benefit from such optimization, for example, there is a very small table(such as incremental Data) `small_tab1`,
> SQL such as
> `select * from small_tab1 join hbase_tab1 on small_tab1.key1 = hbase_tab1.rowkey_col`
> can use classic small-table driven join optimization:
> loop each record of small_tab1, and exact each small_tab1.key1 as hbase_tab1's rowkey, and use hbase Get API, the join will execute efficiently.
> The scenario described above is very common, manay business system may have several tables which has major-key such as userID, and they often store them in HBase. But, several times people have requirement to do some analysis with SQL, and these SQL will have good optimization if the SQL execution plan has a good support to HBase rowkey.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org