You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Weichen Xu (JIRA)" <ji...@apache.org> on 2016/06/10 12:43:21 UTC

[jira] [Created] (SPARK-15874) HBase rowkey optimization support for Hbase-handler

Weichen Xu created SPARK-15874:
----------------------------------

             Summary: HBase rowkey optimization support for Hbase-handler
                 Key: SPARK-15874
                 URL: https://issues.apache.org/jira/browse/SPARK-15874
             Project: Spark
          Issue Type: New Feature
          Components: SQL
            Reporter: Weichen Xu


Currently, Spark-SQL use  `org.apache.hadoop.hive.hbase.HBaseStorageHandler` for Hbase table support, which has poor optimization. for example, query such as

select * from hbase_tab1 where rowkey_col = 'abc';

will cause full table scan(each table region turn into a scan split and do full region scan).

In fact, it is easy to implement the following optimization:

1.
SQL such as
`select * from hbase_tab1 where rowkey_col = 'abc';`
or
`select * from hbase_tab1 where rowkey_col = 'abc' or rowkey_col = 'abd' or ...;`
can use hbase rowkey `Get`/`multiGet` API to execute efficiently.

2.
SQL such as
`select * from hbase_tab1 where rowkey_col = 'abc%';`
can use hbase rowkey `Scan` API to execute efficiently.

Higher-level SQL optimization will benefit from such optimization, for example, there is a very small table(such as incremental Data) `small_tab1`,
SQL such as
`select * from small_tab1 join hbase_tab1 on small_tab1.key1 = hbase_tab1.rowkey_col`
can use classic small-table driven join optimization:
loop each record of small_tab1, and exact each small_tab1.key1 as hbase_tab1's rowkey, and use hbase Get API, the join will execute efficiently.

The scenario described above is very common, manay business system may have several tables which has main-key such as userID, and they often
store them in HBase. But, several times people have requirement to do some analysis with SQL, and these SQL will have good optimization if the SQL execution plan has a good support to HBase rowkey.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org