You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hbase.apache.org by "bill.yunfu" <gu...@alibaba-inc.com> on 2018/07/29 07:53:08 UTC

May I take this issue --hbase-spark

May I take this issue --hbase-spark

Hi community 
   I am working in one HBase team which service hundreds customers. We find
that along increasing amount of data in the HBase, many customers have
analysis requirement for their data on Hbase. For example they want use
Spark to do some analysis which may query more data from Hbase and may also
join with other tables, the tables may be in Hbase or Spark. 
   But Hbase can not support this scenario very well. So we plan use spark
to support this. 
   We found the Apache Hbase already has one module called Hbase-spark, but
this module is not updated recently and not formally released. Besides we
found there are others project support Sql On Hbase. For example Hive on
Hbase which give good sql syntax support.  
   Even there are many projects for Spark on Hbase, but I think now no one
is the public knowing for users. Because our customer have more and more
requirement for Spark on Hbase, So we want take this issue. Initial goal is
make a standard and public knowing Spark on Hbase in apache Hbase
community. 
   Our initial idea is:     
   SQL support:  Now the hbase-spark model can not spark-sql command to
create table, We want make it support sql command which may like the sql
syntax from Hive on HBase or the SQL syntax from SHC. 
   Performance improved: this part is not very clearly now, the goal is use
spark sql query HBase data has a good performance. 
   
We want to get some suggestions from community. Then I will raise a JIRA to
track it and put a design document. 

Best Regards
Bill 




--
Sent from: http://apache-hbase.679495.n3.nabble.com/HBase-Developer-f679493.html

Re: May I take this issue --hbase-spark

Posted by nurseryboy <zg...@163.com>.
hi Ted 
   You are right, one sample is from Hive,  our hive version is 2.3.3. 
  The hive sample is just to show the create table syntax, it will not
impact the hbase-spark part. 
   
   The demo I am preparing that, thank you. 
  
Regards
Bill 


Ted Yu-3 wrote
> bq. ROW FORMAT SERDE 'org.apache.hadoop.hive.hbase.HBaseSerDe'
> 
> The above implies dependency on some class from Hive.
> 
> Which Hive release would you use if you choose the above route ?
> 
> Looking forward to your demo.
> 
> On Tue, Jul 31, 2018 at 9:09 AM bill.yunfu &lt;

> guangcheng.zgc@

> &gt;
> wrote:
> 
>> hi Ted
>>    Thank you for replying.
>> The sql support means user can directly use spark sql to create table and
>> query data from HBase. we found two sql support on HBase
>> SHC use following command to create table in spark sql:
>> CREATE TABLE spark_hbase USING
>> org.apache.spark.sql.execution.datasources.hbase
>>       OPTIONS ('catalog'=
>>       '{"table":{"namespace":"default", "name":"test",
>> "tableCoder":"PrimitiveType"},"rowkey":"key",
>>       "columns":{
>>       "col0":{"cf":"rowkey", "col":"key", "type":"string"},
>>       "col1":{"cf":"cf", "col":"a", "type":"string"}}}'
>>       )
>> (SHC is a project can get details from:
>> https://github.com/hortonworks-spark/shc)
>> In spark sql also can use hive command to create table:
>> create  table spark_hbase (col0 string, col1 string)
>>     ROW FORMAT SERDE 'org.apache.hadoop.hive.hbase.HBaseSerDe' with
>> SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf:a")
>>     STORED AS
>>         INPUTFORMAT
>> 'org.apache.hadoop.hive.hbase.HiveHBaseTableInputFormat'
>>         OUTPUTFORMAT
>> 'org.apache.hadoop.hive.hbase.HiveHBaseTableOutputFormat'
>>                 tblproperties ("hbase.table.name" = "test");
>>
>> So we want make a similar DDL to create the table for hbase-spark model
>> and
>> query with the spark sql.
>>
>> And for the Spark release, we suggestion first target at spark 2.y, for
>> example the spark 2.2.2 which is stability now.
>>
>> We will create a demo base on hbase-spark model with sql support in
>> local,
>> then share here to discuss.
>>
>> Regards
>> Bill
>>
>>
>> Ted Yu-3 wrote
>> > For SQL support, can you be more specific on how the SQL support would
>> be
>> > added ?
>> >
>> > Maybe you can illustrate some examples showing the enhanced SQL syntax.
>> >
>> > Also, which Spark release(s) would be targeted?
>> >
>> > Thanks
>> >
>> > On Mon, Jul 30, 2018 at 10:57 AM bill.yunfu &lt;
>>
>> > guangcheng.zgc@
>>
>> > &gt;
>> > wrote:
>>
>>
>>
>>
>>
>> --
>> Sent from:
>> http://apache-hbase.679495.n3.nabble.com/HBase-Developer-f679493.html
>>





--
Sent from: http://apache-hbase.679495.n3.nabble.com/HBase-Developer-f679493.html

Re: May I take this issue --hbase-spark

Posted by Ted Yu <yu...@gmail.com>.
bq. ROW FORMAT SERDE 'org.apache.hadoop.hive.hbase.HBaseSerDe'

The above implies dependency on some class from Hive.

Which Hive release would you use if you choose the above route ?

Looking forward to your demo.

On Tue, Jul 31, 2018 at 9:09 AM bill.yunfu <gu...@alibaba-inc.com>
wrote:

> hi Ted
>    Thank you for replying.
> The sql support means user can directly use spark sql to create table and
> query data from HBase. we found two sql support on HBase
> SHC use following command to create table in spark sql:
> CREATE TABLE spark_hbase USING
> org.apache.spark.sql.execution.datasources.hbase
>       OPTIONS ('catalog'=
>       '{"table":{"namespace":"default", "name":"test",
> "tableCoder":"PrimitiveType"},"rowkey":"key",
>       "columns":{
>       "col0":{"cf":"rowkey", "col":"key", "type":"string"},
>       "col1":{"cf":"cf", "col":"a", "type":"string"}}}'
>       )
> (SHC is a project can get details from:
> https://github.com/hortonworks-spark/shc)
> In spark sql also can use hive command to create table:
> create  table spark_hbase (col0 string, col1 string)
>     ROW FORMAT SERDE 'org.apache.hadoop.hive.hbase.HBaseSerDe' with
> SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf:a")
>     STORED AS
>         INPUTFORMAT
> 'org.apache.hadoop.hive.hbase.HiveHBaseTableInputFormat'
>         OUTPUTFORMAT
> 'org.apache.hadoop.hive.hbase.HiveHBaseTableOutputFormat'
>                 tblproperties ("hbase.table.name" = "test");
>
> So we want make a similar DDL to create the table for hbase-spark model and
> query with the spark sql.
>
> And for the Spark release, we suggestion first target at spark 2.y, for
> example the spark 2.2.2 which is stability now.
>
> We will create a demo base on hbase-spark model with sql support in local,
> then share here to discuss.
>
> Regards
> Bill
>
>
> Ted Yu-3 wrote
> > For SQL support, can you be more specific on how the SQL support would be
> > added ?
> >
> > Maybe you can illustrate some examples showing the enhanced SQL syntax.
> >
> > Also, which Spark release(s) would be targeted?
> >
> > Thanks
> >
> > On Mon, Jul 30, 2018 at 10:57 AM bill.yunfu &lt;
>
> > guangcheng.zgc@
>
> > &gt;
> > wrote:
>
>
>
>
>
> --
> Sent from:
> http://apache-hbase.679495.n3.nabble.com/HBase-Developer-f679493.html
>

Re: May I take this issue --hbase-spark

Posted by "bill.yunfu" <gu...@alibaba-inc.com>.
hi Ted 
   Thank you for replying. 
The sql support means user can directly use spark sql to create table and
query data from HBase. we found two sql support on HBase 
SHC use following command to create table in spark sql: 
CREATE TABLE spark_hbase USING
org.apache.spark.sql.execution.datasources.hbase
      OPTIONS ('catalog'=
      '{"table":{"namespace":"default", "name":"test",
"tableCoder":"PrimitiveType"},"rowkey":"key",
      "columns":{
      "col0":{"cf":"rowkey", "col":"key", "type":"string"},
      "col1":{"cf":"cf", "col":"a", "type":"string"}}}'
      )
(SHC is a project can get details from:
https://github.com/hortonworks-spark/shc)
In spark sql also can use hive command to create table: 
create  table spark_hbase (col0 string, col1 string)
    ROW FORMAT SERDE 'org.apache.hadoop.hive.hbase.HBaseSerDe' with
SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf:a")
    STORED AS
        INPUTFORMAT 'org.apache.hadoop.hive.hbase.HiveHBaseTableInputFormat'
        OUTPUTFORMAT
'org.apache.hadoop.hive.hbase.HiveHBaseTableOutputFormat'
		tblproperties ("hbase.table.name" = "test"); 

So we want make a similar DDL to create the table for hbase-spark model and
query with the spark sql. 

And for the Spark release, we suggestion first target at spark 2.y, for
example the spark 2.2.2 which is stability now. 

We will create a demo base on hbase-spark model with sql support in local,
then share here to discuss. 

Regards
Bill


Ted Yu-3 wrote
> For SQL support, can you be more specific on how the SQL support would be
> added ?
> 
> Maybe you can illustrate some examples showing the enhanced SQL syntax.
> 
> Also, which Spark release(s) would be targeted?
> 
> Thanks
> 
> On Mon, Jul 30, 2018 at 10:57 AM bill.yunfu &lt;

> guangcheng.zgc@

> &gt;
> wrote:





--
Sent from: http://apache-hbase.679495.n3.nabble.com/HBase-Developer-f679493.html

Re: May I take this issue --hbase-spark

Posted by Ted Yu <yu...@gmail.com>.
For SQL support, can you be more specific on how the SQL support would be
added ?

Maybe you can illustrate some examples showing the enhanced SQL syntax.

Also, which Spark release(s) would be targeted?

Thanks

On Mon, Jul 30, 2018 at 10:57 AM bill.yunfu <gu...@alibaba-inc.com>
wrote:

> May I take this issue --hbase-spark
>
> Hi community
>    I am working in one HBase team which service hundreds customers. We find
> that along increasing amount of data in the HBase, many customers have
> analysis requirement for their data on Hbase. For example they want use
> Spark to do some analysis which may query more data from Hbase and may also
> join with other tables, the tables may be in Hbase or Spark.
>    But Hbase can not support this scenario very well. So we plan use spark
> to support this.
>    We found the Apache Hbase already has one module called Hbase-spark, but
> this module is not updated recently and not formally released. Besides we
> found there are others project support Sql On Hbase. For example Hive on
> Hbase which give good sql syntax support.
>    Even there are many projects for Spark on Hbase, but I think now no one
> is the public knowing for users. Because our customer have more and more
> requirement for Spark on Hbase, So we want take this issue. Initial goal is
> make a standard and public knowing Spark on Hbase in apache Hbase
> community.
>    Our initial idea is:
>    SQL support:  Now the hbase-spark model can not spark-sql command to
> create table, We want make it support sql command which may like the sql
> syntax from Hive on HBase or the SQL syntax from SHC.
>    Performance improved: this part is not very clearly now, the goal is use
> spark sql query HBase data has a good performance.
>
> We want to get some suggestions from community. Then I will raise a JIRA to
> track it and put a design document.
>
> Best Regards
> Bill
>
>
>
>
> --
> Sent from:
> http://apache-hbase.679495.n3.nabble.com/HBase-Developer-f679493.html
>

Re: May I take this issue --hbase-spark

Posted by Sean Busbey <bu...@apache.org>.
Hi Bill!

Please check out the scope document attached to HBASE-18405 "Track
scope for HBase-Spark module". It's the result of the last time the
community went through discussing what was needed for a release-worthy
integration.

pdf: https://s.apache.org/fejd

I haven't gotten to take a look at the scope specifically in about a
year, but it'd be great to get renewed effort going again. It'd be
simpler and faster to propose things in terms of updating that scope
document.

Also please note that recently the issue of moving the spark
integration out of the main repo came up again as a part of a wider
discussion about moving integration with various other systems into a
different repo (thread on dev@hbase with subject "[DISCUSS] Kafka
Connection, HBASE-15320").

On Sun, Jul 29, 2018 at 2:53 AM, bill.yunfu
<gu...@alibaba-inc.com> wrote:
> May I take this issue --hbase-spark
>
> Hi community
>    I am working in one HBase team which service hundreds customers. We find
> that along increasing amount of data in the HBase, many customers have
> analysis requirement for their data on Hbase. For example they want use
> Spark to do some analysis which may query more data from Hbase and may also
> join with other tables, the tables may be in Hbase or Spark.
>    But Hbase can not support this scenario very well. So we plan use spark
> to support this.
>    We found the Apache Hbase already has one module called Hbase-spark, but
> this module is not updated recently and not formally released. Besides we
> found there are others project support Sql On Hbase. For example Hive on
> Hbase which give good sql syntax support.
>    Even there are many projects for Spark on Hbase, but I think now no one
> is the public knowing for users. Because our customer have more and more
> requirement for Spark on Hbase, So we want take this issue. Initial goal is
> make a standard and public knowing Spark on Hbase in apache Hbase
> community.
>    Our initial idea is:
>    SQL support:  Now the hbase-spark model can not spark-sql command to
> create table, We want make it support sql command which may like the sql
> syntax from Hive on HBase or the SQL syntax from SHC.
>    Performance improved: this part is not very clearly now, the goal is use
> spark sql query HBase data has a good performance.
>
> We want to get some suggestions from community. Then I will raise a JIRA to
> track it and put a design document.
>
> Best Regards
> Bill
>
>
>
>
> --
> Sent from: http://apache-hbase.679495.n3.nabble.com/HBase-Developer-f679493.html

Re: May I take this issue --hbase-spark

Posted by Stack <st...@duboce.net>.
On Mon, Jul 30, 2018 at 10:57 AM bill.yunfu
<gu...@alibaba-inc.com> wrote:
>
> May I take this issue --hbase-spark
>
> Hi community
>    I am working in one HBase team which service hundreds customers. We find
> that along increasing amount of data in the HBase, many customers have
> analysis requirement for their data on Hbase. For example they want use
> Spark to do some analysis which may query more data from Hbase and may also
> join with other tables, the tables may be in Hbase or Spark.
>    But Hbase can not support this scenario very well. So we plan use spark
> to support this.
>    We found the Apache Hbase already has one module called Hbase-spark, but
> this module is not updated recently and not formally released. Besides we
> found there are others project support Sql On Hbase. For example Hive on
> Hbase which give good sql syntax support.
>    Even there are many projects for Spark on Hbase, but I think now no one
> is the public knowing for users. Because our customer have more and more
> requirement for Spark on Hbase, So we want take this issue. Initial goal is
> make a standard and public knowing Spark on Hbase in apache Hbase
> community.
>    Our initial idea is:
>    SQL support:  Now the hbase-spark model can not spark-sql command to
> create table, We want make it support sql command which may like the sql
> syntax from Hive on HBase or the SQL syntax from SHC.
>    Performance improved: this part is not very clearly now, the goal is use
> spark sql query HBase data has a good performance.
>
> We want to get some suggestions from community. Then I will raise a JIRA to
> track it and put a design document.
>

Thanks for showing up to help Bill. I second what our Sean said.

I owe work on moving connectors out of hbase core. It was suggested we
might move Spark out as part of the hbase-connector effort. Lets see.
It be deserving of its own dedicated repository if its rate of change
is far in excess of that of other connectors. Lets see.

Spark SQL support is a long-time ask. Anything you could do to forward
this project would be much appreciated.

S



> Best Regards
> Bill
>
>
>
>
> --
> Sent from: http://apache-hbase.679495.n3.nabble.com/HBase-Developer-f679493.html