You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Benjamin Kim <bb...@gmail.com> on 2016/10/07 14:56:20 UTC

Spark SQL Thriftserver with HBase

Does anyone know if Spark can work with HBase tables using Spark SQL? I know in Hive we are able to create tables on top of an underlying HBase table that can be accessed using MapReduce jobs. Can the same be done using HiveContext or SQLContext? We are trying to setup a way to GET and POST data to and from the HBase table using the Spark SQL JDBC thriftserver from our RESTful API endpoints and/or HTTP web farms. If we can get this to work, then we can load balance the thriftservers. In addition, this will benefit us in giving us a way to abstract the data storage layer away from the presentation layer code. There is a chance that we will swap out the data storage technology in the future. We are currently experimenting with Kudu.

Thanks,
Ben
---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Spark SQL Thriftserver with HBase

Posted by Michael Segel <ms...@hotmail.com>.

You forgot to mention that if you roll your own… you can toss your own level of security on top of it.

For most, that’s not important.
For those working with PII type of information… kinda important, especially when the rules can get convoluted.


On Oct 17, 2016, at 12:14 PM, vincent gromakowski <vi...@gmail.com>> wrote:

I would suggest to code your own Spark thriftserver which seems to be very easy.
http://stackoverflow.com/questions/27108863/accessing-spark-sql-rdd-tables-through-the-thrift-server

I am starting to test it. The big advantage is that you can implement any logic because it's a spark job and then start a thrift server on temporary table. For example you can query a micro batch rdd from a kafka stream, or pre load some tables and implement a rolling cache to periodically update the spark in memory tables with persistent store...
It's not part of the public API and I don't know yet what are the issues doing this but I think Spark community should look at this path: making the thriftserver be instantiable in any spark job.

2016-10-17 18:17 GMT+02:00 Michael Segel <ms...@hotmail.com>>:
Guys,
Sorry for jumping in late to the game…

If memory serves (which may not be a good thing…) :

You can use HiveServer2 as a connection point to HBase.
While this doesn’t perform well, its probably the cleanest solution.
I’m not keen on Phoenix… wouldn’t recommend it….


The issue is that you’re trying to make HBase, a key/value object store, a Relational Engine… its not.

There are some considerations which make HBase not ideal for all use cases and you may find better performance with Parquet files.

One thing missing is the use of secondary indexing and query optimizations that you have in RDBMSs and are lacking in HBase / MapRDB / etc …  so your performance will vary.

With respect to Tableau… their entire interface in to the big data world revolves around the JDBC/ODBC interface. So if you don’t have that piece as part of your solution, you’re DOA w respect to Tableau.

Have you considered Drill as your JDBC connection point?  (YAAP: Yet another Apache project)


On Oct 9, 2016, at 12:23 PM, Benjamin Kim <bb...@gmail.com>> wrote:

Thanks for all the suggestions. It would seem you guys are right about the Tableau side of things. The reports don’t need to be real-time, and they won’t be directly feeding off of the main DMP HBase data. Instead, it’ll be batched to Parquet or Kudu/Impala or even PostgreSQL.

I originally thought that we needed two-way data retrieval from the DMP HBase for ID generation, but after further investigation into the use-case and architecture, the ID generation needs to happen local to the Ad Servers where we generate a unique ID and store it in a ID linking table. Even better, many of the 3rd party services supply this ID. So, data only needs to flow in one direction. We will use Kafka as the bus for this. No JDBC required. This is also goes for the REST Endpoints. 3rd party services will hit ours to update our data with no need to read from our data. And, when we want to update their data, we will hit theirs to update their data using a triggered job.

This al boils down to just integrating with Kafka.

Once again, thanks for all the help.

Cheers,
Ben


On Oct 9, 2016, at 3:16 AM, Jörn Franke <jo...@gmail.com>> wrote:

please keep also in mind that Tableau Server has the capabilities to store data in-memory and refresh only when needed the in-memory data. This means you can import it from any source and let your users work only on the in-memory data in Tableau Server.

On Sun, Oct 9, 2016 at 9:22 AM, Jörn Franke <jo...@gmail.com>> wrote:
Cloudera 5.8 has a very old version of Hive without Tez, but Mich provided already a good alternative. However, you should check if it contains a recent version of Hbase and Phoenix. That being said, I just wonder what is the dataflow, data model and the analysis you plan to do. Maybe there are completely different solutions possible. Especially these single inserts, upserts etc. should be avoided as much as possible in the Big Data (analysis) world with any technology, because they do not perform well.

Hive with Llap will provide an in-memory cache for interactive analytics. You can put full tables in-memory with Hive using Ignite HDFS in-memory solution. All this does only make sense if you do not use MR as an engine, the right input format (ORC, parquet) and a recent Hive version.

On 8 Oct 2016, at 21:55, Benjamin Kim <bb...@gmail.com>> wrote:

Mich,

Unfortunately, we are moving away from Hive and unifying on Spark using CDH 5.8 as our distro. And, the Tableau released a Spark ODBC/JDBC driver too. I will either try Phoenix JDBC Server for HBase or push to move faster to Kudu with Impala. We will use Impala as the JDBC in-between until the Kudu team completes Spark SQL support for JDBC.

Thanks for the advice.

Cheers,
Ben


On Oct 8, 2016, at 12:35 PM, Mich Talebzadeh <mi...@gmail.com>> wrote:

Sure. But essentially you are looking at batch data for analytics for your tableau users so Hive may be a better choice with its rich SQL and ODBC.JDBC connection to Tableau already.

I would go for Hive especially the new release will have an in-memory offering as well for frequently accessed data :)


Dr Mich Talebzadeh



LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.



On 8 October 2016 at 20:15, Benjamin Kim <bb...@gmail.com>> wrote:
Mich,

First and foremost, we have visualization servers that run Tableau for external user reports. Second, we have servers that are ad servers and REST endpoints for cookie sync and segmentation data exchange. These will use JDBC directly within the same data-center. When not colocated in the same data-center, they will connected to a located database server using JDBC. Either way, by using JDBC everywhere, it simplifies and unifies the code on the JDBC industry standard.

Does this make sense?

Thanks,
Ben


On Oct 8, 2016, at 11:47 AM, Mich Talebzadeh <mi...@gmail.com>> wrote:

Like any other design what is your presentation layer and end users?

Are they SQL centric users from Tableau background or they may use spark functional programming.

It is best to describe the use case.

HTH

Dr Mich Talebzadeh



LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.



On 8 October 2016 at 19:40, Felix Cheung <fe...@hotmail.com>> wrote:
I wouldn't be too surprised Spark SQL - JDBC data source - Phoenix JDBC server - HBASE would work better.

Without naming specifics, there are at least 4 or 5 different implementations of HBASE sources, each at varying level of development and different requirements (HBASE release version, Kerberos support etc)


_____________________________
From: Benjamin Kim <bb...@gmail.com>>
Sent: Saturday, October 8, 2016 11:26 AM
Subject: Re: Spark SQL Thriftserver with HBase
To: Mich Talebzadeh <mi...@gmail.com>>
Cc: <us...@spark.apache.org>>, Felix Cheung <fe...@hotmail.com>>



Mich,

Are you talking about the Phoenix JDBC Server? If so, I forgot about that alternative.

Thanks,
Ben


On Oct 8, 2016, at 11:21 AM, Mich Talebzadeh <mi...@gmail.com>> wrote:

I don't think it will work

you can use phoenix on top of hbase

hbase(main):336:0> scan 'tsco', 'LIMIT' => 1
ROW                                                       COLUMN+CELL
 TSCO-1-Apr-08                                            column=stock_daily:Date, timestamp=1475866783376, value=1-Apr-08
 TSCO-1-Apr-08                                            column=stock_daily:close, timestamp=1475866783376, value=405.25
 TSCO-1-Apr-08                                            column=stock_daily:high, timestamp=1475866783376, value=406.75
 TSCO-1-Apr-08                                            column=stock_daily:low, timestamp=1475866783376, value=379.25
 TSCO-1-Apr-08                                            column=stock_daily:open, timestamp=1475866783376, value=380.00
 TSCO-1-Apr-08                                            column=stock_daily:stock, timestamp=1475866783376, value=TESCO PLC
 TSCO-1-Apr-08                                            column=stock_daily:ticker, timestamp=1475866783376, value=TSCO
 TSCO-1-Apr-08                                            column=stock_daily:volume, timestamp=1475866783376, value=49664486

And the same on Phoenix on top of Hvbase table

0: jdbc:phoenix:thin:url=http://rhes564:8765<http://rhes564:8765/>> select substr(to_char(to_date("Date",'dd-MMM-yy')),1,10) AS TradeDate, "close" AS "Day's close", "high" AS "Day's High", "low" AS "Day's Low", "open" AS "Day's Open", "ticker", "volume", (to_number("low")+to_number("high"))/2 AS "AverageDailyPrice" from "tsco" where to_number("volume") > 0 and "high" != '-' and to_date("Date",'dd-MMM-yy') > to_date('2015-10-06','yyyy-MM-dd') order by  to_date("Date",'dd-MMM-yy') limit 1;
+-------------+--------------+-------------+------------+-------------+---------+-----------+--------------------+
|  TRADEDATE  | Day's close  | Day's High  | Day's Low  | Day's Open  | ticker  |  volume   | AverageDailyPrice  |
+-------------+--------------+-------------+------------+-------------+---------+-----------+--------------------+
| 2015-10-07  | 197.00       | 198.05      | 184.84     | 192.20      | TSCO    | 30046994  | 191.445            |


HTH




Dr Mich Talebzadeh



LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destructionof data or any other property which may arise from relying on this email's technical content is explicitly disclaimed.The author will in no case be liable for any monetary damages arising from suchloss, damage or destruction.



On 8 October 2016 at 19:05, Felix Cheung <fe...@hotmail.com>> wrote:
Great, then I think those packages as Spark data source should allow you to do exactly that (replace org.apache.spark.sql.jdbc with HBASE one)

I do think it will be great to get more examples around this though. Would be great if you could share your experience with this!


_____________________________
From: Benjamin Kim <bb...@gmail.com>>
Sent: Saturday, October 8, 2016 11:00 AM
Subject: Re: Spark SQL Thriftserver with HBase
To: Felix Cheung <fe...@hotmail.com>>
Cc: <us...@spark.apache.org>>


Felix,

My goal is to use Spark SQL JDBC Thriftserver to access HBase tables using just SQL. I have been able to CREATE tables using this statement below in the past:

CREATE TABLE <table-name>
USING org.apache.spark.sql.jdbc
OPTIONS (
  url "jdbc:postgresql://<hostname>:<port>/dm?user=<username>&password=<password>",
  dbtable "dim.dimension_acamp"
);

After doing this, I can access the PostgreSQL table using Spark SQL JDBC Thriftserver using SQL statements (SELECT, UPDATE, INSERT, etc.). I want to do the same with HBase tables. We tried this using Hive and HiveServer2, but the response times are just too long.

Thanks,
Ben


On Oct 8, 2016, at 10:53 AM, Felix Cheung <fe...@hotmail.com>> wrote:

Ben,

I'm not sure I'm following completely.

Is your goal to use Spark to create or access tables in HBASE? If so the link below and several packages out there support that by having a HBASE data source for Spark. There are some examples on how the Spark code look like in that link as well. On that note, you should also be able to use the HBASE data source from pure SQL (Spark SQL) query as well, which should work in the case with the Spark SQL JDBC Thrift Server (with USING,http://spark.apache.org/docs/latest/sql-programming-guide.html#tab_sql_10).


_____________________________
From: Benjamin Kim <bb...@gmail.com>>
Sent: Saturday, October 8, 2016 10:40 AM
Subject: Re: Spark SQL Thriftserver with HBase
To: Felix Cheung <fe...@hotmail.com>>
Cc: <us...@spark.apache.org>>


Felix,

The only alternative way is to create a stored procedure (udf) in database terms that would run Spark scala code underneath. In this way, I can use Spark SQL JDBC Thriftserver to execute it using SQL code passing the key, values I want to UPSERT. I wonder if this is possible since I cannot CREATE a wrapper table on top of a HBase table in Spark SQL?

What do you think? Is this the right approach?

Thanks,
Ben

On Oct 8, 2016, at 10:33 AM, Felix Cheung <fe...@hotmail.com>> wrote:

HBase has released support for Spark
hbase.apache.org/book.html#spark<http://hbase.apache.org/book.html#spark>

And if you search you should find several alternative approaches.





On Fri, Oct 7, 2016 at 7:56 AM -0700, "Benjamin Kim" <bb...@gmail.com>> wrote:

Does anyone know if Spark can work with HBase tables using Spark SQL? I know in Hive we are able to create tables on top of an underlying HBase table that can be accessed using MapReduce jobs. Can the same be done using HiveContext or SQLContext? We are trying to setup a way to GET and POST data to and from the HBase table using the Spark SQL JDBC thriftserver from our RESTful API endpoints and/or HTTP web farms. If we can get this to work, then we can load balance the thriftservers. In addition, this will benefit us in giving us a way to abstract the data storage layer away from the presentation layer code. There is a chance that we will swap out the data storage technology in the future. We are currently experimenting with Kudu.

Thanks,
Ben
---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org<ma...@spark.apache.org>

Re: Spark SQL Thriftserver with HBase

Posted by Mich Talebzadeh <mi...@gmail.com>.

Ben,



*Also look at Phoenix (Apache project) which provides a better (one of the
best) SQL/JDBC layer on top of HBase.*

*http://phoenix.apache.org/ <http://phoenix.apache.org/>*


I am afraid this does not work with Spark 2!

Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 17 October 2016 at 20:20, Thakrar, Jayesh <jt...@conversantmedia.com>
wrote:

> Ben,
>
>
>
> Also look at Phoenix (Apache project) which provides a better (one of the
> best) SQL/JDBC layer on top of HBase.
>
> http://phoenix.apache.org/
>
>
>
> Cheers,
>
> Jayesh
>
>
>
>
>
> *From: *vincent gromakowski <vi...@gmail.com>
> *Date: *Monday, October 17, 2016 at 1:53 PM
> *To: *Benjamin Kim <bb...@gmail.com>
> *Cc: *Michael Segel <ms...@hotmail.com>, Jörn Franke <
> jornfranke@gmail.com>, Mich Talebzadeh <mi...@gmail.com>, Felix
> Cheung <fe...@hotmail.com>, "user@spark.apache.org" <
> user@spark.apache.org>
>
> *Subject: *Re: Spark SQL Thriftserver with HBase
>
>
>
> Instead of (or additionally to) saving results somewhere, you just start a
> thriftserver that expose the Spark tables of the SQLContext (or
> SparkSession now). That means you can implement any logic (and maybe use
> structured streaming) to expose your data. Today using the thriftserver
> means reading data from the persistent store every query, so if the data
> modeling doesn't fit the query it can be quite long.  What you generally do
> in a common spark job is to load the data and cache spark table in a
> in-memory columnar table which is quite efficient for any kind of query,
> the counterpart is that the cache isn't updated you have to implement a
> reload mechanism, and this solution isn't available using the thriftserver.
>
> What I propose is to mix the two world: periodically/delta load data in
> spark table cache and expose it through the thriftserver. But you have to
> implement the loading logic, it can be very simple to very complex
> depending on your needs.
>
>
>
>
>
> 2016-10-17 19:48 GMT+02:00 Benjamin Kim <bb...@gmail.com>:
>
> Is this technique similar to what Kinesis is offering or what Structured
> Streaming is going to have eventually?
>
>
>
> Just curious.
>
>
>
> Cheers,
>
> Ben
>
>
>
>
>
> On Oct 17, 2016, at 10:14 AM, vincent gromakowski <
> vincent.gromakowski@gmail.com> wrote:
>
>
>
> I would suggest to code your own Spark thriftserver which seems to be very
> easy.
> http://stackoverflow.com/questions/27108863/accessing-
> spark-sql-rdd-tables-through-the-thrift-server
>
> I am starting to test it. The big advantage is that you can implement any
> logic because it's a spark job and then start a thrift server on temporary
> table. For example you can query a micro batch rdd from a kafka stream, or
> pre load some tables and implement a rolling cache to periodically update
> the spark in memory tables with persistent store...
>
> It's not part of the public API and I don't know yet what are the issues
> doing this but I think Spark community should look at this path: making the
> thriftserver be instantiable in any spark job.
>
>
>
> 2016-10-17 18:17 GMT+02:00 Michael Segel <ms...@hotmail.com>:
>
> Guys,
>
> Sorry for jumping in late to the game…
>
>
>
> If memory serves (which may not be a good thing…) :
>
>
>
> You can use HiveServer2 as a connection point to HBase.
>
> While this doesn’t perform well, its probably the cleanest solution.
>
> I’m not keen on Phoenix… wouldn’t recommend it….
>
>
>
>
>
> The issue is that you’re trying to make HBase, a key/value object store, a
> Relational Engine… its not.
>
>
>
> There are some considerations which make HBase not ideal for all use cases
> and you may find better performance with Parquet files.
>
>
>
> One thing missing is the use of secondary indexing and query optimizations
> that you have in RDBMSs and are lacking in HBase / MapRDB / etc …  so your
> performance will vary.
>
>
>
> With respect to Tableau… their entire interface in to the big data world
> revolves around the JDBC/ODBC interface. So if you don’t have that piece as
> part of your solution, you’re DOA w respect to Tableau.
>
>
>
> Have you considered Drill as your JDBC connection point?  (YAAP: Yet
> another Apache project)
>
>
>
>
>
> On Oct 9, 2016, at 12:23 PM, Benjamin Kim <bb...@gmail.com> wrote:
>
>
>
> Thanks for all the suggestions. It would seem you guys are right about the
> Tableau side of things. The reports don’t need to be real-time, and they
> won’t be directly feeding off of the main DMP HBase data. Instead, it’ll be
> batched to Parquet or Kudu/Impala or even PostgreSQL.
>
>
>
> I originally thought that we needed two-way data retrieval from the DMP
> HBase for ID generation, but after further investigation into the use-case
> and architecture, the ID generation needs to happen local to the Ad Servers
> where we generate a unique ID and store it in a ID linking table. Even
> better, many of the 3rd party services supply this ID. So, data only needs
> to flow in one direction. We will use Kafka as the bus for this. No JDBC
> required. This is also goes for the REST Endpoints. 3rd party services will
> hit ours to update our data with no need to read from our data. And, when
> we want to update their data, we will hit theirs to update their data using
> a triggered job.
>
>
>
> This al boils down to just integrating with Kafka.
>
>
>
> Once again, thanks for all the help.
>
>
>
> Cheers,
>
> Ben
>
>
>
>
>
> On Oct 9, 2016, at 3:16 AM, Jörn Franke <jo...@gmail.com> wrote:
>
>
>
> please keep also in mind that Tableau Server has the capabilities to store
> data in-memory and refresh only when needed the in-memory data. This means
> you can import it from any source and let your users work only on the
> in-memory data in Tableau Server.
>
>
>
> On Sun, Oct 9, 2016 at 9:22 AM, Jörn Franke <jo...@gmail.com> wrote:
>
> Cloudera 5.8 has a very old version of Hive without Tez, but Mich provided
> already a good alternative. However, you should check if it contains a
> recent version of Hbase and Phoenix. That being said, I just wonder what is
> the dataflow, data model and the analysis you plan to do. Maybe there are
> completely different solutions possible. Especially these single inserts,
> upserts etc. should be avoided as much as possible in the Big Data
> (analysis) world with any technology, because they do not perform well.
>
>
>
> Hive with Llap will provide an in-memory cache for interactive analytics.
> You can put full tables in-memory with Hive using Ignite HDFS in-memory
> solution. All this does only make sense if you do not use MR as an engine,
> the right input format (ORC, parquet) and a recent Hive version.
>
>
> On 8 Oct 2016, at 21:55, Benjamin Kim <bb...@gmail.com> wrote:
>
> Mich,
>
>
>
> Unfortunately, we are moving away from Hive and unifying on Spark using
> CDH 5.8 as our distro. And, the Tableau released a Spark ODBC/JDBC driver
> too. I will either try Phoenix JDBC Server for HBase or push to move faster
> to Kudu with Impala. We will use Impala as the JDBC in-between until the
> Kudu team completes Spark SQL support for JDBC.
>
>
>
> Thanks for the advice.
>
>
>
> Cheers,
>
> Ben
>
>
>
>
>
> On Oct 8, 2016, at 12:35 PM, Mich Talebzadeh <mi...@gmail.com>
> wrote:
>
>
>
> Sure. But essentially you are looking at batch data for analytics for your
> tableau users so Hive may be a better choice with its rich SQL and
> ODBC.JDBC connection to Tableau already.
>
>
>
> I would go for Hive especially the new release will have an in-memory
> offering as well for frequently accessed data :)
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
>
> On 8 October 2016 at 20:15, Benjamin Kim <bb...@gmail.com> wrote:
>
> Mich,
>
>
>
> First and foremost, we have visualization servers that run Tableau for
> external user reports. Second, we have servers that are ad servers and REST
> endpoints for cookie sync and segmentation data exchange. These will use
> JDBC directly within the same data-center. When not colocated in the same
> data-center, they will connected to a located database server using JDBC.
> Either way, by using JDBC everywhere, it simplifies and unifies the code on
> the JDBC industry standard.
>
>
>
> Does this make sense?
>
>
>
> Thanks,
>
> Ben
>
>
>
>
>
> On Oct 8, 2016, at 11:47 AM, Mich Talebzadeh <mi...@gmail.com>
> wrote:
>
>
>
> Like any other design what is your presentation layer and end users?
>
>
>
> Are they SQL centric users from Tableau background or they may use spark
> functional programming.
>
>
>
> It is best to describe the use case.
>
>
>
> HTH
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
>
> On 8 October 2016 at 19:40, Felix Cheung <fe...@hotmail.com>
> wrote:
>
> I wouldn't be too surprised Spark SQL - JDBC data source - Phoenix JDBC
> server - HBASE would work better.
>
>
>
> Without naming specifics, there are at least 4 or 5 different
> implementations of HBASE sources, each at varying level of development and
> different requirements (HBASE release version, Kerberos support etc)
>
>
>
>
>
> _____________________________
> From: Benjamin Kim <bb...@gmail.com>
> Sent: Saturday, October 8, 2016 11:26 AM
> Subject: Re: Spark SQL Thriftserver with HBase
> To: Mich Talebzadeh <mi...@gmail.com>
> Cc: <us...@spark.apache.org>, Felix Cheung <fe...@hotmail.com>
>
>
>
>
> Mich,
>
>
>
> Are you talking about the Phoenix JDBC Server? If so, I forgot about that
> alternative.
>
>
>
> Thanks,
>
> Ben
>
>
>
>
>
> On Oct 8, 2016, at 11:21 AM, Mich Talebzadeh <mi...@gmail.com>
> wrote:
>
>
>
> I don't think it will work
>
>
>
> you can use phoenix on top of hbase
>
>
>
> hbase(main):336:0> scan 'tsco', 'LIMIT' => 1
> ROW                                                       COLUMN+CELL
>  TSCO-1-Apr-08
> column=stock_daily:Date, timestamp=1475866783376, value=1-Apr-08
>  TSCO-1-Apr-08
> column=stock_daily:close, timestamp=1475866783376, value=405.25
>  TSCO-1-Apr-08
> column=stock_daily:high, timestamp=1475866783376, value=406.75
>  TSCO-1-Apr-08
> column=stock_daily:low, timestamp=1475866783376, value=379.25
>  TSCO-1-Apr-08
> column=stock_daily:open, timestamp=1475866783376, value=380.00
>  TSCO-1-Apr-08
> column=stock_daily:stock, timestamp=1475866783376, value=TESCO PLC
>  TSCO-1-Apr-08
> column=stock_daily:ticker, timestamp=1475866783376, value=TSCO
>  TSCO-1-Apr-08
> column=stock_daily:volume, timestamp=1475866783376, value=49664486
>
>
>
> And the same on Phoenix on top of Hvbase table
>
>
>
> 0: jdbc:phoenix:thin:url=http://rhes564:8765> select
> substr(to_char(to_date("Date",'dd-MMM-yy')),1,10) AS TradeDate, "close"
> AS "Day's close", "high" AS "Day's High", "low" AS "Day's Low", "open" AS
> "Day's Open", "ticker", "volume", (to_number("low")+to_number("high"))/2
> AS "AverageDailyPrice" from "tsco" where to_number("volume") > 0 and "high"
> != '-' and to_date("Date",'dd-MMM-yy') > to_date('2015-10-06','yyyy-MM-dd')
> order by  to_date("Date",'dd-MMM-yy') limit 1;
> +-------------+--------------+-------------+------------+---
> ----------+---------+-----------+--------------------+
> |  TRADEDATE  | Day's close  | Day's High  | Day's Low  | Day's Open  |
> ticker  |  volume   | AverageDailyPrice  |
> +-------------+--------------+-------------+------------+---
> ----------+---------+-----------+--------------------+
> | 2015-10-07  | 197.00       | 198.05      | 184.84     | 192.20      |
> TSCO    | 30046994  | 191.445            |
>
>
>
> HTH
>
>
>
>
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destructionof data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.The
> author will in no case be liable for any monetary damages arising from
> suchloss, damage or destruction.
>
>
>
>
>
> On 8 October 2016 at 19:05, Felix Cheung <fe...@hotmail.com>
> wrote:
>
> Great, then I think those packages as Spark data source should allow you
> to do exactly that (replace org.apache.spark.sql.jdbc with HBASE one)
>
>
>
> I do think it will be great to get more examples around this though. Would
> be great if you could share your experience with this!
>
>
>
>
>
> _____________________________
> From: Benjamin Kim <bb...@gmail.com>
> Sent: Saturday, October 8, 2016 11:00 AM
> Subject: Re: Spark SQL Thriftserver with HBase
> To: Felix Cheung <fe...@hotmail.com>
> Cc: <us...@spark.apache.org>
>
>
> Felix,
>
>
>
> My goal is to use Spark SQL JDBC Thriftserver to access HBase tables using
> just SQL. I have been able to CREATE tables using this statement below in
> the past:
>
>
>
> CREATE TABLE <table-name>
>
> USING org.apache.spark.sql.jdbc
>
> OPTIONS (
>
>   url "jdbc:postgresql://<hostname>:<port>/dm?user=<username>&
> password=<password>",
>
>   dbtable "dim.dimension_acamp"
>
> );
>
>
>
> After doing this, I can access the PostgreSQL table using Spark SQL JDBC
> Thriftserver using SQL statements (SELECT, UPDATE, INSERT, etc.). I want to
> do the same with HBase tables. We tried this using Hive and HiveServer2,
> but the response times are just too long.
>
>
>
> Thanks,
>
> Ben
>
>
>
>
>
> On Oct 8, 2016, at 10:53 AM, Felix Cheung <fe...@hotmail.com>
> wrote:
>
>
>
> Ben,
>
>
>
> I'm not sure I'm following completely.
>
>
>
> Is your goal to use Spark to create or access tables in HBASE? If so the
> link below and several packages out there support that by having a HBASE
> data source for Spark. There are some examples on how the Spark code look
> like in that link as well. On that note, you should also be able to use the
> HBASE data source from pure SQL (Spark SQL) query as well, which should
> work in the case with the Spark SQL JDBC Thrift Server (with USING,
> http://spark.apache.org/docs/latest/sql-programming-guide.html#tab_sql_10
> ).
>
>
>
> _____________________________
> From: Benjamin Kim <bb...@gmail.com>
> Sent: Saturday, October 8, 2016 10:40 AM
> Subject: Re: Spark SQL Thriftserver with HBase
> To: Felix Cheung <fe...@hotmail.com>
> Cc: <us...@spark.apache.org>
>
>
> Felix,
>
>
>
> The only alternative way is to create a stored procedure (udf) in database
> terms that would run Spark scala code underneath. In this way, I can use
> Spark SQL JDBC Thriftserver to execute it using SQL code passing the key,
> values I want to UPSERT. I wonder if this is possible since I cannot CREATE
> a wrapper table on top of a HBase table in Spark SQL?
>
>
>
> What do you think? Is this the right approach?
>
>
>
> Thanks,
>
> Ben
>
>
>
> On Oct 8, 2016, at 10:33 AM, Felix Cheung <fe...@hotmail.com>
> wrote:
>
>
>
> HBase has released support for Spark
>
> hbase.apache.org/book.html#spark
>
>
>
> And if you search you should find several alternative approaches.
>
>
>
>
>
> On Fri, Oct 7, 2016 at 7:56 AM -0700, "Benjamin Kim" <bb...@gmail.com>
> wrote:
>
> Does anyone know if Spark can work with HBase tables using Spark SQL? I
> know in Hive we are able to create tables on top of an underlying HBase
> table that can be accessed using MapReduce jobs. Can the same be done using
> HiveContext or SQLContext? We are trying to setup a way to GET and POST
> data to and from the HBase table using the Spark SQL JDBC thriftserver from
> our RESTful API endpoints and/or HTTP web farms. If we can get this to
> work, then we can load balance the thriftservers. In addition, this will
> benefit us in giving us a way to abstract the data storage layer away from
> the presentation layer code. There is a chance that we will swap out the
> data storage technology in the future. We are currently experimenting with
> Kudu.
>
> Thanks,
> Ben
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>

Re: Spark SQL Thriftserver with HBase

Posted by Michael Segel <ms...@hotmail.com>.

Skip Phoenix

On Oct 17, 2016, at 2:20 PM, Thakrar, Jayesh <jt...@conversantmedia.com>> wrote:

Ben,

Also look at Phoenix (Apache project) which provides a better (one of the best) SQL/JDBC layer on top of HBase.
http://phoenix.apache.org/

Cheers,
Jayesh


From: vincent gromakowski <vi...@gmail.com>>
Date: Monday, October 17, 2016 at 1:53 PM
To: Benjamin Kim <bb...@gmail.com>>
Cc: Michael Segel <ms...@hotmail.com>>, Jörn Franke <jo...@gmail.com>>, Mich Talebzadeh <mi...@gmail.com>>, Felix Cheung <fe...@hotmail.com>>, "user@spark.apache.org<ma...@spark.apache.org>" <us...@spark.apache.org>>
Subject: Re: Spark SQL Thriftserver with HBase

Instead of (or additionally to) saving results somewhere, you just start a thriftserver that expose the Spark tables of the SQLContext (or SparkSession now). That means you can implement any logic (and maybe use structured streaming) to expose your data. Today using the thriftserver means reading data from the persistent store every query, so if the data modeling doesn't fit the query it can be quite long.  What you generally do in a common spark job is to load the data and cache spark table in a in-memory columnar table which is quite efficient for any kind of query, the counterpart is that the cache isn't updated you have to implement a reload mechanism, and this solution isn't available using the thriftserver.
What I propose is to mix the two world: periodically/delta load data in spark table cache and expose it through the thriftserver. But you have to implement the loading logic, it can be very simple to very complex depending on your needs.


2016-10-17 19:48 GMT+02:00 Benjamin Kim <bb...@gmail.com>>:
Is this technique similar to what Kinesis is offering or what Structured Streaming is going to have eventually?

Just curious.

Cheers,
Ben


On Oct 17, 2016, at 10:14 AM, vincent gromakowski <vi...@gmail.com>> wrote:

I would suggest to code your own Spark thriftserver which seems to be very easy.
http://stackoverflow.com/questions/27108863/accessing-spark-sql-rdd-tables-through-the-thrift-server

I am starting to test it. The big advantage is that you can implement any logic because it's a spark job and then start a thrift server on temporary table. For example you can query a micro batch rdd from a kafka stream, or pre load some tables and implement a rolling cache to periodically update the spark in memory tables with persistent store...
It's not part of the public API and I don't know yet what are the issues doing this but I think Spark community should look at this path: making the thriftserver be instantiable in any spark job.

2016-10-17 18:17 GMT+02:00 Michael Segel <ms...@hotmail.com>>:
Guys,
Sorry for jumping in late to the game…

If memory serves (which may not be a good thing…) :

You can use HiveServer2 as a connection point to HBase.
While this doesn’t perform well, its probably the cleanest solution.
I’m not keen on Phoenix… wouldn’t recommend it….


The issue is that you’re trying to make HBase, a key/value object store, a Relational Engine… its not.

There are some considerations which make HBase not ideal for all use cases and you may find better performance with Parquet files.

One thing missing is the use of secondary indexing and query optimizations that you have in RDBMSs and are lacking in HBase / MapRDB / etc …  so your performance will vary.

With respect to Tableau… their entire interface in to the big data world revolves around the JDBC/ODBC interface. So if you don’t have that piece as part of your solution, you’re DOA w respect to Tableau.

Have you considered Drill as your JDBC connection point?  (YAAP: Yet another Apache project)


On Oct 9, 2016, at 12:23 PM, Benjamin Kim <bb...@gmail.com>> wrote:

Thanks for all the suggestions. It would seem you guys are right about the Tableau side of things. The reports don’t need to be real-time, and they won’t be directly feeding off of the main DMP HBase data. Instead, it’ll be batched to Parquet or Kudu/Impala or even PostgreSQL.

I originally thought that we needed two-way data retrieval from the DMP HBase for ID generation, but after further investigation into the use-case and architecture, the ID generation needs to happen local to the Ad Servers where we generate a unique ID and store it in a ID linking table. Even better, many of the 3rd party services supply this ID. So, data only needs to flow in one direction. We will use Kafka as the bus for this. No JDBC required. This is also goes for the REST Endpoints. 3rd party services will hit ours to update our data with no need to read from our data. And, when we want to update their data, we will hit theirs to update their data using a triggered job.

This al boils down to just integrating with Kafka.

Once again, thanks for all the help.

Cheers,
Ben


On Oct 9, 2016, at 3:16 AM, Jörn Franke <jo...@gmail.com>> wrote:

please keep also in mind that Tableau Server has the capabilities to store data in-memory and refresh only when needed the in-memory data. This means you can import it from any source and let your users work only on the in-memory data in Tableau Server.

On Sun, Oct 9, 2016 at 9:22 AM, Jörn Franke <jo...@gmail.com>> wrote:
Cloudera 5.8 has a very old version of Hive without Tez, but Mich provided already a good alternative. However, you should check if it contains a recent version of Hbase and Phoenix. That being said, I just wonder what is the dataflow, data model and the analysis you plan to do. Maybe there are completely different solutions possible. Especially these single inserts, upserts etc. should be avoided as much as possible in the Big Data (analysis) world with any technology, because they do not perform well.

Hive with Llap will provide an in-memory cache for interactive analytics. You can put full tables in-memory with Hive using Ignite HDFS in-memory solution. All this does only make sense if you do not use MR as an engine, the right input format (ORC, parquet) and a recent Hive version.

On 8 Oct 2016, at 21:55, Benjamin Kim <bb...@gmail.com>> wrote:
Mich,

Unfortunately, we are moving away from Hive and unifying on Spark using CDH 5.8 as our distro. And, the Tableau released a Spark ODBC/JDBC driver too. I will either try Phoenix JDBC Server for HBase or push to move faster to Kudu with Impala. We will use Impala as the JDBC in-between until the Kudu team completes Spark SQL support for JDBC.

Thanks for the advice.

Cheers,
Ben


On Oct 8, 2016, at 12:35 PM, Mich Talebzadeh <mi...@gmail.com>> wrote:

Sure. But essentially you are looking at batch data for analytics for your tableau users so Hive may be a better choice with its rich SQL and ODBC.JDBC connection to Tableau already.

I would go for Hive especially the new release will have an in-memory offering as well for frequently accessed data :)


Dr Mich Talebzadeh

LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.


On 8 October 2016 at 20:15, Benjamin Kim <bb...@gmail.com>> wrote:
Mich,

First and foremost, we have visualization servers that run Tableau for external user reports. Second, we have servers that are ad servers and REST endpoints for cookie sync and segmentation data exchange. These will use JDBC directly within the same data-center. When not colocated in the same data-center, they will connected to a located database server using JDBC. Either way, by using JDBC everywhere, it simplifies and unifies the code on the JDBC industry standard.

Does this make sense?

Thanks,
Ben


On Oct 8, 2016, at 11:47 AM, Mich Talebzadeh <mi...@gmail.com>> wrote:

Like any other design what is your presentation layer and end users?

Are they SQL centric users from Tableau background or they may use spark functional programming.

It is best to describe the use case.

HTH

Dr Mich Talebzadeh

LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.


On 8 October 2016 at 19:40, Felix Cheung <fe...@hotmail.com>> wrote:
I wouldn't be too surprised Spark SQL - JDBC data source - Phoenix JDBC server - HBASE would work better.

Without naming specifics, there are at least 4 or 5 different implementations of HBASE sources, each at varying level of development and different requirements (HBASE release version, Kerberos support etc)


_____________________________
From: Benjamin Kim <bb...@gmail.com>>
Sent: Saturday, October 8, 2016 11:26 AM
Subject: Re: Spark SQL Thriftserver with HBase
To: Mich Talebzadeh <mi...@gmail.com>>
Cc: <us...@spark.apache.org>>, Felix Cheung <fe...@hotmail.com>>



Mich,

Are you talking about the Phoenix JDBC Server? If so, I forgot about that alternative.

Thanks,
Ben


On Oct 8, 2016, at 11:21 AM, Mich Talebzadeh <mi...@gmail.com>> wrote:

I don't think it will work

you can use phoenix on top of hbase

hbase(main):336:0> scan 'tsco', 'LIMIT' => 1
ROW                                                       COLUMN+CELL
 TSCO-1-Apr-08                                            column=stock_daily:Date, timestamp=1475866783376, value=1-Apr-08
 TSCO-1-Apr-08                                            column=stock_daily:close, timestamp=1475866783376, value=405.25
 TSCO-1-Apr-08                                            column=stock_daily:high, timestamp=1475866783376, value=406.75
 TSCO-1-Apr-08                                            column=stock_daily:low, timestamp=1475866783376, value=379.25
 TSCO-1-Apr-08                                            column=stock_daily:open, timestamp=1475866783376, value=380.00
 TSCO-1-Apr-08                                            column=stock_daily:stock, timestamp=1475866783376, value=TESCO PLC
 TSCO-1-Apr-08                                            column=stock_daily:ticker, timestamp=1475866783376, value=TSCO
 TSCO-1-Apr-08                                            column=stock_daily:volume, timestamp=1475866783376, value=49664486

And the same on Phoenix on top of Hvbase table

0: jdbc:phoenix:thin:url=http://rhes564:8765<http://rhes564:8765/>> select substr(to_char(to_date("Date",'dd-MMM-yy')),1,10) AS TradeDate, "close" AS "Day's close", "high" AS "Day's High", "low" AS "Day's Low", "open" AS "Day's Open", "ticker", "volume", (to_number("low")+to_number("high"))/2 AS "AverageDailyPrice" from "tsco" where to_number("volume") > 0 and "high" != '-' and to_date("Date",'dd-MMM-yy') > to_date('2015-10-06','yyyy-MM-dd') order by  to_date("Date",'dd-MMM-yy') limit 1;
+-------------+--------------+-------------+------------+-------------+---------+-----------+--------------------+
|  TRADEDATE  | Day's close  | Day's High  | Day's Low  | Day's Open  | ticker  |  volume   | AverageDailyPrice  |
+-------------+--------------+-------------+------------+-------------+---------+-----------+--------------------+
| 2015-10-07  | 197.00       | 198.05      | 184.84     | 192.20      | TSCO    | 30046994  | 191.445            |


HTH







Dr Mich Talebzadeh

LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>

Disclaimer: Use it at your own risk.Any and all responsibility for any loss, damage or destructionof data or any other property which may arise from relying on this email's technical content is explicitly disclaimed.The author will in no case be liable for any monetary damages arising from suchloss, damage or destruction.


On 8 October 2016 at 19:05, Felix Cheung <fe...@hotmail.com>> wrote:
Great, then I think those packages as Spark data source should allow you to do exactly that (replace org.apache.spark.sql.jdbc with HBASE one)

I do think it will be great to get more examples around this though. Would be great if you could share your experience with this!


_____________________________
From: Benjamin Kim <bb...@gmail.com>>
Sent: Saturday, October 8, 2016 11:00 AM
Subject: Re: Spark SQL Thriftserver with HBase
To: Felix Cheung <fe...@hotmail.com>>
Cc: <us...@spark.apache.org>>


Felix,

My goal is to use Spark SQL JDBC Thriftserver to access HBase tables using just SQL. I have been able to CREATE tables using this statement below in the past:

CREATE TABLE <table-name>
USING org.apache.spark.sql.jdbc
OPTIONS (
  url "jdbc:postgresql://<hostname>:<port>/dm?user=<username>&password=<password>",
  dbtable "dim.dimension_acamp"
);

After doing this, I can access the PostgreSQL table using Spark SQL JDBC Thriftserver using SQL statements (SELECT, UPDATE, INSERT, etc.). I want to do the same with HBase tables. We tried this using Hive and HiveServer2, but the response times are just too long.

Thanks,
Ben


On Oct 8, 2016, at 10:53 AM, Felix Cheung <fe...@hotmail.com>> wrote:

Ben,

I'm not sure I'm following completely.

Is your goal to use Spark to create or access tables in HBASE? If so the link below and several packages out there support that by having a HBASE data source for Spark. There are some examples on how the Spark code look like in that link as well. On that note, you should also be able to use the HBASE data source from pure SQL (Spark SQL) query as well, which should work in the case with the Spark SQL JDBC Thrift Server (with USING,http://spark.apache.org/docs/latest/sql-programming-guide.html#tab_sql_10).

_____________________________
From: Benjamin Kim <bb...@gmail.com>>
Sent: Saturday, October 8, 2016 10:40 AM
Subject: Re: Spark SQL Thriftserver with HBase
To: Felix Cheung <fe...@hotmail.com>>
Cc: <us...@spark.apache.org>>


Felix,

The only alternative way is to create a stored procedure (udf) in database terms that would run Spark scala code underneath. In this way, I can use Spark SQL JDBC Thriftserver to execute it using SQL code passing the key, values I want to UPSERT. I wonder if this is possible since I cannot CREATE a wrapper table on top of a HBase table in Spark SQL?

What do you think? Is this the right approach?

Thanks,
Ben

On Oct 8, 2016, at 10:33 AM, Felix Cheung <fe...@hotmail.com>> wrote:

HBase has released support for Spark
hbase.apache.org/book.html#spark<http://hbase.apache.org/book.html#spark>

And if you search you should find several alternative approaches.



On Fri, Oct 7, 2016 at 7:56 AM -0700, "Benjamin Kim" <bb...@gmail.com>> wrote:
Does anyone know if Spark can work with HBase tables using Spark SQL? I know in Hive we are able to create tables on top of an underlying HBase table that can be accessed using MapReduce jobs. Can the same be done using HiveContext or SQLContext? We are trying to setup a way to GET and POST data to and from the HBase table using the Spark SQL JDBC thriftserver from our RESTful API endpoints and/or HTTP web farms. If we can get this to work, then we can load balance the thriftservers. In addition, this will benefit us in giving us a way to abstract the data storage layer away from the presentation layer code. There is a chance that we will swap out the data storage technology in the future. We are currently experimenting with Kudu.

Thanks,
Ben
---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org<ma...@spark.apache.org>

Re: Spark SQL Thriftserver with HBase

Posted by "Thakrar, Jayesh" <jt...@conversantmedia.com>.

Ben,

Also look at Phoenix (Apache project) which provides a better (one of the best) SQL/JDBC layer on top of HBase.
http://phoenix.apache.org/

Cheers,
Jayesh


From: vincent gromakowski <vi...@gmail.com>
Date: Monday, October 17, 2016 at 1:53 PM
To: Benjamin Kim <bb...@gmail.com>
Cc: Michael Segel <ms...@hotmail.com>, Jörn Franke <jo...@gmail.com>, Mich Talebzadeh <mi...@gmail.com>, Felix Cheung <fe...@hotmail.com>, "user@spark.apache.org" <us...@spark.apache.org>
Subject: Re: Spark SQL Thriftserver with HBase

Instead of (or additionally to) saving results somewhere, you just start a thriftserver that expose the Spark tables of the SQLContext (or SparkSession now). That means you can implement any logic (and maybe use structured streaming) to expose your data. Today using the thriftserver means reading data from the persistent store every query, so if the data modeling doesn't fit the query it can be quite long.  What you generally do in a common spark job is to load the data and cache spark table in a in-memory columnar table which is quite efficient for any kind of query, the counterpart is that the cache isn't updated you have to implement a reload mechanism, and this solution isn't available using the thriftserver.
What I propose is to mix the two world: periodically/delta load data in spark table cache and expose it through the thriftserver. But you have to implement the loading logic, it can be very simple to very complex depending on your needs.


2016-10-17 19:48 GMT+02:00 Benjamin Kim <bb...@gmail.com>>:
Is this technique similar to what Kinesis is offering or what Structured Streaming is going to have eventually?

Just curious.

Cheers,
Ben


On Oct 17, 2016, at 10:14 AM, vincent gromakowski <vi...@gmail.com>> wrote:

I would suggest to code your own Spark thriftserver which seems to be very easy.
http://stackoverflow.com/questions/27108863/accessing-spark-sql-rdd-tables-through-the-thrift-server

I am starting to test it. The big advantage is that you can implement any logic because it's a spark job and then start a thrift server on temporary table. For example you can query a micro batch rdd from a kafka stream, or pre load some tables and implement a rolling cache to periodically update the spark in memory tables with persistent store...
It's not part of the public API and I don't know yet what are the issues doing this but I think Spark community should look at this path: making the thriftserver be instantiable in any spark job.

2016-10-17 18:17 GMT+02:00 Michael Segel <ms...@hotmail.com>>:
Guys,
Sorry for jumping in late to the game…

If memory serves (which may not be a good thing…) :

You can use HiveServer2 as a connection point to HBase.
While this doesn’t perform well, its probably the cleanest solution.
I’m not keen on Phoenix… wouldn’t recommend it….


The issue is that you’re trying to make HBase, a key/value object store, a Relational Engine… its not.

There are some considerations which make HBase not ideal for all use cases and you may find better performance with Parquet files.

One thing missing is the use of secondary indexing and query optimizations that you have in RDBMSs and are lacking in HBase / MapRDB / etc …  so your performance will vary.

With respect to Tableau… their entire interface in to the big data world revolves around the JDBC/ODBC interface. So if you don’t have that piece as part of your solution, you’re DOA w respect to Tableau.

Have you considered Drill as your JDBC connection point?  (YAAP: Yet another Apache project)


On Oct 9, 2016, at 12:23 PM, Benjamin Kim <bb...@gmail.com>> wrote:

Thanks for all the suggestions. It would seem you guys are right about the Tableau side of things. The reports don’t need to be real-time, and they won’t be directly feeding off of the main DMP HBase data. Instead, it’ll be batched to Parquet or Kudu/Impala or even PostgreSQL.

I originally thought that we needed two-way data retrieval from the DMP HBase for ID generation, but after further investigation into the use-case and architecture, the ID generation needs to happen local to the Ad Servers where we generate a unique ID and store it in a ID linking table. Even better, many of the 3rd party services supply this ID. So, data only needs to flow in one direction. We will use Kafka as the bus for this. No JDBC required. This is also goes for the REST Endpoints. 3rd party services will hit ours to update our data with no need to read from our data. And, when we want to update their data, we will hit theirs to update their data using a triggered job.

This al boils down to just integrating with Kafka.

Once again, thanks for all the help.

Cheers,
Ben


On Oct 9, 2016, at 3:16 AM, Jörn Franke <jo...@gmail.com>> wrote:

please keep also in mind that Tableau Server has the capabilities to store data in-memory and refresh only when needed the in-memory data. This means you can import it from any source and let your users work only on the in-memory data in Tableau Server.

On Sun, Oct 9, 2016 at 9:22 AM, Jörn Franke <jo...@gmail.com>> wrote:
Cloudera 5.8 has a very old version of Hive without Tez, but Mich provided already a good alternative. However, you should check if it contains a recent version of Hbase and Phoenix. That being said, I just wonder what is the dataflow, data model and the analysis you plan to do. Maybe there are completely different solutions possible. Especially these single inserts, upserts etc. should be avoided as much as possible in the Big Data (analysis) world with any technology, because they do not perform well.

Hive with Llap will provide an in-memory cache for interactive analytics. You can put full tables in-memory with Hive using Ignite HDFS in-memory solution. All this does only make sense if you do not use MR as an engine, the right input format (ORC, parquet) and a recent Hive version.

On 8 Oct 2016, at 21:55, Benjamin Kim <bb...@gmail.com>> wrote:
Mich,

Unfortunately, we are moving away from Hive and unifying on Spark using CDH 5.8 as our distro. And, the Tableau released a Spark ODBC/JDBC driver too. I will either try Phoenix JDBC Server for HBase or push to move faster to Kudu with Impala. We will use Impala as the JDBC in-between until the Kudu team completes Spark SQL support for JDBC.

Thanks for the advice.

Cheers,
Ben


On Oct 8, 2016, at 12:35 PM, Mich Talebzadeh <mi...@gmail.com>> wrote:

Sure. But essentially you are looking at batch data for analytics for your tableau users so Hive may be a better choice with its rich SQL and ODBC.JDBC connection to Tableau already.

I would go for Hive especially the new release will have an in-memory offering as well for frequently accessed data :)


Dr Mich Talebzadeh


LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw


http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.



On 8 October 2016 at 20:15, Benjamin Kim <bb...@gmail.com>> wrote:
Mich,

First and foremost, we have visualization servers that run Tableau for external user reports. Second, we have servers that are ad servers and REST endpoints for cookie sync and segmentation data exchange. These will use JDBC directly within the same data-center. When not colocated in the same data-center, they will connected to a located database server using JDBC. Either way, by using JDBC everywhere, it simplifies and unifies the code on the JDBC industry standard.

Does this make sense?

Thanks,
Ben


On Oct 8, 2016, at 11:47 AM, Mich Talebzadeh <mi...@gmail.com>> wrote:

Like any other design what is your presentation layer and end users?

Are they SQL centric users from Tableau background or they may use spark functional programming.

It is best to describe the use case.

HTH

Dr Mich Talebzadeh


LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw


http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.



On 8 October 2016 at 19:40, Felix Cheung <fe...@hotmail.com>> wrote:
I wouldn't be too surprised Spark SQL - JDBC data source - Phoenix JDBC server - HBASE would work better.

Without naming specifics, there are at least 4 or 5 different implementations of HBASE sources, each at varying level of development and different requirements (HBASE release version, Kerberos support etc)


_____________________________
From: Benjamin Kim <bb...@gmail.com>>
Sent: Saturday, October 8, 2016 11:26 AM
Subject: Re: Spark SQL Thriftserver with HBase
To: Mich Talebzadeh <mi...@gmail.com>>
Cc: <us...@spark.apache.org>>, Felix Cheung <fe...@hotmail.com>>



Mich,

Are you talking about the Phoenix JDBC Server? If so, I forgot about that alternative.

Thanks,
Ben


On Oct 8, 2016, at 11:21 AM, Mich Talebzadeh <mi...@gmail.com>> wrote:

I don't think it will work

you can use phoenix on top of hbase

hbase(main):336:0> scan 'tsco', 'LIMIT' => 1
ROW                                                       COLUMN+CELL
 TSCO-1-Apr-08                                            column=stock_daily:Date, timestamp=1475866783376, value=1-Apr-08
 TSCO-1-Apr-08                                            column=stock_daily:close, timestamp=1475866783376, value=405.25
 TSCO-1-Apr-08                                            column=stock_daily:high, timestamp=1475866783376, value=406.75
 TSCO-1-Apr-08                                            column=stock_daily:low, timestamp=1475866783376, value=379.25
 TSCO-1-Apr-08                                            column=stock_daily:open, timestamp=1475866783376, value=380.00
 TSCO-1-Apr-08                                            column=stock_daily:stock, timestamp=1475866783376, value=TESCO PLC
 TSCO-1-Apr-08                                            column=stock_daily:ticker, timestamp=1475866783376, value=TSCO
 TSCO-1-Apr-08                                            column=stock_daily:volume, timestamp=1475866783376, value=49664486

And the same on Phoenix on top of Hvbase table

0: jdbc:phoenix:thin:url=http://rhes564:8765<http://rhes564:8765/>> select substr(to_char(to_date("Date",'dd-MMM-yy')),1,10) AS TradeDate, "close" AS "Day's close", "high" AS "Day's High", "low" AS "Day's Low", "open" AS "Day's Open", "ticker", "volume", (to_number("low")+to_number("high"))/2 AS "AverageDailyPrice" from "tsco" where to_number("volume") > 0 and "high" != '-' and to_date("Date",'dd-MMM-yy') > to_date('2015-10-06','yyyy-MM-dd') order by  to_date("Date",'dd-MMM-yy') limit 1;
+-------------+--------------+-------------+------------+-------------+---------+-----------+--------------------+
|  TRADEDATE  | Day's close  | Day's High  | Day's Low  | Day's Open  | ticker  |  volume   | AverageDailyPrice  |
+-------------+--------------+-------------+------------+-------------+---------+-----------+--------------------+
| 2015-10-07  | 197.00       | 198.05      | 184.84     | 192.20      | TSCO    | 30046994  | 191.445            |


HTH





Dr Mich Talebzadeh


LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw


http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destructionof data or any other property which may arise from relying on this email's technical content is explicitly disclaimed.The author will in no case be liable for any monetary damages arising from suchloss, damage or destruction.



On 8 October 2016 at 19:05, Felix Cheung <fe...@hotmail.com>> wrote:
Great, then I think those packages as Spark data source should allow you to do exactly that (replace org.apache.spark.sql.jdbc with HBASE one)

I do think it will be great to get more examples around this though. Would be great if you could share your experience with this!


_____________________________
From: Benjamin Kim <bb...@gmail.com>>
Sent: Saturday, October 8, 2016 11:00 AM
Subject: Re: Spark SQL Thriftserver with HBase
To: Felix Cheung <fe...@hotmail.com>>
Cc: <us...@spark.apache.org>>


Felix,

My goal is to use Spark SQL JDBC Thriftserver to access HBase tables using just SQL. I have been able to CREATE tables using this statement below in the past:

CREATE TABLE <table-name>
USING org.apache.spark.sql.jdbc
OPTIONS (
  url "jdbc:postgresql://<hostname>:<port>/dm?user=<username>&password=<password>",
  dbtable "dim.dimension_acamp"
);

After doing this, I can access the PostgreSQL table using Spark SQL JDBC Thriftserver using SQL statements (SELECT, UPDATE, INSERT, etc.). I want to do the same with HBase tables. We tried this using Hive and HiveServer2, but the response times are just too long.

Thanks,
Ben


On Oct 8, 2016, at 10:53 AM, Felix Cheung <fe...@hotmail.com>> wrote:

Ben,

I'm not sure I'm following completely.

Is your goal to use Spark to create or access tables in HBASE? If so the link below and several packages out there support that by having a HBASE data source for Spark. There are some examples on how the Spark code look like in that link as well. On that note, you should also be able to use the HBASE data source from pure SQL (Spark SQL) query as well, which should work in the case with the Spark SQL JDBC Thrift Server (with USING,http://spark.apache.org/docs/latest/sql-programming-guide.html#tab_sql_10).

_____________________________
From: Benjamin Kim <bb...@gmail.com>>
Sent: Saturday, October 8, 2016 10:40 AM
Subject: Re: Spark SQL Thriftserver with HBase
To: Felix Cheung <fe...@hotmail.com>>
Cc: <us...@spark.apache.org>>


Felix,

The only alternative way is to create a stored procedure (udf) in database terms that would run Spark scala code underneath. In this way, I can use Spark SQL JDBC Thriftserver to execute it using SQL code passing the key, values I want to UPSERT. I wonder if this is possible since I cannot CREATE a wrapper table on top of a HBase table in Spark SQL?

What do you think? Is this the right approach?

Thanks,
Ben

On Oct 8, 2016, at 10:33 AM, Felix Cheung <fe...@hotmail.com>> wrote:

HBase has released support for Spark
hbase.apache.org/book.html#spark<http://hbase.apache.org/book.html#spark>

And if you search you should find several alternative approaches.



On Fri, Oct 7, 2016 at 7:56 AM -0700, "Benjamin Kim" <bb...@gmail.com>> wrote:
Does anyone know if Spark can work with HBase tables using Spark SQL? I know in Hive we are able to create tables on top of an underlying HBase table that can be accessed using MapReduce jobs. Can the same be done using HiveContext or SQLContext? We are trying to setup a way to GET and POST data to and from the HBase table using the Spark SQL JDBC thriftserver from our RESTful API endpoints and/or HTTP web farms. If we can get this to work, then we can load balance the thriftservers. In addition, this will benefit us in giving us a way to abstract the data storage layer away from the presentation layer code. There is a chance that we will swap out the data storage technology in the future. We are currently experimenting with Kudu.

Thanks,
Ben
---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org<ma...@spark.apache.org>

Re: Spark SQL Thriftserver with HBase

Posted by Benjamin Kim <bb...@gmail.com>.

This will give me an opportunity to start using Structured Streaming. Then, I can try adding more functionality. If all goes well, then we could transition off of HBase to a more in-memory data solution that can “spill-over” data for us.

> On Oct 17, 2016, at 11:53 AM, vincent gromakowski <vi...@gmail.com> wrote:
> 
> Instead of (or additionally to) saving results somewhere, you just start a thriftserver that expose the Spark tables of the SQLContext (or SparkSession now). That means you can implement any logic (and maybe use structured streaming) to expose your data. Today using the thriftserver means reading data from the persistent store every query, so if the data modeling doesn't fit the query it can be quite long.  What you generally do in a common spark job is to load the data and cache spark table in a in-memory columnar table which is quite efficient for any kind of query, the counterpart is that the cache isn't updated you have to implement a reload mechanism, and this solution isn't available using the thriftserver.
> What I propose is to mix the two world: periodically/delta load data in spark table cache and expose it through the thriftserver. But you have to implement the loading logic, it can be very simple to very complex depending on your needs.
> 
> 
> 2016-10-17 19:48 GMT+02:00 Benjamin Kim <bbuild11@gmail.com <ma...@gmail.com>>:
> Is this technique similar to what Kinesis is offering or what Structured Streaming is going to have eventually?
> 
> Just curious.
> 
> Cheers,
> Ben
> 
>  
>> On Oct 17, 2016, at 10:14 AM, vincent gromakowski <vincent.gromakowski@gmail.com <ma...@gmail.com>> wrote:
>> 
>> I would suggest to code your own Spark thriftserver which seems to be very easy.
>> http://stackoverflow.com/questions/27108863/accessing-spark-sql-rdd-tables-through-the-thrift-server <http://stackoverflow.com/questions/27108863/accessing-spark-sql-rdd-tables-through-the-thrift-server>
>> 
>> I am starting to test it. The big advantage is that you can implement any logic because it's a spark job and then start a thrift server on temporary table. For example you can query a micro batch rdd from a kafka stream, or pre load some tables and implement a rolling cache to periodically update the spark in memory tables with persistent store...
>> It's not part of the public API and I don't know yet what are the issues doing this but I think Spark community should look at this path: making the thriftserver be instantiable in any spark job.
>> 
>> 2016-10-17 18:17 GMT+02:00 Michael Segel <msegel_hadoop@hotmail.com <ma...@hotmail.com>>:
>> Guys, 
>> Sorry for jumping in late to the game… 
>> 
>> If memory serves (which may not be a good thing…) :
>> 
>> You can use HiveServer2 as a connection point to HBase.  
>> While this doesn’t perform well, its probably the cleanest solution. 
>> I’m not keen on Phoenix… wouldn’t recommend it…. 
>> 
>> 
>> The issue is that you’re trying to make HBase, a key/value object store, a Relational Engine… its not. 
>> 
>> There are some considerations which make HBase not ideal for all use cases and you may find better performance with Parquet files. 
>> 
>> One thing missing is the use of secondary indexing and query optimizations that you have in RDBMSs and are lacking in HBase / MapRDB / etc …  so your performance will vary. 
>> 
>> With respect to Tableau… their entire interface in to the big data world revolves around the JDBC/ODBC interface. So if you don’t have that piece as part of your solution, you’re DOA w respect to Tableau. 
>> 
>> Have you considered Drill as your JDBC connection point?  (YAAP: Yet another Apache project) 
>> 
>> 
>>> On Oct 9, 2016, at 12:23 PM, Benjamin Kim <bbuild11@gmail.com <ma...@gmail.com>> wrote:
>>> 
>>> Thanks for all the suggestions. It would seem you guys are right about the Tableau side of things. The reports don’t need to be real-time, and they won’t be directly feeding off of the main DMP HBase data. Instead, it’ll be batched to Parquet or Kudu/Impala or even PostgreSQL.
>>> 
>>> I originally thought that we needed two-way data retrieval from the DMP HBase for ID generation, but after further investigation into the use-case and architecture, the ID generation needs to happen local to the Ad Servers where we generate a unique ID and store it in a ID linking table. Even better, many of the 3rd party services supply this ID. So, data only needs to flow in one direction. We will use Kafka as the bus for this. No JDBC required. This is also goes for the REST Endpoints. 3rd party services will hit ours to update our data with no need to read from our data. And, when we want to update their data, we will hit theirs to update their data using a triggered job.
>>> 
>>> This al boils down to just integrating with Kafka.
>>> 
>>> Once again, thanks for all the help.
>>> 
>>> Cheers,
>>> Ben
>>> 
>>> 
>>>> On Oct 9, 2016, at 3:16 AM, Jörn Franke <jornfranke@gmail.com <ma...@gmail.com>> wrote:
>>>> 
>>>> please keep also in mind that Tableau Server has the capabilities to store data in-memory and refresh only when needed the in-memory data. This means you can import it from any source and let your users work only on the in-memory data in Tableau Server.
>>>> 
>>>> On Sun, Oct 9, 2016 at 9:22 AM, Jörn Franke <jornfranke@gmail.com <ma...@gmail.com>> wrote:
>>>> Cloudera 5.8 has a very old version of Hive without Tez, but Mich provided already a good alternative. However, you should check if it contains a recent version of Hbase and Phoenix. That being said, I just wonder what is the dataflow, data model and the analysis you plan to do. Maybe there are completely different solutions possible. Especially these single inserts, upserts etc. should be avoided as much as possible in the Big Data (analysis) world with any technology, because they do not perform well. 
>>>> 
>>>> Hive with Llap will provide an in-memory cache for interactive analytics. You can put full tables in-memory with Hive using Ignite HDFS in-memory solution. All this does only make sense if you do not use MR as an engine, the right input format (ORC, parquet) and a recent Hive version.
>>>> 
>>>> On 8 Oct 2016, at 21:55, Benjamin Kim <bbuild11@gmail.com <ma...@gmail.com>> wrote:
>>>> 
>>>>> Mich,
>>>>> 
>>>>> Unfortunately, we are moving away from Hive and unifying on Spark using CDH 5.8 as our distro. And, the Tableau released a Spark ODBC/JDBC driver too. I will either try Phoenix JDBC Server for HBase or push to move faster to Kudu with Impala. We will use Impala as the JDBC in-between until the Kudu team completes Spark SQL support for JDBC.
>>>>> 
>>>>> Thanks for the advice.
>>>>> 
>>>>> Cheers,
>>>>> Ben
>>>>> 
>>>>> 
>>>>>> On Oct 8, 2016, at 12:35 PM, Mich Talebzadeh <mich.talebzadeh@gmail.com <ma...@gmail.com>> wrote:
>>>>>> 
>>>>>> Sure. But essentially you are looking at batch data for analytics for your tableau users so Hive may be a better choice with its rich SQL and ODBC.JDBC connection to Tableau already.
>>>>>> 
>>>>>> I would go for Hive especially the new release will have an in-memory offering as well for frequently accessed data :)
>>>>>> 
>>>>>> 
>>>>>> Dr Mich Talebzadeh
>>>>>>  
>>>>>> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>>>>>  
>>>>>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>>>>>> 
>>>>>> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.
>>>>>>  
>>>>>> 
>>>>>> On 8 October 2016 at 20:15, Benjamin Kim <bbuild11@gmail.com <ma...@gmail.com>> wrote:
>>>>>> Mich,
>>>>>> 
>>>>>> First and foremost, we have visualization servers that run Tableau for external user reports. Second, we have servers that are ad servers and REST endpoints for cookie sync and segmentation data exchange. These will use JDBC directly within the same data-center. When not colocated in the same data-center, they will connected to a located database server using JDBC. Either way, by using JDBC everywhere, it simplifies and unifies the code on the JDBC industry standard.
>>>>>> 
>>>>>> Does this make sense?
>>>>>> 
>>>>>> Thanks,
>>>>>> Ben
>>>>>> 
>>>>>> 
>>>>>>> On Oct 8, 2016, at 11:47 AM, Mich Talebzadeh <mich.talebzadeh@gmail.com <ma...@gmail.com>> wrote:
>>>>>>> 
>>>>>>> Like any other design what is your presentation layer and end users?
>>>>>>> 
>>>>>>> Are they SQL centric users from Tableau background or they may use spark functional programming.
>>>>>>> 
>>>>>>> It is best to describe the use case.
>>>>>>> 
>>>>>>> HTH
>>>>>>> 
>>>>>>> Dr Mich Talebzadeh
>>>>>>>  
>>>>>>> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>>>>>>  
>>>>>>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>>>>>>> 
>>>>>>> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.
>>>>>>>  
>>>>>>> 
>>>>>>> On 8 October 2016 at 19:40, Felix Cheung <felixcheung_m@hotmail.com <ma...@hotmail.com>> wrote:
>>>>>>> I wouldn't be too surprised Spark SQL - JDBC data source - Phoenix JDBC server - HBASE would work better.
>>>>>>> 
>>>>>>> Without naming specifics, there are at least 4 or 5 different implementations of HBASE sources, each at varying level of development and different requirements (HBASE release version, Kerberos support etc)
>>>>>>> 
>>>>>>> 
>>>>>>> _____________________________
>>>>>>> From: Benjamin Kim <bbuild11@gmail.com <ma...@gmail.com>>
>>>>>>> Sent: Saturday, October 8, 2016 11:26 AM
>>>>>>> Subject: Re: Spark SQL Thriftserver with HBase
>>>>>>> To: Mich Talebzadeh <mich.talebzadeh@gmail.com <ma...@gmail.com>>
>>>>>>> Cc: <user@spark.apache.org <ma...@spark.apache.org>>, Felix Cheung <felixcheung_m@hotmail.com <ma...@hotmail.com>>
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> Mich,
>>>>>>> 
>>>>>>> Are you talking about the Phoenix JDBC Server? If so, I forgot about that alternative.
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> Ben
>>>>>>> 
>>>>>>> 
>>>>>>> On Oct 8, 2016, at 11:21 AM, Mich Talebzadeh <mich.talebzadeh@gmail.com <ma...@gmail.com>> wrote:
>>>>>>> 
>>>>>>> I don't think it will work
>>>>>>> 
>>>>>>> you can use phoenix on top of hbase
>>>>>>> 
>>>>>>> hbase(main):336:0> scan 'tsco', 'LIMIT' => 1
>>>>>>> ROW                                                       COLUMN+CELL
>>>>>>>  TSCO-1-Apr-08                                            column=stock_daily:Date, timestamp=1475866783376, value=1-Apr-08
>>>>>>>  TSCO-1-Apr-08                                            column=stock_daily:close, timestamp=1475866783376, value=405.25
>>>>>>>  TSCO-1-Apr-08                                            column=stock_daily:high, timestamp=1475866783376, value=406.75
>>>>>>>  TSCO-1-Apr-08                                            column=stock_daily:low, timestamp=1475866783376, value=379.25
>>>>>>>  TSCO-1-Apr-08                                            column=stock_daily:open, timestamp=1475866783376, value=380.00
>>>>>>>  TSCO-1-Apr-08                                            column=stock_daily:stock, timestamp=1475866783376, value=TESCO PLC
>>>>>>>  TSCO-1-Apr-08                                            column=stock_daily:ticker, timestamp=1475866783376, value=TSCO
>>>>>>>  TSCO-1-Apr-08                                            column=stock_daily:volume, timestamp=1475866783376, value=49664486
>>>>>>> 
>>>>>>> And the same on Phoenix on top of Hvbase table
>>>>>>> 
>>>>>>> 0: jdbc:phoenix:thin:url=http://rhes564:8765 <http://rhes564:8765/>> select substr(to_char(to_date("Date",'dd-MMM-yy')),1,10) AS TradeDate, "close" AS "Day's close", "high" AS "Day's High", "low" AS "Day's Low", "open" AS "Day's Open", "ticker", "volume", (to_number("low")+to_number("high"))/2 AS "AverageDailyPrice" from "tsco" where to_number("volume") > 0 and "high" != '-' and to_date("Date",'dd-MMM-yy') > to_date('2015-10-06','yyyy-MM-dd') order by  to_date("Date",'dd-MMM-yy') limit 1;
>>>>>>> +-------------+--------------+-------------+------------+-------------+---------+-----------+--------------------+
>>>>>>> |  TRADEDATE  | Day's close  | Day's High  | Day's Low  | Day's Open  | ticker  |  volume   | AverageDailyPrice  |
>>>>>>> +-------------+--------------+-------------+------------+-------------+---------+-----------+--------------------+
>>>>>>> | 2015-10-07  | 197.00       | 198.05      | 184.84     | 192.20      | TSCO    | 30046994  | 191.445            |
>>>>>>> 
>>>>>>> HTH
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> Dr Mich Talebzadeh
>>>>>>>  
>>>>>>> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>>>>>>  
>>>>>>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>>>>>>> 
>>>>>>> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destructionof data or any other property which may arise from relying on this email's technical content is explicitly disclaimed.The author will in no case be liable for any monetary damages arising from suchloss, damage or destruction.
>>>>>>>  
>>>>>>> 
>>>>>>> On 8 October 2016 at 19:05, Felix Cheung <felixcheung_m@hotmail.com <ma...@hotmail.com>> wrote:
>>>>>>> Great, then I think those packages as Spark data source should allow you to do exactly that (replace org.apache.spark.sql.jdbc with HBASE one)
>>>>>>> 
>>>>>>> I do think it will be great to get more examples around this though. Would be great if you could share your experience with this!
>>>>>>> 
>>>>>>> 
>>>>>>> _____________________________
>>>>>>> From: Benjamin Kim <bbuild11@gmail.com <ma...@gmail.com>>
>>>>>>> Sent: Saturday, October 8, 2016 11:00 AM
>>>>>>> Subject: Re: Spark SQL Thriftserver with HBase
>>>>>>> To: Felix Cheung <felixcheung_m@hotmail.com <ma...@hotmail.com>>
>>>>>>> Cc: <user@spark.apache.org <ma...@spark.apache.org>>
>>>>>>> 
>>>>>>> 
>>>>>>> Felix,
>>>>>>> 
>>>>>>> My goal is to use Spark SQL JDBC Thriftserver to access HBase tables using just SQL. I have been able to CREATE tables using this statement below in the past:
>>>>>>> 
>>>>>>> CREATE TABLE <table-name>
>>>>>>> USING org.apache.spark.sql.jdbc
>>>>>>> OPTIONS (
>>>>>>>   url "jdbc:postgresql://<hostname>:<port>/dm?user=<username>&password=<password>",
>>>>>>>   dbtable "dim.dimension_acamp"
>>>>>>> );
>>>>>>> 
>>>>>>> After doing this, I can access the PostgreSQL table using Spark SQL JDBC Thriftserver using SQL statements (SELECT, UPDATE, INSERT, etc.). I want to do the same with HBase tables. We tried this using Hive and HiveServer2, but the response times are just too long.
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> Ben
>>>>>>> 
>>>>>>> 
>>>>>>> On Oct 8, 2016, at 10:53 AM, Felix Cheung <felixcheung_m@hotmail.com <ma...@hotmail.com>> wrote:
>>>>>>> 
>>>>>>> Ben,
>>>>>>> 
>>>>>>> I'm not sure I'm following completely.
>>>>>>> 
>>>>>>> Is your goal to use Spark to create or access tables in HBASE? If so the link below and several packages out there support that by having a HBASE data source for Spark. There are some examples on how the Spark code look like in that link as well. On that note, you should also be able to use the HBASE data source from pure SQL (Spark SQL) query as well, which should work in the case with the Spark SQL JDBC Thrift Server (with USING,http://spark.apache.org/docs/latest/sql-programming-guide.html#tab_sql_10 <http://spark.apache.org/docs/latest/sql-programming-guide.html#tab_sql_10>).
>>>>>>> 
>>>>>>> 
>>>>>>> _____________________________
>>>>>>> From: Benjamin Kim <bbuild11@gmail.com <ma...@gmail.com>>
>>>>>>> Sent: Saturday, October 8, 2016 10:40 AM
>>>>>>> Subject: Re: Spark SQL Thriftserver with HBase
>>>>>>> To: Felix Cheung <felixcheung_m@hotmail.com <ma...@hotmail.com>>
>>>>>>> Cc: <user@spark.apache.org <ma...@spark.apache.org>>
>>>>>>> 
>>>>>>> 
>>>>>>> Felix,
>>>>>>> 
>>>>>>> The only alternative way is to create a stored procedure (udf) in database terms that would run Spark scala code underneath. In this way, I can use Spark SQL JDBC Thriftserver to execute it using SQL code passing the key, values I want to UPSERT. I wonder if this is possible since I cannot CREATE a wrapper table on top of a HBase table in Spark SQL?
>>>>>>> 
>>>>>>> What do you think? Is this the right approach?
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> Ben
>>>>>>> 
>>>>>>> On Oct 8, 2016, at 10:33 AM, Felix Cheung <felixcheung_m@hotmail.com <ma...@hotmail.com>> wrote:
>>>>>>> 
>>>>>>> HBase has released support for Spark
>>>>>>> hbase.apache.org/book.html#spark <http://hbase.apache.org/book.html#spark>
>>>>>>> 
>>>>>>> And if you search you should find several alternative approaches.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Fri, Oct 7, 2016 at 7:56 AM -0700, "Benjamin Kim" <bbuild11@gmail.com <ma...@gmail.com>> wrote:
>>>>>>> 
>>>>>>> Does anyone know if Spark can work with HBase tables using Spark SQL? I know in Hive we are able to create tables on top of an underlying HBase table that can be accessed using MapReduce jobs. Can the same be done using HiveContext or SQLContext? We are trying to setup a way to GET and POST data to and from the HBase table using the Spark SQL JDBC thriftserver from our RESTful API endpoints and/or HTTP web farms. If we can get this to work, then we can load balance the thriftservers. In addition, this will benefit us in giving us a way to abstract the data storage layer away from the presentation layer code. There is a chance that we will swap out the data storage technology in the future. We are currently experimenting with Kudu.
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> Ben
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org <ma...@spark.apache.org>
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> 
>> 
> 
>

Re: Spark SQL Thriftserver with HBase

Posted by Michael Segel <ms...@hotmail.com>.

@Mitch

You don’t have a schema in HBase other than the table name and the list of associated column families.

So you can’t really infer a schema easily…


On Oct 17, 2016, at 2:17 PM, Mich Talebzadeh <mi...@gmail.com>> wrote:

How about this method of creating Data Frames on Hbase tables directly.

I define an RDD for each column in the column family as below. In this case column trade_info:ticker

//create rdd
val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat],classOf[org.apache.hadoop.hbase.io<http://hbase.io/>.ImmutableBytesWritable],classOf[org.apache.hadoop.hbase.client.Result])
val rdd1 = hBaseRDD.map(tuple => tuple._2).map(result => (result.getRow, result.getColumn("price_info".getBytes(), "ticker".getBytes()))).map(row => {
(
  row._1.map(_.toChar).mkString,
  row._2.asScala.reduceLeft {
    (a, b) => if (a.getTimestamp > b.getTimestamp) a else b
  }.getValue.map(_.toChar).mkString
)
})
case class columns (key: String, ticker: String)
val dfticker = rdd1.toDF.map(p => columns(p(0).toString,p(1).toString))

Note that the end result is a DataFrame with the RowKey -> key and column -> ticker

I use the same approach to create two other DataFrames, namely dftimecreated and dfprice for the two other columns.

Note that if I don't need a column, then I do not create a DF for it. So a DF with each column I use. I am not sure how this compares if I read the full row through other methods if any.

Anyway all I need to do after creating a DataFrame for each column is to join themthrough RowKey to slice and dice data. Like below.

Get me the latest prices ordered by timecreated and ticker (ticker is stock)

val rs = dfticker.join(dftimecreated,"key").join(dfprice,"key").orderBy('timecreated desc, 'price desc).select('timecreated, 'ticker, 'price.cast("Float").as("Latest price"))
rs.show(10)

+-------------------+------+------------+
|        timecreated|ticker|Latest price|
+-------------------+------+------------+
|2016-10-16T18:44:57|   S16|   97.631966|
|2016-10-16T18:44:57|   S13|    92.11406|
|2016-10-16T18:44:57|   S19|    85.93021|
|2016-10-16T18:44:57|   S09|   85.714645|
|2016-10-16T18:44:57|   S15|    82.38932|
|2016-10-16T18:44:57|   S17|    80.77747|
|2016-10-16T18:44:57|   S06|    79.81854|
|2016-10-16T18:44:57|   S18|    74.10128|
|2016-10-16T18:44:57|   S07|    66.13622|
|2016-10-16T18:44:57|   S20|    60.35727|
+-------------------+------+------------+
only showing top 10 rows

Dr Mich Talebzadeh



LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.



On 17 October 2016 at 19:53, vincent gromakowski <vi...@gmail.com>> wrote:
Instead of (or additionally to) saving results somewhere, you just start a thriftserver that expose the Spark tables of the SQLContext (or SparkSession now). That means you can implement any logic (and maybe use structured streaming) to expose your data. Today using the thriftserver means reading data from the persistent store every query, so if the data modeling doesn't fit the query it can be quite long.  What you generally do in a common spark job is to load the data and cache spark table in a in-memory columnar table which is quite efficient for any kind of query, the counterpart is that the cache isn't updated you have to implement a reload mechanism, and this solution isn't available using the thriftserver.
What I propose is to mix the two world: periodically/delta load data in spark table cache and expose it through the thriftserver. But you have to implement the loading logic, it can be very simple to very complex depending on your needs.


2016-10-17 19:48 GMT+02:00 Benjamin Kim <bb...@gmail.com>>:
Is this technique similar to what Kinesis is offering or what Structured Streaming is going to have eventually?

Just curious.

Cheers,
Ben


On Oct 17, 2016, at 10:14 AM, vincent gromakowski <vi...@gmail.com>> wrote:

I would suggest to code your own Spark thriftserver which seems to be very easy.
http://stackoverflow.com/questions/27108863/accessing-spark-sql-rdd-tables-through-the-thrift-server

I am starting to test it. The big advantage is that you can implement any logic because it's a spark job and then start a thrift server on temporary table. For example you can query a micro batch rdd from a kafka stream, or pre load some tables and implement a rolling cache to periodically update the spark in memory tables with persistent store...
It's not part of the public API and I don't know yet what are the issues doing this but I think Spark community should look at this path: making the thriftserver be instantiable in any spark job.

2016-10-17 18:17 GMT+02:00 Michael Segel <ms...@hotmail.com>>:
Guys,
Sorry for jumping in late to the game…

If memory serves (which may not be a good thing…) :

You can use HiveServer2 as a connection point to HBase.
While this doesn’t perform well, its probably the cleanest solution.
I’m not keen on Phoenix… wouldn’t recommend it….


The issue is that you’re trying to make HBase, a key/value object store, a Relational Engine… its not.

There are some considerations which make HBase not ideal for all use cases and you may find better performance with Parquet files.

One thing missing is the use of secondary indexing and query optimizations that you have in RDBMSs and are lacking in HBase / MapRDB / etc …  so your performance will vary.

With respect to Tableau… their entire interface in to the big data world revolves around the JDBC/ODBC interface. So if you don’t have that piece as part of your solution, you’re DOA w respect to Tableau.

Have you considered Drill as your JDBC connection point?  (YAAP: Yet another Apache project)


On Oct 9, 2016, at 12:23 PM, Benjamin Kim <bb...@gmail.com>> wrote:

Thanks for all the suggestions. It would seem you guys are right about the Tableau side of things. The reports don’t need to be real-time, and they won’t be directly feeding off of the main DMP HBase data. Instead, it’ll be batched to Parquet or Kudu/Impala or even PostgreSQL.

I originally thought that we needed two-way data retrieval from the DMP HBase for ID generation, but after further investigation into the use-case and architecture, the ID generation needs to happen local to the Ad Servers where we generate a unique ID and store it in a ID linking table. Even better, many of the 3rd party services supply this ID. So, data only needs to flow in one direction. We will use Kafka as the bus for this. No JDBC required. This is also goes for the REST Endpoints. 3rd party services will hit ours to update our data with no need to read from our data. And, when we want to update their data, we will hit theirs to update their data using a triggered job.

This al boils down to just integrating with Kafka.

Once again, thanks for all the help.

Cheers,
Ben


On Oct 9, 2016, at 3:16 AM, Jörn Franke <jo...@gmail.com>> wrote:

please keep also in mind that Tableau Server has the capabilities to store data in-memory and refresh only when needed the in-memory data. This means you can import it from any source and let your users work only on the in-memory data in Tableau Server.

On Sun, Oct 9, 2016 at 9:22 AM, Jörn Franke <jo...@gmail.com>> wrote:
Cloudera 5.8 has a very old version of Hive without Tez, but Mich provided already a good alternative. However, you should check if it contains a recent version of Hbase and Phoenix. That being said, I just wonder what is the dataflow, data model and the analysis you plan to do. Maybe there are completely different solutions possible. Especially these single inserts, upserts etc. should be avoided as much as possible in the Big Data (analysis) world with any technology, because they do not perform well.

Hive with Llap will provide an in-memory cache for interactive analytics. You can put full tables in-memory with Hive using Ignite HDFS in-memory solution. All this does only make sense if you do not use MR as an engine, the right input format (ORC, parquet) and a recent Hive version.

On 8 Oct 2016, at 21:55, Benjamin Kim <bb...@gmail.com>> wrote:

Mich,

Unfortunately, we are moving away from Hive and unifying on Spark using CDH 5.8 as our distro. And, the Tableau released a Spark ODBC/JDBC driver too. I will either try Phoenix JDBC Server for HBase or push to move faster to Kudu with Impala. We will use Impala as the JDBC in-between until the Kudu team completes Spark SQL support for JDBC.

Thanks for the advice.

Cheers,
Ben


On Oct 8, 2016, at 12:35 PM, Mich Talebzadeh <mi...@gmail.com>> wrote:

Sure. But essentially you are looking at batch data for analytics for your tableau users so Hive may be a better choice with its rich SQL and ODBC.JDBC connection to Tableau already.

I would go for Hive especially the new release will have an in-memory offering as well for frequently accessed data :)


Dr Mich Talebzadeh



LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.



On 8 October 2016 at 20:15, Benjamin Kim <bb...@gmail.com>> wrote:
Mich,

First and foremost, we have visualization servers that run Tableau for external user reports. Second, we have servers that are ad servers and REST endpoints for cookie sync and segmentation data exchange. These will use JDBC directly within the same data-center. When not colocated in the same data-center, they will connected to a located database server using JDBC. Either way, by using JDBC everywhere, it simplifies and unifies the code on the JDBC industry standard.

Does this make sense?

Thanks,
Ben


On Oct 8, 2016, at 11:47 AM, Mich Talebzadeh <mi...@gmail.com>> wrote:

Like any other design what is your presentation layer and end users?

Are they SQL centric users from Tableau background or they may use spark functional programming.

It is best to describe the use case.

HTH

Dr Mich Talebzadeh



LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.



On 8 October 2016 at 19:40, Felix Cheung <fe...@hotmail.com>> wrote:
I wouldn't be too surprised Spark SQL - JDBC data source - Phoenix JDBC server - HBASE would work better.

Without naming specifics, there are at least 4 or 5 different implementations of HBASE sources, each at varying level of development and different requirements (HBASE release version, Kerberos support etc)


_____________________________
From: Benjamin Kim <bb...@gmail.com>>
Sent: Saturday, October 8, 2016 11:26 AM
Subject: Re: Spark SQL Thriftserver with HBase
To: Mich Talebzadeh <mi...@gmail.com>>
Cc: <us...@spark.apache.org>>, Felix Cheung <fe...@hotmail.com>>



Mich,

Are you talking about the Phoenix JDBC Server? If so, I forgot about that alternative.

Thanks,
Ben


On Oct 8, 2016, at 11:21 AM, Mich Talebzadeh <mi...@gmail.com>> wrote:

I don't think it will work

you can use phoenix on top of hbase

hbase(main):336:0> scan 'tsco', 'LIMIT' => 1
ROW                                                       COLUMN+CELL
 TSCO-1-Apr-08                                            column=stock_daily:Date, timestamp=1475866783376, value=1-Apr-08
 TSCO-1-Apr-08                                            column=stock_daily:close, timestamp=1475866783376, value=405.25
 TSCO-1-Apr-08                                            column=stock_daily:high, timestamp=1475866783376, value=406.75
 TSCO-1-Apr-08                                            column=stock_daily:low, timestamp=1475866783376, value=379.25
 TSCO-1-Apr-08                                            column=stock_daily:open, timestamp=1475866783376, value=380.00
 TSCO-1-Apr-08                                            column=stock_daily:stock, timestamp=1475866783376, value=TESCO PLC
 TSCO-1-Apr-08                                            column=stock_daily:ticker, timestamp=1475866783376, value=TSCO
 TSCO-1-Apr-08                                            column=stock_daily:volume, timestamp=1475866783376, value=49664486

And the same on Phoenix on top of Hvbase table

0: jdbc:phoenix:thin:url=http://rhes564:8765<http://rhes564:8765/>> select substr(to_char(to_date("Date",'dd-MMM-yy')),1,10) AS TradeDate, "close" AS "Day's close", "high" AS "Day's High", "low" AS "Day's Low", "open" AS "Day's Open", "ticker", "volume", (to_number("low")+to_number("high"))/2 AS "AverageDailyPrice" from "tsco" where to_number("volume") > 0 and "high" != '-' and to_date("Date",'dd-MMM-yy') > to_date('2015-10-06','yyyy-MM-dd') order by  to_date("Date",'dd-MMM-yy') limit 1;
+-------------+--------------+-------------+------------+-------------+---------+-----------+--------------------+
|  TRADEDATE  | Day's close  | Day's High  | Day's Low  | Day's Open  | ticker  |  volume   | AverageDailyPrice  |
+-------------+--------------+-------------+------------+-------------+---------+-----------+--------------------+
| 2015-10-07  | 197.00       | 198.05      | 184.84     | 192.20      | TSCO    | 30046994  | 191.445            |


HTH




Dr Mich Talebzadeh



LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destructionof data or any other property which may arise from relying on this email's technical content is explicitly disclaimed.The author will in no case be liable for any monetary damages arising from suchloss, damage or destruction.



On 8 October 2016 at 19:05, Felix Cheung <fe...@hotmail.com>> wrote:
Great, then I think those packages as Spark data source should allow you to do exactly that (replace org.apache.spark.sql.jdbc with HBASE one)

I do think it will be great to get more examples around this though. Would be great if you could share your experience with this!


_____________________________
From: Benjamin Kim <bb...@gmail.com>>
Sent: Saturday, October 8, 2016 11:00 AM
Subject: Re: Spark SQL Thriftserver with HBase
To: Felix Cheung <fe...@hotmail.com>>
Cc: <us...@spark.apache.org>>


Felix,

My goal is to use Spark SQL JDBC Thriftserver to access HBase tables using just SQL. I have been able to CREATE tables using this statement below in the past:

CREATE TABLE <table-name>
USING org.apache.spark.sql.jdbc
OPTIONS (
  url "jdbc:postgresql://<hostname>:<port>/dm?user=<username>&password=<password>",
  dbtable "dim.dimension_acamp"
);

After doing this, I can access the PostgreSQL table using Spark SQL JDBC Thriftserver using SQL statements (SELECT, UPDATE, INSERT, etc.). I want to do the same with HBase tables. We tried this using Hive and HiveServer2, but the response times are just too long.

Thanks,
Ben


On Oct 8, 2016, at 10:53 AM, Felix Cheung <fe...@hotmail.com>> wrote:

Ben,

I'm not sure I'm following completely.

Is your goal to use Spark to create or access tables in HBASE? If so the link below and several packages out there support that by having a HBASE data source for Spark. There are some examples on how the Spark code look like in that link as well. On that note, you should also be able to use the HBASE data source from pure SQL (Spark SQL) query as well, which should work in the case with the Spark SQL JDBC Thrift Server (with USING,http://spark.apache.org/docs/latest/sql-programming-guide.html#tab_sql_10).


_____________________________
From: Benjamin Kim <bb...@gmail.com>>
Sent: Saturday, October 8, 2016 10:40 AM
Subject: Re: Spark SQL Thriftserver with HBase
To: Felix Cheung <fe...@hotmail.com>>
Cc: <us...@spark.apache.org>>


Felix,

The only alternative way is to create a stored procedure (udf) in database terms that would run Spark scala code underneath. In this way, I can use Spark SQL JDBC Thriftserver to execute it using SQL code passing the key, values I want to UPSERT. I wonder if this is possible since I cannot CREATE a wrapper table on top of a HBase table in Spark SQL?

What do you think? Is this the right approach?

Thanks,
Ben

On Oct 8, 2016, at 10:33 AM, Felix Cheung <fe...@hotmail.com>> wrote:

HBase has released support for Spark
hbase.apache.org/book.html#spark<http://hbase.apache.org/book.html#spark>

And if you search you should find several alternative approaches.





On Fri, Oct 7, 2016 at 7:56 AM -0700, "Benjamin Kim" <bb...@gmail.com>> wrote:

Does anyone know if Spark can work with HBase tables using Spark SQL? I know in Hive we are able to create tables on top of an underlying HBase table that can be accessed using MapReduce jobs. Can the same be done using HiveContext or SQLContext? We are trying to setup a way to GET and POST data to and from the HBase table using the Spark SQL JDBC thriftserver from our RESTful API endpoints and/or HTTP web farms. If we can get this to work, then we can load balance the thriftservers. In addition, this will benefit us in giving us a way to abstract the data storage layer away from the presentation layer code. There is a chance that we will swap out the data storage technology in the future. We are currently experimenting with Kudu.

Thanks,
Ben
---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org<ma...@spark.apache.org>

Re: Spark SQL Thriftserver with HBase

Posted by Mich Talebzadeh <mi...@gmail.com>.

How about this method of creating Data Frames on Hbase tables directly.

I define an RDD for each column in the column family as below. In this case
column trade_info:ticker

//create rdd
val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat],
classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
classOf[org.apache.hadoop.hbase.client.Result])
val rdd1 = hBaseRDD.map(tuple => tuple._2).map(result => (result.getRow,
result.getColumn("price_info".getBytes(), "ticker".getBytes()))).map(row =>
{
(
  row._1.map(_.toChar).mkString,
  row._2.asScala.reduceLeft {
    (a, b) => if (a.getTimestamp > b.getTimestamp) a else b
  }.getValue.map(_.toChar).mkString
)
})
case class columns (key: String, ticker: String)
val dfticker = rdd1.toDF.map(p => columns(p(0).toString,p(1).toString))

Note that the end result is a DataFrame with the RowKey -> key and column
-> ticker

I use the same approach to create two other DataFrames, namely dftimecreated
and dfprice for the two other columns.

Note that if I don't need a column, then I do not create a DF for it. So a
DF with each column I use. I am not sure how this compares if I read the
full row through other methods if any.

Anyway all I need to do after creating a DataFrame for each column is to
join themthrough RowKey to slice and dice data. Like below.

Get me the latest prices ordered by timecreated and ticker (ticker is stock)

val rs = dfticker.join(dftimecreated,"key").join(dfprice,"key").orderBy('timecreated
desc, 'price desc).select('timecreated, 'ticker,
'price.cast("Float").as("Latest
price"))
rs.show(10)

+-------------------+------+------------+
|        timecreated|ticker|Latest price|
+-------------------+------+------------+
|2016-10-16T18:44:57|   S16|   97.631966|
|2016-10-16T18:44:57|   S13|    92.11406|
|2016-10-16T18:44:57|   S19|    85.93021|
|2016-10-16T18:44:57|   S09|   85.714645|
|2016-10-16T18:44:57|   S15|    82.38932|
|2016-10-16T18:44:57|   S17|    80.77747|
|2016-10-16T18:44:57|   S06|    79.81854|
|2016-10-16T18:44:57|   S18|    74.10128|
|2016-10-16T18:44:57|   S07|    66.13622|
|2016-10-16T18:44:57|   S20|    60.35727|
+-------------------+------+------------+
only showing top 10 rows

Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 17 October 2016 at 19:53, vincent gromakowski <
vincent.gromakowski@gmail.com> wrote:

> Instead of (or additionally to) saving results somewhere, you just start a
> thriftserver that expose the Spark tables of the SQLContext (or
> SparkSession now). That means you can implement any logic (and maybe use
> structured streaming) to expose your data. Today using the thriftserver
> means reading data from the persistent store every query, so if the data
> modeling doesn't fit the query it can be quite long.  What you generally do
> in a common spark job is to load the data and cache spark table in a
> in-memory columnar table which is quite efficient for any kind of query,
> the counterpart is that the cache isn't updated you have to implement a
> reload mechanism, and this solution isn't available using the thriftserver.
> What I propose is to mix the two world: periodically/delta load data in
> spark table cache and expose it through the thriftserver. But you have to
> implement the loading logic, it can be very simple to very complex
> depending on your needs.
>
>
> 2016-10-17 19:48 GMT+02:00 Benjamin Kim <bb...@gmail.com>:
>
>> Is this technique similar to what Kinesis is offering or what Structured
>> Streaming is going to have eventually?
>>
>> Just curious.
>>
>> Cheers,
>> Ben
>>
>>
>>
>> On Oct 17, 2016, at 10:14 AM, vincent gromakowski <
>> vincent.gromakowski@gmail.com> wrote:
>>
>> I would suggest to code your own Spark thriftserver which seems to be
>> very easy.
>> http://stackoverflow.com/questions/27108863/accessing-spark-
>> sql-rdd-tables-through-the-thrift-server
>>
>> I am starting to test it. The big advantage is that you can implement any
>> logic because it's a spark job and then start a thrift server on temporary
>> table. For example you can query a micro batch rdd from a kafka stream, or
>> pre load some tables and implement a rolling cache to periodically update
>> the spark in memory tables with persistent store...
>> It's not part of the public API and I don't know yet what are the issues
>> doing this but I think Spark community should look at this path: making the
>> thriftserver be instantiable in any spark job.
>>
>> 2016-10-17 18:17 GMT+02:00 Michael Segel <ms...@hotmail.com>:
>>
>>> Guys,
>>> Sorry for jumping in late to the game…
>>>
>>> If memory serves (which may not be a good thing…) :
>>>
>>> You can use HiveServer2 as a connection point to HBase.
>>> While this doesn’t perform well, its probably the cleanest solution.
>>> I’m not keen on Phoenix… wouldn’t recommend it….
>>>
>>>
>>> The issue is that you’re trying to make HBase, a key/value object store,
>>> a Relational Engine… its not.
>>>
>>> There are some considerations which make HBase not ideal for all use
>>> cases and you may find better performance with Parquet files.
>>>
>>> One thing missing is the use of secondary indexing and query
>>> optimizations that you have in RDBMSs and are lacking in HBase / MapRDB /
>>> etc …  so your performance will vary.
>>>
>>> With respect to Tableau… their entire interface in to the big data world
>>> revolves around the JDBC/ODBC interface. So if you don’t have that piece as
>>> part of your solution, you’re DOA w respect to Tableau.
>>>
>>> Have you considered Drill as your JDBC connection point?  (YAAP: Yet
>>> another Apache project)
>>>
>>>
>>> On Oct 9, 2016, at 12:23 PM, Benjamin Kim <bb...@gmail.com> wrote:
>>>
>>> Thanks for all the suggestions. It would seem you guys are right about
>>> the Tableau side of things. The reports don’t need to be real-time, and
>>> they won’t be directly feeding off of the main DMP HBase data. Instead,
>>> it’ll be batched to Parquet or Kudu/Impala or even PostgreSQL.
>>>
>>> I originally thought that we needed two-way data retrieval from the DMP
>>> HBase for ID generation, but after further investigation into the use-case
>>> and architecture, the ID generation needs to happen local to the Ad Servers
>>> where we generate a unique ID and store it in a ID linking table. Even
>>> better, many of the 3rd party services supply this ID. So, data only needs
>>> to flow in one direction. We will use Kafka as the bus for this. No JDBC
>>> required. This is also goes for the REST Endpoints. 3rd party services will
>>> hit ours to update our data with no need to read from our data. And, when
>>> we want to update their data, we will hit theirs to update their data using
>>> a triggered job.
>>>
>>> This al boils down to just integrating with Kafka.
>>>
>>> Once again, thanks for all the help.
>>>
>>> Cheers,
>>> Ben
>>>
>>>
>>> On Oct 9, 2016, at 3:16 AM, Jörn Franke <jo...@gmail.com> wrote:
>>>
>>> please keep also in mind that Tableau Server has the capabilities to
>>> store data in-memory and refresh only when needed the in-memory data. This
>>> means you can import it from any source and let your users work only on the
>>> in-memory data in Tableau Server.
>>>
>>> On Sun, Oct 9, 2016 at 9:22 AM, Jörn Franke <jo...@gmail.com>
>>> wrote:
>>>
>>>> Cloudera 5.8 has a very old version of Hive without Tez, but Mich
>>>> provided already a good alternative. However, you should check if it
>>>> contains a recent version of Hbase and Phoenix. That being said, I just
>>>> wonder what is the dataflow, data model and the analysis you plan to do.
>>>> Maybe there are completely different solutions possible. Especially these
>>>> single inserts, upserts etc. should be avoided as much as possible in the
>>>> Big Data (analysis) world with any technology, because they do not perform
>>>> well.
>>>>
>>>> Hive with Llap will provide an in-memory cache for interactive
>>>> analytics. You can put full tables in-memory with Hive using Ignite HDFS
>>>> in-memory solution. All this does only make sense if you do not use MR as
>>>> an engine, the right input format (ORC, parquet) and a recent Hive version.
>>>>
>>>> On 8 Oct 2016, at 21:55, Benjamin Kim <bb...@gmail.com> wrote:
>>>>
>>>> Mich,
>>>>
>>>> Unfortunately, we are moving away from Hive and unifying on Spark using
>>>> CDH 5.8 as our distro. And, the Tableau released a Spark ODBC/JDBC driver
>>>> too. I will either try Phoenix JDBC Server for HBase or push to move faster
>>>> to Kudu with Impala. We will use Impala as the JDBC in-between until the
>>>> Kudu team completes Spark SQL support for JDBC.
>>>>
>>>> Thanks for the advice.
>>>>
>>>> Cheers,
>>>> Ben
>>>>
>>>>
>>>> On Oct 8, 2016, at 12:35 PM, Mich Talebzadeh <mi...@gmail.com>
>>>> wrote:
>>>>
>>>> Sure. But essentially you are looking at batch data for analytics for
>>>> your tableau users so Hive may be a better choice with its rich SQL and
>>>> ODBC.JDBC connection to Tableau already.
>>>>
>>>> I would go for Hive especially the new release will have an in-memory
>>>> offering as well for frequently accessed data :)
>>>>
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>> On 8 October 2016 at 20:15, Benjamin Kim <bb...@gmail.com> wrote:
>>>>
>>>>> Mich,
>>>>>
>>>>> First and foremost, we have visualization servers that run Tableau for
>>>>> external user reports. Second, we have servers that are ad servers and REST
>>>>> endpoints for cookie sync and segmentation data exchange. These will use
>>>>> JDBC directly within the same data-center. When not colocated in the same
>>>>> data-center, they will connected to a located database server using JDBC.
>>>>> Either way, by using JDBC everywhere, it simplifies and unifies the code on
>>>>> the JDBC industry standard.
>>>>>
>>>>> Does this make sense?
>>>>>
>>>>> Thanks,
>>>>> Ben
>>>>>
>>>>>
>>>>> On Oct 8, 2016, at 11:47 AM, Mich Talebzadeh <
>>>>> mich.talebzadeh@gmail.com> wrote:
>>>>>
>>>>> Like any other design what is your presentation layer and end users?
>>>>>
>>>>> Are they SQL centric users from Tableau background or they may use
>>>>> spark functional programming.
>>>>>
>>>>> It is best to describe the use case.
>>>>>
>>>>> HTH
>>>>>
>>>>> Dr Mich Talebzadeh
>>>>>
>>>>>
>>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>
>>>>>
>>>>> http://talebzadehmich.wordpress.com
>>>>>
>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>>> any loss, damage or destruction of data or any other property which may
>>>>> arise from relying on this email's technical content is explicitly
>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>> arising from such loss, damage or destruction.
>>>>>
>>>>>
>>>>>
>>>>> On 8 October 2016 at 19:40, Felix Cheung <fe...@hotmail.com>
>>>>> wrote:
>>>>>
>>>>>> I wouldn't be too surprised Spark SQL - JDBC data source - Phoenix
>>>>>> JDBC server - HBASE would work better.
>>>>>>
>>>>>> Without naming specifics, there are at least 4 or 5 different
>>>>>> implementations of HBASE sources, each at varying level of development and
>>>>>> different requirements (HBASE release version, Kerberos support etc)
>>>>>>
>>>>>>
>>>>>> _____________________________
>>>>>> From: Benjamin Kim <bb...@gmail.com>
>>>>>> Sent: Saturday, October 8, 2016 11:26 AM
>>>>>> Subject: Re: Spark SQL Thriftserver with HBase
>>>>>> To: Mich Talebzadeh <mi...@gmail.com>
>>>>>> Cc: <us...@spark.apache.org>, Felix Cheung <fe...@hotmail.com>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Mich,
>>>>>>
>>>>>> Are you talking about the Phoenix JDBC Server? If so, I forgot about
>>>>>> that alternative.
>>>>>>
>>>>>> Thanks,
>>>>>> Ben
>>>>>>
>>>>>>
>>>>>> On Oct 8, 2016, at 11:21 AM, Mich Talebzadeh <
>>>>>> mich.talebzadeh@gmail.com> wrote:
>>>>>>
>>>>>> I don't think it will work
>>>>>>
>>>>>> you can use phoenix on top of hbase
>>>>>>
>>>>>> hbase(main):336:0> scan 'tsco', 'LIMIT' => 1
>>>>>> ROW                                                       COLUMN+CELL
>>>>>>  TSCO-1-Apr-08
>>>>>> column=stock_daily:Date, timestamp=1475866783376, value=1-Apr-08
>>>>>>  TSCO-1-Apr-08
>>>>>> column=stock_daily:close, timestamp=1475866783376, value=405.25
>>>>>>  TSCO-1-Apr-08
>>>>>> column=stock_daily:high, timestamp=1475866783376, value=406.75
>>>>>>  TSCO-1-Apr-08
>>>>>> column=stock_daily:low, timestamp=1475866783376, value=379.25
>>>>>>  TSCO-1-Apr-08
>>>>>> column=stock_daily:open, timestamp=1475866783376, value=380.00
>>>>>>  TSCO-1-Apr-08
>>>>>> column=stock_daily:stock, timestamp=1475866783376, value=TESCO PLC
>>>>>>  TSCO-1-Apr-08
>>>>>> column=stock_daily:ticker, timestamp=1475866783376, value=TSCO
>>>>>>  TSCO-1-Apr-08
>>>>>> column=stock_daily:volume, timestamp=1475866783376, value=49664486
>>>>>>
>>>>>> And the same on Phoenix on top of Hvbase table
>>>>>>
>>>>>> 0: jdbc:phoenix:thin:url=http://rhes564:8765> select
>>>>>> substr(to_char(to_date("Date",'dd-MMM-yy')),1,10) AS TradeDate,
>>>>>> "close" AS "Day's close", "high" AS "Day's High", "low" AS "Day's Low",
>>>>>> "open" AS "Day's Open", "ticker", "volume", (to_number("low")+to_number("high"))/2
>>>>>> AS "AverageDailyPrice" from "tsco" where to_number("volume") > 0 and "high"
>>>>>> != '-' and to_date("Date",'dd-MMM-yy') > to_date('2015-10-06','yyyy-MM-dd')
>>>>>> order by  to_date("Date",'dd-MMM-yy') limit 1;
>>>>>> +-------------+--------------+-------------+------------+---
>>>>>> ----------+---------+-----------+--------------------+
>>>>>> |  TRADEDATE  | Day's close  | Day's High  | Day's Low  | Day's Open
>>>>>> | ticker  |  volume   | AverageDailyPrice  |
>>>>>> +-------------+--------------+-------------+------------+---
>>>>>> ----------+---------+-----------+--------------------+
>>>>>> | 2015-10-07  | 197.00       | 198.05      | 184.84     | 192.20
>>>>>> | TSCO    | 30046994  | 191.445            |
>>>>>>
>>>>>> HTH
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Dr Mich Talebzadeh
>>>>>>
>>>>>>
>>>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>>
>>>>>>
>>>>>> http://talebzadehmich.wordpress.com
>>>>>>
>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>> for any loss, damage or destructionof data or any other property which may
>>>>>> arise from relying on this email's technical content is explicitly
>>>>>> disclaimed.The author will in no case be liable for any monetary damages
>>>>>> arising from suchloss, damage or destruction.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 8 October 2016 at 19:05, Felix Cheung <fe...@hotmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Great, then I think those packages as Spark data source should allow
>>>>>>> you to do exactly that (replace org.apache.spark.sql.jdbc with HBASE one)
>>>>>>>
>>>>>>> I do think it will be great to get more examples around this though.
>>>>>>> Would be great if you could share your experience with this!
>>>>>>>
>>>>>>>
>>>>>>> _____________________________
>>>>>>> From: Benjamin Kim <bb...@gmail.com>
>>>>>>> Sent: Saturday, October 8, 2016 11:00 AM
>>>>>>> Subject: Re: Spark SQL Thriftserver with HBase
>>>>>>> To: Felix Cheung <fe...@hotmail.com>
>>>>>>> Cc: <us...@spark.apache.org>
>>>>>>>
>>>>>>>
>>>>>>> Felix,
>>>>>>>
>>>>>>> My goal is to use Spark SQL JDBC Thriftserver to access HBase tables
>>>>>>> using just SQL. I have been able to CREATE tables using this statement
>>>>>>> below in the past:
>>>>>>>
>>>>>>> CREATE TABLE <table-name>
>>>>>>> USING org.apache.spark.sql.jdbc
>>>>>>> OPTIONS (
>>>>>>>   url "jdbc:postgresql://<hostname>:<port>/dm?user=<username>&pass
>>>>>>> word=<password>",
>>>>>>>   dbtable "dim.dimension_acamp"
>>>>>>> );
>>>>>>>
>>>>>>>
>>>>>>> After doing this, I can access the PostgreSQL table using Spark SQL
>>>>>>> JDBC Thriftserver using SQL statements (SELECT, UPDATE, INSERT, etc.). I
>>>>>>> want to do the same with HBase tables. We tried this using Hive and
>>>>>>> HiveServer2, but the response times are just too long.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Ben
>>>>>>>
>>>>>>>
>>>>>>> On Oct 8, 2016, at 10:53 AM, Felix Cheung <fe...@hotmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Ben,
>>>>>>>
>>>>>>> I'm not sure I'm following completely.
>>>>>>>
>>>>>>> Is your goal to use Spark to create or access tables in HBASE? If so
>>>>>>> the link below and several packages out there support that by having a
>>>>>>> HBASE data source for Spark. There are some examples on how the Spark code
>>>>>>> look like in that link as well. On that note, you should also be able to
>>>>>>> use the HBASE data source from pure SQL (Spark SQL) query as well, which
>>>>>>> should work in the case with the Spark SQL JDBC Thrift Server (with USING,
>>>>>>> http://spark.apache.org/docs/latest/sql-programming-gu
>>>>>>> ide.html#tab_sql_10).
>>>>>>>
>>>>>>>
>>>>>>> _____________________________
>>>>>>> From: Benjamin Kim <bb...@gmail.com>
>>>>>>> Sent: Saturday, October 8, 2016 10:40 AM
>>>>>>> Subject: Re: Spark SQL Thriftserver with HBase
>>>>>>> To: Felix Cheung <fe...@hotmail.com>
>>>>>>> Cc: <us...@spark.apache.org>
>>>>>>>
>>>>>>>
>>>>>>> Felix,
>>>>>>>
>>>>>>> The only alternative way is to create a stored procedure (udf) in
>>>>>>> database terms that would run Spark scala code underneath. In this way, I
>>>>>>> can use Spark SQL JDBC Thriftserver to execute it using SQL code passing
>>>>>>> the key, values I want to UPSERT. I wonder if this is possible since I
>>>>>>> cannot CREATE a wrapper table on top of a HBase table in Spark SQL?
>>>>>>>
>>>>>>> What do you think? Is this the right approach?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Ben
>>>>>>>
>>>>>>> On Oct 8, 2016, at 10:33 AM, Felix Cheung <fe...@hotmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>> HBase has released support for Spark
>>>>>>> hbase.apache.org/book.html#spark
>>>>>>>
>>>>>>> And if you search you should find several alternative approaches.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Oct 7, 2016 at 7:56 AM -0700, "Benjamin Kim" <
>>>>>>> bbuild11@gmail.com> wrote:
>>>>>>>
>>>>>>> Does anyone know if Spark can work with HBase tables using Spark
>>>>>>> SQL? I know in Hive we are able to create tables on top of an underlying
>>>>>>> HBase table that can be accessed using MapReduce jobs. Can the same be done
>>>>>>> using HiveContext or SQLContext? We are trying to setup a way to GET and
>>>>>>> POST data to and from the HBase table using the Spark SQL JDBC thriftserver
>>>>>>> from our RESTful API endpoints and/or HTTP web farms. If we can get this to
>>>>>>> work, then we can load balance the thriftservers. In addition, this will
>>>>>>> benefit us in giving us a way to abstract the data storage layer away from
>>>>>>> the presentation layer code. There is a chance that we will swap out the
>>>>>>> data storage technology in the future. We are currently experimenting with
>>>>>>> Kudu.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Ben
>>>>>>> ------------------------------------------------------------
>>>>>>> ---------
>>>>>>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>>
>>
>>
>

Re: Spark SQL Thriftserver with HBase

Posted by vincent gromakowski <vi...@gmail.com>.

Instead of (or additionally to) saving results somewhere, you just start a
thriftserver that expose the Spark tables of the SQLContext (or
SparkSession now). That means you can implement any logic (and maybe use
structured streaming) to expose your data. Today using the thriftserver
means reading data from the persistent store every query, so if the data
modeling doesn't fit the query it can be quite long.  What you generally do
in a common spark job is to load the data and cache spark table in a
in-memory columnar table which is quite efficient for any kind of query,
the counterpart is that the cache isn't updated you have to implement a
reload mechanism, and this solution isn't available using the thriftserver.
What I propose is to mix the two world: periodically/delta load data in
spark table cache and expose it through the thriftserver. But you have to
implement the loading logic, it can be very simple to very complex
depending on your needs.


2016-10-17 19:48 GMT+02:00 Benjamin Kim <bb...@gmail.com>:

> Is this technique similar to what Kinesis is offering or what Structured
> Streaming is going to have eventually?
>
> Just curious.
>
> Cheers,
> Ben
>
>
>
> On Oct 17, 2016, at 10:14 AM, vincent gromakowski <
> vincent.gromakowski@gmail.com> wrote:
>
> I would suggest to code your own Spark thriftserver which seems to be very
> easy.
> http://stackoverflow.com/questions/27108863/accessing-
> spark-sql-rdd-tables-through-the-thrift-server
>
> I am starting to test it. The big advantage is that you can implement any
> logic because it's a spark job and then start a thrift server on temporary
> table. For example you can query a micro batch rdd from a kafka stream, or
> pre load some tables and implement a rolling cache to periodically update
> the spark in memory tables with persistent store...
> It's not part of the public API and I don't know yet what are the issues
> doing this but I think Spark community should look at this path: making the
> thriftserver be instantiable in any spark job.
>
> 2016-10-17 18:17 GMT+02:00 Michael Segel <ms...@hotmail.com>:
>
>> Guys,
>> Sorry for jumping in late to the game…
>>
>> If memory serves (which may not be a good thing…) :
>>
>> You can use HiveServer2 as a connection point to HBase.
>> While this doesn’t perform well, its probably the cleanest solution.
>> I’m not keen on Phoenix… wouldn’t recommend it….
>>
>>
>> The issue is that you’re trying to make HBase, a key/value object store,
>> a Relational Engine… its not.
>>
>> There are some considerations which make HBase not ideal for all use
>> cases and you may find better performance with Parquet files.
>>
>> One thing missing is the use of secondary indexing and query
>> optimizations that you have in RDBMSs and are lacking in HBase / MapRDB /
>> etc …  so your performance will vary.
>>
>> With respect to Tableau… their entire interface in to the big data world
>> revolves around the JDBC/ODBC interface. So if you don’t have that piece as
>> part of your solution, you’re DOA w respect to Tableau.
>>
>> Have you considered Drill as your JDBC connection point?  (YAAP: Yet
>> another Apache project)
>>
>>
>> On Oct 9, 2016, at 12:23 PM, Benjamin Kim <bb...@gmail.com> wrote:
>>
>> Thanks for all the suggestions. It would seem you guys are right about
>> the Tableau side of things. The reports don’t need to be real-time, and
>> they won’t be directly feeding off of the main DMP HBase data. Instead,
>> it’ll be batched to Parquet or Kudu/Impala or even PostgreSQL.
>>
>> I originally thought that we needed two-way data retrieval from the DMP
>> HBase for ID generation, but after further investigation into the use-case
>> and architecture, the ID generation needs to happen local to the Ad Servers
>> where we generate a unique ID and store it in a ID linking table. Even
>> better, many of the 3rd party services supply this ID. So, data only needs
>> to flow in one direction. We will use Kafka as the bus for this. No JDBC
>> required. This is also goes for the REST Endpoints. 3rd party services will
>> hit ours to update our data with no need to read from our data. And, when
>> we want to update their data, we will hit theirs to update their data using
>> a triggered job.
>>
>> This al boils down to just integrating with Kafka.
>>
>> Once again, thanks for all the help.
>>
>> Cheers,
>> Ben
>>
>>
>> On Oct 9, 2016, at 3:16 AM, Jörn Franke <jo...@gmail.com> wrote:
>>
>> please keep also in mind that Tableau Server has the capabilities to
>> store data in-memory and refresh only when needed the in-memory data. This
>> means you can import it from any source and let your users work only on the
>> in-memory data in Tableau Server.
>>
>> On Sun, Oct 9, 2016 at 9:22 AM, Jörn Franke <jo...@gmail.com> wrote:
>>
>>> Cloudera 5.8 has a very old version of Hive without Tez, but Mich
>>> provided already a good alternative. However, you should check if it
>>> contains a recent version of Hbase and Phoenix. That being said, I just
>>> wonder what is the dataflow, data model and the analysis you plan to do.
>>> Maybe there are completely different solutions possible. Especially these
>>> single inserts, upserts etc. should be avoided as much as possible in the
>>> Big Data (analysis) world with any technology, because they do not perform
>>> well.
>>>
>>> Hive with Llap will provide an in-memory cache for interactive
>>> analytics. You can put full tables in-memory with Hive using Ignite HDFS
>>> in-memory solution. All this does only make sense if you do not use MR as
>>> an engine, the right input format (ORC, parquet) and a recent Hive version.
>>>
>>> On 8 Oct 2016, at 21:55, Benjamin Kim <bb...@gmail.com> wrote:
>>>
>>> Mich,
>>>
>>> Unfortunately, we are moving away from Hive and unifying on Spark using
>>> CDH 5.8 as our distro. And, the Tableau released a Spark ODBC/JDBC driver
>>> too. I will either try Phoenix JDBC Server for HBase or push to move faster
>>> to Kudu with Impala. We will use Impala as the JDBC in-between until the
>>> Kudu team completes Spark SQL support for JDBC.
>>>
>>> Thanks for the advice.
>>>
>>> Cheers,
>>> Ben
>>>
>>>
>>> On Oct 8, 2016, at 12:35 PM, Mich Talebzadeh <mi...@gmail.com>
>>> wrote:
>>>
>>> Sure. But essentially you are looking at batch data for analytics for
>>> your tableau users so Hive may be a better choice with its rich SQL and
>>> ODBC.JDBC connection to Tableau already.
>>>
>>> I would go for Hive especially the new release will have an in-memory
>>> offering as well for frequently accessed data :)
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 8 October 2016 at 20:15, Benjamin Kim <bb...@gmail.com> wrote:
>>>
>>>> Mich,
>>>>
>>>> First and foremost, we have visualization servers that run Tableau for
>>>> external user reports. Second, we have servers that are ad servers and REST
>>>> endpoints for cookie sync and segmentation data exchange. These will use
>>>> JDBC directly within the same data-center. When not colocated in the same
>>>> data-center, they will connected to a located database server using JDBC.
>>>> Either way, by using JDBC everywhere, it simplifies and unifies the code on
>>>> the JDBC industry standard.
>>>>
>>>> Does this make sense?
>>>>
>>>> Thanks,
>>>> Ben
>>>>
>>>>
>>>> On Oct 8, 2016, at 11:47 AM, Mich Talebzadeh <mi...@gmail.com>
>>>> wrote:
>>>>
>>>> Like any other design what is your presentation layer and end users?
>>>>
>>>> Are they SQL centric users from Tableau background or they may use
>>>> spark functional programming.
>>>>
>>>> It is best to describe the use case.
>>>>
>>>> HTH
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>> On 8 October 2016 at 19:40, Felix Cheung <fe...@hotmail.com>
>>>> wrote:
>>>>
>>>>> I wouldn't be too surprised Spark SQL - JDBC data source - Phoenix
>>>>> JDBC server - HBASE would work better.
>>>>>
>>>>> Without naming specifics, there are at least 4 or 5 different
>>>>> implementations of HBASE sources, each at varying level of development and
>>>>> different requirements (HBASE release version, Kerberos support etc)
>>>>>
>>>>>
>>>>> _____________________________
>>>>> From: Benjamin Kim <bb...@gmail.com>
>>>>> Sent: Saturday, October 8, 2016 11:26 AM
>>>>> Subject: Re: Spark SQL Thriftserver with HBase
>>>>> To: Mich Talebzadeh <mi...@gmail.com>
>>>>> Cc: <us...@spark.apache.org>, Felix Cheung <fe...@hotmail.com>
>>>>>
>>>>>
>>>>>
>>>>> Mich,
>>>>>
>>>>> Are you talking about the Phoenix JDBC Server? If so, I forgot about
>>>>> that alternative.
>>>>>
>>>>> Thanks,
>>>>> Ben
>>>>>
>>>>>
>>>>> On Oct 8, 2016, at 11:21 AM, Mich Talebzadeh <
>>>>> mich.talebzadeh@gmail.com> wrote:
>>>>>
>>>>> I don't think it will work
>>>>>
>>>>> you can use phoenix on top of hbase
>>>>>
>>>>> hbase(main):336:0> scan 'tsco', 'LIMIT' => 1
>>>>> ROW                                                       COLUMN+CELL
>>>>>  TSCO-1-Apr-08
>>>>> column=stock_daily:Date, timestamp=1475866783376, value=1-Apr-08
>>>>>  TSCO-1-Apr-08
>>>>> column=stock_daily:close, timestamp=1475866783376, value=405.25
>>>>>  TSCO-1-Apr-08
>>>>> column=stock_daily:high, timestamp=1475866783376, value=406.75
>>>>>  TSCO-1-Apr-08
>>>>> column=stock_daily:low, timestamp=1475866783376, value=379.25
>>>>>  TSCO-1-Apr-08
>>>>> column=stock_daily:open, timestamp=1475866783376, value=380.00
>>>>>  TSCO-1-Apr-08
>>>>> column=stock_daily:stock, timestamp=1475866783376, value=TESCO PLC
>>>>>  TSCO-1-Apr-08
>>>>> column=stock_daily:ticker, timestamp=1475866783376, value=TSCO
>>>>>  TSCO-1-Apr-08
>>>>> column=stock_daily:volume, timestamp=1475866783376, value=49664486
>>>>>
>>>>> And the same on Phoenix on top of Hvbase table
>>>>>
>>>>> 0: jdbc:phoenix:thin:url=http://rhes564:8765> select
>>>>> substr(to_char(to_date("Date",'dd-MMM-yy')),1,10) AS TradeDate,
>>>>> "close" AS "Day's close", "high" AS "Day's High", "low" AS "Day's Low",
>>>>> "open" AS "Day's Open", "ticker", "volume", (to_number("low")+to_number("high"))/2
>>>>> AS "AverageDailyPrice" from "tsco" where to_number("volume") > 0 and "high"
>>>>> != '-' and to_date("Date",'dd-MMM-yy') > to_date('2015-10-06','yyyy-MM-dd')
>>>>> order by  to_date("Date",'dd-MMM-yy') limit 1;
>>>>> +-------------+--------------+-------------+------------+---
>>>>> ----------+---------+-----------+--------------------+
>>>>> |  TRADEDATE  | Day's close  | Day's High  | Day's Low  | Day's Open
>>>>> | ticker  |  volume   | AverageDailyPrice  |
>>>>> +-------------+--------------+-------------+------------+---
>>>>> ----------+---------+-----------+--------------------+
>>>>> | 2015-10-07  | 197.00       | 198.05      | 184.84     | 192.20
>>>>> | TSCO    | 30046994  | 191.445            |
>>>>>
>>>>> HTH
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Dr Mich Talebzadeh
>>>>>
>>>>>
>>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>
>>>>>
>>>>> http://talebzadehmich.wordpress.com
>>>>>
>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>>> any loss, damage or destructionof data or any other property which may
>>>>> arise from relying on this email's technical content is explicitly
>>>>> disclaimed.The author will in no case be liable for any monetary damages
>>>>> arising from suchloss, damage or destruction.
>>>>>
>>>>>
>>>>>
>>>>> On 8 October 2016 at 19:05, Felix Cheung <fe...@hotmail.com>
>>>>> wrote:
>>>>>
>>>>>> Great, then I think those packages as Spark data source should allow
>>>>>> you to do exactly that (replace org.apache.spark.sql.jdbc with HBASE one)
>>>>>>
>>>>>> I do think it will be great to get more examples around this though.
>>>>>> Would be great if you could share your experience with this!
>>>>>>
>>>>>>
>>>>>> _____________________________
>>>>>> From: Benjamin Kim <bb...@gmail.com>
>>>>>> Sent: Saturday, October 8, 2016 11:00 AM
>>>>>> Subject: Re: Spark SQL Thriftserver with HBase
>>>>>> To: Felix Cheung <fe...@hotmail.com>
>>>>>> Cc: <us...@spark.apache.org>
>>>>>>
>>>>>>
>>>>>> Felix,
>>>>>>
>>>>>> My goal is to use Spark SQL JDBC Thriftserver to access HBase tables
>>>>>> using just SQL. I have been able to CREATE tables using this statement
>>>>>> below in the past:
>>>>>>
>>>>>> CREATE TABLE <table-name>
>>>>>> USING org.apache.spark.sql.jdbc
>>>>>> OPTIONS (
>>>>>>   url "jdbc:postgresql://<hostname>:<port>/dm?user=<username>&pass
>>>>>> word=<password>",
>>>>>>   dbtable "dim.dimension_acamp"
>>>>>> );
>>>>>>
>>>>>>
>>>>>> After doing this, I can access the PostgreSQL table using Spark SQL
>>>>>> JDBC Thriftserver using SQL statements (SELECT, UPDATE, INSERT, etc.). I
>>>>>> want to do the same with HBase tables. We tried this using Hive and
>>>>>> HiveServer2, but the response times are just too long.
>>>>>>
>>>>>> Thanks,
>>>>>> Ben
>>>>>>
>>>>>>
>>>>>> On Oct 8, 2016, at 10:53 AM, Felix Cheung <fe...@hotmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> Ben,
>>>>>>
>>>>>> I'm not sure I'm following completely.
>>>>>>
>>>>>> Is your goal to use Spark to create or access tables in HBASE? If so
>>>>>> the link below and several packages out there support that by having a
>>>>>> HBASE data source for Spark. There are some examples on how the Spark code
>>>>>> look like in that link as well. On that note, you should also be able to
>>>>>> use the HBASE data source from pure SQL (Spark SQL) query as well, which
>>>>>> should work in the case with the Spark SQL JDBC Thrift Server (with USING,
>>>>>> http://spark.apache.org/docs/latest/sql-programming-gu
>>>>>> ide.html#tab_sql_10).
>>>>>>
>>>>>>
>>>>>> _____________________________
>>>>>> From: Benjamin Kim <bb...@gmail.com>
>>>>>> Sent: Saturday, October 8, 2016 10:40 AM
>>>>>> Subject: Re: Spark SQL Thriftserver with HBase
>>>>>> To: Felix Cheung <fe...@hotmail.com>
>>>>>> Cc: <us...@spark.apache.org>
>>>>>>
>>>>>>
>>>>>> Felix,
>>>>>>
>>>>>> The only alternative way is to create a stored procedure (udf) in
>>>>>> database terms that would run Spark scala code underneath. In this way, I
>>>>>> can use Spark SQL JDBC Thriftserver to execute it using SQL code passing
>>>>>> the key, values I want to UPSERT. I wonder if this is possible since I
>>>>>> cannot CREATE a wrapper table on top of a HBase table in Spark SQL?
>>>>>>
>>>>>> What do you think? Is this the right approach?
>>>>>>
>>>>>> Thanks,
>>>>>> Ben
>>>>>>
>>>>>> On Oct 8, 2016, at 10:33 AM, Felix Cheung <fe...@hotmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> HBase has released support for Spark
>>>>>> hbase.apache.org/book.html#spark
>>>>>>
>>>>>> And if you search you should find several alternative approaches.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Oct 7, 2016 at 7:56 AM -0700, "Benjamin Kim" <
>>>>>> bbuild11@gmail.com> wrote:
>>>>>>
>>>>>> Does anyone know if Spark can work with HBase tables using Spark SQL?
>>>>>> I know in Hive we are able to create tables on top of an underlying HBase
>>>>>> table that can be accessed using MapReduce jobs. Can the same be done using
>>>>>> HiveContext or SQLContext? We are trying to setup a way to GET and POST
>>>>>> data to and from the HBase table using the Spark SQL JDBC thriftserver from
>>>>>> our RESTful API endpoints and/or HTTP web farms. If we can get this to
>>>>>> work, then we can load balance the thriftservers. In addition, this will
>>>>>> benefit us in giving us a way to abstract the data storage layer away from
>>>>>> the presentation layer code. There is a chance that we will swap out the
>>>>>> data storage technology in the future. We are currently experimenting with
>>>>>> Kudu.
>>>>>>
>>>>>> Thanks,
>>>>>> Ben
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>>
>
>

Re: Spark SQL Thriftserver with HBase

Posted by Benjamin Kim <bb...@gmail.com>.

Is this technique similar to what Kinesis is offering or what Structured Streaming is going to have eventually?

Just curious.

Cheers,
Ben

 
> On Oct 17, 2016, at 10:14 AM, vincent gromakowski <vi...@gmail.com> wrote:
> 
> I would suggest to code your own Spark thriftserver which seems to be very easy.
> http://stackoverflow.com/questions/27108863/accessing-spark-sql-rdd-tables-through-the-thrift-server <http://stackoverflow.com/questions/27108863/accessing-spark-sql-rdd-tables-through-the-thrift-server>
> 
> I am starting to test it. The big advantage is that you can implement any logic because it's a spark job and then start a thrift server on temporary table. For example you can query a micro batch rdd from a kafka stream, or pre load some tables and implement a rolling cache to periodically update the spark in memory tables with persistent store...
> It's not part of the public API and I don't know yet what are the issues doing this but I think Spark community should look at this path: making the thriftserver be instantiable in any spark job.
> 
> 2016-10-17 18:17 GMT+02:00 Michael Segel <msegel_hadoop@hotmail.com <ma...@hotmail.com>>:
> Guys, 
> Sorry for jumping in late to the game… 
> 
> If memory serves (which may not be a good thing…) :
> 
> You can use HiveServer2 as a connection point to HBase.  
> While this doesn’t perform well, its probably the cleanest solution. 
> I’m not keen on Phoenix… wouldn’t recommend it…. 
> 
> 
> The issue is that you’re trying to make HBase, a key/value object store, a Relational Engine… its not. 
> 
> There are some considerations which make HBase not ideal for all use cases and you may find better performance with Parquet files. 
> 
> One thing missing is the use of secondary indexing and query optimizations that you have in RDBMSs and are lacking in HBase / MapRDB / etc …  so your performance will vary. 
> 
> With respect to Tableau… their entire interface in to the big data world revolves around the JDBC/ODBC interface. So if you don’t have that piece as part of your solution, you’re DOA w respect to Tableau. 
> 
> Have you considered Drill as your JDBC connection point?  (YAAP: Yet another Apache project) 
> 
> 
>> On Oct 9, 2016, at 12:23 PM, Benjamin Kim <bbuild11@gmail.com <ma...@gmail.com>> wrote:
>> 
>> Thanks for all the suggestions. It would seem you guys are right about the Tableau side of things. The reports don’t need to be real-time, and they won’t be directly feeding off of the main DMP HBase data. Instead, it’ll be batched to Parquet or Kudu/Impala or even PostgreSQL.
>> 
>> I originally thought that we needed two-way data retrieval from the DMP HBase for ID generation, but after further investigation into the use-case and architecture, the ID generation needs to happen local to the Ad Servers where we generate a unique ID and store it in a ID linking table. Even better, many of the 3rd party services supply this ID. So, data only needs to flow in one direction. We will use Kafka as the bus for this. No JDBC required. This is also goes for the REST Endpoints. 3rd party services will hit ours to update our data with no need to read from our data. And, when we want to update their data, we will hit theirs to update their data using a triggered job.
>> 
>> This al boils down to just integrating with Kafka.
>> 
>> Once again, thanks for all the help.
>> 
>> Cheers,
>> Ben
>> 
>> 
>>> On Oct 9, 2016, at 3:16 AM, Jörn Franke <jornfranke@gmail.com <ma...@gmail.com>> wrote:
>>> 
>>> please keep also in mind that Tableau Server has the capabilities to store data in-memory and refresh only when needed the in-memory data. This means you can import it from any source and let your users work only on the in-memory data in Tableau Server.
>>> 
>>> On Sun, Oct 9, 2016 at 9:22 AM, Jörn Franke <jornfranke@gmail.com <ma...@gmail.com>> wrote:
>>> Cloudera 5.8 has a very old version of Hive without Tez, but Mich provided already a good alternative. However, you should check if it contains a recent version of Hbase and Phoenix. That being said, I just wonder what is the dataflow, data model and the analysis you plan to do. Maybe there are completely different solutions possible. Especially these single inserts, upserts etc. should be avoided as much as possible in the Big Data (analysis) world with any technology, because they do not perform well. 
>>> 
>>> Hive with Llap will provide an in-memory cache for interactive analytics. You can put full tables in-memory with Hive using Ignite HDFS in-memory solution. All this does only make sense if you do not use MR as an engine, the right input format (ORC, parquet) and a recent Hive version.
>>> 
>>> On 8 Oct 2016, at 21:55, Benjamin Kim <bbuild11@gmail.com <ma...@gmail.com>> wrote:
>>> 
>>>> Mich,
>>>> 
>>>> Unfortunately, we are moving away from Hive and unifying on Spark using CDH 5.8 as our distro. And, the Tableau released a Spark ODBC/JDBC driver too. I will either try Phoenix JDBC Server for HBase or push to move faster to Kudu with Impala. We will use Impala as the JDBC in-between until the Kudu team completes Spark SQL support for JDBC.
>>>> 
>>>> Thanks for the advice.
>>>> 
>>>> Cheers,
>>>> Ben
>>>> 
>>>> 
>>>>> On Oct 8, 2016, at 12:35 PM, Mich Talebzadeh <mich.talebzadeh@gmail.com <ma...@gmail.com>> wrote:
>>>>> 
>>>>> Sure. But essentially you are looking at batch data for analytics for your tableau users so Hive may be a better choice with its rich SQL and ODBC.JDBC connection to Tableau already.
>>>>> 
>>>>> I would go for Hive especially the new release will have an in-memory offering as well for frequently accessed data :)
>>>>> 
>>>>> 
>>>>> Dr Mich Talebzadeh
>>>>>  
>>>>> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>>>>  
>>>>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>>>>> 
>>>>> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.
>>>>>  
>>>>> 
>>>>> On 8 October 2016 at 20:15, Benjamin Kim <bbuild11@gmail.com <ma...@gmail.com>> wrote:
>>>>> Mich,
>>>>> 
>>>>> First and foremost, we have visualization servers that run Tableau for external user reports. Second, we have servers that are ad servers and REST endpoints for cookie sync and segmentation data exchange. These will use JDBC directly within the same data-center. When not colocated in the same data-center, they will connected to a located database server using JDBC. Either way, by using JDBC everywhere, it simplifies and unifies the code on the JDBC industry standard.
>>>>> 
>>>>> Does this make sense?
>>>>> 
>>>>> Thanks,
>>>>> Ben
>>>>> 
>>>>> 
>>>>>> On Oct 8, 2016, at 11:47 AM, Mich Talebzadeh <mich.talebzadeh@gmail.com <ma...@gmail.com>> wrote:
>>>>>> 
>>>>>> Like any other design what is your presentation layer and end users?
>>>>>> 
>>>>>> Are they SQL centric users from Tableau background or they may use spark functional programming.
>>>>>> 
>>>>>> It is best to describe the use case.
>>>>>> 
>>>>>> HTH
>>>>>> 
>>>>>> Dr Mich Talebzadeh
>>>>>>  
>>>>>> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>>>>>  
>>>>>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>>>>>> 
>>>>>> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.
>>>>>>  
>>>>>> 
>>>>>> On 8 October 2016 at 19:40, Felix Cheung <felixcheung_m@hotmail.com <ma...@hotmail.com>> wrote:
>>>>>> I wouldn't be too surprised Spark SQL - JDBC data source - Phoenix JDBC server - HBASE would work better.
>>>>>> 
>>>>>> Without naming specifics, there are at least 4 or 5 different implementations of HBASE sources, each at varying level of development and different requirements (HBASE release version, Kerberos support etc)
>>>>>> 
>>>>>> 
>>>>>> _____________________________
>>>>>> From: Benjamin Kim <bbuild11@gmail.com <ma...@gmail.com>>
>>>>>> Sent: Saturday, October 8, 2016 11:26 AM
>>>>>> Subject: Re: Spark SQL Thriftserver with HBase
>>>>>> To: Mich Talebzadeh <mich.talebzadeh@gmail.com <ma...@gmail.com>>
>>>>>> Cc: <user@spark.apache.org <ma...@spark.apache.org>>, Felix Cheung <felixcheung_m@hotmail.com <ma...@hotmail.com>>
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Mich,
>>>>>> 
>>>>>> Are you talking about the Phoenix JDBC Server? If so, I forgot about that alternative.
>>>>>> 
>>>>>> Thanks,
>>>>>> Ben
>>>>>> 
>>>>>> 
>>>>>> On Oct 8, 2016, at 11:21 AM, Mich Talebzadeh <mich.talebzadeh@gmail.com <ma...@gmail.com>> wrote:
>>>>>> 
>>>>>> I don't think it will work
>>>>>> 
>>>>>> you can use phoenix on top of hbase
>>>>>> 
>>>>>> hbase(main):336:0> scan 'tsco', 'LIMIT' => 1
>>>>>> ROW                                                       COLUMN+CELL
>>>>>>  TSCO-1-Apr-08                                            column=stock_daily:Date, timestamp=1475866783376, value=1-Apr-08
>>>>>>  TSCO-1-Apr-08                                            column=stock_daily:close, timestamp=1475866783376, value=405.25
>>>>>>  TSCO-1-Apr-08                                            column=stock_daily:high, timestamp=1475866783376, value=406.75
>>>>>>  TSCO-1-Apr-08                                            column=stock_daily:low, timestamp=1475866783376, value=379.25
>>>>>>  TSCO-1-Apr-08                                            column=stock_daily:open, timestamp=1475866783376, value=380.00
>>>>>>  TSCO-1-Apr-08                                            column=stock_daily:stock, timestamp=1475866783376, value=TESCO PLC
>>>>>>  TSCO-1-Apr-08                                            column=stock_daily:ticker, timestamp=1475866783376, value=TSCO
>>>>>>  TSCO-1-Apr-08                                            column=stock_daily:volume, timestamp=1475866783376, value=49664486
>>>>>> 
>>>>>> And the same on Phoenix on top of Hvbase table
>>>>>> 
>>>>>> 0: jdbc:phoenix:thin:url=http://rhes564:8765 <http://rhes564:8765/>> select substr(to_char(to_date("Date",'dd-MMM-yy')),1,10) AS TradeDate, "close" AS "Day's close", "high" AS "Day's High", "low" AS "Day's Low", "open" AS "Day's Open", "ticker", "volume", (to_number("low")+to_number("high"))/2 AS "AverageDailyPrice" from "tsco" where to_number("volume") > 0 and "high" != '-' and to_date("Date",'dd-MMM-yy') > to_date('2015-10-06','yyyy-MM-dd') order by  to_date("Date",'dd-MMM-yy') limit 1;
>>>>>> +-------------+--------------+-------------+------------+-------------+---------+-----------+--------------------+
>>>>>> |  TRADEDATE  | Day's close  | Day's High  | Day's Low  | Day's Open  | ticker  |  volume   | AverageDailyPrice  |
>>>>>> +-------------+--------------+-------------+------------+-------------+---------+-----------+--------------------+
>>>>>> | 2015-10-07  | 197.00       | 198.05      | 184.84     | 192.20      | TSCO    | 30046994  | 191.445            |
>>>>>> 
>>>>>> HTH
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Dr Mich Talebzadeh
>>>>>>  
>>>>>> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>>>>>  
>>>>>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>>>>>> 
>>>>>> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destructionof data or any other property which may arise from relying on this email's technical content is explicitly disclaimed.The author will in no case be liable for any monetary damages arising from suchloss, damage or destruction.
>>>>>>  
>>>>>> 
>>>>>> On 8 October 2016 at 19:05, Felix Cheung <felixcheung_m@hotmail.com <ma...@hotmail.com>> wrote:
>>>>>> Great, then I think those packages as Spark data source should allow you to do exactly that (replace org.apache.spark.sql.jdbc with HBASE one)
>>>>>> 
>>>>>> I do think it will be great to get more examples around this though. Would be great if you could share your experience with this!
>>>>>> 
>>>>>> 
>>>>>> _____________________________
>>>>>> From: Benjamin Kim <bbuild11@gmail.com <ma...@gmail.com>>
>>>>>> Sent: Saturday, October 8, 2016 11:00 AM
>>>>>> Subject: Re: Spark SQL Thriftserver with HBase
>>>>>> To: Felix Cheung <felixcheung_m@hotmail.com <ma...@hotmail.com>>
>>>>>> Cc: <user@spark.apache.org <ma...@spark.apache.org>>
>>>>>> 
>>>>>> 
>>>>>> Felix,
>>>>>> 
>>>>>> My goal is to use Spark SQL JDBC Thriftserver to access HBase tables using just SQL. I have been able to CREATE tables using this statement below in the past:
>>>>>> 
>>>>>> CREATE TABLE <table-name>
>>>>>> USING org.apache.spark.sql.jdbc
>>>>>> OPTIONS (
>>>>>>   url "jdbc:postgresql://<hostname>:<port>/dm?user=<username>&password=<password>",
>>>>>>   dbtable "dim.dimension_acamp"
>>>>>> );
>>>>>> 
>>>>>> After doing this, I can access the PostgreSQL table using Spark SQL JDBC Thriftserver using SQL statements (SELECT, UPDATE, INSERT, etc.). I want to do the same with HBase tables. We tried this using Hive and HiveServer2, but the response times are just too long.
>>>>>> 
>>>>>> Thanks,
>>>>>> Ben
>>>>>> 
>>>>>> 
>>>>>> On Oct 8, 2016, at 10:53 AM, Felix Cheung <felixcheung_m@hotmail.com <ma...@hotmail.com>> wrote:
>>>>>> 
>>>>>> Ben,
>>>>>> 
>>>>>> I'm not sure I'm following completely.
>>>>>> 
>>>>>> Is your goal to use Spark to create or access tables in HBASE? If so the link below and several packages out there support that by having a HBASE data source for Spark. There are some examples on how the Spark code look like in that link as well. On that note, you should also be able to use the HBASE data source from pure SQL (Spark SQL) query as well, which should work in the case with the Spark SQL JDBC Thrift Server (with USING,http://spark.apache.org/docs/latest/sql-programming-guide.html#tab_sql_10 <http://spark.apache.org/docs/latest/sql-programming-guide.html#tab_sql_10>).
>>>>>> 
>>>>>> 
>>>>>> _____________________________
>>>>>> From: Benjamin Kim <bbuild11@gmail.com <ma...@gmail.com>>
>>>>>> Sent: Saturday, October 8, 2016 10:40 AM
>>>>>> Subject: Re: Spark SQL Thriftserver with HBase
>>>>>> To: Felix Cheung <felixcheung_m@hotmail.com <ma...@hotmail.com>>
>>>>>> Cc: <user@spark.apache.org <ma...@spark.apache.org>>
>>>>>> 
>>>>>> 
>>>>>> Felix,
>>>>>> 
>>>>>> The only alternative way is to create a stored procedure (udf) in database terms that would run Spark scala code underneath. In this way, I can use Spark SQL JDBC Thriftserver to execute it using SQL code passing the key, values I want to UPSERT. I wonder if this is possible since I cannot CREATE a wrapper table on top of a HBase table in Spark SQL?
>>>>>> 
>>>>>> What do you think? Is this the right approach?
>>>>>> 
>>>>>> Thanks,
>>>>>> Ben
>>>>>> 
>>>>>> On Oct 8, 2016, at 10:33 AM, Felix Cheung <felixcheung_m@hotmail.com <ma...@hotmail.com>> wrote:
>>>>>> 
>>>>>> HBase has released support for Spark
>>>>>> hbase.apache.org/book.html#spark <http://hbase.apache.org/book.html#spark>
>>>>>> 
>>>>>> And if you search you should find several alternative approaches.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Fri, Oct 7, 2016 at 7:56 AM -0700, "Benjamin Kim" <bbuild11@gmail.com <ma...@gmail.com>> wrote:
>>>>>> 
>>>>>> Does anyone know if Spark can work with HBase tables using Spark SQL? I know in Hive we are able to create tables on top of an underlying HBase table that can be accessed using MapReduce jobs. Can the same be done using HiveContext or SQLContext? We are trying to setup a way to GET and POST data to and from the HBase table using the Spark SQL JDBC thriftserver from our RESTful API endpoints and/or HTTP web farms. If we can get this to work, then we can load balance the thriftservers. In addition, this will benefit us in giving us a way to abstract the data storage layer away from the presentation layer code. There is a chance that we will swap out the data storage technology in the future. We are currently experimenting with Kudu.
>>>>>> 
>>>>>> Thanks,
>>>>>> Ben
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org <ma...@spark.apache.org>
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>> 
>> 
> 
>

Re: Spark SQL Thriftserver with HBase

Posted by vincent gromakowski <vi...@gmail.com>.

I would suggest to code your own Spark thriftserver which seems to be very
easy.
http://stackoverflow.com/questions/27108863/accessing-spark-sql-rdd-tables-through-the-thrift-server

I am starting to test it. The big advantage is that you can implement any
logic because it's a spark job and then start a thrift server on temporary
table. For example you can query a micro batch rdd from a kafka stream, or
pre load some tables and implement a rolling cache to periodically update
the spark in memory tables with persistent store...
It's not part of the public API and I don't know yet what are the issues
doing this but I think Spark community should look at this path: making the
thriftserver be instantiable in any spark job.

2016-10-17 18:17 GMT+02:00 Michael Segel <ms...@hotmail.com>:

> Guys,
> Sorry for jumping in late to the game…
>
> If memory serves (which may not be a good thing…) :
>
> You can use HiveServer2 as a connection point to HBase.
> While this doesn’t perform well, its probably the cleanest solution.
> I’m not keen on Phoenix… wouldn’t recommend it….
>
>
> The issue is that you’re trying to make HBase, a key/value object store, a
> Relational Engine… its not.
>
> There are some considerations which make HBase not ideal for all use cases
> and you may find better performance with Parquet files.
>
> One thing missing is the use of secondary indexing and query optimizations
> that you have in RDBMSs and are lacking in HBase / MapRDB / etc …  so your
> performance will vary.
>
> With respect to Tableau… their entire interface in to the big data world
> revolves around the JDBC/ODBC interface. So if you don’t have that piece as
> part of your solution, you’re DOA w respect to Tableau.
>
> Have you considered Drill as your JDBC connection point?  (YAAP: Yet
> another Apache project)
>
>
> On Oct 9, 2016, at 12:23 PM, Benjamin Kim <bb...@gmail.com> wrote:
>
> Thanks for all the suggestions. It would seem you guys are right about the
> Tableau side of things. The reports don’t need to be real-time, and they
> won’t be directly feeding off of the main DMP HBase data. Instead, it’ll be
> batched to Parquet or Kudu/Impala or even PostgreSQL.
>
> I originally thought that we needed two-way data retrieval from the DMP
> HBase for ID generation, but after further investigation into the use-case
> and architecture, the ID generation needs to happen local to the Ad Servers
> where we generate a unique ID and store it in a ID linking table. Even
> better, many of the 3rd party services supply this ID. So, data only needs
> to flow in one direction. We will use Kafka as the bus for this. No JDBC
> required. This is also goes for the REST Endpoints. 3rd party services will
> hit ours to update our data with no need to read from our data. And, when
> we want to update their data, we will hit theirs to update their data using
> a triggered job.
>
> This al boils down to just integrating with Kafka.
>
> Once again, thanks for all the help.
>
> Cheers,
> Ben
>
>
> On Oct 9, 2016, at 3:16 AM, Jörn Franke <jo...@gmail.com> wrote:
>
> please keep also in mind that Tableau Server has the capabilities to store
> data in-memory and refresh only when needed the in-memory data. This means
> you can import it from any source and let your users work only on the
> in-memory data in Tableau Server.
>
> On Sun, Oct 9, 2016 at 9:22 AM, Jörn Franke <jo...@gmail.com> wrote:
>
>> Cloudera 5.8 has a very old version of Hive without Tez, but Mich
>> provided already a good alternative. However, you should check if it
>> contains a recent version of Hbase and Phoenix. That being said, I just
>> wonder what is the dataflow, data model and the analysis you plan to do.
>> Maybe there are completely different solutions possible. Especially these
>> single inserts, upserts etc. should be avoided as much as possible in the
>> Big Data (analysis) world with any technology, because they do not perform
>> well.
>>
>> Hive with Llap will provide an in-memory cache for interactive analytics.
>> You can put full tables in-memory with Hive using Ignite HDFS in-memory
>> solution. All this does only make sense if you do not use MR as an engine,
>> the right input format (ORC, parquet) and a recent Hive version.
>>
>> On 8 Oct 2016, at 21:55, Benjamin Kim <bb...@gmail.com> wrote:
>>
>> Mich,
>>
>> Unfortunately, we are moving away from Hive and unifying on Spark using
>> CDH 5.8 as our distro. And, the Tableau released a Spark ODBC/JDBC driver
>> too. I will either try Phoenix JDBC Server for HBase or push to move faster
>> to Kudu with Impala. We will use Impala as the JDBC in-between until the
>> Kudu team completes Spark SQL support for JDBC.
>>
>> Thanks for the advice.
>>
>> Cheers,
>> Ben
>>
>>
>> On Oct 8, 2016, at 12:35 PM, Mich Talebzadeh <mi...@gmail.com>
>> wrote:
>>
>> Sure. But essentially you are looking at batch data for analytics for
>> your tableau users so Hive may be a better choice with its rich SQL and
>> ODBC.JDBC connection to Tableau already.
>>
>> I would go for Hive especially the new release will have an in-memory
>> offering as well for frequently accessed data :)
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 8 October 2016 at 20:15, Benjamin Kim <bb...@gmail.com> wrote:
>>
>>> Mich,
>>>
>>> First and foremost, we have visualization servers that run Tableau for
>>> external user reports. Second, we have servers that are ad servers and REST
>>> endpoints for cookie sync and segmentation data exchange. These will use
>>> JDBC directly within the same data-center. When not colocated in the same
>>> data-center, they will connected to a located database server using JDBC.
>>> Either way, by using JDBC everywhere, it simplifies and unifies the code on
>>> the JDBC industry standard.
>>>
>>> Does this make sense?
>>>
>>> Thanks,
>>> Ben
>>>
>>>
>>> On Oct 8, 2016, at 11:47 AM, Mich Talebzadeh <mi...@gmail.com>
>>> wrote:
>>>
>>> Like any other design what is your presentation layer and end users?
>>>
>>> Are they SQL centric users from Tableau background or they may use spark
>>> functional programming.
>>>
>>> It is best to describe the use case.
>>>
>>> HTH
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 8 October 2016 at 19:40, Felix Cheung <fe...@hotmail.com>
>>> wrote:
>>>
>>>> I wouldn't be too surprised Spark SQL - JDBC data source - Phoenix JDBC
>>>> server - HBASE would work better.
>>>>
>>>> Without naming specifics, there are at least 4 or 5 different
>>>> implementations of HBASE sources, each at varying level of development and
>>>> different requirements (HBASE release version, Kerberos support etc)
>>>>
>>>>
>>>> _____________________________
>>>> From: Benjamin Kim <bb...@gmail.com>
>>>> Sent: Saturday, October 8, 2016 11:26 AM
>>>> Subject: Re: Spark SQL Thriftserver with HBase
>>>> To: Mich Talebzadeh <mi...@gmail.com>
>>>> Cc: <us...@spark.apache.org>, Felix Cheung <fe...@hotmail.com>
>>>>
>>>>
>>>>
>>>> Mich,
>>>>
>>>> Are you talking about the Phoenix JDBC Server? If so, I forgot about
>>>> that alternative.
>>>>
>>>> Thanks,
>>>> Ben
>>>>
>>>>
>>>> On Oct 8, 2016, at 11:21 AM, Mich Talebzadeh <mi...@gmail.com>
>>>> wrote:
>>>>
>>>> I don't think it will work
>>>>
>>>> you can use phoenix on top of hbase
>>>>
>>>> hbase(main):336:0> scan 'tsco', 'LIMIT' => 1
>>>> ROW                                                       COLUMN+CELL
>>>>  TSCO-1-Apr-08
>>>> column=stock_daily:Date, timestamp=1475866783376, value=1-Apr-08
>>>>  TSCO-1-Apr-08
>>>> column=stock_daily:close, timestamp=1475866783376, value=405.25
>>>>  TSCO-1-Apr-08
>>>> column=stock_daily:high, timestamp=1475866783376, value=406.75
>>>>  TSCO-1-Apr-08
>>>> column=stock_daily:low, timestamp=1475866783376, value=379.25
>>>>  TSCO-1-Apr-08
>>>> column=stock_daily:open, timestamp=1475866783376, value=380.00
>>>>  TSCO-1-Apr-08
>>>> column=stock_daily:stock, timestamp=1475866783376, value=TESCO PLC
>>>>  TSCO-1-Apr-08
>>>> column=stock_daily:ticker, timestamp=1475866783376, value=TSCO
>>>>  TSCO-1-Apr-08
>>>> column=stock_daily:volume, timestamp=1475866783376, value=49664486
>>>>
>>>> And the same on Phoenix on top of Hvbase table
>>>>
>>>> 0: jdbc:phoenix:thin:url=http://rhes564:8765> select
>>>> substr(to_char(to_date("Date",'dd-MMM-yy')),1,10) AS TradeDate,
>>>> "close" AS "Day's close", "high" AS "Day's High", "low" AS "Day's Low",
>>>> "open" AS "Day's Open", "ticker", "volume", (to_number("low")+to_number("high"))/2
>>>> AS "AverageDailyPrice" from "tsco" where to_number("volume") > 0 and "high"
>>>> != '-' and to_date("Date",'dd-MMM-yy') > to_date('2015-10-06','yyyy-MM-dd')
>>>> order by  to_date("Date",'dd-MMM-yy') limit 1;
>>>> +-------------+--------------+-------------+------------+---
>>>> ----------+---------+-----------+--------------------+
>>>> |  TRADEDATE  | Day's close  | Day's High  | Day's Low  | Day's Open  |
>>>> ticker  |  volume   | AverageDailyPrice  |
>>>> +-------------+--------------+-------------+------------+---
>>>> ----------+---------+-----------+--------------------+
>>>> | 2015-10-07  | 197.00       | 198.05      | 184.84     | 192.20      |
>>>> TSCO    | 30046994  | 191.445            |
>>>>
>>>> HTH
>>>>
>>>>
>>>>
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destructionof data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed.The author will in no case be liable for any monetary damages
>>>> arising from suchloss, damage or destruction.
>>>>
>>>>
>>>>
>>>> On 8 October 2016 at 19:05, Felix Cheung <fe...@hotmail.com>
>>>> wrote:
>>>>
>>>>> Great, then I think those packages as Spark data source should allow
>>>>> you to do exactly that (replace org.apache.spark.sql.jdbc with HBASE one)
>>>>>
>>>>> I do think it will be great to get more examples around this though.
>>>>> Would be great if you could share your experience with this!
>>>>>
>>>>>
>>>>> _____________________________
>>>>> From: Benjamin Kim <bb...@gmail.com>
>>>>> Sent: Saturday, October 8, 2016 11:00 AM
>>>>> Subject: Re: Spark SQL Thriftserver with HBase
>>>>> To: Felix Cheung <fe...@hotmail.com>
>>>>> Cc: <us...@spark.apache.org>
>>>>>
>>>>>
>>>>> Felix,
>>>>>
>>>>> My goal is to use Spark SQL JDBC Thriftserver to access HBase tables
>>>>> using just SQL. I have been able to CREATE tables using this statement
>>>>> below in the past:
>>>>>
>>>>> CREATE TABLE <table-name>
>>>>> USING org.apache.spark.sql.jdbc
>>>>> OPTIONS (
>>>>>   url "jdbc:postgresql://<hostname>:<port>/dm?user=<username>&pass
>>>>> word=<password>",
>>>>>   dbtable "dim.dimension_acamp"
>>>>> );
>>>>>
>>>>>
>>>>> After doing this, I can access the PostgreSQL table using Spark SQL
>>>>> JDBC Thriftserver using SQL statements (SELECT, UPDATE, INSERT, etc.). I
>>>>> want to do the same with HBase tables. We tried this using Hive and
>>>>> HiveServer2, but the response times are just too long.
>>>>>
>>>>> Thanks,
>>>>> Ben
>>>>>
>>>>>
>>>>> On Oct 8, 2016, at 10:53 AM, Felix Cheung <fe...@hotmail.com>
>>>>> wrote:
>>>>>
>>>>> Ben,
>>>>>
>>>>> I'm not sure I'm following completely.
>>>>>
>>>>> Is your goal to use Spark to create or access tables in HBASE? If so
>>>>> the link below and several packages out there support that by having a
>>>>> HBASE data source for Spark. There are some examples on how the Spark code
>>>>> look like in that link as well. On that note, you should also be able to
>>>>> use the HBASE data source from pure SQL (Spark SQL) query as well, which
>>>>> should work in the case with the Spark SQL JDBC Thrift Server (with USING,
>>>>> http://spark.apache.org/docs/latest/sql-programming-gu
>>>>> ide.html#tab_sql_10).
>>>>>
>>>>>
>>>>> _____________________________
>>>>> From: Benjamin Kim <bb...@gmail.com>
>>>>> Sent: Saturday, October 8, 2016 10:40 AM
>>>>> Subject: Re: Spark SQL Thriftserver with HBase
>>>>> To: Felix Cheung <fe...@hotmail.com>
>>>>> Cc: <us...@spark.apache.org>
>>>>>
>>>>>
>>>>> Felix,
>>>>>
>>>>> The only alternative way is to create a stored procedure (udf) in
>>>>> database terms that would run Spark scala code underneath. In this way, I
>>>>> can use Spark SQL JDBC Thriftserver to execute it using SQL code passing
>>>>> the key, values I want to UPSERT. I wonder if this is possible since I
>>>>> cannot CREATE a wrapper table on top of a HBase table in Spark SQL?
>>>>>
>>>>> What do you think? Is this the right approach?
>>>>>
>>>>> Thanks,
>>>>> Ben
>>>>>
>>>>> On Oct 8, 2016, at 10:33 AM, Felix Cheung <fe...@hotmail.com>
>>>>> wrote:
>>>>>
>>>>> HBase has released support for Spark
>>>>> hbase.apache.org/book.html#spark
>>>>>
>>>>> And if you search you should find several alternative approaches.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Oct 7, 2016 at 7:56 AM -0700, "Benjamin Kim" <
>>>>> bbuild11@gmail.com> wrote:
>>>>>
>>>>> Does anyone know if Spark can work with HBase tables using Spark SQL?
>>>>> I know in Hive we are able to create tables on top of an underlying HBase
>>>>> table that can be accessed using MapReduce jobs. Can the same be done using
>>>>> HiveContext or SQLContext? We are trying to setup a way to GET and POST
>>>>> data to and from the HBase table using the Spark SQL JDBC thriftserver from
>>>>> our RESTful API endpoints and/or HTTP web farms. If we can get this to
>>>>> work, then we can load balance the thriftservers. In addition, this will
>>>>> benefit us in giving us a way to abstract the data storage layer away from
>>>>> the presentation layer code. There is a chance that we will swap out the
>>>>> data storage technology in the future. We are currently experimenting with
>>>>> Kudu.
>>>>>
>>>>> Thanks,
>>>>> Ben
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>
>
>

Re: Spark SQL Thriftserver with HBase

Posted by Michael Segel <ms...@hotmail.com>.

Its a quasi columnar store.
Sort of a hi-bred approach.


On Oct 17, 2016, at 4:30 PM, Mich Talebzadeh <mi...@gmail.com>> wrote:

I assume that Hbase is more of columnar data store by virtue of it storing column data together.

many interpretation of this is all over places. However, it is not columnar in a sense of column based (as opposed to row based) implementation of relational model.



Dr Mich Talebzadeh



LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.



On 17 October 2016 at 22:14, Jörn Franke <jo...@gmail.com>> wrote:
Oltp use case scenario does not mean necessarily the traditional oltp. See also apache hawk etc. they can fit indeed to some use cases to some other less.

On 17 Oct 2016, at 23:02, Michael Segel <ms...@hotmail.com>> wrote:

You really don’t want to do OLTP on a distributed NoSQL engine.
Remember Big Data isn’t relational its more of a hierarchy model or record model. Think IMS or Pick (Dick Pick’s revelation, U2, Universe, etc …)


On Oct 17, 2016, at 3:45 PM, Jörn Franke <jo...@gmail.com>> wrote:

It has some implication because it imposes the SQL model on Hbase. Internally it translates the SQL queries into custom Hbase processors. Keep also in mind for what Hbase need a proper key design and how Phoenix designs those keys to get the best performance out of it. I think for oltp it is a workable model and I think they plan to offer Phoenix as a default interface as part of Hbase anyway.
For OLAP it depends.


On 17 Oct 2016, at 22:34, ayan guha <gu...@gmail.com>> wrote:


Hi

Any reason not to recommend Phoneix? I haven't used it myself so curious about pro's and cons about the use of it.

On 18 Oct 2016 03:17, "Michael Segel" <ms...@hotmail.com>> wrote:
Guys,
Sorry for jumping in late to the game…

If memory serves (which may not be a good thing…) :

You can use HiveServer2 as a connection point to HBase.
While this doesn’t perform well, its probably the cleanest solution.
I’m not keen on Phoenix… wouldn’t recommend it….


The issue is that you’re trying to make HBase, a key/value object store, a Relational Engine… its not.

There are some considerations which make HBase not ideal for all use cases and you may find better performance with Parquet files.

One thing missing is the use of secondary indexing and query optimizations that you have in RDBMSs and are lacking in HBase / MapRDB / etc …  so your performance will vary.

With respect to Tableau… their entire interface in to the big data world revolves around the JDBC/ODBC interface. So if you don’t have that piece as part of your solution, you’re DOA w respect to Tableau.

Have you considered Drill as your JDBC connection point?  (YAAP: Yet another Apache project)


On Oct 9, 2016, at 12:23 PM, Benjamin Kim <bb...@gmail.com>> wrote:

Thanks for all the suggestions. It would seem you guys are right about the Tableau side of things. The reports don’t need to be real-time, and they won’t be directly feeding off of the main DMP HBase data. Instead, it’ll be batched to Parquet or Kudu/Impala or even PostgreSQL.

I originally thought that we needed two-way data retrieval from the DMP HBase for ID generation, but after further investigation into the use-case and architecture, the ID generation needs to happen local to the Ad Servers where we generate a unique ID and store it in a ID linking table. Even better, many of the 3rd party services supply this ID. So, data only needs to flow in one direction. We will use Kafka as the bus for this. No JDBC required. This is also goes for the REST Endpoints. 3rd party services will hit ours to update our data with no need to read from our data. And, when we want to update their data, we will hit theirs to update their data using a triggered job.

This al boils down to just integrating with Kafka.

Once again, thanks for all the help.

Cheers,
Ben


On Oct 9, 2016, at 3:16 AM, Jörn Franke <jo...@gmail.com>> wrote:

please keep also in mind that Tableau Server has the capabilities to store data in-memory and refresh only when needed the in-memory data. This means you can import it from any source and let your users work only on the in-memory data in Tableau Server.

On Sun, Oct 9, 2016 at 9:22 AM, Jörn Franke <jo...@gmail.com>> wrote:
Cloudera 5.8 has a very old version of Hive without Tez, but Mich provided already a good alternative. However, you should check if it contains a recent version of Hbase and Phoenix. That being said, I just wonder what is the dataflow, data model and the analysis you plan to do. Maybe there are completely different solutions possible. Especially these single inserts, upserts etc. should be avoided as much as possible in the Big Data (analysis) world with any technology, because they do not perform well.

Hive with Llap will provide an in-memory cache for interactive analytics. You can put full tables in-memory with Hive using Ignite HDFS in-memory solution. All this does only make sense if you do not use MR as an engine, the right input format (ORC, parquet) and a recent Hive version.

On 8 Oct 2016, at 21:55, Benjamin Kim <bb...@gmail.com>> wrote:

Mich,

Unfortunately, we are moving away from Hive and unifying on Spark using CDH 5.8 as our distro. And, the Tableau released a Spark ODBC/JDBC driver too. I will either try Phoenix JDBC Server for HBase or push to move faster to Kudu with Impala. We will use Impala as the JDBC in-between until the Kudu team completes Spark SQL support for JDBC.

Thanks for the advice.

Cheers,
Ben


On Oct 8, 2016, at 12:35 PM, Mich Talebzadeh <mi...@gmail.com>> wrote:

Sure. But essentially you are looking at batch data for analytics for your tableau users so Hive may be a better choice with its rich SQL and ODBC.JDBC connection to Tableau already.

I would go for Hive especially the new release will have an in-memory offering as well for frequently accessed data :)


Dr Mich Talebzadeh



LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.



On 8 October 2016 at 20:15, Benjamin Kim <bb...@gmail.com>> wrote:
Mich,

First and foremost, we have visualization servers that run Tableau for external user reports. Second, we have servers that are ad servers and REST endpoints for cookie sync and segmentation data exchange. These will use JDBC directly within the same data-center. When not colocated in the same data-center, they will connected to a located database server using JDBC. Either way, by using JDBC everywhere, it simplifies and unifies the code on the JDBC industry standard.

Does this make sense?

Thanks,
Ben


On Oct 8, 2016, at 11:47 AM, Mich Talebzadeh <mi...@gmail.com>> wrote:

Like any other design what is your presentation layer and end users?

Are they SQL centric users from Tableau background or they may use spark functional programming.

It is best to describe the use case.

HTH

Dr Mich Talebzadeh



LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.



On 8 October 2016 at 19:40, Felix Cheung <fe...@hotmail.com>> wrote:
I wouldn't be too surprised Spark SQL - JDBC data source - Phoenix JDBC server - HBASE would work better.

Without naming specifics, there are at least 4 or 5 different implementations of HBASE sources, each at varying level of development and different requirements (HBASE release version, Kerberos support etc)


_____________________________
From: Benjamin Kim <bb...@gmail.com>>
Sent: Saturday, October 8, 2016 11:26 AM
Subject: Re: Spark SQL Thriftserver with HBase
To: Mich Talebzadeh <mi...@gmail.com>>
Cc: <us...@spark.apache.org>>, Felix Cheung <fe...@hotmail.com>>



Mich,

Are you talking about the Phoenix JDBC Server? If so, I forgot about that alternative.

Thanks,
Ben


On Oct 8, 2016, at 11:21 AM, Mich Talebzadeh <mi...@gmail.com>> wrote:

I don't think it will work

you can use phoenix on top of hbase

hbase(main):336:0> scan 'tsco', 'LIMIT' => 1
ROW                                                       COLUMN+CELL
 TSCO-1-Apr-08                                            column=stock_daily:Date, timestamp=1475866783376, value=1-Apr-08
 TSCO-1-Apr-08                                            column=stock_daily:close, timestamp=1475866783376, value=405.25
 TSCO-1-Apr-08                                            column=stock_daily:high, timestamp=1475866783376, value=406.75
 TSCO-1-Apr-08                                            column=stock_daily:low, timestamp=1475866783376, value=379.25
 TSCO-1-Apr-08                                            column=stock_daily:open, timestamp=1475866783376, value=380.00
 TSCO-1-Apr-08                                            column=stock_daily:stock, timestamp=1475866783376, value=TESCO PLC
 TSCO-1-Apr-08                                            column=stock_daily:ticker, timestamp=1475866783376, value=TSCO
 TSCO-1-Apr-08                                            column=stock_daily:volume, timestamp=1475866783376, value=49664486

And the same on Phoenix on top of Hvbase table

0: jdbc:phoenix:thin:url=http://rhes564:8765<http://rhes564:8765/>> select substr(to_char(to_date("Date",'dd-MMM-yy')),1,10) AS TradeDate, "close" AS "Day's close", "high" AS "Day's High", "low" AS "Day's Low", "open" AS "Day's Open", "ticker", "volume", (to_number("low")+to_number("high"))/2 AS "AverageDailyPrice" from "tsco" where to_number("volume") > 0 and "high" != '-' and to_date("Date",'dd-MMM-yy') > to_date('2015-10-06','yyyy-MM-dd') order by  to_date("Date",'dd-MMM-yy') limit 1;
+-------------+--------------+-------------+------------+-------------+---------+-----------+--------------------+
|  TRADEDATE  | Day's close  | Day's High  | Day's Low  | Day's Open  | ticker  |  volume   | AverageDailyPrice  |
+-------------+--------------+-------------+------------+-------------+---------+-----------+--------------------+
| 2015-10-07  | 197.00       | 198.05      | 184.84     | 192.20      | TSCO    | 30046994  | 191.445            |


HTH




Dr Mich Talebzadeh



LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destructionof data or any other property which may arise from relying on this email's technical content is explicitly disclaimed.The author will in no case be liable for any monetary damages arising from suchloss, damage or destruction.



On 8 October 2016 at 19:05, Felix Cheung <fe...@hotmail.com>> wrote:
Great, then I think those packages as Spark data source should allow you to do exactly that (replace org.apache.spark.sql.jdbc with HBASE one)

I do think it will be great to get more examples around this though. Would be great if you could share your experience with this!


_____________________________
From: Benjamin Kim <bb...@gmail.com>>
Sent: Saturday, October 8, 2016 11:00 AM
Subject: Re: Spark SQL Thriftserver with HBase
To: Felix Cheung <fe...@hotmail.com>>
Cc: <us...@spark.apache.org>>


Felix,

My goal is to use Spark SQL JDBC Thriftserver to access HBase tables using just SQL. I have been able to CREATE tables using this statement below in the past:

CREATE TABLE <table-name>
USING org.apache.spark.sql.jdbc
OPTIONS (
  url "jdbc:postgresql://<hostname>:<port>/dm?user=<username>&password=<password>",
  dbtable "dim.dimension_acamp"
);

After doing this, I can access the PostgreSQL table using Spark SQL JDBC Thriftserver using SQL statements (SELECT, UPDATE, INSERT, etc.). I want to do the same with HBase tables. We tried this using Hive and HiveServer2, but the response times are just too long.

Thanks,
Ben


On Oct 8, 2016, at 10:53 AM, Felix Cheung <fe...@hotmail.com>> wrote:

Ben,

I'm not sure I'm following completely.

Is your goal to use Spark to create or access tables in HBASE? If so the link below and several packages out there support that by having a HBASE data source for Spark. There are some examples on how the Spark code look like in that link as well. On that note, you should also be able to use the HBASE data source from pure SQL (Spark SQL) query as well, which should work in the case with the Spark SQL JDBC Thrift Server (with USING,http://spark.apache.org/docs/latest/sql-programming-guide.html#tab_sql_10).


_____________________________
From: Benjamin Kim <bb...@gmail.com>>
Sent: Saturday, October 8, 2016 10:40 AM
Subject: Re: Spark SQL Thriftserver with HBase
To: Felix Cheung <fe...@hotmail.com>>
Cc: <us...@spark.apache.org>>


Felix,

The only alternative way is to create a stored procedure (udf) in database terms that would run Spark scala code underneath. In this way, I can use Spark SQL JDBC Thriftserver to execute it using SQL code passing the key, values I want to UPSERT. I wonder if this is possible since I cannot CREATE a wrapper table on top of a HBase table in Spark SQL?

What do you think? Is this the right approach?

Thanks,
Ben

On Oct 8, 2016, at 10:33 AM, Felix Cheung <fe...@hotmail.com>> wrote:

HBase has released support for Spark
hbase.apache.org/book.html#spark<http://hbase.apache.org/book.html#spark>

And if you search you should find several alternative approaches.





On Fri, Oct 7, 2016 at 7:56 AM -0700, "Benjamin Kim" <bb...@gmail.com>> wrote:

Does anyone know if Spark can work with HBase tables using Spark SQL? I know in Hive we are able to create tables on top of an underlying HBase table that can be accessed using MapReduce jobs. Can the same be done using HiveContext or SQLContext? We are trying to setup a way to GET and POST data to and from the HBase table using the Spark SQL JDBC thriftserver from our RESTful API endpoints and/or HTTP web farms. If we can get this to work, then we can load balance the thriftservers. In addition, this will benefit us in giving us a way to abstract the data storage layer away from the presentation layer code. There is a chance that we will swap out the data storage technology in the future. We are currently experimenting with Kudu.

Thanks,
Ben
---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org<ma...@spark.apache.org>

Re: Spark SQL Thriftserver with HBase

Posted by Mich Talebzadeh <mi...@gmail.com>.

I assume that Hbase is more of columnar data store by virtue of it storing
column data together.

many interpretation of this is all over places. However, it is not columnar
in a sense of column based (as opposed to row based) implementation of
relational model.



Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 17 October 2016 at 22:14, Jörn Franke <jo...@gmail.com> wrote:

> Oltp use case scenario does not mean necessarily the traditional oltp. See
> also apache hawk etc. they can fit indeed to some use cases to some other
> less.
>
> On 17 Oct 2016, at 23:02, Michael Segel <ms...@hotmail.com> wrote:
>
> You really don’t want to do OLTP on a distributed NoSQL engine.
> Remember Big Data isn’t relational its more of a hierarchy model or record
> model. Think IMS or Pick (Dick Pick’s revelation, U2, Universe, etc …)
>
>
>
> On Oct 17, 2016, at 3:45 PM, Jörn Franke <jo...@gmail.com> wrote:
>
> It has some implication because it imposes the SQL model on Hbase.
> Internally it translates the SQL queries into custom Hbase processors. Keep
> also in mind for what Hbase need a proper key design and how Phoenix
> designs those keys to get the best performance out of it. I think for oltp
> it is a workable model and I think they plan to offer Phoenix as a default
> interface as part of Hbase anyway.
> For OLAP it depends.
>
>
> On 17 Oct 2016, at 22:34, ayan guha <gu...@gmail.com> wrote:
>
> Hi
>
> Any reason not to recommend Phoneix? I haven't used it myself so curious
> about pro's and cons about the use of it.
> On 18 Oct 2016 03:17, "Michael Segel" <ms...@hotmail.com> wrote:
>
>> Guys,
>> Sorry for jumping in late to the game…
>>
>> If memory serves (which may not be a good thing…) :
>>
>> You can use HiveServer2 as a connection point to HBase.
>> While this doesn’t perform well, its probably the cleanest solution.
>> I’m not keen on Phoenix… wouldn’t recommend it….
>>
>>
>> The issue is that you’re trying to make HBase, a key/value object store,
>> a Relational Engine… its not.
>>
>> There are some considerations which make HBase not ideal for all use
>> cases and you may find better performance with Parquet files.
>>
>> One thing missing is the use of secondary indexing and query
>> optimizations that you have in RDBMSs and are lacking in HBase / MapRDB /
>> etc …  so your performance will vary.
>>
>> With respect to Tableau… their entire interface in to the big data world
>> revolves around the JDBC/ODBC interface. So if you don’t have that piece as
>> part of your solution, you’re DOA w respect to Tableau.
>>
>> Have you considered Drill as your JDBC connection point?  (YAAP: Yet
>> another Apache project)
>>
>>
>> On Oct 9, 2016, at 12:23 PM, Benjamin Kim <bb...@gmail.com> wrote:
>>
>> Thanks for all the suggestions. It would seem you guys are right about
>> the Tableau side of things. The reports don’t need to be real-time, and
>> they won’t be directly feeding off of the main DMP HBase data. Instead,
>> it’ll be batched to Parquet or Kudu/Impala or even PostgreSQL.
>>
>> I originally thought that we needed two-way data retrieval from the DMP
>> HBase for ID generation, but after further investigation into the use-case
>> and architecture, the ID generation needs to happen local to the Ad Servers
>> where we generate a unique ID and store it in a ID linking table. Even
>> better, many of the 3rd party services supply this ID. So, data only needs
>> to flow in one direction. We will use Kafka as the bus for this. No JDBC
>> required. This is also goes for the REST Endpoints. 3rd party services will
>> hit ours to update our data with no need to read from our data. And, when
>> we want to update their data, we will hit theirs to update their data using
>> a triggered job.
>>
>> This al boils down to just integrating with Kafka.
>>
>> Once again, thanks for all the help.
>>
>> Cheers,
>> Ben
>>
>>
>> On Oct 9, 2016, at 3:16 AM, Jörn Franke <jo...@gmail.com> wrote:
>>
>> please keep also in mind that Tableau Server has the capabilities to
>> store data in-memory and refresh only when needed the in-memory data. This
>> means you can import it from any source and let your users work only on the
>> in-memory data in Tableau Server.
>>
>> On Sun, Oct 9, 2016 at 9:22 AM, Jörn Franke <jo...@gmail.com> wrote:
>>
>>> Cloudera 5.8 has a very old version of Hive without Tez, but Mich
>>> provided already a good alternative. However, you should check if it
>>> contains a recent version of Hbase and Phoenix. That being said, I just
>>> wonder what is the dataflow, data model and the analysis you plan to do.
>>> Maybe there are completely different solutions possible. Especially these
>>> single inserts, upserts etc. should be avoided as much as possible in the
>>> Big Data (analysis) world with any technology, because they do not perform
>>> well.
>>>
>>> Hive with Llap will provide an in-memory cache for interactive
>>> analytics. You can put full tables in-memory with Hive using Ignite HDFS
>>> in-memory solution. All this does only make sense if you do not use MR as
>>> an engine, the right input format (ORC, parquet) and a recent Hive version.
>>>
>>> On 8 Oct 2016, at 21:55, Benjamin Kim <bb...@gmail.com> wrote:
>>>
>>> Mich,
>>>
>>> Unfortunately, we are moving away from Hive and unifying on Spark using
>>> CDH 5.8 as our distro. And, the Tableau released a Spark ODBC/JDBC driver
>>> too. I will either try Phoenix JDBC Server for HBase or push to move faster
>>> to Kudu with Impala. We will use Impala as the JDBC in-between until the
>>> Kudu team completes Spark SQL support for JDBC.
>>>
>>> Thanks for the advice.
>>>
>>> Cheers,
>>> Ben
>>>
>>>
>>> On Oct 8, 2016, at 12:35 PM, Mich Talebzadeh <mi...@gmail.com>
>>> wrote:
>>>
>>> Sure. But essentially you are looking at batch data for analytics for
>>> your tableau users so Hive may be a better choice with its rich SQL and
>>> ODBC.JDBC connection to Tableau already.
>>>
>>> I would go for Hive especially the new release will have an in-memory
>>> offering as well for frequently accessed data :)
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 8 October 2016 at 20:15, Benjamin Kim <bb...@gmail.com> wrote:
>>>
>>>> Mich,
>>>>
>>>> First and foremost, we have visualization servers that run Tableau for
>>>> external user reports. Second, we have servers that are ad servers and REST
>>>> endpoints for cookie sync and segmentation data exchange. These will use
>>>> JDBC directly within the same data-center. When not colocated in the same
>>>> data-center, they will connected to a located database server using JDBC.
>>>> Either way, by using JDBC everywhere, it simplifies and unifies the code on
>>>> the JDBC industry standard.
>>>>
>>>> Does this make sense?
>>>>
>>>> Thanks,
>>>> Ben
>>>>
>>>>
>>>> On Oct 8, 2016, at 11:47 AM, Mich Talebzadeh <mi...@gmail.com>
>>>> wrote:
>>>>
>>>> Like any other design what is your presentation layer and end users?
>>>>
>>>> Are they SQL centric users from Tableau background or they may use
>>>> spark functional programming.
>>>>
>>>> It is best to describe the use case.
>>>>
>>>> HTH
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>> On 8 October 2016 at 19:40, Felix Cheung <fe...@hotmail.com>
>>>> wrote:
>>>>
>>>>> I wouldn't be too surprised Spark SQL - JDBC data source - Phoenix
>>>>> JDBC server - HBASE would work better.
>>>>>
>>>>> Without naming specifics, there are at least 4 or 5 different
>>>>> implementations of HBASE sources, each at varying level of development and
>>>>> different requirements (HBASE release version, Kerberos support etc)
>>>>>
>>>>>
>>>>> _____________________________
>>>>> From: Benjamin Kim <bb...@gmail.com>
>>>>> Sent: Saturday, October 8, 2016 11:26 AM
>>>>> Subject: Re: Spark SQL Thriftserver with HBase
>>>>> To: Mich Talebzadeh <mi...@gmail.com>
>>>>> Cc: <us...@spark.apache.org>, Felix Cheung <fe...@hotmail.com>
>>>>>
>>>>>
>>>>>
>>>>> Mich,
>>>>>
>>>>> Are you talking about the Phoenix JDBC Server? If so, I forgot about
>>>>> that alternative.
>>>>>
>>>>> Thanks,
>>>>> Ben
>>>>>
>>>>>
>>>>> On Oct 8, 2016, at 11:21 AM, Mich Talebzadeh <
>>>>> mich.talebzadeh@gmail.com> wrote:
>>>>>
>>>>> I don't think it will work
>>>>>
>>>>> you can use phoenix on top of hbase
>>>>>
>>>>> hbase(main):336:0> scan 'tsco', 'LIMIT' => 1
>>>>> ROW                                                       COLUMN+CELL
>>>>>  TSCO-1-Apr-08
>>>>> column=stock_daily:Date, timestamp=1475866783376, value=1-Apr-08
>>>>>  TSCO-1-Apr-08
>>>>> column=stock_daily:close, timestamp=1475866783376, value=405.25
>>>>>  TSCO-1-Apr-08
>>>>> column=stock_daily:high, timestamp=1475866783376, value=406.75
>>>>>  TSCO-1-Apr-08
>>>>> column=stock_daily:low, timestamp=1475866783376, value=379.25
>>>>>  TSCO-1-Apr-08
>>>>> column=stock_daily:open, timestamp=1475866783376, value=380.00
>>>>>  TSCO-1-Apr-08
>>>>> column=stock_daily:stock, timestamp=1475866783376, value=TESCO PLC
>>>>>  TSCO-1-Apr-08
>>>>> column=stock_daily:ticker, timestamp=1475866783376, value=TSCO
>>>>>  TSCO-1-Apr-08
>>>>> column=stock_daily:volume, timestamp=1475866783376, value=49664486
>>>>>
>>>>> And the same on Phoenix on top of Hvbase table
>>>>>
>>>>> 0: jdbc:phoenix:thin:url=http://rhes564:8765> select
>>>>> substr(to_char(to_date("Date",'dd-MMM-yy')),1,10) AS TradeDate,
>>>>> "close" AS "Day's close", "high" AS "Day's High", "low" AS "Day's Low",
>>>>> "open" AS "Day's Open", "ticker", "volume", (to_number("low")+to_number("high"))/2
>>>>> AS "AverageDailyPrice" from "tsco" where to_number("volume") > 0 and "high"
>>>>> != '-' and to_date("Date",'dd-MMM-yy') > to_date('2015-10-06','yyyy-MM-dd')
>>>>> order by  to_date("Date",'dd-MMM-yy') limit 1;
>>>>> +-------------+--------------+-------------+------------+---
>>>>> ----------+---------+-----------+--------------------+
>>>>> |  TRADEDATE  | Day's close  | Day's High  | Day's Low  | Day's Open
>>>>> | ticker  |  volume   | AverageDailyPrice  |
>>>>> +-------------+--------------+-------------+------------+---
>>>>> ----------+---------+-----------+--------------------+
>>>>> | 2015-10-07  | 197.00       | 198.05      | 184.84     | 192.20
>>>>> | TSCO    | 30046994  | 191.445            |
>>>>>
>>>>> HTH
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Dr Mich Talebzadeh
>>>>>
>>>>>
>>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>
>>>>>
>>>>> http://talebzadehmich.wordpress.com
>>>>>
>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>>> any loss, damage or destructionof data or any other property which may
>>>>> arise from relying on this email's technical content is explicitly
>>>>> disclaimed.The author will in no case be liable for any monetary damages
>>>>> arising from suchloss, damage or destruction.
>>>>>
>>>>>
>>>>>
>>>>> On 8 October 2016 at 19:05, Felix Cheung <fe...@hotmail.com>
>>>>> wrote:
>>>>>
>>>>>> Great, then I think those packages as Spark data source should allow
>>>>>> you to do exactly that (replace org.apache.spark.sql.jdbc with HBASE one)
>>>>>>
>>>>>> I do think it will be great to get more examples around this though.
>>>>>> Would be great if you could share your experience with this!
>>>>>>
>>>>>>
>>>>>> _____________________________
>>>>>> From: Benjamin Kim <bb...@gmail.com>
>>>>>> Sent: Saturday, October 8, 2016 11:00 AM
>>>>>> Subject: Re: Spark SQL Thriftserver with HBase
>>>>>> To: Felix Cheung <fe...@hotmail.com>
>>>>>> Cc: <us...@spark.apache.org>
>>>>>>
>>>>>>
>>>>>> Felix,
>>>>>>
>>>>>> My goal is to use Spark SQL JDBC Thriftserver to access HBase tables
>>>>>> using just SQL. I have been able to CREATE tables using this statement
>>>>>> below in the past:
>>>>>>
>>>>>> CREATE TABLE <table-name>
>>>>>> USING org.apache.spark.sql.jdbc
>>>>>> OPTIONS (
>>>>>>   url "jdbc:postgresql://<hostname>:<port>/dm?user=<username>&pass
>>>>>> word=<password>",
>>>>>>   dbtable "dim.dimension_acamp"
>>>>>> );
>>>>>>
>>>>>>
>>>>>> After doing this, I can access the PostgreSQL table using Spark SQL
>>>>>> JDBC Thriftserver using SQL statements (SELECT, UPDATE, INSERT, etc.). I
>>>>>> want to do the same with HBase tables. We tried this using Hive and
>>>>>> HiveServer2, but the response times are just too long.
>>>>>>
>>>>>> Thanks,
>>>>>> Ben
>>>>>>
>>>>>>
>>>>>> On Oct 8, 2016, at 10:53 AM, Felix Cheung <fe...@hotmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> Ben,
>>>>>>
>>>>>> I'm not sure I'm following completely.
>>>>>>
>>>>>> Is your goal to use Spark to create or access tables in HBASE? If so
>>>>>> the link below and several packages out there support that by having a
>>>>>> HBASE data source for Spark. There are some examples on how the Spark code
>>>>>> look like in that link as well. On that note, you should also be able to
>>>>>> use the HBASE data source from pure SQL (Spark SQL) query as well, which
>>>>>> should work in the case with the Spark SQL JDBC Thrift Server (with USING,
>>>>>> http://spark.apache.org/docs/latest/sql-programming-gu
>>>>>> ide.html#tab_sql_10).
>>>>>>
>>>>>>
>>>>>> _____________________________
>>>>>> From: Benjamin Kim <bb...@gmail.com>
>>>>>> Sent: Saturday, October 8, 2016 10:40 AM
>>>>>> Subject: Re: Spark SQL Thriftserver with HBase
>>>>>> To: Felix Cheung <fe...@hotmail.com>
>>>>>> Cc: <us...@spark.apache.org>
>>>>>>
>>>>>>
>>>>>> Felix,
>>>>>>
>>>>>> The only alternative way is to create a stored procedure (udf) in
>>>>>> database terms that would run Spark scala code underneath. In this way, I
>>>>>> can use Spark SQL JDBC Thriftserver to execute it using SQL code passing
>>>>>> the key, values I want to UPSERT. I wonder if this is possible since I
>>>>>> cannot CREATE a wrapper table on top of a HBase table in Spark SQL?
>>>>>>
>>>>>> What do you think? Is this the right approach?
>>>>>>
>>>>>> Thanks,
>>>>>> Ben
>>>>>>
>>>>>> On Oct 8, 2016, at 10:33 AM, Felix Cheung <fe...@hotmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> HBase has released support for Spark
>>>>>> hbase.apache.org/book.html#spark
>>>>>>
>>>>>> And if you search you should find several alternative approaches.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Oct 7, 2016 at 7:56 AM -0700, "Benjamin Kim" <
>>>>>> bbuild11@gmail.com> wrote:
>>>>>>
>>>>>> Does anyone know if Spark can work with HBase tables using Spark SQL?
>>>>>> I know in Hive we are able to create tables on top of an underlying HBase
>>>>>> table that can be accessed using MapReduce jobs. Can the same be done using
>>>>>> HiveContext or SQLContext? We are trying to setup a way to GET and POST
>>>>>> data to and from the HBase table using the Spark SQL JDBC thriftserver from
>>>>>> our RESTful API endpoints and/or HTTP web farms. If we can get this to
>>>>>> work, then we can load balance the thriftservers. In addition, this will
>>>>>> benefit us in giving us a way to abstract the data storage layer away from
>>>>>> the presentation layer code. There is a chance that we will swap out the
>>>>>> data storage technology in the future. We are currently experimenting with
>>>>>> Kudu.
>>>>>>
>>>>>> Thanks,
>>>>>> Ben
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>>
>

Re: Spark SQL Thriftserver with HBase

Posted by Jörn Franke <jo...@gmail.com>.

Oltp use case scenario does not mean necessarily the traditional oltp. See also apache hawk etc. they can fit indeed to some use cases to some other less.

> On 17 Oct 2016, at 23:02, Michael Segel <ms...@hotmail.com> wrote:
> 
> You really don’t want to do OLTP on a distributed NoSQL engine. 
> Remember Big Data isn’t relational its more of a hierarchy model or record model. Think IMS or Pick (Dick Pick’s revelation, U2, Universe, etc …) 
> 
>  
>> On Oct 17, 2016, at 3:45 PM, Jörn Franke <jo...@gmail.com> wrote:
>> 
>> It has some implication because it imposes the SQL model on Hbase. Internally it translates the SQL queries into custom Hbase processors. Keep also in mind for what Hbase need a proper key design and how Phoenix designs those keys to get the best performance out of it. I think for oltp it is a workable model and I think they plan to offer Phoenix as a default interface as part of Hbase anyway.
>> For OLAP it depends. 
>> 
>> 
>> On 17 Oct 2016, at 22:34, ayan guha <gu...@gmail.com> wrote:
>> 
>>> Hi
>>> 
>>> Any reason not to recommend Phoneix? I haven't used it myself so curious about pro's and cons about the use of it.
>>> 
>>>> On 18 Oct 2016 03:17, "Michael Segel" <ms...@hotmail.com> wrote:
>>>> Guys, 
>>>> Sorry for jumping in late to the game… 
>>>> 
>>>> If memory serves (which may not be a good thing…) :
>>>> 
>>>> You can use HiveServer2 as a connection point to HBase.  
>>>> While this doesn’t perform well, its probably the cleanest solution. 
>>>> I’m not keen on Phoenix… wouldn’t recommend it…. 
>>>> 
>>>> 
>>>> The issue is that you’re trying to make HBase, a key/value object store, a Relational Engine… its not. 
>>>> 
>>>> There are some considerations which make HBase not ideal for all use cases and you may find better performance with Parquet files. 
>>>> 
>>>> One thing missing is the use of secondary indexing and query optimizations that you have in RDBMSs and are lacking in HBase / MapRDB / etc …  so your performance will vary. 
>>>> 
>>>> With respect to Tableau… their entire interface in to the big data world revolves around the JDBC/ODBC interface. So if you don’t have that piece as part of your solution, you’re DOA w respect to Tableau. 
>>>> 
>>>> Have you considered Drill as your JDBC connection point?  (YAAP: Yet another Apache project) 
>>>> 
>>>> 
>>>>> On Oct 9, 2016, at 12:23 PM, Benjamin Kim <bb...@gmail.com> wrote:
>>>>> 
>>>>> Thanks for all the suggestions. It would seem you guys are right about the Tableau side of things. The reports don’t need to be real-time, and they won’t be directly feeding off of the main DMP HBase data. Instead, it’ll be batched to Parquet or Kudu/Impala or even PostgreSQL.
>>>>> 
>>>>> I originally thought that we needed two-way data retrieval from the DMP HBase for ID generation, but after further investigation into the use-case and architecture, the ID generation needs to happen local to the Ad Servers where we generate a unique ID and store it in a ID linking table. Even better, many of the 3rd party services supply this ID. So, data only needs to flow in one direction. We will use Kafka as the bus for this. No JDBC required. This is also goes for the REST Endpoints. 3rd party services will hit ours to update our data with no need to read from our data. And, when we want to update their data, we will hit theirs to update their data using a triggered job.
>>>>> 
>>>>> This al boils down to just integrating with Kafka.
>>>>> 
>>>>> Once again, thanks for all the help.
>>>>> 
>>>>> Cheers,
>>>>> Ben
>>>>> 
>>>>> 
>>>>>> On Oct 9, 2016, at 3:16 AM, Jörn Franke <jo...@gmail.com> wrote:
>>>>>> 
>>>>>> please keep also in mind that Tableau Server has the capabilities to store data in-memory and refresh only when needed the in-memory data. This means you can import it from any source and let your users work only on the in-memory data in Tableau Server.
>>>>>> 
>>>>>>> On Sun, Oct 9, 2016 at 9:22 AM, Jörn Franke <jo...@gmail.com> wrote:
>>>>>>> Cloudera 5.8 has a very old version of Hive without Tez, but Mich provided already a good alternative. However, you should check if it contains a recent version of Hbase and Phoenix. That being said, I just wonder what is the dataflow, data model and the analysis you plan to do. Maybe there are completely different solutions possible. Especially these single inserts, upserts etc. should be avoided as much as possible in the Big Data (analysis) world with any technology, because they do not perform well. 
>>>>>>> 
>>>>>>> Hive with Llap will provide an in-memory cache for interactive analytics. You can put full tables in-memory with Hive using Ignite HDFS in-memory solution. All this does only make sense if you do not use MR as an engine, the right input format (ORC, parquet) and a recent Hive version.
>>>>>>> 
>>>>>>> On 8 Oct 2016, at 21:55, Benjamin Kim <bb...@gmail.com> wrote:
>>>>>>> 
>>>>>>>> Mich,
>>>>>>>> 
>>>>>>>> Unfortunately, we are moving away from Hive and unifying on Spark using CDH 5.8 as our distro. And, the Tableau released a Spark ODBC/JDBC driver too. I will either try Phoenix JDBC Server for HBase or push to move faster to Kudu with Impala. We will use Impala as the JDBC in-between until the Kudu team completes Spark SQL support for JDBC.
>>>>>>>> 
>>>>>>>> Thanks for the advice.
>>>>>>>> 
>>>>>>>> Cheers,
>>>>>>>> Ben
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On Oct 8, 2016, at 12:35 PM, Mich Talebzadeh <mi...@gmail.com> wrote:
>>>>>>>>> 
>>>>>>>>> Sure. But essentially you are looking at batch data for analytics for your tableau users so Hive may be a better choice with its rich SQL and ODBC.JDBC connection to Tableau already.
>>>>>>>>> 
>>>>>>>>> I would go for Hive especially the new release will have an in-memory offering as well for frequently accessed data :)
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Dr Mich Talebzadeh
>>>>>>>>>  
>>>>>>>>> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>>>>>  
>>>>>>>>> http://talebzadehmich.wordpress.com
>>>>>>>>> 
>>>>>>>>> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.
>>>>>>>>>  
>>>>>>>>> 
>>>>>>>>>> On 8 October 2016 at 20:15, Benjamin Kim <bb...@gmail.com> wrote:
>>>>>>>>>> Mich,
>>>>>>>>>> 
>>>>>>>>>> First and foremost, we have visualization servers that run Tableau for external user reports. Second, we have servers that are ad servers and REST endpoints for cookie sync and segmentation data exchange. These will use JDBC directly within the same data-center. When not colocated in the same data-center, they will connected to a located database server using JDBC. Either way, by using JDBC everywhere, it simplifies and unifies the code on the JDBC industry standard.
>>>>>>>>>> 
>>>>>>>>>> Does this make sense?
>>>>>>>>>> 
>>>>>>>>>> Thanks,
>>>>>>>>>> Ben
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> On Oct 8, 2016, at 11:47 AM, Mich Talebzadeh <mi...@gmail.com> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> Like any other design what is your presentation layer and end users?
>>>>>>>>>>> 
>>>>>>>>>>> Are they SQL centric users from Tableau background or they may use spark functional programming.
>>>>>>>>>>> 
>>>>>>>>>>> It is best to describe the use case.
>>>>>>>>>>> 
>>>>>>>>>>> HTH
>>>>>>>>>>> 
>>>>>>>>>>> Dr Mich Talebzadeh
>>>>>>>>>>>  
>>>>>>>>>>> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>>>>>>>  
>>>>>>>>>>> http://talebzadehmich.wordpress.com
>>>>>>>>>>> 
>>>>>>>>>>> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.
>>>>>>>>>>>  
>>>>>>>>>>> 
>>>>>>>>>>>> On 8 October 2016 at 19:40, Felix Cheung <fe...@hotmail.com> wrote:
>>>>>>>>>>>> I wouldn't be too surprised Spark SQL - JDBC data source - Phoenix JDBC server - HBASE would work better.
>>>>>>>>>>>> 
>>>>>>>>>>>> Without naming specifics, there are at least 4 or 5 different implementations of HBASE sources, each at varying level of development and different requirements (HBASE release version, Kerberos support etc)
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> _____________________________
>>>>>>>>>>>> From: Benjamin Kim <bb...@gmail.com>
>>>>>>>>>>>> Sent: Saturday, October 8, 2016 11:26 AM
>>>>>>>>>>>> Subject: Re: Spark SQL Thriftserver with HBase
>>>>>>>>>>>> To: Mich Talebzadeh <mi...@gmail.com>
>>>>>>>>>>>> Cc: <us...@spark.apache.org>, Felix Cheung <fe...@hotmail.com>
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> Mich,
>>>>>>>>>>>> 
>>>>>>>>>>>> Are you talking about the Phoenix JDBC Server? If so, I forgot about that alternative.
>>>>>>>>>>>> 
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Ben
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> On Oct 8, 2016, at 11:21 AM, Mich Talebzadeh <mi...@gmail.com> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> I don't think it will work
>>>>>>>>>>>> 
>>>>>>>>>>>> you can use phoenix on top of hbase
>>>>>>>>>>>> 
>>>>>>>>>>>> hbase(main):336:0> scan 'tsco', 'LIMIT' => 1
>>>>>>>>>>>> ROW                                                       COLUMN+CELL
>>>>>>>>>>>>  TSCO-1-Apr-08                                            column=stock_daily:Date, timestamp=1475866783376, value=1-Apr-08
>>>>>>>>>>>>  TSCO-1-Apr-08                                            column=stock_daily:close, timestamp=1475866783376, value=405.25
>>>>>>>>>>>>  TSCO-1-Apr-08                                            column=stock_daily:high, timestamp=1475866783376, value=406.75
>>>>>>>>>>>>  TSCO-1-Apr-08                                            column=stock_daily:low, timestamp=1475866783376, value=379.25
>>>>>>>>>>>>  TSCO-1-Apr-08                                            column=stock_daily:open, timestamp=1475866783376, value=380.00
>>>>>>>>>>>>  TSCO-1-Apr-08                                            column=stock_daily:stock, timestamp=1475866783376, value=TESCO PLC
>>>>>>>>>>>>  TSCO-1-Apr-08                                            column=stock_daily:ticker, timestamp=1475866783376, value=TSCO
>>>>>>>>>>>>  TSCO-1-Apr-08                                            column=stock_daily:volume, timestamp=1475866783376, value=49664486
>>>>>>>>>>>> 
>>>>>>>>>>>> And the same on Phoenix on top of Hvbase table
>>>>>>>>>>>> 
>>>>>>>>>>>> 0: jdbc:phoenix:thin:url=http://rhes564:8765> select substr(to_char(to_date("Date",'dd-MMM-yy')),1,10) AS TradeDate, "close" AS "Day's close", "high" AS "Day's High", "low" AS "Day's Low", "open" AS "Day's Open", "ticker", "volume", (to_number("low")+to_number("high"))/2 AS "AverageDailyPrice" from "tsco" where to_number("volume") > 0 and "high" != '-' and to_date("Date",'dd-MMM-yy') > to_date('2015-10-06','yyyy-MM-dd') order by  to_date("Date",'dd-MMM-yy') limit 1;
>>>>>>>>>>>> +-------------+--------------+-------------+------------+-------------+---------+-----------+--------------------+
>>>>>>>>>>>> |  TRADEDATE  | Day's close  | Day's High  | Day's Low  | Day's Open  | ticker  |  volume   | AverageDailyPrice  |
>>>>>>>>>>>> +-------------+--------------+-------------+------------+-------------+---------+-----------+--------------------+
>>>>>>>>>>>> | 2015-10-07  | 197.00       | 198.05      | 184.84     | 192.20      | TSCO    | 30046994  | 191.445            |
>>>>>>>>>>>> 
>>>>>>>>>>>> HTH
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> Dr Mich Talebzadeh
>>>>>>>>>>>>  
>>>>>>>>>>>> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>>>>>>>>  
>>>>>>>>>>>> http://talebzadehmich.wordpress.com
>>>>>>>>>>>> 
>>>>>>>>>>>> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destructionof data or any other property which may arise from relying on this email's technical content is explicitly disclaimed.The author will in no case be liable for any monetary damages arising from suchloss, damage or destruction.
>>>>>>>>>>>>  
>>>>>>>>>>>> 
>>>>>>>>>>>>> On 8 October 2016 at 19:05, Felix Cheung <fe...@hotmail.com> wrote:
>>>>>>>>>>>>> Great, then I think those packages as Spark data source should allow you to do exactly that (replace org.apache.spark.sql.jdbc with HBASE one)
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I do think it will be great to get more examples around this though. Would be great if you could share your experience with this!
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> _____________________________
>>>>>>>>>>>>> From: Benjamin Kim <bb...@gmail.com>
>>>>>>>>>>>>> Sent: Saturday, October 8, 2016 11:00 AM
>>>>>>>>>>>>> Subject: Re: Spark SQL Thriftserver with HBase
>>>>>>>>>>>>> To: Felix Cheung <fe...@hotmail.com>
>>>>>>>>>>>>> Cc: <us...@spark.apache.org>
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Felix,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> My goal is to use Spark SQL JDBC Thriftserver to access HBase tables using just SQL. I have been able to CREATE tables using this statement below in the past:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> CREATE TABLE <table-name>
>>>>>>>>>>>>> USING org.apache.spark.sql.jdbc
>>>>>>>>>>>>> OPTIONS (
>>>>>>>>>>>>>   url "jdbc:postgresql://<hostname>:<port>/dm?user=<username>&password=<password>",
>>>>>>>>>>>>>   dbtable "dim.dimension_acamp"
>>>>>>>>>>>>> );
>>>>>>>>>>>>> 
>>>>>>>>>>>>> After doing this, I can access the PostgreSQL table using Spark SQL JDBC Thriftserver using SQL statements (SELECT, UPDATE, INSERT, etc.). I want to do the same with HBase tables. We tried this using Hive and HiveServer2, but the response times are just too long.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Ben
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Oct 8, 2016, at 10:53 AM, Felix Cheung <fe...@hotmail.com> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Ben,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I'm not sure I'm following completely.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Is your goal to use Spark to create or access tables in HBASE? If so the link below and several packages out there support that by having a HBASE data source for Spark. There are some examples on how the Spark code look like in that link as well. On that note, you should also be able to use the HBASE data source from pure SQL (Spark SQL) query as well, which should work in the case with the Spark SQL JDBC Thrift Server (with USING,http://spark.apache.org/docs/latest/sql-programming-guide.html#tab_sql_10).
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> _____________________________
>>>>>>>>>>>>> From: Benjamin Kim <bb...@gmail.com>
>>>>>>>>>>>>> Sent: Saturday, October 8, 2016 10:40 AM
>>>>>>>>>>>>> Subject: Re: Spark SQL Thriftserver with HBase
>>>>>>>>>>>>> To: Felix Cheung <fe...@hotmail.com>
>>>>>>>>>>>>> Cc: <us...@spark.apache.org>
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Felix,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> The only alternative way is to create a stored procedure (udf) in database terms that would run Spark scala code underneath. In this way, I can use Spark SQL JDBC Thriftserver to execute it using SQL code passing the key, values I want to UPSERT. I wonder if this is possible since I cannot CREATE a wrapper table on top of a HBase table in Spark SQL?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> What do you think? Is this the right approach?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Ben
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Oct 8, 2016, at 10:33 AM, Felix Cheung <fe...@hotmail.com> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> HBase has released support for Spark
>>>>>>>>>>>>> hbase.apache.org/book.html#spark
>>>>>>>>>>>>> 
>>>>>>>>>>>>> And if you search you should find several alternative approaches.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Fri, Oct 7, 2016 at 7:56 AM -0700, "Benjamin Kim" <bb...@gmail.com> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Does anyone know if Spark can work with HBase tables using Spark SQL? I know in Hive we are able to create tables on top of an underlying HBase table that can be accessed using MapReduce jobs. Can the same be done using HiveContext or SQLContext? We are trying to setup a way to GET and POST data to and from the HBase table using the Spark SQL JDBC thriftserver from our RESTful API endpoints and/or HTTP web farms. If we can get this to work, then we can load balance the thriftservers. In addition, this will benefit us in giving us a way to abstract the data storage layer away from the presentation layer code. There is a chance that we will swap out the data storage technology in the future. We are currently experimenting with Kudu.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Ben
>>>>>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>>>>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>

Re: Spark SQL Thriftserver with HBase

Posted by Michael Segel <ms...@hotmail.com>.

You really don’t want to do OLTP on a distributed NoSQL engine.
Remember Big Data isn’t relational its more of a hierarchy model or record model. Think IMS or Pick (Dick Pick’s revelation, U2, Universe, etc …)


On Oct 17, 2016, at 3:45 PM, Jörn Franke <jo...@gmail.com>> wrote:

It has some implication because it imposes the SQL model on Hbase. Internally it translates the SQL queries into custom Hbase processors. Keep also in mind for what Hbase need a proper key design and how Phoenix designs those keys to get the best performance out of it. I think for oltp it is a workable model and I think they plan to offer Phoenix as a default interface as part of Hbase anyway.
For OLAP it depends.


On 17 Oct 2016, at 22:34, ayan guha <gu...@gmail.com>> wrote:


Hi

Any reason not to recommend Phoneix? I haven't used it myself so curious about pro's and cons about the use of it.

On 18 Oct 2016 03:17, "Michael Segel" <ms...@hotmail.com>> wrote:
Guys,
Sorry for jumping in late to the game…

If memory serves (which may not be a good thing…) :

You can use HiveServer2 as a connection point to HBase.
While this doesn’t perform well, its probably the cleanest solution.
I’m not keen on Phoenix… wouldn’t recommend it….


The issue is that you’re trying to make HBase, a key/value object store, a Relational Engine… its not.

There are some considerations which make HBase not ideal for all use cases and you may find better performance with Parquet files.

One thing missing is the use of secondary indexing and query optimizations that you have in RDBMSs and are lacking in HBase / MapRDB / etc …  so your performance will vary.

With respect to Tableau… their entire interface in to the big data world revolves around the JDBC/ODBC interface. So if you don’t have that piece as part of your solution, you’re DOA w respect to Tableau.

Have you considered Drill as your JDBC connection point?  (YAAP: Yet another Apache project)


On Oct 9, 2016, at 12:23 PM, Benjamin Kim <bb...@gmail.com>> wrote:

Thanks for all the suggestions. It would seem you guys are right about the Tableau side of things. The reports don’t need to be real-time, and they won’t be directly feeding off of the main DMP HBase data. Instead, it’ll be batched to Parquet or Kudu/Impala or even PostgreSQL.

I originally thought that we needed two-way data retrieval from the DMP HBase for ID generation, but after further investigation into the use-case and architecture, the ID generation needs to happen local to the Ad Servers where we generate a unique ID and store it in a ID linking table. Even better, many of the 3rd party services supply this ID. So, data only needs to flow in one direction. We will use Kafka as the bus for this. No JDBC required. This is also goes for the REST Endpoints. 3rd party services will hit ours to update our data with no need to read from our data. And, when we want to update their data, we will hit theirs to update their data using a triggered job.

This al boils down to just integrating with Kafka.

Once again, thanks for all the help.

Cheers,
Ben


On Oct 9, 2016, at 3:16 AM, Jörn Franke <jo...@gmail.com>> wrote:

please keep also in mind that Tableau Server has the capabilities to store data in-memory and refresh only when needed the in-memory data. This means you can import it from any source and let your users work only on the in-memory data in Tableau Server.

On Sun, Oct 9, 2016 at 9:22 AM, Jörn Franke <jo...@gmail.com>> wrote:
Cloudera 5.8 has a very old version of Hive without Tez, but Mich provided already a good alternative. However, you should check if it contains a recent version of Hbase and Phoenix. That being said, I just wonder what is the dataflow, data model and the analysis you plan to do. Maybe there are completely different solutions possible. Especially these single inserts, upserts etc. should be avoided as much as possible in the Big Data (analysis) world with any technology, because they do not perform well.

Hive with Llap will provide an in-memory cache for interactive analytics. You can put full tables in-memory with Hive using Ignite HDFS in-memory solution. All this does only make sense if you do not use MR as an engine, the right input format (ORC, parquet) and a recent Hive version.

On 8 Oct 2016, at 21:55, Benjamin Kim <bb...@gmail.com>> wrote:

Mich,

Unfortunately, we are moving away from Hive and unifying on Spark using CDH 5.8 as our distro. And, the Tableau released a Spark ODBC/JDBC driver too. I will either try Phoenix JDBC Server for HBase or push to move faster to Kudu with Impala. We will use Impala as the JDBC in-between until the Kudu team completes Spark SQL support for JDBC.

Thanks for the advice.

Cheers,
Ben


On Oct 8, 2016, at 12:35 PM, Mich Talebzadeh <mi...@gmail.com>> wrote:

Sure. But essentially you are looking at batch data for analytics for your tableau users so Hive may be a better choice with its rich SQL and ODBC.JDBC connection to Tableau already.

I would go for Hive especially the new release will have an in-memory offering as well for frequently accessed data :)


Dr Mich Talebzadeh



LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.



On 8 October 2016 at 20:15, Benjamin Kim <bb...@gmail.com>> wrote:
Mich,

First and foremost, we have visualization servers that run Tableau for external user reports. Second, we have servers that are ad servers and REST endpoints for cookie sync and segmentation data exchange. These will use JDBC directly within the same data-center. When not colocated in the same data-center, they will connected to a located database server using JDBC. Either way, by using JDBC everywhere, it simplifies and unifies the code on the JDBC industry standard.

Does this make sense?

Thanks,
Ben


On Oct 8, 2016, at 11:47 AM, Mich Talebzadeh <mi...@gmail.com>> wrote:

Like any other design what is your presentation layer and end users?

Are they SQL centric users from Tableau background or they may use spark functional programming.

It is best to describe the use case.

HTH

Dr Mich Talebzadeh



LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.



On 8 October 2016 at 19:40, Felix Cheung <fe...@hotmail.com>> wrote:
I wouldn't be too surprised Spark SQL - JDBC data source - Phoenix JDBC server - HBASE would work better.

Without naming specifics, there are at least 4 or 5 different implementations of HBASE sources, each at varying level of development and different requirements (HBASE release version, Kerberos support etc)


_____________________________
From: Benjamin Kim <bb...@gmail.com>>
Sent: Saturday, October 8, 2016 11:26 AM
Subject: Re: Spark SQL Thriftserver with HBase
To: Mich Talebzadeh <mi...@gmail.com>>
Cc: <us...@spark.apache.org>>, Felix Cheung <fe...@hotmail.com>>



Mich,

Are you talking about the Phoenix JDBC Server? If so, I forgot about that alternative.

Thanks,
Ben


On Oct 8, 2016, at 11:21 AM, Mich Talebzadeh <mi...@gmail.com>> wrote:

I don't think it will work

you can use phoenix on top of hbase

hbase(main):336:0> scan 'tsco', 'LIMIT' => 1
ROW                                                       COLUMN+CELL
 TSCO-1-Apr-08                                            column=stock_daily:Date, timestamp=1475866783376, value=1-Apr-08
 TSCO-1-Apr-08                                            column=stock_daily:close, timestamp=1475866783376, value=405.25
 TSCO-1-Apr-08                                            column=stock_daily:high, timestamp=1475866783376, value=406.75
 TSCO-1-Apr-08                                            column=stock_daily:low, timestamp=1475866783376, value=379.25
 TSCO-1-Apr-08                                            column=stock_daily:open, timestamp=1475866783376, value=380.00
 TSCO-1-Apr-08                                            column=stock_daily:stock, timestamp=1475866783376, value=TESCO PLC
 TSCO-1-Apr-08                                            column=stock_daily:ticker, timestamp=1475866783376, value=TSCO
 TSCO-1-Apr-08                                            column=stock_daily:volume, timestamp=1475866783376, value=49664486

And the same on Phoenix on top of Hvbase table

0: jdbc:phoenix:thin:url=http://rhes564:8765<http://rhes564:8765/>> select substr(to_char(to_date("Date",'dd-MMM-yy')),1,10) AS TradeDate, "close" AS "Day's close", "high" AS "Day's High", "low" AS "Day's Low", "open" AS "Day's Open", "ticker", "volume", (to_number("low")+to_number("high"))/2 AS "AverageDailyPrice" from "tsco" where to_number("volume") > 0 and "high" != '-' and to_date("Date",'dd-MMM-yy') > to_date('2015-10-06','yyyy-MM-dd') order by  to_date("Date",'dd-MMM-yy') limit 1;
+-------------+--------------+-------------+------------+-------------+---------+-----------+--------------------+
|  TRADEDATE  | Day's close  | Day's High  | Day's Low  | Day's Open  | ticker  |  volume   | AverageDailyPrice  |
+-------------+--------------+-------------+------------+-------------+---------+-----------+--------------------+
| 2015-10-07  | 197.00       | 198.05      | 184.84     | 192.20      | TSCO    | 30046994  | 191.445            |


HTH




Dr Mich Talebzadeh



LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destructionof data or any other property which may arise from relying on this email's technical content is explicitly disclaimed.The author will in no case be liable for any monetary damages arising from suchloss, damage or destruction.



On 8 October 2016 at 19:05, Felix Cheung <fe...@hotmail.com>> wrote:
Great, then I think those packages as Spark data source should allow you to do exactly that (replace org.apache.spark.sql.jdbc with HBASE one)

I do think it will be great to get more examples around this though. Would be great if you could share your experience with this!


_____________________________
From: Benjamin Kim <bb...@gmail.com>>
Sent: Saturday, October 8, 2016 11:00 AM
Subject: Re: Spark SQL Thriftserver with HBase
To: Felix Cheung <fe...@hotmail.com>>
Cc: <us...@spark.apache.org>>


Felix,

My goal is to use Spark SQL JDBC Thriftserver to access HBase tables using just SQL. I have been able to CREATE tables using this statement below in the past:

CREATE TABLE <table-name>
USING org.apache.spark.sql.jdbc
OPTIONS (
  url "jdbc:postgresql://<hostname>:<port>/dm?user=<username>&password=<password>",
  dbtable "dim.dimension_acamp"
);

After doing this, I can access the PostgreSQL table using Spark SQL JDBC Thriftserver using SQL statements (SELECT, UPDATE, INSERT, etc.). I want to do the same with HBase tables. We tried this using Hive and HiveServer2, but the response times are just too long.

Thanks,
Ben


On Oct 8, 2016, at 10:53 AM, Felix Cheung <fe...@hotmail.com>> wrote:

Ben,

I'm not sure I'm following completely.

Is your goal to use Spark to create or access tables in HBASE? If so the link below and several packages out there support that by having a HBASE data source for Spark. There are some examples on how the Spark code look like in that link as well. On that note, you should also be able to use the HBASE data source from pure SQL (Spark SQL) query as well, which should work in the case with the Spark SQL JDBC Thrift Server (with USING,http://spark.apache.org/docs/latest/sql-programming-guide.html#tab_sql_10).


_____________________________
From: Benjamin Kim <bb...@gmail.com>>
Sent: Saturday, October 8, 2016 10:40 AM
Subject: Re: Spark SQL Thriftserver with HBase
To: Felix Cheung <fe...@hotmail.com>>
Cc: <us...@spark.apache.org>>


Felix,

The only alternative way is to create a stored procedure (udf) in database terms that would run Spark scala code underneath. In this way, I can use Spark SQL JDBC Thriftserver to execute it using SQL code passing the key, values I want to UPSERT. I wonder if this is possible since I cannot CREATE a wrapper table on top of a HBase table in Spark SQL?

What do you think? Is this the right approach?

Thanks,
Ben

On Oct 8, 2016, at 10:33 AM, Felix Cheung <fe...@hotmail.com>> wrote:

HBase has released support for Spark
hbase.apache.org/book.html#spark<http://hbase.apache.org/book.html#spark>

And if you search you should find several alternative approaches.





On Fri, Oct 7, 2016 at 7:56 AM -0700, "Benjamin Kim" <bb...@gmail.com>> wrote:

Does anyone know if Spark can work with HBase tables using Spark SQL? I know in Hive we are able to create tables on top of an underlying HBase table that can be accessed using MapReduce jobs. Can the same be done using HiveContext or SQLContext? We are trying to setup a way to GET and POST data to and from the HBase table using the Spark SQL JDBC thriftserver from our RESTful API endpoints and/or HTTP web farms. If we can get this to work, then we can load balance the thriftservers. In addition, this will benefit us in giving us a way to abstract the data storage layer away from the presentation layer code. There is a chance that we will swap out the data storage technology in the future. We are currently experimenting with Kudu.

Thanks,
Ben
---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org<ma...@spark.apache.org>

Re: Spark SQL Thriftserver with HBase

Posted by Jörn Franke <jo...@gmail.com>.

It has some implication because it imposes the SQL model on Hbase. Internally it translates the SQL queries into custom Hbase processors. Keep also in mind for what Hbase need a proper key design and how Phoenix designs those keys to get the best performance out of it. I think for oltp it is a workable model and I think they plan to offer Phoenix as a default interface as part of Hbase anyway.
For OLAP it depends. 


> On 17 Oct 2016, at 22:34, ayan guha <gu...@gmail.com> wrote:
> 
> Hi
> 
> Any reason not to recommend Phoneix? I haven't used it myself so curious about pro's and cons about the use of it.
> 
>> On 18 Oct 2016 03:17, "Michael Segel" <ms...@hotmail.com> wrote:
>> Guys, 
>> Sorry for jumping in late to the game… 
>> 
>> If memory serves (which may not be a good thing…) :
>> 
>> You can use HiveServer2 as a connection point to HBase.  
>> While this doesn’t perform well, its probably the cleanest solution. 
>> I’m not keen on Phoenix… wouldn’t recommend it…. 
>> 
>> 
>> The issue is that you’re trying to make HBase, a key/value object store, a Relational Engine… its not. 
>> 
>> There are some considerations which make HBase not ideal for all use cases and you may find better performance with Parquet files. 
>> 
>> One thing missing is the use of secondary indexing and query optimizations that you have in RDBMSs and are lacking in HBase / MapRDB / etc …  so your performance will vary. 
>> 
>> With respect to Tableau… their entire interface in to the big data world revolves around the JDBC/ODBC interface. So if you don’t have that piece as part of your solution, you’re DOA w respect to Tableau. 
>> 
>> Have you considered Drill as your JDBC connection point?  (YAAP: Yet another Apache project) 
>> 
>> 
>>> On Oct 9, 2016, at 12:23 PM, Benjamin Kim <bb...@gmail.com> wrote:
>>> 
>>> Thanks for all the suggestions. It would seem you guys are right about the Tableau side of things. The reports don’t need to be real-time, and they won’t be directly feeding off of the main DMP HBase data. Instead, it’ll be batched to Parquet or Kudu/Impala or even PostgreSQL.
>>> 
>>> I originally thought that we needed two-way data retrieval from the DMP HBase for ID generation, but after further investigation into the use-case and architecture, the ID generation needs to happen local to the Ad Servers where we generate a unique ID and store it in a ID linking table. Even better, many of the 3rd party services supply this ID. So, data only needs to flow in one direction. We will use Kafka as the bus for this. No JDBC required. This is also goes for the REST Endpoints. 3rd party services will hit ours to update our data with no need to read from our data. And, when we want to update their data, we will hit theirs to update their data using a triggered job.
>>> 
>>> This al boils down to just integrating with Kafka.
>>> 
>>> Once again, thanks for all the help.
>>> 
>>> Cheers,
>>> Ben
>>> 
>>> 
>>>> On Oct 9, 2016, at 3:16 AM, Jörn Franke <jo...@gmail.com> wrote:
>>>> 
>>>> please keep also in mind that Tableau Server has the capabilities to store data in-memory and refresh only when needed the in-memory data. This means you can import it from any source and let your users work only on the in-memory data in Tableau Server.
>>>> 
>>>>> On Sun, Oct 9, 2016 at 9:22 AM, Jörn Franke <jo...@gmail.com> wrote:
>>>>> Cloudera 5.8 has a very old version of Hive without Tez, but Mich provided already a good alternative. However, you should check if it contains a recent version of Hbase and Phoenix. That being said, I just wonder what is the dataflow, data model and the analysis you plan to do. Maybe there are completely different solutions possible. Especially these single inserts, upserts etc. should be avoided as much as possible in the Big Data (analysis) world with any technology, because they do not perform well. 
>>>>> 
>>>>> Hive with Llap will provide an in-memory cache for interactive analytics. You can put full tables in-memory with Hive using Ignite HDFS in-memory solution. All this does only make sense if you do not use MR as an engine, the right input format (ORC, parquet) and a recent Hive version.
>>>>> 
>>>>> On 8 Oct 2016, at 21:55, Benjamin Kim <bb...@gmail.com> wrote:
>>>>> 
>>>>>> Mich,
>>>>>> 
>>>>>> Unfortunately, we are moving away from Hive and unifying on Spark using CDH 5.8 as our distro. And, the Tableau released a Spark ODBC/JDBC driver too. I will either try Phoenix JDBC Server for HBase or push to move faster to Kudu with Impala. We will use Impala as the JDBC in-between until the Kudu team completes Spark SQL support for JDBC.
>>>>>> 
>>>>>> Thanks for the advice.
>>>>>> 
>>>>>> Cheers,
>>>>>> Ben
>>>>>> 
>>>>>> 
>>>>>>> On Oct 8, 2016, at 12:35 PM, Mich Talebzadeh <mi...@gmail.com> wrote:
>>>>>>> 
>>>>>>> Sure. But essentially you are looking at batch data for analytics for your tableau users so Hive may be a better choice with its rich SQL and ODBC.JDBC connection to Tableau already.
>>>>>>> 
>>>>>>> I would go for Hive especially the new release will have an in-memory offering as well for frequently accessed data :)
>>>>>>> 
>>>>>>> 
>>>>>>> Dr Mich Talebzadeh
>>>>>>>  
>>>>>>> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>>>  
>>>>>>> http://talebzadehmich.wordpress.com
>>>>>>> 
>>>>>>> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.
>>>>>>>  
>>>>>>> 
>>>>>>>> On 8 October 2016 at 20:15, Benjamin Kim <bb...@gmail.com> wrote:
>>>>>>>> Mich,
>>>>>>>> 
>>>>>>>> First and foremost, we have visualization servers that run Tableau for external user reports. Second, we have servers that are ad servers and REST endpoints for cookie sync and segmentation data exchange. These will use JDBC directly within the same data-center. When not colocated in the same data-center, they will connected to a located database server using JDBC. Either way, by using JDBC everywhere, it simplifies and unifies the code on the JDBC industry standard.
>>>>>>>> 
>>>>>>>> Does this make sense?
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> Ben
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On Oct 8, 2016, at 11:47 AM, Mich Talebzadeh <mi...@gmail.com> wrote:
>>>>>>>>> 
>>>>>>>>> Like any other design what is your presentation layer and end users?
>>>>>>>>> 
>>>>>>>>> Are they SQL centric users from Tableau background or they may use spark functional programming.
>>>>>>>>> 
>>>>>>>>> It is best to describe the use case.
>>>>>>>>> 
>>>>>>>>> HTH
>>>>>>>>> 
>>>>>>>>> Dr Mich Talebzadeh
>>>>>>>>>  
>>>>>>>>> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>>>>>  
>>>>>>>>> http://talebzadehmich.wordpress.com
>>>>>>>>> 
>>>>>>>>> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.
>>>>>>>>>  
>>>>>>>>> 
>>>>>>>>>> On 8 October 2016 at 19:40, Felix Cheung <fe...@hotmail.com> wrote:
>>>>>>>>>> I wouldn't be too surprised Spark SQL - JDBC data source - Phoenix JDBC server - HBASE would work better.
>>>>>>>>>> 
>>>>>>>>>> Without naming specifics, there are at least 4 or 5 different implementations of HBASE sources, each at varying level of development and different requirements (HBASE release version, Kerberos support etc)
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> _____________________________
>>>>>>>>>> From: Benjamin Kim <bb...@gmail.com>
>>>>>>>>>> Sent: Saturday, October 8, 2016 11:26 AM
>>>>>>>>>> Subject: Re: Spark SQL Thriftserver with HBase
>>>>>>>>>> To: Mich Talebzadeh <mi...@gmail.com>
>>>>>>>>>> Cc: <us...@spark.apache.org>, Felix Cheung <fe...@hotmail.com>
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Mich,
>>>>>>>>>> 
>>>>>>>>>> Are you talking about the Phoenix JDBC Server? If so, I forgot about that alternative.
>>>>>>>>>> 
>>>>>>>>>> Thanks,
>>>>>>>>>> Ben
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Oct 8, 2016, at 11:21 AM, Mich Talebzadeh <mi...@gmail.com> wrote:
>>>>>>>>>> 
>>>>>>>>>> I don't think it will work
>>>>>>>>>> 
>>>>>>>>>> you can use phoenix on top of hbase
>>>>>>>>>> 
>>>>>>>>>> hbase(main):336:0> scan 'tsco', 'LIMIT' => 1
>>>>>>>>>> ROW                                                       COLUMN+CELL
>>>>>>>>>>  TSCO-1-Apr-08                                            column=stock_daily:Date, timestamp=1475866783376, value=1-Apr-08
>>>>>>>>>>  TSCO-1-Apr-08                                            column=stock_daily:close, timestamp=1475866783376, value=405.25
>>>>>>>>>>  TSCO-1-Apr-08                                            column=stock_daily:high, timestamp=1475866783376, value=406.75
>>>>>>>>>>  TSCO-1-Apr-08                                            column=stock_daily:low, timestamp=1475866783376, value=379.25
>>>>>>>>>>  TSCO-1-Apr-08                                            column=stock_daily:open, timestamp=1475866783376, value=380.00
>>>>>>>>>>  TSCO-1-Apr-08                                            column=stock_daily:stock, timestamp=1475866783376, value=TESCO PLC
>>>>>>>>>>  TSCO-1-Apr-08                                            column=stock_daily:ticker, timestamp=1475866783376, value=TSCO
>>>>>>>>>>  TSCO-1-Apr-08                                            column=stock_daily:volume, timestamp=1475866783376, value=49664486
>>>>>>>>>> 
>>>>>>>>>> And the same on Phoenix on top of Hvbase table
>>>>>>>>>> 
>>>>>>>>>> 0: jdbc:phoenix:thin:url=http://rhes564:8765> select substr(to_char(to_date("Date",'dd-MMM-yy')),1,10) AS TradeDate, "close" AS "Day's close", "high" AS "Day's High", "low" AS "Day's Low", "open" AS "Day's Open", "ticker", "volume", (to_number("low")+to_number("high"))/2 AS "AverageDailyPrice" from "tsco" where to_number("volume") > 0 and "high" != '-' and to_date("Date",'dd-MMM-yy') > to_date('2015-10-06','yyyy-MM-dd') order by  to_date("Date",'dd-MMM-yy') limit 1;
>>>>>>>>>> +-------------+--------------+-------------+------------+-------------+---------+-----------+--------------------+
>>>>>>>>>> |  TRADEDATE  | Day's close  | Day's High  | Day's Low  | Day's Open  | ticker  |  volume   | AverageDailyPrice  |
>>>>>>>>>> +-------------+--------------+-------------+------------+-------------+---------+-----------+--------------------+
>>>>>>>>>> | 2015-10-07  | 197.00       | 198.05      | 184.84     | 192.20      | TSCO    | 30046994  | 191.445            |
>>>>>>>>>> 
>>>>>>>>>> HTH
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Dr Mich Talebzadeh
>>>>>>>>>>  
>>>>>>>>>> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>>>>>>  
>>>>>>>>>> http://talebzadehmich.wordpress.com
>>>>>>>>>> 
>>>>>>>>>> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destructionof data or any other property which may arise from relying on this email's technical content is explicitly disclaimed.The author will in no case be liable for any monetary damages arising from suchloss, damage or destruction.
>>>>>>>>>>  
>>>>>>>>>> 
>>>>>>>>>>> On 8 October 2016 at 19:05, Felix Cheung <fe...@hotmail.com> wrote:
>>>>>>>>>>> Great, then I think those packages as Spark data source should allow you to do exactly that (replace org.apache.spark.sql.jdbc with HBASE one)
>>>>>>>>>>> 
>>>>>>>>>>> I do think it will be great to get more examples around this though. Would be great if you could share your experience with this!
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> _____________________________
>>>>>>>>>>> From: Benjamin Kim <bb...@gmail.com>
>>>>>>>>>>> Sent: Saturday, October 8, 2016 11:00 AM
>>>>>>>>>>> Subject: Re: Spark SQL Thriftserver with HBase
>>>>>>>>>>> To: Felix Cheung <fe...@hotmail.com>
>>>>>>>>>>> Cc: <us...@spark.apache.org>
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> Felix,
>>>>>>>>>>> 
>>>>>>>>>>> My goal is to use Spark SQL JDBC Thriftserver to access HBase tables using just SQL. I have been able to CREATE tables using this statement below in the past:
>>>>>>>>>>> 
>>>>>>>>>>> CREATE TABLE <table-name>
>>>>>>>>>>> USING org.apache.spark.sql.jdbc
>>>>>>>>>>> OPTIONS (
>>>>>>>>>>>   url "jdbc:postgresql://<hostname>:<port>/dm?user=<username>&password=<password>",
>>>>>>>>>>>   dbtable "dim.dimension_acamp"
>>>>>>>>>>> );
>>>>>>>>>>> 
>>>>>>>>>>> After doing this, I can access the PostgreSQL table using Spark SQL JDBC Thriftserver using SQL statements (SELECT, UPDATE, INSERT, etc.). I want to do the same with HBase tables. We tried this using Hive and HiveServer2, but the response times are just too long.
>>>>>>>>>>> 
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Ben
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On Oct 8, 2016, at 10:53 AM, Felix Cheung <fe...@hotmail.com> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> Ben,
>>>>>>>>>>> 
>>>>>>>>>>> I'm not sure I'm following completely.
>>>>>>>>>>> 
>>>>>>>>>>> Is your goal to use Spark to create or access tables in HBASE? If so the link below and several packages out there support that by having a HBASE data source for Spark. There are some examples on how the Spark code look like in that link as well. On that note, you should also be able to use the HBASE data source from pure SQL (Spark SQL) query as well, which should work in the case with the Spark SQL JDBC Thrift Server (with USING,http://spark.apache.org/docs/latest/sql-programming-guide.html#tab_sql_10).
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> _____________________________
>>>>>>>>>>> From: Benjamin Kim <bb...@gmail.com>
>>>>>>>>>>> Sent: Saturday, October 8, 2016 10:40 AM
>>>>>>>>>>> Subject: Re: Spark SQL Thriftserver with HBase
>>>>>>>>>>> To: Felix Cheung <fe...@hotmail.com>
>>>>>>>>>>> Cc: <us...@spark.apache.org>
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> Felix,
>>>>>>>>>>> 
>>>>>>>>>>> The only alternative way is to create a stored procedure (udf) in database terms that would run Spark scala code underneath. In this way, I can use Spark SQL JDBC Thriftserver to execute it using SQL code passing the key, values I want to UPSERT. I wonder if this is possible since I cannot CREATE a wrapper table on top of a HBase table in Spark SQL?
>>>>>>>>>>> 
>>>>>>>>>>> What do you think? Is this the right approach?
>>>>>>>>>>> 
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Ben
>>>>>>>>>>> 
>>>>>>>>>>> On Oct 8, 2016, at 10:33 AM, Felix Cheung <fe...@hotmail.com> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> HBase has released support for Spark
>>>>>>>>>>> hbase.apache.org/book.html#spark
>>>>>>>>>>> 
>>>>>>>>>>> And if you search you should find several alternative approaches.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On Fri, Oct 7, 2016 at 7:56 AM -0700, "Benjamin Kim" <bb...@gmail.com> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> Does anyone know if Spark can work with HBase tables using Spark SQL? I know in Hive we are able to create tables on top of an underlying HBase table that can be accessed using MapReduce jobs. Can the same be done using HiveContext or SQLContext? We are trying to setup a way to GET and POST data to and from the HBase table using the Spark SQL JDBC thriftserver from our RESTful API endpoints and/or HTTP web farms. If we can get this to work, then we can load balance the thriftservers. In addition, this will benefit us in giving us a way to abstract the data storage layer away from the presentation layer code. There is a chance that we will swap out the data storage technology in the future. We are currently experimenting with Kudu.
>>>>>>>>>>> 
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Ben
>>>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>> 
>>> 
>>

Re: Spark SQL Thriftserver with HBase

Posted by ayan guha <gu...@gmail.com>.

Hi

Any reason not to recommend Phoneix? I haven't used it myself so curious
about pro's and cons about the use of it.
On 18 Oct 2016 03:17, "Michael Segel" <ms...@hotmail.com> wrote:

> Guys,
> Sorry for jumping in late to the game…
>
> If memory serves (which may not be a good thing…) :
>
> You can use HiveServer2 as a connection point to HBase.
> While this doesn’t perform well, its probably the cleanest solution.
> I’m not keen on Phoenix… wouldn’t recommend it….
>
>
> The issue is that you’re trying to make HBase, a key/value object store, a
> Relational Engine… its not.
>
> There are some considerations which make HBase not ideal for all use cases
> and you may find better performance with Parquet files.
>
> One thing missing is the use of secondary indexing and query optimizations
> that you have in RDBMSs and are lacking in HBase / MapRDB / etc …  so your
> performance will vary.
>
> With respect to Tableau… their entire interface in to the big data world
> revolves around the JDBC/ODBC interface. So if you don’t have that piece as
> part of your solution, you’re DOA w respect to Tableau.
>
> Have you considered Drill as your JDBC connection point?  (YAAP: Yet
> another Apache project)
>
>
> On Oct 9, 2016, at 12:23 PM, Benjamin Kim <bb...@gmail.com> wrote:
>
> Thanks for all the suggestions. It would seem you guys are right about the
> Tableau side of things. The reports don’t need to be real-time, and they
> won’t be directly feeding off of the main DMP HBase data. Instead, it’ll be
> batched to Parquet or Kudu/Impala or even PostgreSQL.
>
> I originally thought that we needed two-way data retrieval from the DMP
> HBase for ID generation, but after further investigation into the use-case
> and architecture, the ID generation needs to happen local to the Ad Servers
> where we generate a unique ID and store it in a ID linking table. Even
> better, many of the 3rd party services supply this ID. So, data only needs
> to flow in one direction. We will use Kafka as the bus for this. No JDBC
> required. This is also goes for the REST Endpoints. 3rd party services will
> hit ours to update our data with no need to read from our data. And, when
> we want to update their data, we will hit theirs to update their data using
> a triggered job.
>
> This al boils down to just integrating with Kafka.
>
> Once again, thanks for all the help.
>
> Cheers,
> Ben
>
>
> On Oct 9, 2016, at 3:16 AM, Jörn Franke <jo...@gmail.com> wrote:
>
> please keep also in mind that Tableau Server has the capabilities to store
> data in-memory and refresh only when needed the in-memory data. This means
> you can import it from any source and let your users work only on the
> in-memory data in Tableau Server.
>
> On Sun, Oct 9, 2016 at 9:22 AM, Jörn Franke <jo...@gmail.com> wrote:
>
>> Cloudera 5.8 has a very old version of Hive without Tez, but Mich
>> provided already a good alternative. However, you should check if it
>> contains a recent version of Hbase and Phoenix. That being said, I just
>> wonder what is the dataflow, data model and the analysis you plan to do.
>> Maybe there are completely different solutions possible. Especially these
>> single inserts, upserts etc. should be avoided as much as possible in the
>> Big Data (analysis) world with any technology, because they do not perform
>> well.
>>
>> Hive with Llap will provide an in-memory cache for interactive analytics.
>> You can put full tables in-memory with Hive using Ignite HDFS in-memory
>> solution. All this does only make sense if you do not use MR as an engine,
>> the right input format (ORC, parquet) and a recent Hive version.
>>
>> On 8 Oct 2016, at 21:55, Benjamin Kim <bb...@gmail.com> wrote:
>>
>> Mich,
>>
>> Unfortunately, we are moving away from Hive and unifying on Spark using
>> CDH 5.8 as our distro. And, the Tableau released a Spark ODBC/JDBC driver
>> too. I will either try Phoenix JDBC Server for HBase or push to move faster
>> to Kudu with Impala. We will use Impala as the JDBC in-between until the
>> Kudu team completes Spark SQL support for JDBC.
>>
>> Thanks for the advice.
>>
>> Cheers,
>> Ben
>>
>>
>> On Oct 8, 2016, at 12:35 PM, Mich Talebzadeh <mi...@gmail.com>
>> wrote:
>>
>> Sure. But essentially you are looking at batch data for analytics for
>> your tableau users so Hive may be a better choice with its rich SQL and
>> ODBC.JDBC connection to Tableau already.
>>
>> I would go for Hive especially the new release will have an in-memory
>> offering as well for frequently accessed data :)
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 8 October 2016 at 20:15, Benjamin Kim <bb...@gmail.com> wrote:
>>
>>> Mich,
>>>
>>> First and foremost, we have visualization servers that run Tableau for
>>> external user reports. Second, we have servers that are ad servers and REST
>>> endpoints for cookie sync and segmentation data exchange. These will use
>>> JDBC directly within the same data-center. When not colocated in the same
>>> data-center, they will connected to a located database server using JDBC.
>>> Either way, by using JDBC everywhere, it simplifies and unifies the code on
>>> the JDBC industry standard.
>>>
>>> Does this make sense?
>>>
>>> Thanks,
>>> Ben
>>>
>>>
>>> On Oct 8, 2016, at 11:47 AM, Mich Talebzadeh <mi...@gmail.com>
>>> wrote:
>>>
>>> Like any other design what is your presentation layer and end users?
>>>
>>> Are they SQL centric users from Tableau background or they may use spark
>>> functional programming.
>>>
>>> It is best to describe the use case.
>>>
>>> HTH
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 8 October 2016 at 19:40, Felix Cheung <fe...@hotmail.com>
>>> wrote:
>>>
>>>> I wouldn't be too surprised Spark SQL - JDBC data source - Phoenix JDBC
>>>> server - HBASE would work better.
>>>>
>>>> Without naming specifics, there are at least 4 or 5 different
>>>> implementations of HBASE sources, each at varying level of development and
>>>> different requirements (HBASE release version, Kerberos support etc)
>>>>
>>>>
>>>> _____________________________
>>>> From: Benjamin Kim <bb...@gmail.com>
>>>> Sent: Saturday, October 8, 2016 11:26 AM
>>>> Subject: Re: Spark SQL Thriftserver with HBase
>>>> To: Mich Talebzadeh <mi...@gmail.com>
>>>> Cc: <us...@spark.apache.org>, Felix Cheung <fe...@hotmail.com>
>>>>
>>>>
>>>>
>>>> Mich,
>>>>
>>>> Are you talking about the Phoenix JDBC Server? If so, I forgot about
>>>> that alternative.
>>>>
>>>> Thanks,
>>>> Ben
>>>>
>>>>
>>>> On Oct 8, 2016, at 11:21 AM, Mich Talebzadeh <mi...@gmail.com>
>>>> wrote:
>>>>
>>>> I don't think it will work
>>>>
>>>> you can use phoenix on top of hbase
>>>>
>>>> hbase(main):336:0> scan 'tsco', 'LIMIT' => 1
>>>> ROW                                                       COLUMN+CELL
>>>>  TSCO-1-Apr-08
>>>> column=stock_daily:Date, timestamp=1475866783376, value=1-Apr-08
>>>>  TSCO-1-Apr-08
>>>> column=stock_daily:close, timestamp=1475866783376, value=405.25
>>>>  TSCO-1-Apr-08
>>>> column=stock_daily:high, timestamp=1475866783376, value=406.75
>>>>  TSCO-1-Apr-08
>>>> column=stock_daily:low, timestamp=1475866783376, value=379.25
>>>>  TSCO-1-Apr-08
>>>> column=stock_daily:open, timestamp=1475866783376, value=380.00
>>>>  TSCO-1-Apr-08
>>>> column=stock_daily:stock, timestamp=1475866783376, value=TESCO PLC
>>>>  TSCO-1-Apr-08
>>>> column=stock_daily:ticker, timestamp=1475866783376, value=TSCO
>>>>  TSCO-1-Apr-08
>>>> column=stock_daily:volume, timestamp=1475866783376, value=49664486
>>>>
>>>> And the same on Phoenix on top of Hvbase table
>>>>
>>>> 0: jdbc:phoenix:thin:url=http://rhes564:8765> select
>>>> substr(to_char(to_date("Date",'dd-MMM-yy')),1,10) AS TradeDate,
>>>> "close" AS "Day's close", "high" AS "Day's High", "low" AS "Day's Low",
>>>> "open" AS "Day's Open", "ticker", "volume", (to_number("low")+to_number("high"))/2
>>>> AS "AverageDailyPrice" from "tsco" where to_number("volume") > 0 and "high"
>>>> != '-' and to_date("Date",'dd-MMM-yy') > to_date('2015-10-06','yyyy-MM-dd')
>>>> order by  to_date("Date",'dd-MMM-yy') limit 1;
>>>> +-------------+--------------+-------------+------------+---
>>>> ----------+---------+-----------+--------------------+
>>>> |  TRADEDATE  | Day's close  | Day's High  | Day's Low  | Day's Open  |
>>>> ticker  |  volume   | AverageDailyPrice  |
>>>> +-------------+--------------+-------------+------------+---
>>>> ----------+---------+-----------+--------------------+
>>>> | 2015-10-07  | 197.00       | 198.05      | 184.84     | 192.20      |
>>>> TSCO    | 30046994  | 191.445            |
>>>>
>>>> HTH
>>>>
>>>>
>>>>
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destructionof data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed.The author will in no case be liable for any monetary damages
>>>> arising from suchloss, damage or destruction.
>>>>
>>>>
>>>>
>>>> On 8 October 2016 at 19:05, Felix Cheung <fe...@hotmail.com>
>>>> wrote:
>>>>
>>>>> Great, then I think those packages as Spark data source should allow
>>>>> you to do exactly that (replace org.apache.spark.sql.jdbc with HBASE one)
>>>>>
>>>>> I do think it will be great to get more examples around this though.
>>>>> Would be great if you could share your experience with this!
>>>>>
>>>>>
>>>>> _____________________________
>>>>> From: Benjamin Kim <bb...@gmail.com>
>>>>> Sent: Saturday, October 8, 2016 11:00 AM
>>>>> Subject: Re: Spark SQL Thriftserver with HBase
>>>>> To: Felix Cheung <fe...@hotmail.com>
>>>>> Cc: <us...@spark.apache.org>
>>>>>
>>>>>
>>>>> Felix,
>>>>>
>>>>> My goal is to use Spark SQL JDBC Thriftserver to access HBase tables
>>>>> using just SQL. I have been able to CREATE tables using this statement
>>>>> below in the past:
>>>>>
>>>>> CREATE TABLE <table-name>
>>>>> USING org.apache.spark.sql.jdbc
>>>>> OPTIONS (
>>>>>   url "jdbc:postgresql://<hostname>:<port>/dm?user=<username>&pass
>>>>> word=<password>",
>>>>>   dbtable "dim.dimension_acamp"
>>>>> );
>>>>>
>>>>>
>>>>> After doing this, I can access the PostgreSQL table using Spark SQL
>>>>> JDBC Thriftserver using SQL statements (SELECT, UPDATE, INSERT, etc.). I
>>>>> want to do the same with HBase tables. We tried this using Hive and
>>>>> HiveServer2, but the response times are just too long.
>>>>>
>>>>> Thanks,
>>>>> Ben
>>>>>
>>>>>
>>>>> On Oct 8, 2016, at 10:53 AM, Felix Cheung <fe...@hotmail.com>
>>>>> wrote:
>>>>>
>>>>> Ben,
>>>>>
>>>>> I'm not sure I'm following completely.
>>>>>
>>>>> Is your goal to use Spark to create or access tables in HBASE? If so
>>>>> the link below and several packages out there support that by having a
>>>>> HBASE data source for Spark. There are some examples on how the Spark code
>>>>> look like in that link as well. On that note, you should also be able to
>>>>> use the HBASE data source from pure SQL (Spark SQL) query as well, which
>>>>> should work in the case with the Spark SQL JDBC Thrift Server (with USING,
>>>>> http://spark.apache.org/docs/latest/sql-programming-gu
>>>>> ide.html#tab_sql_10).
>>>>>
>>>>>
>>>>> _____________________________
>>>>> From: Benjamin Kim <bb...@gmail.com>
>>>>> Sent: Saturday, October 8, 2016 10:40 AM
>>>>> Subject: Re: Spark SQL Thriftserver with HBase
>>>>> To: Felix Cheung <fe...@hotmail.com>
>>>>> Cc: <us...@spark.apache.org>
>>>>>
>>>>>
>>>>> Felix,
>>>>>
>>>>> The only alternative way is to create a stored procedure (udf) in
>>>>> database terms that would run Spark scala code underneath. In this way, I
>>>>> can use Spark SQL JDBC Thriftserver to execute it using SQL code passing
>>>>> the key, values I want to UPSERT. I wonder if this is possible since I
>>>>> cannot CREATE a wrapper table on top of a HBase table in Spark SQL?
>>>>>
>>>>> What do you think? Is this the right approach?
>>>>>
>>>>> Thanks,
>>>>> Ben
>>>>>
>>>>> On Oct 8, 2016, at 10:33 AM, Felix Cheung <fe...@hotmail.com>
>>>>> wrote:
>>>>>
>>>>> HBase has released support for Spark
>>>>> hbase.apache.org/book.html#spark
>>>>>
>>>>> And if you search you should find several alternative approaches.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Oct 7, 2016 at 7:56 AM -0700, "Benjamin Kim" <
>>>>> bbuild11@gmail.com> wrote:
>>>>>
>>>>> Does anyone know if Spark can work with HBase tables using Spark SQL?
>>>>> I know in Hive we are able to create tables on top of an underlying HBase
>>>>> table that can be accessed using MapReduce jobs. Can the same be done using
>>>>> HiveContext or SQLContext? We are trying to setup a way to GET and POST
>>>>> data to and from the HBase table using the Spark SQL JDBC thriftserver from
>>>>> our RESTful API endpoints and/or HTTP web farms. If we can get this to
>>>>> work, then we can load balance the thriftservers. In addition, this will
>>>>> benefit us in giving us a way to abstract the data storage layer away from
>>>>> the presentation layer code. There is a chance that we will swap out the
>>>>> data storage technology in the future. We are currently experimenting with
>>>>> Kudu.
>>>>>
>>>>> Thanks,
>>>>> Ben
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>
>
>

Re: Spark SQL Thriftserver with HBase

Posted by Michael Segel <ms...@hotmail.com>.

Guys,
Sorry for jumping in late to the game…

If memory serves (which may not be a good thing…) :

You can use HiveServer2 as a connection point to HBase.
While this doesn’t perform well, its probably the cleanest solution.
I’m not keen on Phoenix… wouldn’t recommend it….


The issue is that you’re trying to make HBase, a key/value object store, a Relational Engine… its not.

There are some considerations which make HBase not ideal for all use cases and you may find better performance with Parquet files.

One thing missing is the use of secondary indexing and query optimizations that you have in RDBMSs and are lacking in HBase / MapRDB / etc …  so your performance will vary.

With respect to Tableau… their entire interface in to the big data world revolves around the JDBC/ODBC interface. So if you don’t have that piece as part of your solution, you’re DOA w respect to Tableau.

Have you considered Drill as your JDBC connection point?  (YAAP: Yet another Apache project)


On Oct 9, 2016, at 12:23 PM, Benjamin Kim <bb...@gmail.com>> wrote:

Thanks for all the suggestions. It would seem you guys are right about the Tableau side of things. The reports don’t need to be real-time, and they won’t be directly feeding off of the main DMP HBase data. Instead, it’ll be batched to Parquet or Kudu/Impala or even PostgreSQL.

I originally thought that we needed two-way data retrieval from the DMP HBase for ID generation, but after further investigation into the use-case and architecture, the ID generation needs to happen local to the Ad Servers where we generate a unique ID and store it in a ID linking table. Even better, many of the 3rd party services supply this ID. So, data only needs to flow in one direction. We will use Kafka as the bus for this. No JDBC required. This is also goes for the REST Endpoints. 3rd party services will hit ours to update our data with no need to read from our data. And, when we want to update their data, we will hit theirs to update their data using a triggered job.

This al boils down to just integrating with Kafka.

Once again, thanks for all the help.

Cheers,
Ben


On Oct 9, 2016, at 3:16 AM, Jörn Franke <jo...@gmail.com>> wrote:

please keep also in mind that Tableau Server has the capabilities to store data in-memory and refresh only when needed the in-memory data. This means you can import it from any source and let your users work only on the in-memory data in Tableau Server.

On Sun, Oct 9, 2016 at 9:22 AM, Jörn Franke <jo...@gmail.com>> wrote:
Cloudera 5.8 has a very old version of Hive without Tez, but Mich provided already a good alternative. However, you should check if it contains a recent version of Hbase and Phoenix. That being said, I just wonder what is the dataflow, data model and the analysis you plan to do. Maybe there are completely different solutions possible. Especially these single inserts, upserts etc. should be avoided as much as possible in the Big Data (analysis) world with any technology, because they do not perform well.

Hive with Llap will provide an in-memory cache for interactive analytics. You can put full tables in-memory with Hive using Ignite HDFS in-memory solution. All this does only make sense if you do not use MR as an engine, the right input format (ORC, parquet) and a recent Hive version.

On 8 Oct 2016, at 21:55, Benjamin Kim <bb...@gmail.com>> wrote:

Mich,

Unfortunately, we are moving away from Hive and unifying on Spark using CDH 5.8 as our distro. And, the Tableau released a Spark ODBC/JDBC driver too. I will either try Phoenix JDBC Server for HBase or push to move faster to Kudu with Impala. We will use Impala as the JDBC in-between until the Kudu team completes Spark SQL support for JDBC.

Thanks for the advice.

Cheers,
Ben


On Oct 8, 2016, at 12:35 PM, Mich Talebzadeh <mi...@gmail.com>> wrote:

Sure. But essentially you are looking at batch data for analytics for your tableau users so Hive may be a better choice with its rich SQL and ODBC.JDBC connection to Tableau already.

I would go for Hive especially the new release will have an in-memory offering as well for frequently accessed data :)


Dr Mich Talebzadeh



LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.



On 8 October 2016 at 20:15, Benjamin Kim <bb...@gmail.com>> wrote:
Mich,

First and foremost, we have visualization servers that run Tableau for external user reports. Second, we have servers that are ad servers and REST endpoints for cookie sync and segmentation data exchange. These will use JDBC directly within the same data-center. When not colocated in the same data-center, they will connected to a located database server using JDBC. Either way, by using JDBC everywhere, it simplifies and unifies the code on the JDBC industry standard.

Does this make sense?

Thanks,
Ben


On Oct 8, 2016, at 11:47 AM, Mich Talebzadeh <mi...@gmail.com>> wrote:

Like any other design what is your presentation layer and end users?

Are they SQL centric users from Tableau background or they may use spark functional programming.

It is best to describe the use case.

HTH

Dr Mich Talebzadeh



LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.



On 8 October 2016 at 19:40, Felix Cheung <fe...@hotmail.com>> wrote:
I wouldn't be too surprised Spark SQL - JDBC data source - Phoenix JDBC server - HBASE would work better.

Without naming specifics, there are at least 4 or 5 different implementations of HBASE sources, each at varying level of development and different requirements (HBASE release version, Kerberos support etc)


_____________________________
From: Benjamin Kim <bb...@gmail.com>>
Sent: Saturday, October 8, 2016 11:26 AM
Subject: Re: Spark SQL Thriftserver with HBase
To: Mich Talebzadeh <mi...@gmail.com>>
Cc: <us...@spark.apache.org>>, Felix Cheung <fe...@hotmail.com>>



Mich,

Are you talking about the Phoenix JDBC Server? If so, I forgot about that alternative.

Thanks,
Ben


On Oct 8, 2016, at 11:21 AM, Mich Talebzadeh <mi...@gmail.com>> wrote:

I don't think it will work

you can use phoenix on top of hbase

hbase(main):336:0> scan 'tsco', 'LIMIT' => 1
ROW                                                       COLUMN+CELL
 TSCO-1-Apr-08                                            column=stock_daily:Date, timestamp=1475866783376, value=1-Apr-08
 TSCO-1-Apr-08                                            column=stock_daily:close, timestamp=1475866783376, value=405.25
 TSCO-1-Apr-08                                            column=stock_daily:high, timestamp=1475866783376, value=406.75
 TSCO-1-Apr-08                                            column=stock_daily:low, timestamp=1475866783376, value=379.25
 TSCO-1-Apr-08                                            column=stock_daily:open, timestamp=1475866783376, value=380.00
 TSCO-1-Apr-08                                            column=stock_daily:stock, timestamp=1475866783376, value=TESCO PLC
 TSCO-1-Apr-08                                            column=stock_daily:ticker, timestamp=1475866783376, value=TSCO
 TSCO-1-Apr-08                                            column=stock_daily:volume, timestamp=1475866783376, value=49664486

And the same on Phoenix on top of Hvbase table

0: jdbc:phoenix:thin:url=http://rhes564:8765<http://rhes564:8765/>> select substr(to_char(to_date("Date",'dd-MMM-yy')),1,10) AS TradeDate, "close" AS "Day's close", "high" AS "Day's High", "low" AS "Day's Low", "open" AS "Day's Open", "ticker", "volume", (to_number("low")+to_number("high"))/2 AS "AverageDailyPrice" from "tsco" where to_number("volume") > 0 and "high" != '-' and to_date("Date",'dd-MMM-yy') > to_date('2015-10-06','yyyy-MM-dd') order by  to_date("Date",'dd-MMM-yy') limit 1;
+-------------+--------------+-------------+------------+-------------+---------+-----------+--------------------+
|  TRADEDATE  | Day's close  | Day's High  | Day's Low  | Day's Open  | ticker  |  volume   | AverageDailyPrice  |
+-------------+--------------+-------------+------------+-------------+---------+-----------+--------------------+
| 2015-10-07  | 197.00       | 198.05      | 184.84     | 192.20      | TSCO    | 30046994  | 191.445            |


HTH




Dr Mich Talebzadeh



LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destructionof data or any other property which may arise from relying on this email's technical content is explicitly disclaimed.The author will in no case be liable for any monetary damages arising from suchloss, damage or destruction.



On 8 October 2016 at 19:05, Felix Cheung <fe...@hotmail.com>> wrote:
Great, then I think those packages as Spark data source should allow you to do exactly that (replace org.apache.spark.sql.jdbc with HBASE one)

I do think it will be great to get more examples around this though. Would be great if you could share your experience with this!


_____________________________
From: Benjamin Kim <bb...@gmail.com>>
Sent: Saturday, October 8, 2016 11:00 AM
Subject: Re: Spark SQL Thriftserver with HBase
To: Felix Cheung <fe...@hotmail.com>>
Cc: <us...@spark.apache.org>>


Felix,

My goal is to use Spark SQL JDBC Thriftserver to access HBase tables using just SQL. I have been able to CREATE tables using this statement below in the past:

CREATE TABLE <table-name>
USING org.apache.spark.sql.jdbc
OPTIONS (
  url "jdbc:postgresql://<hostname>:<port>/dm?user=<username>&password=<password>",
  dbtable "dim.dimension_acamp"
);

After doing this, I can access the PostgreSQL table using Spark SQL JDBC Thriftserver using SQL statements (SELECT, UPDATE, INSERT, etc.). I want to do the same with HBase tables. We tried this using Hive and HiveServer2, but the response times are just too long.

Thanks,
Ben


On Oct 8, 2016, at 10:53 AM, Felix Cheung <fe...@hotmail.com>> wrote:

Ben,

I'm not sure I'm following completely.

Is your goal to use Spark to create or access tables in HBASE? If so the link below and several packages out there support that by having a HBASE data source for Spark. There are some examples on how the Spark code look like in that link as well. On that note, you should also be able to use the HBASE data source from pure SQL (Spark SQL) query as well, which should work in the case with the Spark SQL JDBC Thrift Server (with USING,http://spark.apache.org/docs/latest/sql-programming-guide.html#tab_sql_10).


_____________________________
From: Benjamin Kim <bb...@gmail.com>>
Sent: Saturday, October 8, 2016 10:40 AM
Subject: Re: Spark SQL Thriftserver with HBase
To: Felix Cheung <fe...@hotmail.com>>
Cc: <us...@spark.apache.org>>


Felix,

The only alternative way is to create a stored procedure (udf) in database terms that would run Spark scala code underneath. In this way, I can use Spark SQL JDBC Thriftserver to execute it using SQL code passing the key, values I want to UPSERT. I wonder if this is possible since I cannot CREATE a wrapper table on top of a HBase table in Spark SQL?

What do you think? Is this the right approach?

Thanks,
Ben

On Oct 8, 2016, at 10:33 AM, Felix Cheung <fe...@hotmail.com>> wrote:

HBase has released support for Spark
hbase.apache.org/book.html#spark<http://hbase.apache.org/book.html#spark>

And if you search you should find several alternative approaches.





On Fri, Oct 7, 2016 at 7:56 AM -0700, "Benjamin Kim" <bb...@gmail.com>> wrote:

Does anyone know if Spark can work with HBase tables using Spark SQL? I know in Hive we are able to create tables on top of an underlying HBase table that can be accessed using MapReduce jobs. Can the same be done using HiveContext or SQLContext? We are trying to setup a way to GET and POST data to and from the HBase table using the Spark SQL JDBC thriftserver from our RESTful API endpoints and/or HTTP web farms. If we can get this to work, then we can load balance the thriftservers. In addition, this will benefit us in giving us a way to abstract the data storage layer away from the presentation layer code. There is a chance that we will swap out the data storage technology in the future. We are currently experimenting with Kudu.

Thanks,
Ben
---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org<ma...@spark.apache.org>

Re: Spark SQL Thriftserver with HBase

Posted by Benjamin Kim <bb...@gmail.com>.

Thanks for all the suggestions. It would seem you guys are right about the Tableau side of things. The reports don’t need to be real-time, and they won’t be directly feeding off of the main DMP HBase data. Instead, it’ll be batched to Parquet or Kudu/Impala or even PostgreSQL.

I originally thought that we needed two-way data retrieval from the DMP HBase for ID generation, but after further investigation into the use-case and architecture, the ID generation needs to happen local to the Ad Servers where we generate a unique ID and store it in a ID linking table. Even better, many of the 3rd party services supply this ID. So, data only needs to flow in one direction. We will use Kafka as the bus for this. No JDBC required. This is also goes for the REST Endpoints. 3rd party services will hit ours to update our data with no need to read from our data. And, when we want to update their data, we will hit theirs to update their data using a triggered job.

This al boils down to just integrating with Kafka.

Once again, thanks for all the help.

Cheers,
Ben


> On Oct 9, 2016, at 3:16 AM, Jörn Franke <jo...@gmail.com> wrote:
> 
> please keep also in mind that Tableau Server has the capabilities to store data in-memory and refresh only when needed the in-memory data. This means you can import it from any source and let your users work only on the in-memory data in Tableau Server.
> 
> On Sun, Oct 9, 2016 at 9:22 AM, Jörn Franke <jornfranke@gmail.com <ma...@gmail.com>> wrote:
> Cloudera 5.8 has a very old version of Hive without Tez, but Mich provided already a good alternative. However, you should check if it contains a recent version of Hbase and Phoenix. That being said, I just wonder what is the dataflow, data model and the analysis you plan to do. Maybe there are completely different solutions possible. Especially these single inserts, upserts etc. should be avoided as much as possible in the Big Data (analysis) world with any technology, because they do not perform well. 
> 
> Hive with Llap will provide an in-memory cache for interactive analytics. You can put full tables in-memory with Hive using Ignite HDFS in-memory solution. All this does only make sense if you do not use MR as an engine, the right input format (ORC, parquet) and a recent Hive version.
> 
> On 8 Oct 2016, at 21:55, Benjamin Kim <bbuild11@gmail.com <ma...@gmail.com>> wrote:
> 
>> Mich,
>> 
>> Unfortunately, we are moving away from Hive and unifying on Spark using CDH 5.8 as our distro. And, the Tableau released a Spark ODBC/JDBC driver too. I will either try Phoenix JDBC Server for HBase or push to move faster to Kudu with Impala. We will use Impala as the JDBC in-between until the Kudu team completes Spark SQL support for JDBC.
>> 
>> Thanks for the advice.
>> 
>> Cheers,
>> Ben
>> 
>> 
>>> On Oct 8, 2016, at 12:35 PM, Mich Talebzadeh <mich.talebzadeh@gmail.com <ma...@gmail.com>> wrote:
>>> 
>>> Sure. But essentially you are looking at batch data for analytics for your tableau users so Hive may be a better choice with its rich SQL and ODBC.JDBC connection to Tableau already.
>>> 
>>> I would go for Hive especially the new release will have an in-memory offering as well for frequently accessed data :)
>>> 
>>> 
>>> Dr Mich Talebzadeh
>>>  
>>> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>>  
>>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>>> 
>>> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.
>>>  
>>> 
>>> On 8 October 2016 at 20:15, Benjamin Kim <bbuild11@gmail.com <ma...@gmail.com>> wrote:
>>> Mich,
>>> 
>>> First and foremost, we have visualization servers that run Tableau for external user reports. Second, we have servers that are ad servers and REST endpoints for cookie sync and segmentation data exchange. These will use JDBC directly within the same data-center. When not colocated in the same data-center, they will connected to a located database server using JDBC. Either way, by using JDBC everywhere, it simplifies and unifies the code on the JDBC industry standard.
>>> 
>>> Does this make sense?
>>> 
>>> Thanks,
>>> Ben
>>> 
>>> 
>>>> On Oct 8, 2016, at 11:47 AM, Mich Talebzadeh <mich.talebzadeh@gmail.com <ma...@gmail.com>> wrote:
>>>> 
>>>> Like any other design what is your presentation layer and end users?
>>>> 
>>>> Are they SQL centric users from Tableau background or they may use spark functional programming.
>>>> 
>>>> It is best to describe the use case.
>>>> 
>>>> HTH
>>>> 
>>>> Dr Mich Talebzadeh
>>>>  
>>>> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>>>  
>>>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>>>> 
>>>> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.
>>>>  
>>>> 
>>>> On 8 October 2016 at 19:40, Felix Cheung <felixcheung_m@hotmail.com <ma...@hotmail.com>> wrote:
>>>> I wouldn't be too surprised Spark SQL - JDBC data source - Phoenix JDBC server - HBASE would work better.
>>>> 
>>>> Without naming specifics, there are at least 4 or 5 different implementations of HBASE sources, each at varying level of development and different requirements (HBASE release version, Kerberos support etc)
>>>> 
>>>> 
>>>> _____________________________
>>>> From: Benjamin Kim <bbuild11@gmail.com <ma...@gmail.com>>
>>>> Sent: Saturday, October 8, 2016 11:26 AM
>>>> Subject: Re: Spark SQL Thriftserver with HBase
>>>> To: Mich Talebzadeh <mich.talebzadeh@gmail.com <ma...@gmail.com>>
>>>> Cc: <user@spark.apache.org <ma...@spark.apache.org>>, Felix Cheung <felixcheung_m@hotmail.com <ma...@hotmail.com>>
>>>> 
>>>> 
>>>> 
>>>> Mich,
>>>> 
>>>> Are you talking about the Phoenix JDBC Server? If so, I forgot about that alternative.
>>>> 
>>>> Thanks,
>>>> Ben
>>>> 
>>>> 
>>>> On Oct 8, 2016, at 11:21 AM, Mich Talebzadeh <mich.talebzadeh@gmail.com <ma...@gmail.com>> wrote:
>>>> 
>>>> I don't think it will work
>>>> 
>>>> you can use phoenix on top of hbase
>>>> 
>>>> hbase(main):336:0> scan 'tsco', 'LIMIT' => 1
>>>> ROW                                                       COLUMN+CELL
>>>>  TSCO-1-Apr-08                                            column=stock_daily:Date, timestamp=1475866783376, value=1-Apr-08
>>>>  TSCO-1-Apr-08                                            column=stock_daily:close, timestamp=1475866783376, value=405.25
>>>>  TSCO-1-Apr-08                                            column=stock_daily:high, timestamp=1475866783376, value=406.75
>>>>  TSCO-1-Apr-08                                            column=stock_daily:low, timestamp=1475866783376, value=379.25
>>>>  TSCO-1-Apr-08                                            column=stock_daily:open, timestamp=1475866783376, value=380.00
>>>>  TSCO-1-Apr-08                                            column=stock_daily:stock, timestamp=1475866783376, value=TESCO PLC
>>>>  TSCO-1-Apr-08                                            column=stock_daily:ticker, timestamp=1475866783376, value=TSCO
>>>>  TSCO-1-Apr-08                                            column=stock_daily:volume, timestamp=1475866783376, value=49664486
>>>> 
>>>> And the same on Phoenix on top of Hvbase table
>>>> 
>>>> 0: jdbc:phoenix:thin:url=http://rhes564:8765 <http://rhes564:8765/>> select substr(to_char(to_date("Date",'dd-MMM-yy')),1,10) AS TradeDate, "close" AS "Day's close", "high" AS "Day's High", "low" AS "Day's Low", "open" AS "Day's Open", "ticker", "volume", (to_number("low")+to_number("high"))/2 AS "AverageDailyPrice" from "tsco" where to_number("volume") > 0 and "high" != '-' and to_date("Date",'dd-MMM-yy') > to_date('2015-10-06','yyyy-MM-dd') order by  to_date("Date",'dd-MMM-yy') limit 1;
>>>> +-------------+--------------+-------------+------------+-------------+---------+-----------+--------------------+
>>>> |  TRADEDATE  | Day's close  | Day's High  | Day's Low  | Day's Open  | ticker  |  volume   | AverageDailyPrice  |
>>>> +-------------+--------------+-------------+------------+-------------+---------+-----------+--------------------+
>>>> | 2015-10-07  | 197.00       | 198.05      | 184.84     | 192.20      | TSCO    | 30046994  | 191.445            |
>>>> 
>>>> HTH
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> Dr Mich Talebzadeh
>>>>  
>>>> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>>>  
>>>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>>>> 
>>>> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destructionof data or any other property which may arise from relying on this email's technical content is explicitly disclaimed.The author will in no case be liable for any monetary damages arising from suchloss, damage or destruction.
>>>>  
>>>> 
>>>> On 8 October 2016 at 19:05, Felix Cheung <felixcheung_m@hotmail.com <ma...@hotmail.com>> wrote:
>>>> Great, then I think those packages as Spark data source should allow you to do exactly that (replace org.apache.spark.sql.jdbc with HBASE one)
>>>> 
>>>> I do think it will be great to get more examples around this though. Would be great if you could share your experience with this!
>>>> 
>>>> 
>>>> _____________________________
>>>> From: Benjamin Kim <bbuild11@gmail.com <ma...@gmail.com>>
>>>> Sent: Saturday, October 8, 2016 11:00 AM
>>>> Subject: Re: Spark SQL Thriftserver with HBase
>>>> To: Felix Cheung <felixcheung_m@hotmail.com <ma...@hotmail.com>>
>>>> Cc: <user@spark.apache.org <ma...@spark.apache.org>>
>>>> 
>>>> 
>>>> Felix,
>>>> 
>>>> My goal is to use Spark SQL JDBC Thriftserver to access HBase tables using just SQL. I have been able to CREATE tables using this statement below in the past:
>>>> 
>>>> CREATE TABLE <table-name>
>>>> USING org.apache.spark.sql.jdbc
>>>> OPTIONS (
>>>>   url "jdbc:postgresql://<hostname>:<port>/dm?user=<username>&password=<password>",
>>>>   dbtable "dim.dimension_acamp"
>>>> );
>>>> 
>>>> After doing this, I can access the PostgreSQL table using Spark SQL JDBC Thriftserver using SQL statements (SELECT, UPDATE, INSERT, etc.). I want to do the same with HBase tables. We tried this using Hive and HiveServer2, but the response times are just too long.
>>>> 
>>>> Thanks,
>>>> Ben
>>>> 
>>>> 
>>>> On Oct 8, 2016, at 10:53 AM, Felix Cheung <felixcheung_m@hotmail.com <ma...@hotmail.com>> wrote:
>>>> 
>>>> Ben,
>>>> 
>>>> I'm not sure I'm following completely.
>>>> 
>>>> Is your goal to use Spark to create or access tables in HBASE? If so the link below and several packages out there support that by having a HBASE data source for Spark. There are some examples on how the Spark code look like in that link as well. On that note, you should also be able to use the HBASE data source from pure SQL (Spark SQL) query as well, which should work in the case with the Spark SQL JDBC Thrift Server (with USING,http://spark.apache.org/docs/latest/sql-programming-guide.html#tab_sql_10 <http://spark.apache.org/docs/latest/sql-programming-guide.html#tab_sql_10>).
>>>> 
>>>> 
>>>> _____________________________
>>>> From: Benjamin Kim <bbuild11@gmail.com <ma...@gmail.com>>
>>>> Sent: Saturday, October 8, 2016 10:40 AM
>>>> Subject: Re: Spark SQL Thriftserver with HBase
>>>> To: Felix Cheung <felixcheung_m@hotmail.com <ma...@hotmail.com>>
>>>> Cc: <user@spark.apache.org <ma...@spark.apache.org>>
>>>> 
>>>> 
>>>> Felix,
>>>> 
>>>> The only alternative way is to create a stored procedure (udf) in database terms that would run Spark scala code underneath. In this way, I can use Spark SQL JDBC Thriftserver to execute it using SQL code passing the key, values I want to UPSERT. I wonder if this is possible since I cannot CREATE a wrapper table on top of a HBase table in Spark SQL?
>>>> 
>>>> What do you think? Is this the right approach?
>>>> 
>>>> Thanks,
>>>> Ben
>>>> 
>>>> On Oct 8, 2016, at 10:33 AM, Felix Cheung <felixcheung_m@hotmail.com <ma...@hotmail.com>> wrote:
>>>> 
>>>> HBase has released support for Spark
>>>> hbase.apache.org/book.html#spark <http://hbase.apache.org/book.html#spark>
>>>> 
>>>> And if you search you should find several alternative approaches.
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On Fri, Oct 7, 2016 at 7:56 AM -0700, "Benjamin Kim" <bbuild11@gmail.com <ma...@gmail.com>> wrote:
>>>> 
>>>> Does anyone know if Spark can work with HBase tables using Spark SQL? I know in Hive we are able to create tables on top of an underlying HBase table that can be accessed using MapReduce jobs. Can the same be done using HiveContext or SQLContext? We are trying to setup a way to GET and POST data to and from the HBase table using the Spark SQL JDBC thriftserver from our RESTful API endpoints and/or HTTP web farms. If we can get this to work, then we can load balance the thriftservers. In addition, this will benefit us in giving us a way to abstract the data storage layer away from the presentation layer code. There is a chance that we will swap out the data storage technology in the future. We are currently experimenting with Kudu.
>>>> 
>>>> Thanks,
>>>> Ben
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org <ma...@spark.apache.org>
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>

Re: Spark SQL Thriftserver with HBase

Posted by Jörn Franke <jo...@gmail.com>.

please keep also in mind that Tableau Server has the capabilities to store
data in-memory and refresh only when needed the in-memory data. This means
you can import it from any source and let your users work only on the
in-memory data in Tableau Server.

On Sun, Oct 9, 2016 at 9:22 AM, Jörn Franke <jo...@gmail.com> wrote:

> Cloudera 5.8 has a very old version of Hive without Tez, but Mich provided
> already a good alternative. However, you should check if it contains a
> recent version of Hbase and Phoenix. That being said, I just wonder what is
> the dataflow, data model and the analysis you plan to do. Maybe there are
> completely different solutions possible. Especially these single inserts,
> upserts etc. should be avoided as much as possible in the Big Data
> (analysis) world with any technology, because they do not perform well.
>
> Hive with Llap will provide an in-memory cache for interactive analytics.
> You can put full tables in-memory with Hive using Ignite HDFS in-memory
> solution. All this does only make sense if you do not use MR as an engine,
> the right input format (ORC, parquet) and a recent Hive version.
>
> On 8 Oct 2016, at 21:55, Benjamin Kim <bb...@gmail.com> wrote:
>
> Mich,
>
> Unfortunately, we are moving away from Hive and unifying on Spark using
> CDH 5.8 as our distro. And, the Tableau released a Spark ODBC/JDBC driver
> too. I will either try Phoenix JDBC Server for HBase or push to move faster
> to Kudu with Impala. We will use Impala as the JDBC in-between until the
> Kudu team completes Spark SQL support for JDBC.
>
> Thanks for the advice.
>
> Cheers,
> Ben
>
>
> On Oct 8, 2016, at 12:35 PM, Mich Talebzadeh <mi...@gmail.com>
> wrote:
>
> Sure. But essentially you are looking at batch data for analytics for your
> tableau users so Hive may be a better choice with its rich SQL and
> ODBC.JDBC connection to Tableau already.
>
> I would go for Hive especially the new release will have an in-memory
> offering as well for frequently accessed data :)
>
>
> Dr Mich Talebzadeh
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
> http://talebzadehmich.wordpress.com
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 8 October 2016 at 20:15, Benjamin Kim <bb...@gmail.com> wrote:
>
>> Mich,
>>
>> First and foremost, we have visualization servers that run Tableau for
>> external user reports. Second, we have servers that are ad servers and REST
>> endpoints for cookie sync and segmentation data exchange. These will use
>> JDBC directly within the same data-center. When not colocated in the same
>> data-center, they will connected to a located database server using JDBC.
>> Either way, by using JDBC everywhere, it simplifies and unifies the code on
>> the JDBC industry standard.
>>
>> Does this make sense?
>>
>> Thanks,
>> Ben
>>
>>
>> On Oct 8, 2016, at 11:47 AM, Mich Talebzadeh <mi...@gmail.com>
>> wrote:
>>
>> Like any other design what is your presentation layer and end users?
>>
>> Are they SQL centric users from Tableau background or they may use spark
>> functional programming.
>>
>> It is best to describe the use case.
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 8 October 2016 at 19:40, Felix Cheung <fe...@hotmail.com>
>> wrote:
>>
>>> I wouldn't be too surprised Spark SQL - JDBC data source - Phoenix JDBC
>>> server - HBASE would work better.
>>>
>>> Without naming specifics, there are at least 4 or 5 different
>>> implementations of HBASE sources, each at varying level of development and
>>> different requirements (HBASE release version, Kerberos support etc)
>>>
>>>
>>> _____________________________
>>> From: Benjamin Kim <bb...@gmail.com>
>>> Sent: Saturday, October 8, 2016 11:26 AM
>>> Subject: Re: Spark SQL Thriftserver with HBase
>>> To: Mich Talebzadeh <mi...@gmail.com>
>>> Cc: <us...@spark.apache.org>, Felix Cheung <fe...@hotmail.com>
>>>
>>>
>>>
>>> Mich,
>>>
>>> Are you talking about the Phoenix JDBC Server? If so, I forgot about
>>> that alternative.
>>>
>>> Thanks,
>>> Ben
>>>
>>>
>>> On Oct 8, 2016, at 11:21 AM, Mich Talebzadeh <mi...@gmail.com>
>>> wrote:
>>>
>>> I don't think it will work
>>>
>>> you can use phoenix on top of hbase
>>>
>>> hbase(main):336:0> scan 'tsco', 'LIMIT' => 1
>>> ROW                                                       COLUMN+CELL
>>>  TSCO-1-Apr-08
>>> column=stock_daily:Date, timestamp=1475866783376, value=1-Apr-08
>>>  TSCO-1-Apr-08
>>> column=stock_daily:close, timestamp=1475866783376, value=405.25
>>>  TSCO-1-Apr-08
>>> column=stock_daily:high, timestamp=1475866783376, value=406.75
>>>  TSCO-1-Apr-08
>>> column=stock_daily:low, timestamp=1475866783376, value=379.25
>>>  TSCO-1-Apr-08
>>> column=stock_daily:open, timestamp=1475866783376, value=380.00
>>>  TSCO-1-Apr-08
>>> column=stock_daily:stock, timestamp=1475866783376, value=TESCO PLC
>>>  TSCO-1-Apr-08
>>> column=stock_daily:ticker, timestamp=1475866783376, value=TSCO
>>>  TSCO-1-Apr-08
>>> column=stock_daily:volume, timestamp=1475866783376, value=49664486
>>>
>>> And the same on Phoenix on top of Hvbase table
>>>
>>> 0: jdbc:phoenix:thin:url=http://rhes564:8765> select
>>> substr(to_char(to_date("Date",'dd-MMM-yy')),1,10) AS TradeDate, "close"
>>> AS "Day's close", "high" AS "Day's High", "low" AS "Day's Low", "open" AS
>>> "Day's Open", "ticker", "volume", (to_number("low")+to_number("high"))/2
>>> AS "AverageDailyPrice" from "tsco" where to_number("volume") > 0 and "high"
>>> != '-' and to_date("Date",'dd-MMM-yy') > to_date('2015-10-06','yyyy-MM-dd')
>>> order by  to_date("Date",'dd-MMM-yy') limit 1;
>>> +-------------+--------------+-------------+------------+---
>>> ----------+---------+-----------+--------------------+
>>> |  TRADEDATE  | Day's close  | Day's High  | Day's Low  | Day's Open  |
>>> ticker  |  volume   | AverageDailyPrice  |
>>> +-------------+--------------+-------------+------------+---
>>> ----------+---------+-----------+--------------------+
>>> | 2015-10-07  | 197.00       | 198.05      | 184.84     | 192.20      |
>>> TSCO    | 30046994  | 191.445            |
>>>
>>> HTH
>>>
>>>
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destructionof data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed.The author will in no case be liable for any monetary damages
>>> arising from suchloss, damage or destruction.
>>>
>>>
>>>
>>> On 8 October 2016 at 19:05, Felix Cheung <fe...@hotmail.com>
>>> wrote:
>>>
>>>> Great, then I think those packages as Spark data source should allow
>>>> you to do exactly that (replace org.apache.spark.sql.jdbc with HBASE one)
>>>>
>>>> I do think it will be great to get more examples around this though.
>>>> Would be great if you could share your experience with this!
>>>>
>>>>
>>>> _____________________________
>>>> From: Benjamin Kim <bb...@gmail.com>
>>>> Sent: Saturday, October 8, 2016 11:00 AM
>>>> Subject: Re: Spark SQL Thriftserver with HBase
>>>> To: Felix Cheung <fe...@hotmail.com>
>>>> Cc: <us...@spark.apache.org>
>>>>
>>>>
>>>> Felix,
>>>>
>>>> My goal is to use Spark SQL JDBC Thriftserver to access HBase tables
>>>> using just SQL. I have been able to CREATE tables using this statement
>>>> below in the past:
>>>>
>>>> CREATE TABLE <table-name>
>>>> USING org.apache.spark.sql.jdbc
>>>> OPTIONS (
>>>>   url "jdbc:postgresql://<hostname>:<port>/dm?user=<username>&pass
>>>> word=<password>",
>>>>   dbtable "dim.dimension_acamp"
>>>> );
>>>>
>>>>
>>>> After doing this, I can access the PostgreSQL table using Spark SQL
>>>> JDBC Thriftserver using SQL statements (SELECT, UPDATE, INSERT, etc.). I
>>>> want to do the same with HBase tables. We tried this using Hive and
>>>> HiveServer2, but the response times are just too long.
>>>>
>>>> Thanks,
>>>> Ben
>>>>
>>>>
>>>> On Oct 8, 2016, at 10:53 AM, Felix Cheung <fe...@hotmail.com>
>>>> wrote:
>>>>
>>>> Ben,
>>>>
>>>> I'm not sure I'm following completely.
>>>>
>>>> Is your goal to use Spark to create or access tables in HBASE? If so
>>>> the link below and several packages out there support that by having a
>>>> HBASE data source for Spark. There are some examples on how the Spark code
>>>> look like in that link as well. On that note, you should also be able to
>>>> use the HBASE data source from pure SQL (Spark SQL) query as well, which
>>>> should work in the case with the Spark SQL JDBC Thrift Server (with USING,
>>>> http://spark.apache.org/docs/latest/sql-programming-gu
>>>> ide.html#tab_sql_10).
>>>>
>>>>
>>>> _____________________________
>>>> From: Benjamin Kim <bb...@gmail.com>
>>>> Sent: Saturday, October 8, 2016 10:40 AM
>>>> Subject: Re: Spark SQL Thriftserver with HBase
>>>> To: Felix Cheung <fe...@hotmail.com>
>>>> Cc: <us...@spark.apache.org>
>>>>
>>>>
>>>> Felix,
>>>>
>>>> The only alternative way is to create a stored procedure (udf) in
>>>> database terms that would run Spark scala code underneath. In this way, I
>>>> can use Spark SQL JDBC Thriftserver to execute it using SQL code passing
>>>> the key, values I want to UPSERT. I wonder if this is possible since I
>>>> cannot CREATE a wrapper table on top of a HBase table in Spark SQL?
>>>>
>>>> What do you think? Is this the right approach?
>>>>
>>>> Thanks,
>>>> Ben
>>>>
>>>> On Oct 8, 2016, at 10:33 AM, Felix Cheung <fe...@hotmail.com>
>>>> wrote:
>>>>
>>>> HBase has released support for Spark
>>>> hbase.apache.org/book.html#spark
>>>>
>>>> And if you search you should find several alternative approaches.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Fri, Oct 7, 2016 at 7:56 AM -0700, "Benjamin Kim" <
>>>> bbuild11@gmail.com> wrote:
>>>>
>>>> Does anyone know if Spark can work with HBase tables using Spark SQL? I
>>>> know in Hive we are able to create tables on top of an underlying HBase
>>>> table that can be accessed using MapReduce jobs. Can the same be done using
>>>> HiveContext or SQLContext? We are trying to setup a way to GET and POST
>>>> data to and from the HBase table using the Spark SQL JDBC thriftserver from
>>>> our RESTful API endpoints and/or HTTP web farms. If we can get this to
>>>> work, then we can load balance the thriftservers. In addition, this will
>>>> benefit us in giving us a way to abstract the data storage layer away from
>>>> the presentation layer code. There is a chance that we will swap out the
>>>> data storage technology in the future. We are currently experimenting with
>>>> Kudu.
>>>>
>>>> Thanks,
>>>> Ben
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>>
>>
>>
>
>

Re: Spark SQL Thriftserver with HBase

Posted by Jörn Franke <jo...@gmail.com>.

Cloudera 5.8 has a very old version of Hive without Tez, but Mich provided already a good alternative. However, you should check if it contains a recent version of Hbase and Phoenix. That being said, I just wonder what is the dataflow, data model and the analysis you plan to do. Maybe there are completely different solutions possible. Especially these single inserts, upserts etc. should be avoided as much as possible in the Big Data (analysis) world with any technology, because they do not perform well. 

Hive with Llap will provide an in-memory cache for interactive analytics. You can put full tables in-memory with Hive using Ignite HDFS in-memory solution. All this does only make sense if you do not use MR as an engine, the right input format (ORC, parquet) and a recent Hive version.

> On 8 Oct 2016, at 21:55, Benjamin Kim <bb...@gmail.com> wrote:
> 
> Mich,
> 
> Unfortunately, we are moving away from Hive and unifying on Spark using CDH 5.8 as our distro. And, the Tableau released a Spark ODBC/JDBC driver too. I will either try Phoenix JDBC Server for HBase or push to move faster to Kudu with Impala. We will use Impala as the JDBC in-between until the Kudu team completes Spark SQL support for JDBC.
> 
> Thanks for the advice.
> 
> Cheers,
> Ben
> 
> 
>> On Oct 8, 2016, at 12:35 PM, Mich Talebzadeh <mi...@gmail.com> wrote:
>> 
>> Sure. But essentially you are looking at batch data for analytics for your tableau users so Hive may be a better choice with its rich SQL and ODBC.JDBC connection to Tableau already.
>> 
>> I would go for Hive especially the new release will have an in-memory offering as well for frequently accessed data :)
>> 
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> http://talebzadehmich.wordpress.com
>> 
>> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.
>>  
>> 
>>> On 8 October 2016 at 20:15, Benjamin Kim <bb...@gmail.com> wrote:
>>> Mich,
>>> 
>>> First and foremost, we have visualization servers that run Tableau for external user reports. Second, we have servers that are ad servers and REST endpoints for cookie sync and segmentation data exchange. These will use JDBC directly within the same data-center. When not colocated in the same data-center, they will connected to a located database server using JDBC. Either way, by using JDBC everywhere, it simplifies and unifies the code on the JDBC industry standard.
>>> 
>>> Does this make sense?
>>> 
>>> Thanks,
>>> Ben
>>> 
>>> 
>>>> On Oct 8, 2016, at 11:47 AM, Mich Talebzadeh <mi...@gmail.com> wrote:
>>>> 
>>>> Like any other design what is your presentation layer and end users?
>>>> 
>>>> Are they SQL centric users from Tableau background or they may use spark functional programming.
>>>> 
>>>> It is best to describe the use case.
>>>> 
>>>> HTH
>>>> 
>>>> Dr Mich Talebzadeh
>>>>  
>>>> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>  
>>>> http://talebzadehmich.wordpress.com
>>>> 
>>>> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.
>>>>  
>>>> 
>>>>> On 8 October 2016 at 19:40, Felix Cheung <fe...@hotmail.com> wrote:
>>>>> I wouldn't be too surprised Spark SQL - JDBC data source - Phoenix JDBC server - HBASE would work better.
>>>>> 
>>>>> Without naming specifics, there are at least 4 or 5 different implementations of HBASE sources, each at varying level of development and different requirements (HBASE release version, Kerberos support etc)
>>>>> 
>>>>> 
>>>>> _____________________________
>>>>> From: Benjamin Kim <bb...@gmail.com>
>>>>> Sent: Saturday, October 8, 2016 11:26 AM
>>>>> Subject: Re: Spark SQL Thriftserver with HBase
>>>>> To: Mich Talebzadeh <mi...@gmail.com>
>>>>> Cc: <us...@spark.apache.org>, Felix Cheung <fe...@hotmail.com>
>>>>> 
>>>>> 
>>>>> 
>>>>> Mich,
>>>>> 
>>>>> Are you talking about the Phoenix JDBC Server? If so, I forgot about that alternative.
>>>>> 
>>>>> Thanks,
>>>>> Ben
>>>>> 
>>>>> 
>>>>> On Oct 8, 2016, at 11:21 AM, Mich Talebzadeh <mi...@gmail.com> wrote:
>>>>> 
>>>>> I don't think it will work
>>>>> 
>>>>> you can use phoenix on top of hbase
>>>>> 
>>>>> hbase(main):336:0> scan 'tsco', 'LIMIT' => 1
>>>>> ROW                                                       COLUMN+CELL
>>>>>  TSCO-1-Apr-08                                            column=stock_daily:Date, timestamp=1475866783376, value=1-Apr-08
>>>>>  TSCO-1-Apr-08                                            column=stock_daily:close, timestamp=1475866783376, value=405.25
>>>>>  TSCO-1-Apr-08                                            column=stock_daily:high, timestamp=1475866783376, value=406.75
>>>>>  TSCO-1-Apr-08                                            column=stock_daily:low, timestamp=1475866783376, value=379.25
>>>>>  TSCO-1-Apr-08                                            column=stock_daily:open, timestamp=1475866783376, value=380.00
>>>>>  TSCO-1-Apr-08                                            column=stock_daily:stock, timestamp=1475866783376, value=TESCO PLC
>>>>>  TSCO-1-Apr-08                                            column=stock_daily:ticker, timestamp=1475866783376, value=TSCO
>>>>>  TSCO-1-Apr-08                                            column=stock_daily:volume, timestamp=1475866783376, value=49664486
>>>>> 
>>>>> And the same on Phoenix on top of Hvbase table
>>>>> 
>>>>> 0: jdbc:phoenix:thin:url=http://rhes564:8765> select substr(to_char(to_date("Date",'dd-MMM-yy')),1,10) AS TradeDate, "close" AS "Day's close", "high" AS "Day's High", "low" AS "Day's Low", "open" AS "Day's Open", "ticker", "volume", (to_number("low")+to_number("high"))/2 AS "AverageDailyPrice" from "tsco" where to_number("volume") > 0 and "high" != '-' and to_date("Date",'dd-MMM-yy') > to_date('2015-10-06','yyyy-MM-dd') order by  to_date("Date",'dd-MMM-yy') limit 1;
>>>>> +-------------+--------------+-------------+------------+-------------+---------+-----------+--------------------+
>>>>> |  TRADEDATE  | Day's close  | Day's High  | Day's Low  | Day's Open  | ticker  |  volume   | AverageDailyPrice  |
>>>>> +-------------+--------------+-------------+------------+-------------+---------+-----------+--------------------+
>>>>> | 2015-10-07  | 197.00       | 198.05      | 184.84     | 192.20      | TSCO    | 30046994  | 191.445            |
>>>>> 
>>>>> HTH
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> Dr Mich Talebzadeh
>>>>>  
>>>>> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>  
>>>>> http://talebzadehmich.wordpress.com
>>>>> 
>>>>> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destructionof data or any other property which may arise from relying on this email's technical content is explicitly disclaimed.The author will in no case be liable for any monetary damages arising from suchloss, damage or destruction.
>>>>>  
>>>>> 
>>>>>> On 8 October 2016 at 19:05, Felix Cheung <fe...@hotmail.com> wrote:
>>>>>> Great, then I think those packages as Spark data source should allow you to do exactly that (replace org.apache.spark.sql.jdbc with HBASE one)
>>>>>> 
>>>>>> I do think it will be great to get more examples around this though. Would be great if you could share your experience with this!
>>>>>> 
>>>>>> 
>>>>>> _____________________________
>>>>>> From: Benjamin Kim <bb...@gmail.com>
>>>>>> Sent: Saturday, October 8, 2016 11:00 AM
>>>>>> Subject: Re: Spark SQL Thriftserver with HBase
>>>>>> To: Felix Cheung <fe...@hotmail.com>
>>>>>> Cc: <us...@spark.apache.org>
>>>>>> 
>>>>>> 
>>>>>> Felix,
>>>>>> 
>>>>>> My goal is to use Spark SQL JDBC Thriftserver to access HBase tables using just SQL. I have been able to CREATE tables using this statement below in the past:
>>>>>> 
>>>>>> CREATE TABLE <table-name>
>>>>>> USING org.apache.spark.sql.jdbc
>>>>>> OPTIONS (
>>>>>>   url "jdbc:postgresql://<hostname>:<port>/dm?user=<username>&password=<password>",
>>>>>>   dbtable "dim.dimension_acamp"
>>>>>> );
>>>>>> 
>>>>>> After doing this, I can access the PostgreSQL table using Spark SQL JDBC Thriftserver using SQL statements (SELECT, UPDATE, INSERT, etc.). I want to do the same with HBase tables. We tried this using Hive and HiveServer2, but the response times are just too long.
>>>>>> 
>>>>>> Thanks,
>>>>>> Ben
>>>>>> 
>>>>>> 
>>>>>> On Oct 8, 2016, at 10:53 AM, Felix Cheung <fe...@hotmail.com> wrote:
>>>>>> 
>>>>>> Ben,
>>>>>> 
>>>>>> I'm not sure I'm following completely.
>>>>>> 
>>>>>> Is your goal to use Spark to create or access tables in HBASE? If so the link below and several packages out there support that by having a HBASE data source for Spark. There are some examples on how the Spark code look like in that link as well. On that note, you should also be able to use the HBASE data source from pure SQL (Spark SQL) query as well, which should work in the case with the Spark SQL JDBC Thrift Server (with USING,http://spark.apache.org/docs/latest/sql-programming-guide.html#tab_sql_10).
>>>>>> 
>>>>>> 
>>>>>> _____________________________
>>>>>> From: Benjamin Kim <bb...@gmail.com>
>>>>>> Sent: Saturday, October 8, 2016 10:40 AM
>>>>>> Subject: Re: Spark SQL Thriftserver with HBase
>>>>>> To: Felix Cheung <fe...@hotmail.com>
>>>>>> Cc: <us...@spark.apache.org>
>>>>>> 
>>>>>> 
>>>>>> Felix,
>>>>>> 
>>>>>> The only alternative way is to create a stored procedure (udf) in database terms that would run Spark scala code underneath. In this way, I can use Spark SQL JDBC Thriftserver to execute it using SQL code passing the key, values I want to UPSERT. I wonder if this is possible since I cannot CREATE a wrapper table on top of a HBase table in Spark SQL?
>>>>>> 
>>>>>> What do you think? Is this the right approach?
>>>>>> 
>>>>>> Thanks,
>>>>>> Ben
>>>>>> 
>>>>>> On Oct 8, 2016, at 10:33 AM, Felix Cheung <fe...@hotmail.com> wrote:
>>>>>> 
>>>>>> HBase has released support for Spark
>>>>>> hbase.apache.org/book.html#spark
>>>>>> 
>>>>>> And if you search you should find several alternative approaches.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Fri, Oct 7, 2016 at 7:56 AM -0700, "Benjamin Kim" <bb...@gmail.com> wrote:
>>>>>> 
>>>>>> Does anyone know if Spark can work with HBase tables using Spark SQL? I know in Hive we are able to create tables on top of an underlying HBase table that can be accessed using MapReduce jobs. Can the same be done using HiveContext or SQLContext? We are trying to setup a way to GET and POST data to and from the HBase table using the Spark SQL JDBC thriftserver from our RESTful API endpoints and/or HTTP web farms. If we can get this to work, then we can load balance the thriftservers. In addition, this will benefit us in giving us a way to abstract the data storage layer away from the presentation layer code. There is a chance that we will swap out the data storage technology in the future. We are currently experimenting with Kudu.
>>>>>> 
>>>>>> Thanks,
>>>>>> Ben
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>> 
>>> 
>> 
>

Re: Spark SQL Thriftserver with HBase

Posted by Benjamin Kim <bb...@gmail.com>.

Mich,

Unfortunately, we are moving away from Hive and unifying on Spark using CDH 5.8 as our distro. And, the Tableau released a Spark ODBC/JDBC driver too. I will either try Phoenix JDBC Server for HBase or push to move faster to Kudu with Impala. We will use Impala as the JDBC in-between until the Kudu team completes Spark SQL support for JDBC.

Thanks for the advice.

Cheers,
Ben


> On Oct 8, 2016, at 12:35 PM, Mich Talebzadeh <mi...@gmail.com> wrote:
> 
> Sure. But essentially you are looking at batch data for analytics for your tableau users so Hive may be a better choice with its rich SQL and ODBC.JDBC connection to Tableau already.
> 
> I would go for Hive especially the new release will have an in-memory offering as well for frequently accessed data :)
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.
>  
> 
> On 8 October 2016 at 20:15, Benjamin Kim <bbuild11@gmail.com <ma...@gmail.com>> wrote:
> Mich,
> 
> First and foremost, we have visualization servers that run Tableau for external user reports. Second, we have servers that are ad servers and REST endpoints for cookie sync and segmentation data exchange. These will use JDBC directly within the same data-center. When not colocated in the same data-center, they will connected to a located database server using JDBC. Either way, by using JDBC everywhere, it simplifies and unifies the code on the JDBC industry standard.
> 
> Does this make sense?
> 
> Thanks,
> Ben
> 
> 
>> On Oct 8, 2016, at 11:47 AM, Mich Talebzadeh <mich.talebzadeh@gmail.com <ma...@gmail.com>> wrote:
>> 
>> Like any other design what is your presentation layer and end users?
>> 
>> Are they SQL centric users from Tableau background or they may use spark functional programming.
>> 
>> It is best to describe the use case.
>> 
>> HTH
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>  
>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>> 
>> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.
>>  
>> 
>> On 8 October 2016 at 19:40, Felix Cheung <felixcheung_m@hotmail.com <ma...@hotmail.com>> wrote:
>> I wouldn't be too surprised Spark SQL - JDBC data source - Phoenix JDBC server - HBASE would work better.
>> 
>> Without naming specifics, there are at least 4 or 5 different implementations of HBASE sources, each at varying level of development and different requirements (HBASE release version, Kerberos support etc)
>> 
>> 
>> _____________________________
>> From: Benjamin Kim <bbuild11@gmail.com <ma...@gmail.com>>
>> Sent: Saturday, October 8, 2016 11:26 AM
>> Subject: Re: Spark SQL Thriftserver with HBase
>> To: Mich Talebzadeh <mich.talebzadeh@gmail.com <ma...@gmail.com>>
>> Cc: <user@spark.apache.org <ma...@spark.apache.org>>, Felix Cheung <felixcheung_m@hotmail.com <ma...@hotmail.com>>
>> 
>> 
>> 
>> Mich,
>> 
>> Are you talking about the Phoenix JDBC Server? If so, I forgot about that alternative.
>> 
>> Thanks,
>> Ben
>> 
>> 
>> On Oct 8, 2016, at 11:21 AM, Mich Talebzadeh <mich.talebzadeh@gmail.com <ma...@gmail.com>> wrote:
>> 
>> I don't think it will work
>> 
>> you can use phoenix on top of hbase
>> 
>> hbase(main):336:0> scan 'tsco', 'LIMIT' => 1
>> ROW                                                       COLUMN+CELL
>>  TSCO-1-Apr-08                                            column=stock_daily:Date, timestamp=1475866783376, value=1-Apr-08
>>  TSCO-1-Apr-08                                            column=stock_daily:close, timestamp=1475866783376, value=405.25
>>  TSCO-1-Apr-08                                            column=stock_daily:high, timestamp=1475866783376, value=406.75
>>  TSCO-1-Apr-08                                            column=stock_daily:low, timestamp=1475866783376, value=379.25
>>  TSCO-1-Apr-08                                            column=stock_daily:open, timestamp=1475866783376, value=380.00
>>  TSCO-1-Apr-08                                            column=stock_daily:stock, timestamp=1475866783376, value=TESCO PLC
>>  TSCO-1-Apr-08                                            column=stock_daily:ticker, timestamp=1475866783376, value=TSCO
>>  TSCO-1-Apr-08                                            column=stock_daily:volume, timestamp=1475866783376, value=49664486
>> 
>> And the same on Phoenix on top of Hvbase table
>> 
>> 0: jdbc:phoenix:thin:url=http://rhes564:8765 <http://rhes564:8765/>> select substr(to_char(to_date("Date",'dd-MMM-yy')),1,10) AS TradeDate, "close" AS "Day's close", "high" AS "Day's High", "low" AS "Day's Low", "open" AS "Day's Open", "ticker", "volume", (to_number("low")+to_number("high"))/2 AS "AverageDailyPrice" from "tsco" where to_number("volume") > 0 and "high" != '-' and to_date("Date",'dd-MMM-yy') > to_date('2015-10-06','yyyy-MM-dd') order by  to_date("Date",'dd-MMM-yy') limit 1;
>> +-------------+--------------+-------------+------------+-------------+---------+-----------+--------------------+
>> |  TRADEDATE  | Day's close  | Day's High  | Day's Low  | Day's Open  | ticker  |  volume   | AverageDailyPrice  |
>> +-------------+--------------+-------------+------------+-------------+---------+-----------+--------------------+
>> | 2015-10-07  | 197.00       | 198.05      | 184.84     | 192.20      | TSCO    | 30046994  | 191.445            |
>> 
>> HTH
>> 
>> 
>> 
>> 
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>  
>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>> 
>> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destructionof data or any other property which may arise from relying on this email's technical content is explicitly disclaimed.The author will in no case be liable for any monetary damages arising from suchloss, damage or destruction.
>>  
>> 
>> On 8 October 2016 at 19:05, Felix Cheung <felixcheung_m@hotmail.com <ma...@hotmail.com>> wrote:
>> Great, then I think those packages as Spark data source should allow you to do exactly that (replace org.apache.spark.sql.jdbc with HBASE one)
>> 
>> I do think it will be great to get more examples around this though. Would be great if you could share your experience with this!
>> 
>> 
>> _____________________________
>> From: Benjamin Kim <bbuild11@gmail.com <ma...@gmail.com>>
>> Sent: Saturday, October 8, 2016 11:00 AM
>> Subject: Re: Spark SQL Thriftserver with HBase
>> To: Felix Cheung <felixcheung_m@hotmail.com <ma...@hotmail.com>>
>> Cc: <user@spark.apache.org <ma...@spark.apache.org>>
>> 
>> 
>> Felix,
>> 
>> My goal is to use Spark SQL JDBC Thriftserver to access HBase tables using just SQL. I have been able to CREATE tables using this statement below in the past:
>> 
>> CREATE TABLE <table-name>
>> USING org.apache.spark.sql.jdbc
>> OPTIONS (
>>   url "jdbc:postgresql://<hostname>:<port>/dm?user=<username>&password=<password>",
>>   dbtable "dim.dimension_acamp"
>> );
>> 
>> After doing this, I can access the PostgreSQL table using Spark SQL JDBC Thriftserver using SQL statements (SELECT, UPDATE, INSERT, etc.). I want to do the same with HBase tables. We tried this using Hive and HiveServer2, but the response times are just too long.
>> 
>> Thanks,
>> Ben
>> 
>> 
>> On Oct 8, 2016, at 10:53 AM, Felix Cheung <felixcheung_m@hotmail.com <ma...@hotmail.com>> wrote:
>> 
>> Ben,
>> 
>> I'm not sure I'm following completely.
>> 
>> Is your goal to use Spark to create or access tables in HBASE? If so the link below and several packages out there support that by having a HBASE data source for Spark. There are some examples on how the Spark code look like in that link as well. On that note, you should also be able to use the HBASE data source from pure SQL (Spark SQL) query as well, which should work in the case with the Spark SQL JDBC Thrift Server (with USING,http://spark.apache.org/docs/latest/sql-programming-guide.html#tab_sql_10 <http://spark.apache.org/docs/latest/sql-programming-guide.html#tab_sql_10>).
>> 
>> 
>> _____________________________
>> From: Benjamin Kim <bbuild11@gmail.com <ma...@gmail.com>>
>> Sent: Saturday, October 8, 2016 10:40 AM
>> Subject: Re: Spark SQL Thriftserver with HBase
>> To: Felix Cheung <felixcheung_m@hotmail.com <ma...@hotmail.com>>
>> Cc: <user@spark.apache.org <ma...@spark.apache.org>>
>> 
>> 
>> Felix,
>> 
>> The only alternative way is to create a stored procedure (udf) in database terms that would run Spark scala code underneath. In this way, I can use Spark SQL JDBC Thriftserver to execute it using SQL code passing the key, values I want to UPSERT. I wonder if this is possible since I cannot CREATE a wrapper table on top of a HBase table in Spark SQL?
>> 
>> What do you think? Is this the right approach?
>> 
>> Thanks,
>> Ben
>> 
>> On Oct 8, 2016, at 10:33 AM, Felix Cheung <felixcheung_m@hotmail.com <ma...@hotmail.com>> wrote:
>> 
>> HBase has released support for Spark
>> hbase.apache.org/book.html#spark <http://hbase.apache.org/book.html#spark>
>> 
>> And if you search you should find several alternative approaches.
>> 
>> 
>> 
>> 
>> 
>> On Fri, Oct 7, 2016 at 7:56 AM -0700, "Benjamin Kim" <bbuild11@gmail.com <ma...@gmail.com>> wrote:
>> 
>> Does anyone know if Spark can work with HBase tables using Spark SQL? I know in Hive we are able to create tables on top of an underlying HBase table that can be accessed using MapReduce jobs. Can the same be done using HiveContext or SQLContext? We are trying to setup a way to GET and POST data to and from the HBase table using the Spark SQL JDBC thriftserver from our RESTful API endpoints and/or HTTP web farms. If we can get this to work, then we can load balance the thriftservers. In addition, this will benefit us in giving us a way to abstract the data storage layer away from the presentation layer code. There is a chance that we will swap out the data storage technology in the future. We are currently experimenting with Kudu.
>> 
>> Thanks,
>> Ben
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org <ma...@spark.apache.org>
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
> 
>

Re: Spark SQL Thriftserver with HBase

Posted by Mich Talebzadeh <mi...@gmail.com>.

Sure. But essentially you are looking at batch data for analytics for your
tableau users so Hive may be a better choice with its rich SQL and
ODBC.JDBC connection to Tableau already.

I would go for Hive especially the new release will have an in-memory
offering as well for frequently accessed data :)


Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 8 October 2016 at 20:15, Benjamin Kim <bb...@gmail.com> wrote:

> Mich,
>
> First and foremost, we have visualization servers that run Tableau for
> external user reports. Second, we have servers that are ad servers and REST
> endpoints for cookie sync and segmentation data exchange. These will use
> JDBC directly within the same data-center. When not colocated in the same
> data-center, they will connected to a located database server using JDBC.
> Either way, by using JDBC everywhere, it simplifies and unifies the code on
> the JDBC industry standard.
>
> Does this make sense?
>
> Thanks,
> Ben
>
>
> On Oct 8, 2016, at 11:47 AM, Mich Talebzadeh <mi...@gmail.com>
> wrote:
>
> Like any other design what is your presentation layer and end users?
>
> Are they SQL centric users from Tableau background or they may use spark
> functional programming.
>
> It is best to describe the use case.
>
> HTH
>
> Dr Mich Talebzadeh
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
> http://talebzadehmich.wordpress.com
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 8 October 2016 at 19:40, Felix Cheung <fe...@hotmail.com>
> wrote:
>
>> I wouldn't be too surprised Spark SQL - JDBC data source - Phoenix JDBC
>> server - HBASE would work better.
>>
>> Without naming specifics, there are at least 4 or 5 different
>> implementations of HBASE sources, each at varying level of development and
>> different requirements (HBASE release version, Kerberos support etc)
>>
>>
>> _____________________________
>> From: Benjamin Kim <bb...@gmail.com>
>> Sent: Saturday, October 8, 2016 11:26 AM
>> Subject: Re: Spark SQL Thriftserver with HBase
>> To: Mich Talebzadeh <mi...@gmail.com>
>> Cc: <us...@spark.apache.org>, Felix Cheung <fe...@hotmail.com>
>>
>>
>>
>> Mich,
>>
>> Are you talking about the Phoenix JDBC Server? If so, I forgot about that
>> alternative.
>>
>> Thanks,
>> Ben
>>
>>
>> On Oct 8, 2016, at 11:21 AM, Mich Talebzadeh <mi...@gmail.com>
>> wrote:
>>
>> I don't think it will work
>>
>> you can use phoenix on top of hbase
>>
>> hbase(main):336:0> scan 'tsco', 'LIMIT' => 1
>> ROW                                                       COLUMN+CELL
>>  TSCO-1-Apr-08
>> column=stock_daily:Date, timestamp=1475866783376, value=1-Apr-08
>>  TSCO-1-Apr-08
>> column=stock_daily:close, timestamp=1475866783376, value=405.25
>>  TSCO-1-Apr-08
>> column=stock_daily:high, timestamp=1475866783376, value=406.75
>>  TSCO-1-Apr-08
>> column=stock_daily:low, timestamp=1475866783376, value=379.25
>>  TSCO-1-Apr-08
>> column=stock_daily:open, timestamp=1475866783376, value=380.00
>>  TSCO-1-Apr-08
>> column=stock_daily:stock, timestamp=1475866783376, value=TESCO PLC
>>  TSCO-1-Apr-08
>> column=stock_daily:ticker, timestamp=1475866783376, value=TSCO
>>  TSCO-1-Apr-08
>> column=stock_daily:volume, timestamp=1475866783376, value=49664486
>>
>> And the same on Phoenix on top of Hvbase table
>>
>> 0: jdbc:phoenix:thin:url=http://rhes564:8765> select
>> substr(to_char(to_date("Date",'dd-MMM-yy')),1,10) AS TradeDate, "close"
>> AS "Day's close", "high" AS "Day's High", "low" AS "Day's Low", "open" AS
>> "Day's Open", "ticker", "volume", (to_number("low")+to_number("high"))/2
>> AS "AverageDailyPrice" from "tsco" where to_number("volume") > 0 and "high"
>> != '-' and to_date("Date",'dd-MMM-yy') > to_date('2015-10-06','yyyy-MM-dd')
>> order by  to_date("Date",'dd-MMM-yy') limit 1;
>> +-------------+--------------+-------------+------------+---
>> ----------+---------+-----------+--------------------+
>> |  TRADEDATE  | Day's close  | Day's High  | Day's Low  | Day's Open  |
>> ticker  |  volume   | AverageDailyPrice  |
>> +-------------+--------------+-------------+------------+---
>> ----------+---------+-----------+--------------------+
>> | 2015-10-07  | 197.00       | 198.05      | 184.84     | 192.20      |
>> TSCO    | 30046994  | 191.445            |
>>
>> HTH
>>
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destructionof data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed.The author will in no case be liable for any monetary damages
>> arising from suchloss, damage or destruction.
>>
>>
>>
>> On 8 October 2016 at 19:05, Felix Cheung <fe...@hotmail.com>
>> wrote:
>>
>>> Great, then I think those packages as Spark data source should allow you
>>> to do exactly that (replace org.apache.spark.sql.jdbc with HBASE one)
>>>
>>> I do think it will be great to get more examples around this though.
>>> Would be great if you could share your experience with this!
>>>
>>>
>>> _____________________________
>>> From: Benjamin Kim <bb...@gmail.com>
>>> Sent: Saturday, October 8, 2016 11:00 AM
>>> Subject: Re: Spark SQL Thriftserver with HBase
>>> To: Felix Cheung <fe...@hotmail.com>
>>> Cc: <us...@spark.apache.org>
>>>
>>>
>>> Felix,
>>>
>>> My goal is to use Spark SQL JDBC Thriftserver to access HBase tables
>>> using just SQL. I have been able to CREATE tables using this statement
>>> below in the past:
>>>
>>> CREATE TABLE <table-name>
>>> USING org.apache.spark.sql.jdbc
>>> OPTIONS (
>>>   url "jdbc:postgresql://<hostname>:<port>/dm?user=<username>&pass
>>> word=<password>",
>>>   dbtable "dim.dimension_acamp"
>>> );
>>>
>>>
>>> After doing this, I can access the PostgreSQL table using Spark SQL JDBC
>>> Thriftserver using SQL statements (SELECT, UPDATE, INSERT, etc.). I want to
>>> do the same with HBase tables. We tried this using Hive and HiveServer2,
>>> but the response times are just too long.
>>>
>>> Thanks,
>>> Ben
>>>
>>>
>>> On Oct 8, 2016, at 10:53 AM, Felix Cheung <fe...@hotmail.com>
>>> wrote:
>>>
>>> Ben,
>>>
>>> I'm not sure I'm following completely.
>>>
>>> Is your goal to use Spark to create or access tables in HBASE? If so the
>>> link below and several packages out there support that by having a HBASE
>>> data source for Spark. There are some examples on how the Spark code look
>>> like in that link as well. On that note, you should also be able to use the
>>> HBASE data source from pure SQL (Spark SQL) query as well, which should
>>> work in the case with the Spark SQL JDBC Thrift Server (with USING,
>>> http://spark.apache.org/docs/latest/sql-programming-gu
>>> ide.html#tab_sql_10).
>>>
>>>
>>> _____________________________
>>> From: Benjamin Kim <bb...@gmail.com>
>>> Sent: Saturday, October 8, 2016 10:40 AM
>>> Subject: Re: Spark SQL Thriftserver with HBase
>>> To: Felix Cheung <fe...@hotmail.com>
>>> Cc: <us...@spark.apache.org>
>>>
>>>
>>> Felix,
>>>
>>> The only alternative way is to create a stored procedure (udf) in
>>> database terms that would run Spark scala code underneath. In this way, I
>>> can use Spark SQL JDBC Thriftserver to execute it using SQL code passing
>>> the key, values I want to UPSERT. I wonder if this is possible since I
>>> cannot CREATE a wrapper table on top of a HBase table in Spark SQL?
>>>
>>> What do you think? Is this the right approach?
>>>
>>> Thanks,
>>> Ben
>>>
>>> On Oct 8, 2016, at 10:33 AM, Felix Cheung <fe...@hotmail.com>
>>> wrote:
>>>
>>> HBase has released support for Spark
>>> hbase.apache.org/book.html#spark
>>>
>>> And if you search you should find several alternative approaches.
>>>
>>>
>>>
>>>
>>>
>>> On Fri, Oct 7, 2016 at 7:56 AM -0700, "Benjamin Kim" <bbuild11@gmail.com
>>> > wrote:
>>>
>>> Does anyone know if Spark can work with HBase tables using Spark SQL? I
>>> know in Hive we are able to create tables on top of an underlying HBase
>>> table that can be accessed using MapReduce jobs. Can the same be done using
>>> HiveContext or SQLContext? We are trying to setup a way to GET and POST
>>> data to and from the HBase table using the Spark SQL JDBC thriftserver from
>>> our RESTful API endpoints and/or HTTP web farms. If we can get this to
>>> work, then we can load balance the thriftservers. In addition, this will
>>> benefit us in giving us a way to abstract the data storage layer away from
>>> the presentation layer code. There is a chance that we will swap out the
>>> data storage technology in the future. We are currently experimenting with
>>> Kudu.
>>>
>>> Thanks,
>>> Ben
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>
>>
>>
>>
>
>

Re: Spark SQL Thriftserver with HBase

Posted by Benjamin Kim <bb...@gmail.com>.

Mich,

First and foremost, we have visualization servers that run Tableau for external user reports. Second, we have servers that are ad servers and REST endpoints for cookie sync and segmentation data exchange. These will use JDBC directly within the same data-center. When not colocated in the same data-center, they will connected to a located database server using JDBC. Either way, by using JDBC everywhere, it simplifies and unifies the code on the JDBC industry standard.

Does this make sense?

Thanks,
Ben

> On Oct 8, 2016, at 11:47 AM, Mich Talebzadeh <mi...@gmail.com> wrote:
> 
> Like any other design what is your presentation layer and end users?
> 
> Are they SQL centric users from Tableau background or they may use spark functional programming.
> 
> It is best to describe the use case.
> 
> HTH
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.
>  
> 
> On 8 October 2016 at 19:40, Felix Cheung <felixcheung_m@hotmail.com <ma...@hotmail.com>> wrote:
> I wouldn't be too surprised Spark SQL - JDBC data source - Phoenix JDBC server - HBASE would work better.
> 
> Without naming specifics, there are at least 4 or 5 different implementations of HBASE sources, each at varying level of development and different requirements (HBASE release version, Kerberos support etc)
> 
> 
> _____________________________
> From: Benjamin Kim <bbuild11@gmail.com <ma...@gmail.com>>
> Sent: Saturday, October 8, 2016 11:26 AM
> Subject: Re: Spark SQL Thriftserver with HBase
> To: Mich Talebzadeh <mich.talebzadeh@gmail.com <ma...@gmail.com>>
> Cc: <user@spark.apache.org <ma...@spark.apache.org>>, Felix Cheung <felixcheung_m@hotmail.com <ma...@hotmail.com>>
> 
> 
> 
> Mich,
> 
> Are you talking about the Phoenix JDBC Server? If so, I forgot about that alternative.
> 
> Thanks,
> Ben
> 
> 
> On Oct 8, 2016, at 11:21 AM, Mich Talebzadeh <mich.talebzadeh@gmail.com <ma...@gmail.com>> wrote:
> 
> I don't think it will work
> 
> you can use phoenix on top of hbase
> 
> hbase(main):336:0> scan 'tsco', 'LIMIT' => 1
> ROW                                                       COLUMN+CELL
>  TSCO-1-Apr-08                                            column=stock_daily:Date, timestamp=1475866783376, value=1-Apr-08
>  TSCO-1-Apr-08                                            column=stock_daily:close, timestamp=1475866783376, value=405.25
>  TSCO-1-Apr-08                                            column=stock_daily:high, timestamp=1475866783376, value=406.75
>  TSCO-1-Apr-08                                            column=stock_daily:low, timestamp=1475866783376, value=379.25
>  TSCO-1-Apr-08                                            column=stock_daily:open, timestamp=1475866783376, value=380.00
>  TSCO-1-Apr-08                                            column=stock_daily:stock, timestamp=1475866783376, value=TESCO PLC
>  TSCO-1-Apr-08                                            column=stock_daily:ticker, timestamp=1475866783376, value=TSCO
>  TSCO-1-Apr-08                                            column=stock_daily:volume, timestamp=1475866783376, value=49664486
> 
> And the same on Phoenix on top of Hvbase table
> 
> 0: jdbc:phoenix:thin:url=http://rhes564:8765 <http://rhes564:8765/>> select substr(to_char(to_date("Date",'dd-MMM-yy')),1,10) AS TradeDate, "close" AS "Day's close", "high" AS "Day's High", "low" AS "Day's Low", "open" AS "Day's Open", "ticker", "volume", (to_number("low")+to_number("high"))/2 AS "AverageDailyPrice" from "tsco" where to_number("volume") > 0 and "high" != '-' and to_date("Date",'dd-MMM-yy') > to_date('2015-10-06','yyyy-MM-dd') order by  to_date("Date",'dd-MMM-yy') limit 1;
> +-------------+--------------+-------------+------------+-------------+---------+-----------+--------------------+
> |  TRADEDATE  | Day's close  | Day's High  | Day's Low  | Day's Open  | ticker  |  volume   | AverageDailyPrice  |
> +-------------+--------------+-------------+------------+-------------+---------+-----------+--------------------+
> | 2015-10-07  | 197.00       | 198.05      | 184.84     | 192.20      | TSCO    | 30046994  | 191.445            |
> 
> HTH
> 
> 
> 
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destructionof data or any other property which may arise from relying on this email's technical content is explicitly disclaimed.The author will in no case be liable for any monetary damages arising from suchloss, damage or destruction.
>  
> 
> On 8 October 2016 at 19:05, Felix Cheung <felixcheung_m@hotmail.com <ma...@hotmail.com>> wrote:
> Great, then I think those packages as Spark data source should allow you to do exactly that (replace org.apache.spark.sql.jdbc with HBASE one)
> 
> I do think it will be great to get more examples around this though. Would be great if you could share your experience with this!
> 
> 
> _____________________________
> From: Benjamin Kim <bbuild11@gmail.com <ma...@gmail.com>>
> Sent: Saturday, October 8, 2016 11:00 AM
> Subject: Re: Spark SQL Thriftserver with HBase
> To: Felix Cheung <felixcheung_m@hotmail.com <ma...@hotmail.com>>
> Cc: <user@spark.apache.org <ma...@spark.apache.org>>
> 
> 
> Felix,
> 
> My goal is to use Spark SQL JDBC Thriftserver to access HBase tables using just SQL. I have been able to CREATE tables using this statement below in the past:
> 
> CREATE TABLE <table-name>
> USING org.apache.spark.sql.jdbc
> OPTIONS (
>   url "jdbc:postgresql://<hostname>:<port>/dm?user=<username>&password=<password>",
>   dbtable "dim.dimension_acamp"
> );
> 
> After doing this, I can access the PostgreSQL table using Spark SQL JDBC Thriftserver using SQL statements (SELECT, UPDATE, INSERT, etc.). I want to do the same with HBase tables. We tried this using Hive and HiveServer2, but the response times are just too long.
> 
> Thanks,
> Ben
> 
> 
> On Oct 8, 2016, at 10:53 AM, Felix Cheung <felixcheung_m@hotmail.com <ma...@hotmail.com>> wrote:
> 
> Ben,
> 
> I'm not sure I'm following completely.
> 
> Is your goal to use Spark to create or access tables in HBASE? If so the link below and several packages out there support that by having a HBASE data source for Spark. There are some examples on how the Spark code look like in that link as well. On that note, you should also be able to use the HBASE data source from pure SQL (Spark SQL) query as well, which should work in the case with the Spark SQL JDBC Thrift Server (with USING,http://spark.apache.org/docs/latest/sql-programming-guide.html#tab_sql_10 <http://spark.apache.org/docs/latest/sql-programming-guide.html#tab_sql_10>).
> 
> 
> _____________________________
> From: Benjamin Kim <bbuild11@gmail.com <ma...@gmail.com>>
> Sent: Saturday, October 8, 2016 10:40 AM
> Subject: Re: Spark SQL Thriftserver with HBase
> To: Felix Cheung <felixcheung_m@hotmail.com <ma...@hotmail.com>>
> Cc: <user@spark.apache.org <ma...@spark.apache.org>>
> 
> 
> Felix,
> 
> The only alternative way is to create a stored procedure (udf) in database terms that would run Spark scala code underneath. In this way, I can use Spark SQL JDBC Thriftserver to execute it using SQL code passing the key, values I want to UPSERT. I wonder if this is possible since I cannot CREATE a wrapper table on top of a HBase table in Spark SQL?
> 
> What do you think? Is this the right approach?
> 
> Thanks,
> Ben
> 
> On Oct 8, 2016, at 10:33 AM, Felix Cheung <felixcheung_m@hotmail.com <ma...@hotmail.com>> wrote:
> 
> HBase has released support for Spark
> hbase.apache.org/book.html#spark <http://hbase.apache.org/book.html#spark>
> 
> And if you search you should find several alternative approaches.
> 
> 
> 
> 
> 
> On Fri, Oct 7, 2016 at 7:56 AM -0700, "Benjamin Kim" <bbuild11@gmail.com <ma...@gmail.com>> wrote:
> 
> Does anyone know if Spark can work with HBase tables using Spark SQL? I know in Hive we are able to create tables on top of an underlying HBase table that can be accessed using MapReduce jobs. Can the same be done using HiveContext or SQLContext? We are trying to setup a way to GET and POST data to and from the HBase table using the Spark SQL JDBC thriftserver from our RESTful API endpoints and/or HTTP web farms. If we can get this to work, then we can load balance the thriftservers. In addition, this will benefit us in giving us a way to abstract the data storage layer away from the presentation layer code. There is a chance that we will swap out the data storage technology in the future. We are currently experimenting with Kudu.
> 
> Thanks,
> Ben
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org <ma...@spark.apache.org>
> 
> 
> 
> 
> 
> 
> 
> 
> 
>

Re: Spark SQL Thriftserver with HBase

Posted by Mich Talebzadeh <mi...@gmail.com>.

Like any other design what is your presentation layer and end users?

Are they SQL centric users from Tableau background or they may use spark
functional programming.

It is best to describe the use case.

HTH

Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 8 October 2016 at 19:40, Felix Cheung <fe...@hotmail.com> wrote:

> I wouldn't be too surprised Spark SQL - JDBC data source - Phoenix JDBC
> server - HBASE would work better.
>
> Without naming specifics, there are at least 4 or 5 different
> implementations of HBASE sources, each at varying level of development and
> different requirements (HBASE release version, Kerberos support etc)
>
>
> _____________________________
> From: Benjamin Kim <bb...@gmail.com>
> Sent: Saturday, October 8, 2016 11:26 AM
> Subject: Re: Spark SQL Thriftserver with HBase
> To: Mich Talebzadeh <mi...@gmail.com>
> Cc: <us...@spark.apache.org>, Felix Cheung <fe...@hotmail.com>
>
>
>
> Mich,
>
> Are you talking about the Phoenix JDBC Server? If so, I forgot about that
> alternative.
>
> Thanks,
> Ben
>
>
> On Oct 8, 2016, at 11:21 AM, Mich Talebzadeh <mi...@gmail.com>
> wrote:
>
> I don't think it will work
>
> you can use phoenix on top of hbase
>
> hbase(main):336:0> scan 'tsco', 'LIMIT' => 1
> ROW                                                       COLUMN+CELL
>  TSCO-1-Apr-08
> column=stock_daily:Date, timestamp=1475866783376, value=1-Apr-08
>  TSCO-1-Apr-08
> column=stock_daily:close, timestamp=1475866783376, value=405.25
>  TSCO-1-Apr-08
> column=stock_daily:high, timestamp=1475866783376, value=406.75
>  TSCO-1-Apr-08
> column=stock_daily:low, timestamp=1475866783376, value=379.25
>  TSCO-1-Apr-08
> column=stock_daily:open, timestamp=1475866783376, value=380.00
>  TSCO-1-Apr-08
> column=stock_daily:stock, timestamp=1475866783376, value=TESCO PLC
>  TSCO-1-Apr-08
> column=stock_daily:ticker, timestamp=1475866783376, value=TSCO
>  TSCO-1-Apr-08
> column=stock_daily:volume, timestamp=1475866783376, value=49664486
>
> And the same on Phoenix on top of Hvbase table
>
> 0: jdbc:phoenix:thin:url=http://rhes564:8765> select
> substr(to_char(to_date("Date",'dd-MMM-yy')),1,10) AS TradeDate, "close"
> AS "Day's close", "high" AS "Day's High", "low" AS "Day's Low", "open" AS
> "Day's Open", "ticker", "volume", (to_number("low")+to_number("high"))/2
> AS "AverageDailyPrice" from "tsco" where to_number("volume") > 0 and "high"
> != '-' and to_date("Date",'dd-MMM-yy') > to_date('2015-10-06','yyyy-MM-dd')
> order by  to_date("Date",'dd-MMM-yy') limit 1;
> +-------------+--------------+-------------+------------+---
> ----------+---------+-----------+--------------------+
> |  TRADEDATE  | Day's close  | Day's High  | Day's Low  | Day's Open  |
> ticker  |  volume   | AverageDailyPrice  |
> +-------------+--------------+-------------+------------+---
> ----------+---------+-----------+--------------------+
> | 2015-10-07  | 197.00       | 198.05      | 184.84     | 192.20      |
> TSCO    | 30046994  | 191.445            |
>
> HTH
>
>
>
>
> Dr Mich Talebzadeh
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
> http://talebzadehmich.wordpress.com
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destructionof data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.The
> author will in no case be liable for any monetary damages arising from
> suchloss, damage or destruction.
>
>
>
> On 8 October 2016 at 19:05, Felix Cheung <fe...@hotmail.com>
> wrote:
>
>> Great, then I think those packages as Spark data source should allow you
>> to do exactly that (replace org.apache.spark.sql.jdbc with HBASE one)
>>
>> I do think it will be great to get more examples around this though.
>> Would be great if you could share your experience with this!
>>
>>
>> _____________________________
>> From: Benjamin Kim <bb...@gmail.com>
>> Sent: Saturday, October 8, 2016 11:00 AM
>> Subject: Re: Spark SQL Thriftserver with HBase
>> To: Felix Cheung <fe...@hotmail.com>
>> Cc: <us...@spark.apache.org>
>>
>>
>> Felix,
>>
>> My goal is to use Spark SQL JDBC Thriftserver to access HBase tables
>> using just SQL. I have been able to CREATE tables using this statement
>> below in the past:
>>
>> CREATE TABLE <table-name>
>> USING org.apache.spark.sql.jdbc
>> OPTIONS (
>>   url "jdbc:postgresql://<hostname>:<port>/dm?user=<username>&pass
>> word=<password>",
>>   dbtable "dim.dimension_acamp"
>> );
>>
>>
>> After doing this, I can access the PostgreSQL table using Spark SQL JDBC
>> Thriftserver using SQL statements (SELECT, UPDATE, INSERT, etc.). I want to
>> do the same with HBase tables. We tried this using Hive and HiveServer2,
>> but the response times are just too long.
>>
>> Thanks,
>> Ben
>>
>>
>> On Oct 8, 2016, at 10:53 AM, Felix Cheung <fe...@hotmail.com>
>> wrote:
>>
>> Ben,
>>
>> I'm not sure I'm following completely.
>>
>> Is your goal to use Spark to create or access tables in HBASE? If so the
>> link below and several packages out there support that by having a HBASE
>> data source for Spark. There are some examples on how the Spark code look
>> like in that link as well. On that note, you should also be able to use the
>> HBASE data source from pure SQL (Spark SQL) query as well, which should
>> work in the case with the Spark SQL JDBC Thrift Server (with USING,
>> http://spark.apache.org/docs/latest/sql-programming-guide.html#tab_sql_10
>> ).
>>
>>
>> _____________________________
>> From: Benjamin Kim <bb...@gmail.com>
>> Sent: Saturday, October 8, 2016 10:40 AM
>> Subject: Re: Spark SQL Thriftserver with HBase
>> To: Felix Cheung <fe...@hotmail.com>
>> Cc: <us...@spark.apache.org>
>>
>>
>> Felix,
>>
>> The only alternative way is to create a stored procedure (udf) in
>> database terms that would run Spark scala code underneath. In this way, I
>> can use Spark SQL JDBC Thriftserver to execute it using SQL code passing
>> the key, values I want to UPSERT. I wonder if this is possible since I
>> cannot CREATE a wrapper table on top of a HBase table in Spark SQL?
>>
>> What do you think? Is this the right approach?
>>
>> Thanks,
>> Ben
>>
>> On Oct 8, 2016, at 10:33 AM, Felix Cheung <fe...@hotmail.com>
>> wrote:
>>
>> HBase has released support for Spark
>> hbase.apache.org/book.html#spark
>>
>> And if you search you should find several alternative approaches.
>>
>>
>>
>>
>>
>> On Fri, Oct 7, 2016 at 7:56 AM -0700, "Benjamin Kim" <bb...@gmail.com>
>>  wrote:
>>
>> Does anyone know if Spark can work with HBase tables using Spark SQL? I
>> know in Hive we are able to create tables on top of an underlying HBase
>> table that can be accessed using MapReduce jobs. Can the same be done using
>> HiveContext or SQLContext? We are trying to setup a way to GET and POST
>> data to and from the HBase table using the Spark SQL JDBC thriftserver from
>> our RESTful API endpoints and/or HTTP web farms. If we can get this to
>> work, then we can load balance the thriftservers. In addition, this will
>> benefit us in giving us a way to abstract the data storage layer away from
>> the presentation layer code. There is a chance that we will swap out the
>> data storage technology in the future. We are currently experimenting with
>> Kudu.
>>
>> Thanks,
>> Ben
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>
>>
>>
>>
>>
>>
>>
>>
>
>
>
>

Re: Spark SQL Thriftserver with HBase

Posted by Felix Cheung <fe...@hotmail.com>.

I wouldn't be too surprised Spark SQL - JDBC data source - Phoenix JDBC server - HBASE would work better.

Without naming specifics, there are at least 4 or 5 different implementations of HBASE sources, each at varying level of development and different requirements (HBASE release version, Kerberos support etc)

_____________________________
From: Benjamin Kim <bb...@gmail.com>>
Sent: Saturday, October 8, 2016 11:26 AM
Subject: Re: Spark SQL Thriftserver with HBase
To: Mich Talebzadeh <mi...@gmail.com>>
Cc: <us...@spark.apache.org>>, Felix Cheung <fe...@hotmail.com>>

Mich,

Are you talking about the Phoenix JDBC Server? If so, I forgot about that alternative.

Thanks,
Ben

On Oct 8, 2016, at 11:21 AM, Mich Talebzadeh <mi...@gmail.com>> wrote:

I don't think it will work

you can use phoenix on top of hbase

hbase(main):336:0> scan 'tsco', 'LIMIT' => 1
ROW                                                       COLUMN+CELL
 TSCO-1-Apr-08                                            column=stock_daily:Date, timestamp=1475866783376, value=1-Apr-08
 TSCO-1-Apr-08                                            column=stock_daily:close, timestamp=1475866783376, value=405.25
 TSCO-1-Apr-08                                            column=stock_daily:high, timestamp=1475866783376, value=406.75
 TSCO-1-Apr-08                                            column=stock_daily:low, timestamp=1475866783376, value=379.25
 TSCO-1-Apr-08                                            column=stock_daily:open, timestamp=1475866783376, value=380.00
 TSCO-1-Apr-08                                            column=stock_daily:stock, timestamp=1475866783376, value=TESCO PLC
 TSCO-1-Apr-08                                            column=stock_daily:ticker, timestamp=1475866783376, value=TSCO
 TSCO-1-Apr-08                                            column=stock_daily:volume, timestamp=1475866783376, value=49664486

And the same on Phoenix on top of Hvbase table

0: jdbc:phoenix:thin:url=http://rhes564:8765<http://rhes564:8765/>> select substr(to_char(to_date("Date",'dd-MMM-yy')),1,10) AS TradeDate, "close" AS "Day's close", "high" AS "Day's High", "low" AS "Day's Low", "open" AS "Day's Open", "ticker", "volume", (to_number("low")+to_number("high"))/2 AS "AverageDailyPrice" from "tsco" where to_number("volume") > 0 and "high" != '-' and to_date("Date",'dd-MMM-yy') > to_date('2015-10-06','yyyy-MM-dd') order by  to_date("Date",'dd-MMM-yy') limit 1;
+-------------+--------------+-------------+------------+-------------+---------+-----------+--------------------+
|  TRADEDATE  | Day's close  | Day's High  | Day's Low  | Day's Open  | ticker  |  volume   | AverageDailyPrice  |
+-------------+--------------+-------------+------------+-------------+---------+-----------+--------------------+
| 2015-10-07  | 197.00       | 198.05      | 184.84     | 192.20      | TSCO    | 30046994  | 191.445            |

HTH

Dr Mich Talebzadeh

LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destructionof data or any other property which may arise from relying on this email's technical content is explicitly disclaimed.The author will in no case be liable for any monetary damages arising from suchloss, damage or destruction.

On 8 October 2016 at 19:05, Felix Cheung <fe...@hotmail.com>> wrote:
Great, then I think those packages as Spark data source should allow you to do exactly that (replace org.apache.spark.sql.jdbc with HBASE one)

I do think it will be great to get more examples around this though. Would be great if you could share your experience with this!

_____________________________
From: Benjamin Kim <bb...@gmail.com>>
Sent: Saturday, October 8, 2016 11:00 AM
Subject: Re: Spark SQL Thriftserver with HBase
To: Felix Cheung <fe...@hotmail.com>>
Cc: <us...@spark.apache.org>>

Felix,

My goal is to use Spark SQL JDBC Thriftserver to access HBase tables using just SQL. I have been able to CREATE tables using this statement below in the past:

CREATE TABLE <table-name>
USING org.apache.spark.sql.jdbc
OPTIONS (
  url "jdbc:postgresql://<hostname>:<port>/dm?user=<username>&password=<password>",
  dbtable "dim.dimension_acamp"
);

After doing this, I can access the PostgreSQL table using Spark SQL JDBC Thriftserver using SQL statements (SELECT, UPDATE, INSERT, etc.). I want to do the same with HBase tables. We tried this using Hive and HiveServer2, but the response times are just too long.

Thanks,
Ben

On Oct 8, 2016, at 10:53 AM, Felix Cheung <fe...@hotmail.com>> wrote:

Ben,

I'm not sure I'm following completely.

Is your goal to use Spark to create or access tables in HBASE? If so the link below and several packages out there support that by having a HBASE data source for Spark. There are some examples on how the Spark code look like in that link as well. On that note, you should also be able to use the HBASE data source from pure SQL (Spark SQL) query as well, which should work in the case with the Spark SQL JDBC Thrift Server (with USING,http://spark.apache.org/docs/latest/sql-programming-guide.html#tab_sql_10).

_____________________________
From: Benjamin Kim <bb...@gmail.com>>
Sent: Saturday, October 8, 2016 10:40 AM
Subject: Re: Spark SQL Thriftserver with HBase
To: Felix Cheung <fe...@hotmail.com>>
Cc: <us...@spark.apache.org>>

Felix,

The only alternative way is to create a stored procedure (udf) in database terms that would run Spark scala code underneath. In this way, I can use Spark SQL JDBC Thriftserver to execute it using SQL code passing the key, values I want to UPSERT. I wonder if this is possible since I cannot CREATE a wrapper table on top of a HBase table in Spark SQL?

What do you think? Is this the right approach?

Thanks,
Ben

On Oct 8, 2016, at 10:33 AM, Felix Cheung <fe...@hotmail.com>> wrote:

HBase has released support for Spark
hbase.apache.org/book.html#spark<http://hbase.apache.org/book.html#spark>

And if you search you should find several alternative approaches.

On Fri, Oct 7, 2016 at 7:56 AM -0700, "Benjamin Kim" <bb...@gmail.com>> wrote:

Does anyone know if Spark can work with HBase tables using Spark SQL? I know in Hive we are able to create tables on top of an underlying HBase table that can be accessed using MapReduce jobs. Can the same be done using HiveContext or SQLContext? We are trying to setup a way to GET and POST data to and from the HBase table using the Spark SQL JDBC thriftserver from our RESTful API endpoints and/or HTTP web farms. If we can get this to work, then we can load balance the thriftservers. In addition, this will benefit us in giving us a way to abstract the data storage layer away from the presentation layer code. There is a chance that we will swap out the data storage technology in the future. We are currently experimenting with Kudu.

Thanks,
Ben
---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org<ma...@spark.apache.org>

Re: Spark SQL Thriftserver with HBase

Posted by Benjamin Kim <bb...@gmail.com>.

Mich,

Are you talking about the Phoenix JDBC Server? If so, I forgot about that alternative.

Thanks,
Ben


> On Oct 8, 2016, at 11:21 AM, Mich Talebzadeh <mi...@gmail.com> wrote:
> 
> I don't think it will work
> 
> you can use phoenix on top of hbase
> 
> hbase(main):336:0> scan 'tsco', 'LIMIT' => 1
> ROW                                                       COLUMN+CELL
>  TSCO-1-Apr-08                                            column=stock_daily:Date, timestamp=1475866783376, value=1-Apr-08
>  TSCO-1-Apr-08                                            column=stock_daily:close, timestamp=1475866783376, value=405.25
>  TSCO-1-Apr-08                                            column=stock_daily:high, timestamp=1475866783376, value=406.75
>  TSCO-1-Apr-08                                            column=stock_daily:low, timestamp=1475866783376, value=379.25
>  TSCO-1-Apr-08                                            column=stock_daily:open, timestamp=1475866783376, value=380.00
>  TSCO-1-Apr-08                                            column=stock_daily:stock, timestamp=1475866783376, value=TESCO PLC
>  TSCO-1-Apr-08                                            column=stock_daily:ticker, timestamp=1475866783376, value=TSCO
>  TSCO-1-Apr-08                                            column=stock_daily:volume, timestamp=1475866783376, value=49664486
> 
> And the same on Phoenix on top of Hvbase table
> 
> 0: jdbc:phoenix:thin:url=http://rhes564:8765 <http://rhes564:8765/>> select substr(to_char(to_date("Date",'dd-MMM-yy')),1,10) AS TradeDate, "close" AS "Day's close", "high" AS "Day's High", "low" AS "Day's Low", "open" AS "Day's Open", "ticker", "volume", (to_number("low")+to_number("high"))/2 AS "AverageDailyPrice" from "tsco" where to_number("volume") > 0 and "high" != '-' and to_date("Date",'dd-MMM-yy') > to_date('2015-10-06','yyyy-MM-dd') order by  to_date("Date",'dd-MMM-yy') limit 1;
> +-------------+--------------+-------------+------------+-------------+---------+-----------+--------------------+
> |  TRADEDATE  | Day's close  | Day's High  | Day's Low  | Day's Open  | ticker  |  volume   | AverageDailyPrice  |
> +-------------+--------------+-------------+------------+-------------+---------+-----------+--------------------+
> | 2015-10-07  | 197.00       | 198.05      | 184.84     | 192.20      | TSCO    | 30046994  | 191.445            |
> 
> HTH
> 
> 
> 
> 
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.
>  
> 
> On 8 October 2016 at 19:05, Felix Cheung <felixcheung_m@hotmail.com <ma...@hotmail.com>> wrote:
> Great, then I think those packages as Spark data source should allow you to do exactly that (replace org.apache.spark.sql.jdbc with HBASE one)
> 
> I do think it will be great to get more examples around this though. Would be great if you could share your experience with this!
> 
> 
> _____________________________
> From: Benjamin Kim <bbuild11@gmail.com <ma...@gmail.com>>
> Sent: Saturday, October 8, 2016 11:00 AM
> Subject: Re: Spark SQL Thriftserver with HBase
> To: Felix Cheung <felixcheung_m@hotmail.com <ma...@hotmail.com>>
> Cc: <user@spark.apache.org <ma...@spark.apache.org>>
> 
> 
> Felix,
> 
> My goal is to use Spark SQL JDBC Thriftserver to access HBase tables using just SQL. I have been able to CREATE tables using this statement below in the past:
> 
> CREATE TABLE <table-name>
> USING org.apache.spark.sql.jdbc
> OPTIONS (
>   url "jdbc:postgresql://<hostname>:<port>/dm?user=<username>&password=<password>",
>   dbtable "dim.dimension_acamp"
> );
> 
> After doing this, I can access the PostgreSQL table using Spark SQL JDBC Thriftserver using SQL statements (SELECT, UPDATE, INSERT, etc.). I want to do the same with HBase tables. We tried this using Hive and HiveServer2, but the response times are just too long.
> 
> Thanks,
> Ben
> 
> 
> On Oct 8, 2016, at 10:53 AM, Felix Cheung <felixcheung_m@hotmail.com <ma...@hotmail.com>> wrote:
> 
> Ben,
> 
> I'm not sure I'm following completely.
> 
> Is your goal to use Spark to create or access tables in HBASE? If so the link below and several packages out there support that by having a HBASE data source for Spark. There are some examples on how the Spark code look like in that link as well. On that note, you should also be able to use the HBASE data source from pure SQL (Spark SQL) query as well, which should work in the case with the Spark SQL JDBC Thrift Server (with USING,http://spark.apache.org/docs/latest/sql-programming-guide.html#tab_sql_10 <http://spark.apache.org/docs/latest/sql-programming-guide.html#tab_sql_10>).
> 
> 
> _____________________________
> From: Benjamin Kim <bbuild11@gmail.com <ma...@gmail.com>>
> Sent: Saturday, October 8, 2016 10:40 AM
> Subject: Re: Spark SQL Thriftserver with HBase
> To: Felix Cheung <felixcheung_m@hotmail.com <ma...@hotmail.com>>
> Cc: <user@spark.apache.org <ma...@spark.apache.org>>
> 
> 
> Felix,
> 
> The only alternative way is to create a stored procedure (udf) in database terms that would run Spark scala code underneath. In this way, I can use Spark SQL JDBC Thriftserver to execute it using SQL code passing the key, values I want to UPSERT. I wonder if this is possible since I cannot CREATE a wrapper table on top of a HBase table in Spark SQL?
> 
> What do you think? Is this the right approach?
> 
> Thanks,
> Ben
> 
> On Oct 8, 2016, at 10:33 AM, Felix Cheung <felixcheung_m@hotmail.com <ma...@hotmail.com>> wrote:
> 
> HBase has released support for Spark
> hbase.apache.org/book.html#spark <http://hbase.apache.org/book.html#spark>
> 
> And if you search you should find several alternative approaches.
> 
> 
> 
> 
> 
> On Fri, Oct 7, 2016 at 7:56 AM -0700, "Benjamin Kim" <bbuild11@gmail.com <ma...@gmail.com>> wrote:
> 
> Does anyone know if Spark can work with HBase tables using Spark SQL? I know in Hive we are able to create tables on top of an underlying HBase table that can be accessed using MapReduce jobs. Can the same be done using HiveContext or SQLContext? We are trying to setup a way to GET and POST data to and from the HBase table using the Spark SQL JDBC thriftserver from our RESTful API endpoints and/or HTTP web farms. If we can get this to work, then we can load balance the thriftservers. In addition, this will benefit us in giving us a way to abstract the data storage layer away from the presentation layer code. There is a chance that we will swap out the data storage technology in the future. We are currently experimenting with Kudu.
> 
> Thanks,
> Ben
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org <ma...@spark.apache.org>
> 
> 
> 
> 
> 
>

Re: Spark SQL Thriftserver with HBase

Posted by Mich Talebzadeh <mi...@gmail.com>.

I don't think it will work

you can use phoenix on top of hbase

hbase(main):336:0> scan 'tsco', 'LIMIT' => 1
ROW                                                       COLUMN+CELL
 TSCO-1-Apr-08
column=stock_daily:Date, timestamp=1475866783376, value=1-Apr-08
 TSCO-1-Apr-08
column=stock_daily:close, timestamp=1475866783376, value=405.25
 TSCO-1-Apr-08
column=stock_daily:high, timestamp=1475866783376, value=406.75
 TSCO-1-Apr-08
column=stock_daily:low, timestamp=1475866783376, value=379.25
 TSCO-1-Apr-08
column=stock_daily:open, timestamp=1475866783376, value=380.00
 TSCO-1-Apr-08
column=stock_daily:stock, timestamp=1475866783376, value=TESCO PLC
 TSCO-1-Apr-08
column=stock_daily:ticker, timestamp=1475866783376, value=TSCO
 TSCO-1-Apr-08
column=stock_daily:volume, timestamp=1475866783376, value=49664486

And the same on Phoenix on top of Hvbase table

0: jdbc:phoenix:thin:url=http://rhes564:8765> select
substr(to_char(to_date("Date",'dd-MMM-yy')),1,10) AS TradeDate, "close" AS
"Day's close", "high" AS "Day's High", "low" AS "Day's Low", "open" AS
"Day's Open", "ticker", "volume", (to_number("low")+to_number("high"))/2 AS
"AverageDailyPrice" from "tsco" where to_number("volume") > 0 and "high" !=
'-' and to_date("Date",'dd-MMM-yy') > to_date('2015-10-06','yyyy-MM-dd')
order by  to_date("Date",'dd-MMM-yy') limit 1;
+-------------+--------------+-------------+------------+-------------+---------+-----------+--------------------+
|  TRADEDATE  | Day's close  | Day's High  | Day's Low  | Day's Open  |
ticker  |  volume   | AverageDailyPrice  |
+-------------+--------------+-------------+------------+-------------+---------+-----------+--------------------+
| 2015-10-07  | 197.00       | 198.05      | 184.84     | 192.20      |
TSCO    | 30046994  | 191.445            |

HTH




Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 8 October 2016 at 19:05, Felix Cheung <fe...@hotmail.com> wrote:

> Great, then I think those packages as Spark data source should allow you
> to do exactly that (replace org.apache.spark.sql.jdbc with HBASE one)
>
> I do think it will be great to get more examples around this though. Would
> be great if you could share your experience with this!
>
>
> _____________________________
> From: Benjamin Kim <bb...@gmail.com>
> Sent: Saturday, October 8, 2016 11:00 AM
> Subject: Re: Spark SQL Thriftserver with HBase
> To: Felix Cheung <fe...@hotmail.com>
> Cc: <us...@spark.apache.org>
>
>
> Felix,
>
> My goal is to use Spark SQL JDBC Thriftserver to access HBase tables using
> just SQL. I have been able to CREATE tables using this statement below in
> the past:
>
> CREATE TABLE <table-name>
> USING org.apache.spark.sql.jdbc
> OPTIONS (
>   url "jdbc:postgresql://<hostname>:<port>/dm?user=<username>&
> password=<password>",
>   dbtable "dim.dimension_acamp"
> );
>
>
> After doing this, I can access the PostgreSQL table using Spark SQL JDBC
> Thriftserver using SQL statements (SELECT, UPDATE, INSERT, etc.). I want to
> do the same with HBase tables. We tried this using Hive and HiveServer2,
> but the response times are just too long.
>
> Thanks,
> Ben
>
>
> On Oct 8, 2016, at 10:53 AM, Felix Cheung <fe...@hotmail.com>
> wrote:
>
> Ben,
>
> I'm not sure I'm following completely.
>
> Is your goal to use Spark to create or access tables in HBASE? If so the
> link below and several packages out there support that by having a HBASE
> data source for Spark. There are some examples on how the Spark code look
> like in that link as well. On that note, you should also be able to use the
> HBASE data source from pure SQL (Spark SQL) query as well, which should
> work in the case with the Spark SQL JDBC Thrift Server (with USING,
> http://spark.apache.org/docs/latest/sql-programming-guide.html#tab_sql_10
> ).
>
>
> _____________________________
> From: Benjamin Kim <bb...@gmail.com>
> Sent: Saturday, October 8, 2016 10:40 AM
> Subject: Re: Spark SQL Thriftserver with HBase
> To: Felix Cheung <fe...@hotmail.com>
> Cc: <us...@spark.apache.org>
>
>
> Felix,
>
> The only alternative way is to create a stored procedure (udf) in database
> terms that would run Spark scala code underneath. In this way, I can use
> Spark SQL JDBC Thriftserver to execute it using SQL code passing the key,
> values I want to UPSERT. I wonder if this is possible since I cannot CREATE
> a wrapper table on top of a HBase table in Spark SQL?
>
> What do you think? Is this the right approach?
>
> Thanks,
> Ben
>
> On Oct 8, 2016, at 10:33 AM, Felix Cheung <fe...@hotmail.com>
> wrote:
>
> HBase has released support for Spark
> hbase.apache.org/book.html#spark
>
> And if you search you should find several alternative approaches.
>
>
>
>
>
> On Fri, Oct 7, 2016 at 7:56 AM -0700, "Benjamin Kim" <bb...@gmail.com>
> wrote:
>
> Does anyone know if Spark can work with HBase tables using Spark SQL? I
> know in Hive we are able to create tables on top of an underlying HBase
> table that can be accessed using MapReduce jobs. Can the same be done using
> HiveContext or SQLContext? We are trying to setup a way to GET and POST
> data to and from the HBase table using the Spark SQL JDBC thriftserver from
> our RESTful API endpoints and/or HTTP web farms. If we can get this to
> work, then we can load balance the thriftservers. In addition, this will
> benefit us in giving us a way to abstract the data storage layer away from
> the presentation layer code. There is a chance that we will swap out the
> data storage technology in the future. We are currently experimenting with
> Kudu.
>
> Thanks,
> Ben
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>
>
>
>
>
>
>

Re: Spark SQL Thriftserver with HBase

Posted by Benjamin Kim <bb...@gmail.com>.

Yes. I tried that with the hbase-spark package, but it didn’t work. We were hoping it would. If it did, we would be using it for everything from Ad Servers to REST Endpoints and even Reporting Servers. I guess we will have to wait until they fix it.


> On Oct 8, 2016, at 11:05 AM, Felix Cheung <fe...@hotmail.com> wrote:
> 
> Great, then I think those packages as Spark data source should allow you to do exactly that (replace org.apache.spark.sql.jdbc with HBASE one)
> 
> I do think it will be great to get more examples around this though. Would be great if you could share your experience with this!
> 
> 
> _____________________________
> From: Benjamin Kim <bbuild11@gmail.com <ma...@gmail.com>>
> Sent: Saturday, October 8, 2016 11:00 AM
> Subject: Re: Spark SQL Thriftserver with HBase
> To: Felix Cheung <felixcheung_m@hotmail.com <ma...@hotmail.com>>
> Cc: <user@spark.apache.org <ma...@spark.apache.org>>
> 
> 
> Felix,
> 
> My goal is to use Spark SQL JDBC Thriftserver to access HBase tables using just SQL. I have been able to CREATE tables using this statement below in the past:
> 
> CREATE TABLE <table-name>
> USING org.apache.spark.sql.jdbc
> OPTIONS (
>   url "jdbc:postgresql://<hostname>:<port>/dm?user=<username>&password=<password>",
>   dbtable "dim.dimension_acamp"
> );
> 
> After doing this, I can access the PostgreSQL table using Spark SQL JDBC Thriftserver using SQL statements (SELECT, UPDATE, INSERT, etc.). I want to do the same with HBase tables. We tried this using Hive and HiveServer2, but the response times are just too long.
> 
> Thanks,
> Ben
> 
> 
> On Oct 8, 2016, at 10:53 AM, Felix Cheung <felixcheung_m@hotmail.com <ma...@hotmail.com>> wrote:
> 
> Ben,
> 
> I'm not sure I'm following completely.
> 
> Is your goal to use Spark to create or access tables in HBASE? If so the link below and several packages out there support that by having a HBASE data source for Spark. There are some examples on how the Spark code look like in that link as well. On that note, you should also be able to use the HBASE data source from pure SQL (Spark SQL) query as well, which should work in the case with the Spark SQL JDBC Thrift Server (with USING,http://spark.apache.org/docs/latest/sql-programming-guide.html#tab_sql_10 <http://spark.apache.org/docs/latest/sql-programming-guide.html#tab_sql_10>).
> 
> 
> _____________________________
> From: Benjamin Kim <bbuild11@gmail.com <ma...@gmail.com>>
> Sent: Saturday, October 8, 2016 10:40 AM
> Subject: Re: Spark SQL Thriftserver with HBase
> To: Felix Cheung <felixcheung_m@hotmail.com <ma...@hotmail.com>>
> Cc: <user@spark.apache.org <ma...@spark.apache.org>>
> 
> 
> Felix,
> 
> The only alternative way is to create a stored procedure (udf) in database terms that would run Spark scala code underneath. In this way, I can use Spark SQL JDBC Thriftserver to execute it using SQL code passing the key, values I want to UPSERT. I wonder if this is possible since I cannot CREATE a wrapper table on top of a HBase table in Spark SQL?
> 
> What do you think? Is this the right approach?
> 
> Thanks,
> Ben
> 
> On Oct 8, 2016, at 10:33 AM, Felix Cheung <felixcheung_m@hotmail.com <ma...@hotmail.com>> wrote:
> 
> HBase has released support for Spark
> hbase.apache.org/book.html#spark <http://hbase.apache.org/book.html#spark>
> 
> And if you search you should find several alternative approaches.
> 
> 
> 
> 
> 
> On Fri, Oct 7, 2016 at 7:56 AM -0700, "Benjamin Kim" <bbuild11@gmail.com <ma...@gmail.com>> wrote:
> 
> Does anyone know if Spark can work with HBase tables using Spark SQL? I know in Hive we are able to create tables on top of an underlying HBase table that can be accessed using MapReduce jobs. Can the same be done using HiveContext or SQLContext? We are trying to setup a way to GET and POST data to and from the HBase table using the Spark SQL JDBC thriftserver from our RESTful API endpoints and/or HTTP web farms. If we can get this to work, then we can load balance the thriftservers. In addition, this will benefit us in giving us a way to abstract the data storage layer away from the presentation layer code. There is a chance that we will swap out the data storage technology in the future. We are currently experimenting with Kudu.
> 
> Thanks,
> Ben
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org <ma...@spark.apache.org>
> 
> 
> 
> 
>

Re: Spark SQL Thriftserver with HBase

Posted by Felix Cheung <fe...@hotmail.com>.

Great, then I think those packages as Spark data source should allow you to do exactly that (replace org.apache.spark.sql.jdbc with HBASE one)

I do think it will be great to get more examples around this though. Would be great if you could share your experience with this!

_____________________________
From: Benjamin Kim <bb...@gmail.com>>
Sent: Saturday, October 8, 2016 11:00 AM
Subject: Re: Spark SQL Thriftserver with HBase
To: Felix Cheung <fe...@hotmail.com>>
Cc: <us...@spark.apache.org>>

Felix,

My goal is to use Spark SQL JDBC Thriftserver to access HBase tables using just SQL. I have been able to CREATE tables using this statement below in the past:

CREATE TABLE <table-name>
USING org.apache.spark.sql.jdbc
OPTIONS (
  url "jdbc:postgresql://<hostname>:<port>/dm?user=<username>&password=<password>",
  dbtable "dim.dimension_acamp"
);

After doing this, I can access the PostgreSQL table using Spark SQL JDBC Thriftserver using SQL statements (SELECT, UPDATE, INSERT, etc.). I want to do the same with HBase tables. We tried this using Hive and HiveServer2, but the response times are just too long.

Thanks,
Ben

On Oct 8, 2016, at 10:53 AM, Felix Cheung <fe...@hotmail.com>> wrote:

Ben,

I'm not sure I'm following completely.

Is your goal to use Spark to create or access tables in HBASE? If so the link below and several packages out there support that by having a HBASE data source for Spark. There are some examples on how the Spark code look like in that link as well. On that note, you should also be able to use the HBASE data source from pure SQL (Spark SQL) query as well, which should work in the case with the Spark SQL JDBC Thrift Server (with USING,http://spark.apache.org/docs/latest/sql-programming-guide.html#tab_sql_10).

_____________________________
From: Benjamin Kim <bb...@gmail.com>>
Sent: Saturday, October 8, 2016 10:40 AM
Subject: Re: Spark SQL Thriftserver with HBase
To: Felix Cheung <fe...@hotmail.com>>
Cc: <us...@spark.apache.org>>

Felix,

The only alternative way is to create a stored procedure (udf) in database terms that would run Spark scala code underneath. In this way, I can use Spark SQL JDBC Thriftserver to execute it using SQL code passing the key, values I want to UPSERT. I wonder if this is possible since I cannot CREATE a wrapper table on top of a HBase table in Spark SQL?

What do you think? Is this the right approach?

Thanks,
Ben

On Oct 8, 2016, at 10:33 AM, Felix Cheung <fe...@hotmail.com>> wrote:

HBase has released support for Spark
hbase.apache.org/book.html#spark<http://hbase.apache.org/book.html#spark>

And if you search you should find several alternative approaches.

On Fri, Oct 7, 2016 at 7:56 AM -0700, "Benjamin Kim" <bb...@gmail.com>> wrote:

Does anyone know if Spark can work with HBase tables using Spark SQL? I know in Hive we are able to create tables on top of an underlying HBase table that can be accessed using MapReduce jobs. Can the same be done using HiveContext or SQLContext? We are trying to setup a way to GET and POST data to and from the HBase table using the Spark SQL JDBC thriftserver from our RESTful API endpoints and/or HTTP web farms. If we can get this to work, then we can load balance the thriftservers. In addition, this will benefit us in giving us a way to abstract the data storage layer away from the presentation layer code. There is a chance that we will swap out the data storage technology in the future. We are currently experimenting with Kudu.

Thanks,
Ben
---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org<ma...@spark.apache.org>

Re: Spark SQL Thriftserver with HBase

Posted by Benjamin Kim <bb...@gmail.com>.

Felix,

My goal is to use Spark SQL JDBC Thriftserver to access HBase tables using just SQL. I have been able to CREATE tables using this statement below in the past:

CREATE TABLE <table-name>
USING org.apache.spark.sql.jdbc
OPTIONS (
  url "jdbc:postgresql://<hostname>:<port>/dm?user=<username>&password=<password>",
  dbtable "dim.dimension_acamp"
);

After doing this, I can access the PostgreSQL table using Spark SQL JDBC Thriftserver using SQL statements (SELECT, UPDATE, INSERT, etc.). I want to do the same with HBase tables. We tried this using Hive and HiveServer2, but the response times are just too long.

Thanks,
Ben


> On Oct 8, 2016, at 10:53 AM, Felix Cheung <fe...@hotmail.com> wrote:
> 
> Ben,
> 
> I'm not sure I'm following completely.
> 
> Is your goal to use Spark to create or access tables in HBASE? If so the link below and several packages out there support that by having a HBASE data source for Spark. There are some examples on how the Spark code look like in that link as well. On that note, you should also be able to use the HBASE data source from pure SQL (Spark SQL) query as well, which should work in the case with the Spark SQL JDBC Thrift Server (with USING, http://spark.apache.org/docs/latest/sql-programming-guide.html#tab_sql_10 <http://spark.apache.org/docs/latest/sql-programming-guide.html#tab_sql_10>).
> 
> 
> _____________________________
> From: Benjamin Kim <bbuild11@gmail.com <ma...@gmail.com>>
> Sent: Saturday, October 8, 2016 10:40 AM
> Subject: Re: Spark SQL Thriftserver with HBase
> To: Felix Cheung <felixcheung_m@hotmail.com <ma...@hotmail.com>>
> Cc: <user@spark.apache.org <ma...@spark.apache.org>>
> 
> 
> Felix,
> 
> The only alternative way is to create a stored procedure (udf) in database terms that would run Spark scala code underneath. In this way, I can use Spark SQL JDBC Thriftserver to execute it using SQL code passing the key, values I want to UPSERT. I wonder if this is possible since I cannot CREATE a wrapper table on top of a HBase table in Spark SQL?
> 
> What do you think? Is this the right approach?
> 
> Thanks,
> Ben
> 
> On Oct 8, 2016, at 10:33 AM, Felix Cheung <felixcheung_m@hotmail.com <ma...@hotmail.com>> wrote:
> 
> HBase has released support for Spark
> hbase.apache.org/book.html#spark <http://hbase.apache.org/book.html#spark>
> 
> And if you search you should find several alternative approaches.
> 
> 
> 
> 
> 
> On Fri, Oct 7, 2016 at 7:56 AM -0700, "Benjamin Kim" <bbuild11@gmail.com <ma...@gmail.com>> wrote:
> 
> Does anyone know if Spark can work with HBase tables using Spark SQL? I know in Hive we are able to create tables on top of an underlying HBase table that can be accessed using MapReduce jobs. Can the same be done using HiveContext or SQLContext? We are trying to setup a way to GET and POST data to and from the HBase table using the Spark SQL JDBC thriftserver from our RESTful API endpoints and/or HTTP web farms. If we can get this to work, then we can load balance the thriftservers. In addition, this will benefit us in giving us a way to abstract the data storage layer away from the presentation layer code. There is a chance that we will swap out the data storage technology in the future. We are currently experimenting with Kudu.
> 
> Thanks,
> Ben
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org <ma...@spark.apache.org>
> 
>

Re: Spark SQL Thriftserver with HBase

Posted by Felix Cheung <fe...@hotmail.com>.

Ben,

I'm not sure I'm following completely.

Is your goal to use Spark to create or access tables in HBASE? If so the link below and several packages out there support that by having a HBASE data source for Spark. There are some examples on how the Spark code look like in that link as well. On that note, you should also be able to use the HBASE data source from pure SQL (Spark SQL) query as well, which should work in the case with the Spark SQL JDBC Thrift Server (with USING, http://spark.apache.org/docs/latest/sql-programming-guide.html#tab_sql_10).


_____________________________
From: Benjamin Kim <bb...@gmail.com>>
Sent: Saturday, October 8, 2016 10:40 AM
Subject: Re: Spark SQL Thriftserver with HBase
To: Felix Cheung <fe...@hotmail.com>>
Cc: <us...@spark.apache.org>>


Felix,

The only alternative way is to create a stored procedure (udf) in database terms that would run Spark scala code underneath. In this way, I can use Spark SQL JDBC Thriftserver to execute it using SQL code passing the key, values I want to UPSERT. I wonder if this is possible since I cannot CREATE a wrapper table on top of a HBase table in Spark SQL?

What do you think? Is this the right approach?

Thanks,
Ben

On Oct 8, 2016, at 10:33 AM, Felix Cheung <fe...@hotmail.com>> wrote:

HBase has released support for Spark
hbase.apache.org/book.html#spark<http://hbase.apache.org/book.html#spark>

And if you search you should find several alternative approaches.





On Fri, Oct 7, 2016 at 7:56 AM -0700, "Benjamin Kim" <bb...@gmail.com>> wrote:

Does anyone know if Spark can work with HBase tables using Spark SQL? I know in Hive we are able to create tables on top of an underlying HBase table that can be accessed using MapReduce jobs. Can the same be done using HiveContext or SQLContext? We are trying to setup a way to GET and POST data to and from the HBase table using the Spark SQL JDBC thriftserver from our RESTful API endpoints and/or HTTP web farms. If we can get this to work, then we can load balance the thriftservers. In addition, this will benefit us in giving us a way to abstract the data storage layer away from the presentation layer code. There is a chance that we will swap out the data storage technology in the future. We are currently experimenting with Kudu.

Thanks,
Ben
---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org<ma...@spark.apache.org>

Re: Spark SQL Thriftserver with HBase

Posted by Benjamin Kim <bb...@gmail.com>.

Felix,

The only alternative way is to create a stored procedure (udf) in database terms that would run Spark scala code underneath. In this way, I can use Spark SQL JDBC Thriftserver to execute it using SQL code passing the key, values I want to UPSERT. I wonder if this is possible since I cannot CREATE a wrapper table on top of a HBase table in Spark SQL?

What do you think? Is this the right approach?

Thanks,
Ben

> On Oct 8, 2016, at 10:33 AM, Felix Cheung <fe...@hotmail.com> wrote:
> 
> HBase has released support for Spark
> hbase.apache.org/book.html#spark <http://hbase.apache.org/book.html#spark>
> 
> And if you search you should find several alternative approaches.
> 
> 
> 
> 
> 
> On Fri, Oct 7, 2016 at 7:56 AM -0700, "Benjamin Kim" <bbuild11@gmail.com <ma...@gmail.com>> wrote:
> 
> Does anyone know if Spark can work with HBase tables using Spark SQL? I know in Hive we are able to create tables on top of an underlying HBase table that can be accessed using MapReduce jobs. Can the same be done using HiveContext or SQLContext? We are trying to setup a way to GET and POST data to and from the HBase table using the Spark SQL JDBC thriftserver from our RESTful API endpoints and/or HTTP web farms. If we can get this to work, then we can load balance the thriftservers. In addition, this will benefit us in giving us a way to abstract the data storage layer away from the presentation layer code. There is a chance that we will swap out the data storage technology in the future. We are currently experimenting with Kudu.
> 
> Thanks,
> Ben
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org <ma...@spark.apache.org>

Re: Spark SQL Thriftserver with HBase

Posted by Felix Cheung <fe...@hotmail.com>.

HBase has released support for Spark
hbase.apache.org/book.html#spark<http://hbase.apache.org/book.html#spark>

And if you search you should find several alternative approaches.





On Fri, Oct 7, 2016 at 7:56 AM -0700, "Benjamin Kim" <bb...@gmail.com>> wrote:

Does anyone know if Spark can work with HBase tables using Spark SQL? I know in Hive we are able to create tables on top of an underlying HBase table that can be accessed using MapReduce jobs. Can the same be done using HiveContext or SQLContext? We are trying to setup a way to GET and POST data to and from the HBase table using the Spark SQL JDBC thriftserver from our RESTful API endpoints and/or HTTP web farms. If we can get this to work, then we can load balance the thriftservers. In addition, this will benefit us in giving us a way to abstract the data storage layer away from the presentation layer code. There is a chance that we will swap out the data storage technology in the future. We are currently experimenting with Kudu.

Thanks,
Ben
---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org