You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Darq Moth <da...@gmail.com> on 2014/04/25 22:30:59 UTC

Scala Spark / Shark: How to access existing Hive tables in Hortonworks?

I am trying to find some docs / description of the approach on the subject,
please help. I have Hadoop 2.2.0 from Hortonworks installed with some
existing Hive tables I need to query. Hive SQL works extremly and
unreasonably slow on single node and cluster as well. I hope Shark will
work faster.

>From Spark/Shark docs I can not figure out how to make Shark work with
existing Hive tables. Any ideas how to achieve this? Thanks!

Re: Scala Spark / Shark: How to access existing Hive tables in Hortonworks?

Posted by Mayur Rustagi <ma...@gmail.com>.

Shark will communicate with JDBC with Hive *meta *server. Thr is no such
thing as Hive server, Hive stores all its data in hadoop hdfs, which is
where shark pulls it from.

Shark works on nested select queries.



Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi <https://twitter.com/mayur_rustagi>



On Sat, Apr 26, 2014 at 2:52 AM, Darq Moth <da...@gmail.com> wrote:

> Thanks!
> For now I use JDBC from Scala to get data from Hive.  In Hive I have a
> simple table with 20 rows in the following format:
>
> user_id, movie_title, rating, date
>
> I do 3 nested select requests:
> 1) select distinct user_id
>      2) for each user_id:
>          select distinct movie_title  //select all movies that user saw
>             3) for each movie_title:
>                 select distinct user_id  //select all user who saw this
> movie
>
> On a local Hive table with 20 rows these nested querries work 26 min!
>
> Questions:
> 1) Will Shark optimize nested select requests or not and just use the same
> selects on JDBC?
> 2) What wire protocol will Shark use to communicate with remote Hive
> server?
>
>
> On Sat, Apr 26, 2014 at 12:35 AM, Mayur Rustagi <ma...@gmail.com>wrote:
>
>> You have to configure shark to access the Hortonworks hive metastore
>> (hcatalog?) & you will start seeing the tables in shark shell & can run
>> queries like normal & shark will leverage spark for processing your queries.
>>
>> Mayur Rustagi
>> Ph: +1 (760) 203 3257
>> http://www.sigmoidanalytics.com
>>  @mayur_rustagi <https://twitter.com/mayur_rustagi>
>>
>>
>>
>> On Sat, Apr 26, 2014 at 2:00 AM, Darq Moth <da...@gmail.com> wrote:
>>
>>> I am trying to find some docs / description of the approach on the
>>> subject, please help. I have Hadoop 2.2.0 from Hortonworks installed with
>>> some existing Hive tables I need to query. Hive SQL works extremly and
>>> unreasonably slow on single node and cluster as well. I hope Shark will
>>> work faster.
>>>
>>> From Spark/Shark docs I can not figure out how to make Shark work with
>>> existing Hive tables. Any ideas how to achieve this? Thanks!
>>>
>>
>>
>

Re: Scala Spark / Shark: How to access existing Hive tables in Hortonworks?

Posted by Darq Moth <da...@gmail.com>.

Thanks!
For now I use JDBC from Scala to get data from Hive.  In Hive I have a
simple table with 20 rows in the following format:

user_id, movie_title, rating, date

I do 3 nested select requests:
1) select distinct user_id
     2) for each user_id:
         select distinct movie_title  //select all movies that user saw
            3) for each movie_title:
                select distinct user_id  //select all user who saw this
movie

On a local Hive table with 20 rows these nested querries work 26 min!

Questions:
1) Will Shark optimize nested select requests or not and just use the same
selects on JDBC?
2) What wire protocol will Shark use to communicate with remote Hive server?


On Sat, Apr 26, 2014 at 12:35 AM, Mayur Rustagi <ma...@gmail.com>wrote:

> You have to configure shark to access the Hortonworks hive metastore
> (hcatalog?) & you will start seeing the tables in shark shell & can run
> queries like normal & shark will leverage spark for processing your queries.
>
> Mayur Rustagi
> Ph: +1 (760) 203 3257
> http://www.sigmoidanalytics.com
> @mayur_rustagi <https://twitter.com/mayur_rustagi>
>
>
>
> On Sat, Apr 26, 2014 at 2:00 AM, Darq Moth <da...@gmail.com> wrote:
>
>> I am trying to find some docs / description of the approach on the
>> subject, please help. I have Hadoop 2.2.0 from Hortonworks installed with
>> some existing Hive tables I need to query. Hive SQL works extremly and
>> unreasonably slow on single node and cluster as well. I hope Shark will
>> work faster.
>>
>> From Spark/Shark docs I can not figure out how to make Shark work with
>> existing Hive tables. Any ideas how to achieve this? Thanks!
>>
>
>

Re: Scala Spark / Shark: How to access existing Hive tables in Hortonworks?

Posted by Mayur Rustagi <ma...@gmail.com>.

You have to configure shark to access the Hortonworks hive metastore
(hcatalog?) & you will start seeing the tables in shark shell & can run
queries like normal & shark will leverage spark for processing your queries.

Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi <https://twitter.com/mayur_rustagi>

On Sat, Apr 26, 2014 at 2:00 AM, Darq Moth <da...@gmail.com> wrote:

> I am trying to find some docs / description of the approach on the
> subject, please help. I have Hadoop 2.2.0 from Hortonworks installed with
> some existing Hive tables I need to query. Hive SQL works extremly and
> unreasonably slow on single node and cluster as well. I hope Shark will
> work faster.
>
> From Spark/Shark docs I can not figure out how to make Shark work with
> existing Hive tables. Any ideas how to achieve this? Thanks!
>