You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by nancy henry <na...@gmail.com> on 2017/03/09 03:06:26 UTC

spark-sql use case beginner question

Hi Team,

basically we have all data as hive tables ..and processing it till now in
hive on MR.. now that we have hivecontext which can run hivequeries on
spark, we are making all these complex hive scripts to run using
hivecontext.sql(sc.textfile(hivescript)) kind of approach ie basically
running hive queries on spark and not coding anything yet in scala still we
see just making hive queries to run on spark is showing a lot difference in
time than run on MR..

so as we already have hivescripts lets make those complex hivescript run
using hc.sql as hc.sql is able to do it

or is this not best practice even though spark can do it its still better
to load all those individual hive tables in spark and make rdds and write
scala code to get the same functionality happening in hive

its becoming difficult for us to choose whether to leave it to hc.sql to do
the work of running complex scripts also or we have to code in scala..will
it be worth the effort of manual intervention in terms of performance

ex of our sample scripts
use db;
create tempfunction1 as com.fgh.jkl.TestFunction;

create destable in hive;
insert overwrite desttable select (big complext transformations and usage
of hive udf)
from table1,table2,table3 join table4 on some condition complex and join
table 7 on another complex condition where complex filtering

So please help what would be best approach and why i should not give entire
script for hivecontext to make its own rdds and run on spark if we are able
to do it

coz all examples i see online are only showing hc.sql("select * from
table1) and nothing complex than that

Re: spark-sql use case beginner question

Posted by Subhash Sriram <su...@gmail.com>.
We have a similar use case. We use the DataFrame API to cache data out of
Hive tables, and then run pretty complex scripts on them. You can register
your Hive UDFs to be used within Spark SQL statements if you want.

Something like this:

sqlContext.sql("CREATE TEMPORARY FUNCTION <udf_name> as '<udf class>'")

If you had a table called Prices in the Stocks Hive db, you could do this:

val pricesDf = sqlContext.table("Stocks.Prices")
pricesDf.createOrReplaceTempView("tmp_prices")

Then, you can run whatever SQL you really want on the pricesDf.

sqlContext.sql("select udf_name(), ..... from tmp_prices")

There are a lot of SQL functions available:

http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$

I hope that helps.

Thanks,
Subhash

On Thu, Mar 9, 2017 at 2:28 AM, nancy henry <na...@gmail.com>
wrote:

> okay what is difference between keep set hive.execution.engine =spark
> and
> running the script through hivecontext.sql
>
> Show quoted text
>
>
> On Mar 9, 2017 8:52 AM, "ayan guha" <gu...@gmail.com> wrote:
>
>> Hi
>>
>> Subject to your version of Hive & Spark, you may want to set
>> hive.execution.engine=spark as beeline command line parameter, assuming you
>> are running hive scripts using beeline command line (which is suggested
>> practice for security purposes).
>>
>>
>>
>> On Thu, Mar 9, 2017 at 2:09 PM, nancy henry <na...@gmail.com>
>> wrote:
>>
>>>
>>> Hi Team,
>>>
>>> basically we have all data as hive tables ..and processing it till now
>>> in hive on MR.. now that we have hivecontext which can run hivequeries on
>>> spark, we are making all these complex hive scripts to run using
>>> hivecontext.sql(sc.textfile(hivescript)) kind of approach ie basically
>>> running hive queries on spark and not coding anything yet in scala still we
>>> see just making hive queries to run on spark is showing a lot difference in
>>> time than run on MR..
>>>
>>> so as we already have hivescripts lets make those complex hivescript run
>>> using hc.sql as hc.sql is able to do it
>>>
>>> or is this not best practice even though spark can do it its still
>>> better to load all those individual hive tables in spark and make rdds and
>>> write scala code to get the same functionality happening in hive
>>>
>>> its becoming difficult for us to choose whether to leave it to hc.sql to
>>> do the work of running complex scripts also or we have to code in
>>> scala..will it be worth the effort of manual intervention in terms of
>>> performance
>>>
>>> ex of our sample scripts
>>> use db;
>>> create tempfunction1 as com.fgh.jkl.TestFunction;
>>>
>>> create destable in hive;
>>> insert overwrite desttable select (big complext transformations and
>>> usage of hive udf)
>>> from table1,table2,table3 join table4 on some condition complex and join
>>> table 7 on another complex condition where complex filtering
>>>
>>> So please help what would be best approach and why i should not give
>>> entire script for hivecontext to make its own rdds and run on spark if we
>>> are able to do it
>>>
>>> coz all examples i see online are only showing hc.sql("select * from
>>> table1) and nothing complex than that
>>>
>>>
>>>
>>
>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>

Re: spark-sql use case beginner question

Posted by nancy henry <na...@gmail.com>.
okay what is difference between keep set hive.execution.engine =spark
and
running the script through hivecontext.sql

Show quoted text


On Mar 9, 2017 8:52 AM, "ayan guha" <gu...@gmail.com> wrote:

> Hi
>
> Subject to your version of Hive & Spark, you may want to set
> hive.execution.engine=spark as beeline command line parameter, assuming you
> are running hive scripts using beeline command line (which is suggested
> practice for security purposes).
>
>
>
> On Thu, Mar 9, 2017 at 2:09 PM, nancy henry <na...@gmail.com>
> wrote:
>
>>
>> Hi Team,
>>
>> basically we have all data as hive tables ..and processing it till now in
>> hive on MR.. now that we have hivecontext which can run hivequeries on
>> spark, we are making all these complex hive scripts to run using
>> hivecontext.sql(sc.textfile(hivescript)) kind of approach ie basically
>> running hive queries on spark and not coding anything yet in scala still we
>> see just making hive queries to run on spark is showing a lot difference in
>> time than run on MR..
>>
>> so as we already have hivescripts lets make those complex hivescript run
>> using hc.sql as hc.sql is able to do it
>>
>> or is this not best practice even though spark can do it its still better
>> to load all those individual hive tables in spark and make rdds and write
>> scala code to get the same functionality happening in hive
>>
>> its becoming difficult for us to choose whether to leave it to hc.sql to
>> do the work of running complex scripts also or we have to code in
>> scala..will it be worth the effort of manual intervention in terms of
>> performance
>>
>> ex of our sample scripts
>> use db;
>> create tempfunction1 as com.fgh.jkl.TestFunction;
>>
>> create destable in hive;
>> insert overwrite desttable select (big complext transformations and usage
>> of hive udf)
>> from table1,table2,table3 join table4 on some condition complex and join
>> table 7 on another complex condition where complex filtering
>>
>> So please help what would be best approach and why i should not give
>> entire script for hivecontext to make its own rdds and run on spark if we
>> are able to do it
>>
>> coz all examples i see online are only showing hc.sql("select * from
>> table1) and nothing complex than that
>>
>>
>>
>
>
> --
> Best Regards,
> Ayan Guha
>

Re: spark-sql use case beginner question

Posted by ayan guha <gu...@gmail.com>.
Hi

Subject to your version of Hive & Spark, you may want to set
hive.execution.engine=spark as beeline command line parameter, assuming you
are running hive scripts using beeline command line (which is suggested
practice for security purposes).



On Thu, Mar 9, 2017 at 2:09 PM, nancy henry <na...@gmail.com>
wrote:

>
> Hi Team,
>
> basically we have all data as hive tables ..and processing it till now in
> hive on MR.. now that we have hivecontext which can run hivequeries on
> spark, we are making all these complex hive scripts to run using
> hivecontext.sql(sc.textfile(hivescript)) kind of approach ie basically
> running hive queries on spark and not coding anything yet in scala still we
> see just making hive queries to run on spark is showing a lot difference in
> time than run on MR..
>
> so as we already have hivescripts lets make those complex hivescript run
> using hc.sql as hc.sql is able to do it
>
> or is this not best practice even though spark can do it its still better
> to load all those individual hive tables in spark and make rdds and write
> scala code to get the same functionality happening in hive
>
> its becoming difficult for us to choose whether to leave it to hc.sql to
> do the work of running complex scripts also or we have to code in
> scala..will it be worth the effort of manual intervention in terms of
> performance
>
> ex of our sample scripts
> use db;
> create tempfunction1 as com.fgh.jkl.TestFunction;
>
> create destable in hive;
> insert overwrite desttable select (big complext transformations and usage
> of hive udf)
> from table1,table2,table3 join table4 on some condition complex and join
> table 7 on another complex condition where complex filtering
>
> So please help what would be best approach and why i should not give
> entire script for hivecontext to make its own rdds and run on spark if we
> are able to do it
>
> coz all examples i see online are only showing hc.sql("select * from
> table1) and nothing complex than that
>
>
>


-- 
Best Regards,
Ayan Guha

Re: spark-sql use case beginner question

Posted by nancy henry <na...@gmail.com>.
Hi Team,

basically we have all data as hive tables ..and processing it till now in
hive on MR.. now that we have hivecontext which can run hivequeries on
spark, we are making all these complex hive scripts to run using
hivecontext.sql(sc.textfile(hivescript)) kind of approach ie basically
running hive queries on spark and not coding anything yet in scala still we
see just making hive queries to run on spark is showing a lot difference in
time than run on MR..

so as we already have hivescripts lets make those complex hivescript run
using hc.sql as hc.sql is able to do it

or is this not best practice even though spark can do it its still better
to load all those individual hive tables in spark and make rdds and write
scala code to get the same functionality happening in hive

its becoming difficult for us to choose whether to leave it to hc.sql to do
the work of running complex scripts also or we have to code in scala..will
it be worth the effort of manual intervention in terms of performance

ex of our sample scripts
use db;
create tempfunction1 as com.fgh.jkl.TestFunction;

create destable in hive;
insert overwrite desttable select (big complext transformations and usage
of hive udf)
from table1,table2,table3 join table4 on some condition complex and join
table 7 on another complex condition where complex filtering

So please help what would be best approach and why i should not give entire
script for hivecontext to make its own rdds and run on spark if we are able
to do it

coz all examples i see online are only showing hc.sql("select * from
table1) and nothing complex than that