You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by sjayatheertha <sj...@gmail.com> on 2016/03/14 00:27:26 UTC

Hive on Spark performance

Just curious if you could share your experience on the performance of spark in your company? How much data do you process? And what's the latency you are getting with spark engine?

Vidya

Re: Hive on Spark performance

Posted by sjayatheertha <sj...@gmail.com>.

Thanks for your response. We were evaluating Spark and were curious to know how it is used today and the lowest latency it can provide. 

> On Mar 14, 2016, at 8:37 AM, Mich Talebzadeh <mi...@gmail.com> wrote:
> 
> Hi Wlodeck,
> 
> Let us look at this.
> 
> In Oracle I have two tables channels and sales. This code works in Oracle
> 
>   1  select c.channel_id, sum(c.channel_id * (select count(1) from sales s WHERE c.channel_id = s.channel_id)) As R
>   2  from channels c
>   3* group by c.channel_id
> sh@MYDB.MICH.LOCAL> /
> CHANNEL_ID          R
> ---------- ----------
>          2     516050
>          3    1620984
>          4     473664
>          5          0
>          9      18666
> 
> I have the same tables In Hive but the same query crashes!
> 
> hive> select c.channel_id, sum(c.channel_id * (select count(1) from sales s WHERE c.channel_id = s.channel_id)) As R
>     > from channels c
>     > group by c.channel_id
>     > ;
> NoViableAltException(232@[435:1: precedenceEqualExpression : ( ( LPAREN precedenceBitwiseOrExpression COMMA )=> precedenceEqualExpressionMutiple | precedenceEqualExpressionSingle );])
> 
> The solution is to use a temporary table to keep the sum/group by from sales table as an intermediate stage  (temporary tables are session specific and they are created and dropped after you finish the session)
> 
> hive> create temporary table tmp as select channel_id, count(channel_id) as total from sales group by channel_id;
> 
> 
> Ok the rest is pretty easy
> 
> hive> select c.channel_id, c.channel_id * t.total as results
>     > from channels c, tmp t
>     > where c.channel_id = t.channel_id;
> 
> 2.0     2800432.0
> 3.0     8802300.0
> 4.0     2583552.0
> 9.0     104013.0
> 
> HTH
> 
> 
> 
> 
> 
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
>  
> 
>> On 14 March 2016 at 14:22, ws <wl...@yahoo.com> wrote:
>> Hive 1.2.1.2.3.4.0-3485
>> Spark 1.5.2
>> Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
>> 
>> ### 
>> SELECT 
>> 	f.description,
>> 	f.item_number,
>> 	sum(f.df_a * (select count(1) from e.mv_A_h_a where hb_h_name = r.h_id)) as df_a
>> FROM e.eng_fac_atl_sc_bf_qty f, wv_ATL_2_qty_df_rates r
>> where f.item_number NOT LIKE 'HR%' AND f.item_number NOT LIKE 'UG%' AND f.item_number NOT LIKE 'DEV%'
>> group by 
>> 	f.description,
>> 	f.item_number
>> ###
>> 
>> This query works fine in oracle but not Hive or Spark.
>> So the problem is: "sum(f.df_a * (select count(1) from e.mv_A_h_a where hb_h_name = r.h_id)) as df_a" field.
>> 
>> 
>> Thanks,
>> Wlodek
>> --
>> 
>> 
>> On Sunday, March 13, 2016 7:36 PM, Mich Talebzadeh <mi...@gmail.com> wrote:
>> 
>> 
>> Depending on the version of Hive on Spark engine.
>> 
>> As far as I am aware the latest version of Hive that I am using (Hive 2) has improvements compared to the previous versions of Hive (0.14,1.2.1) on Spark engine.
>> 
>> As of today I have managed to use Hive 2.0 on Spark version 1.3.1. So it is not the latest Spark but it is pretty good.
>> 
>> What specific concerns do you have in mind?
>> 
>> HTH
>> 
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> http://talebzadehmich.wordpress.com
>>  
>> 
>> On 13 March 2016 at 23:27, sjayatheertha <sj...@gmail.com> wrote:
>> Just curious if you could share your experience on the performance of spark in your company? How much data do you process? And what's the latency you are getting with spark engine?
>> 
>> Vidya
>

Re: Hive on Spark performance

Posted by Mich Talebzadeh <mi...@gmail.com>.

Hi Wlodeck,

Let us look at this.

In Oracle I have two tables channels and sales. This code works in Oracle

  1  select c.channel_id, sum(c.channel_id * (select count(1) from sales s
WHERE c.channel_id = s.channel_id)) As R
  2  from channels c
  3* group by c.channel_id
sh@MYDB.MICH.LOCAL> /
CHANNEL_ID          R
---------- ----------
         2     516050
         3    1620984
         4     473664
         5          0
         9      18666

I have the same tables In Hive but the same query crashes!

hive> select c.channel_id, sum(c.channel_id * (select count(1) from sales s
WHERE c.channel_id = s.channel_id)) As R
    > from channels c
    > group by c.channel_id
    > ;
NoViableAltException(232@[435:1: precedenceEqualExpression : ( ( LPAREN
precedenceBitwiseOrExpression COMMA )=> precedenceEqualExpressionMutiple |
precedenceEqualExpressionSingle );])

The solution is to use a temporary table to keep the sum/group by from
sales table as an intermediate stage  (temporary tables are session
specific and they are created and dropped after you finish the session)

hive> create temporary table tmp as select channel_id, count(channel_id) as
total from sales group by channel_id;


Ok the rest is pretty easy

hive> select c.channel_id, c.channel_id * t.total as results
    > from channels c, tmp t
    > where c.channel_id = t.channel_id;

2.0     2800432.0
3.0     8802300.0
4.0     2583552.0
9.0     104013.0

HTH







Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 14 March 2016 at 14:22, ws <wl...@yahoo.com> wrote:

> Hive 1.2.1.2.3.4.0-3485
> Spark 1.5.2
> Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit
> Production
>
> ###
> SELECT
> f.description,
> f.item_number,
> sum(f.df_a * (select count(1) from e.mv_A_h_a where hb_h_name = r.h_id))
> as df_a
> FROM e.eng_fac_atl_sc_bf_qty f, wv_ATL_2_qty_df_rates r
> where f.item_number NOT LIKE 'HR%' AND f.item_number NOT LIKE 'UG%' AND
> f.item_number NOT LIKE 'DEV%'
> group by
> f.description,
> f.item_number
> ###
>
> This query works fine in oracle but not Hive or Spark.
> So the problem is: "sum(f.df_a * (select count(1) from e.mv_A_h_a where
> hb_h_name = r.h_id)) as df_a" field.
>
>
> Thanks,
> Wlodek
> --
>
>
> On Sunday, March 13, 2016 7:36 PM, Mich Talebzadeh <
> mich.talebzadeh@gmail.com> wrote:
>
>
> Depending on the version of Hive on Spark engine.
>
> As far as I am aware the latest version of Hive that I am using (Hive 2)
> has improvements compared to the previous versions of Hive (0.14,1.2.1) on
> Spark engine.
>
> As of today I have managed to use Hive 2.0 on Spark version 1.3.1. So it
> is not the latest Spark but it is pretty good.
>
> What specific concerns do you have in mind?
>
> HTH
>
>
> Dr Mich Talebzadeh
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
> http://talebzadehmich.wordpress.com
>
>
> On 13 March 2016 at 23:27, sjayatheertha <sj...@gmail.com> wrote:
>
> Just curious if you could share your experience on the performance of
> spark in your company? How much data do you process? And what's the latency
> you are getting with spark engine?
>
> Vidya
>
>
>
>
>

Re: Hive on Spark performance

Posted by ws <wl...@yahoo.com>.

Hive 1.2.1.2.3.4.0-3485Spark 1.5.2Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
### SELECT  f.description, f.item_number, sum(f.df_a * (select count(1) from e.mv_A_h_a where hb_h_name = r.h_id)) as df_aFROM e.eng_fac_atl_sc_bf_qty f, wv_ATL_2_qty_df_rates rwhere f.item_number NOT LIKE 'HR%' AND f.item_number NOT LIKE 'UG%' AND f.item_number NOT LIKE 'DEV%'group by  f.description, f.item_number###
This query works fine in oracle but not Hive or Spark.So the problem is: "sum(f.df_a * (select count(1) from e.mv_A_h_a where hb_h_name = r.h_id)) as df_a" field.

Thanks,Wlodek-- 

    On Sunday, March 13, 2016 7:36 PM, Mich Talebzadeh <mi...@gmail.com> wrote:
 

 Depending on the version of Hive on Spark engine. 
As far as I am aware the latest version of Hive that I am using (Hive 2) has improvements compared to the previous versions of Hive (0.14,1.2.1) on Spark engine.
As of today I have managed to use Hive 2.0 on Spark version 1.3.1. So it is not the latest Spark but it is pretty good.
What specific concerns do you have in mind?
HTH

Dr Mich Talebzadeh LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw http://talebzadehmich.wordpress.com 
On 13 March 2016 at 23:27, sjayatheertha <sj...@gmail.com> wrote:

Just curious if you could share your experience on the performance of spark in your company? How much data do you process? And what's the latency you are getting with spark engine?

Vidya

Re: Hive on Spark performance

Posted by Mich Talebzadeh <mi...@gmail.com>.

Depending on the version of Hive on Spark engine.

As far as I am aware the latest version of Hive that I am using (Hive 2)
has improvements compared to the previous versions of Hive (0.14,1.2.1) on
Spark engine.

As of today I have managed to use Hive 2.0 on Spark version 1.3.1. So it is
not the latest Spark but it is pretty good.

What specific concerns do you have in mind?

HTH

Dr Mich Talebzadeh

LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*

http://talebzadehmich.wordpress.com

On 13 March 2016 at 23:27, sjayatheertha <sj...@gmail.com> wrote:

> Just curious if you could share your experience on the performance of
> spark in your company? How much data do you process? And what's the latency
> you are getting with spark engine?
>
> Vidya

Re: Hive on Spark performance

Posted by Mich Talebzadeh <mi...@gmail.com>.

Depending on the version of Hive on Spark engine.

As far as I am aware the latest version of Hive that I am using (Hive 2)
has improvements compared to the previous versions of Hive (0.14,1.2.1) on
Spark engine.

As of today I have managed to use Hive 2.0 on Spark version 1.3.1. So it is
not the latest Spark but it is pretty good.

What specific concerns do you have in mind?

HTH

Dr Mich Talebzadeh

LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*

http://talebzadehmich.wordpress.com

On 13 March 2016 at 23:27, sjayatheertha <sj...@gmail.com> wrote:

> Just curious if you could share your experience on the performance of
> spark in your company? How much data do you process? And what's the latency
> you are getting with spark engine?
>
> Vidya