You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by sudeep tokala <su...@gmail.com> on 2012/08/14 17:08:17 UTC

OPTIMIZING A HIVE QUERY

Hi all,

How to avoid serialization and deserialization overhead in hive join query
? will this optimize my query performance.

Regards
sudeep

Re: OPTIMIZING A HIVE QUERY

Posted by sudeep tokala <su...@gmail.com>.
Thanks for the reply Bertrand.

On Tue, Aug 14, 2012 at 2:12 PM, Bertrand Dechoux <de...@gmail.com>wrote:

>
> > My question was every join in a hive query would constitute to a
> Mapreduce job.
> In the general case, yes. BUT if one side of your join is small enough (ie
> you can keep all in memory), a hash join/map join can be performed which is
> much more performant (no reduce is required).
>
> Bejoy KS has just provided the right link.
>
> > Store data in the smarter way? can you please elaborate on this.
> That's not Hive related. The same logic applies to RDMS. You want to keep
> a normalized source of data but sometimes 'unnomarlizing' it can greatly
> improves your performance. That's one of the advantage of document store.
> It is very dependent on your use cases.
>
> Bertrand
>
> On Tue, Aug 14, 2012 at 7:30 PM, sudeep tokala <su...@gmail.com>wrote:
>
>> hi Bertrand,
>>
>> Thanks for the reply.
>>
>> My question was every join in a hive query would constitute to a
>> Mapreduce job.
>> Mapreduce job goes through serialization and deserilaization of objects
>> Isnt it a overhead.
>>
>> Store data in the smarter way? can you please elaborate on this.
>>
>> Regards
>> Sudeep
>>
>>  On Tue, Aug 14, 2012 at 11:39 AM, Bertrand Dechoux <de...@gmail.com>wrote:
>>
>>> You may want to be clearer. Is your question : how can I change the
>>> serialization strategy of Hive? (If so I let other users answer and I am
>>> also interested in the answer.)
>>>
>>> Else the answer is simple. If you want to join data which can not be
>>> stored into memory, you need to serialize them. The only solution is to
>>> store the data in a smarter way which would not require you to do the join.
>>> By the way, how do you know the serialisation is the bottleneck?
>>>
>>> Bertrand
>>>
>>>
>>> On Tue, Aug 14, 2012 at 5:11 PM, sudeep tokala <su...@gmail.com>wrote:
>>>
>>>>
>>>>
>>>> On Tue, Aug 14, 2012 at 11:08 AM, sudeep tokala <sudeeptokala@gmail.com
>>>> > wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> How to avoid serialization and deserialization overhead in hive join
>>>>> query ? will this optimize my query performance.
>>>>>
>>>>> Regards
>>>>> sudeep
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Bertrand Dechoux
>>>
>>
>>
>
>
> --
> Bertrand Dechoux
>

Re: OPTIMIZING A HIVE QUERY

Posted by Bertrand Dechoux <de...@gmail.com>.
> My question was every join in a hive query would constitute to a
Mapreduce job.
In the general case, yes. BUT if one side of your join is small enough (ie
you can keep all in memory), a hash join/map join can be performed which is
much more performant (no reduce is required).

Bejoy KS has just provided the right link.

> Store data in the smarter way? can you please elaborate on this.
That's not Hive related. The same logic applies to RDMS. You want to keep a
normalized source of data but sometimes 'unnomarlizing' it can greatly
improves your performance. That's one of the advantage of document store.
It is very dependent on your use cases.

Bertrand

On Tue, Aug 14, 2012 at 7:30 PM, sudeep tokala <su...@gmail.com>wrote:

> hi Bertrand,
>
> Thanks for the reply.
>
> My question was every join in a hive query would constitute to a Mapreduce
> job.
> Mapreduce job goes through serialization and deserilaization of objects
> Isnt it a overhead.
>
> Store data in the smarter way? can you please elaborate on this.
>
> Regards
> Sudeep
>
> On Tue, Aug 14, 2012 at 11:39 AM, Bertrand Dechoux <de...@gmail.com>wrote:
>
>> You may want to be clearer. Is your question : how can I change the
>> serialization strategy of Hive? (If so I let other users answer and I am
>> also interested in the answer.)
>>
>> Else the answer is simple. If you want to join data which can not be
>> stored into memory, you need to serialize them. The only solution is to
>> store the data in a smarter way which would not require you to do the join.
>> By the way, how do you know the serialisation is the bottleneck?
>>
>> Bertrand
>>
>>
>> On Tue, Aug 14, 2012 at 5:11 PM, sudeep tokala <su...@gmail.com>wrote:
>>
>>>
>>>
>>> On Tue, Aug 14, 2012 at 11:08 AM, sudeep tokala <su...@gmail.com>wrote:
>>>
>>>> Hi all,
>>>>
>>>> How to avoid serialization and deserialization overhead in hive join
>>>> query ? will this optimize my query performance.
>>>>
>>>> Regards
>>>> sudeep
>>>>
>>>
>>>
>>
>>
>> --
>> Bertrand Dechoux
>>
>
>


-- 
Bertrand Dechoux

Re: OPTIMIZING A HIVE QUERY

Posted by sudeep tokala <su...@gmail.com>.
Thanks Bejoy

On Tue, Aug 14, 2012 at 2:00 PM, Bejoy Ks <be...@yahoo.com> wrote:

>  Hi Sudeep
>
> You can also look at join optimizations like map join, bucketed map
> join,sort merge join etc and choose the right one that fits your
> requirement.
>
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins
>
>
>
>  Regards,
> Bejoy KS
>
>   ------------------------------
> *From:* sudeep tokala <su...@gmail.com>
> *To:* user@hive.apache.org
> *Sent:* Tuesday, August 14, 2012 11:00 PM
> *Subject:* Re: OPTIMIZING A HIVE QUERY
>
>  hi Bertrand,
>
> Thanks for the reply.
>
> My question was every join in a hive query would constitute to a Mapreduce
> job.
> Mapreduce job goes through serialization and deserilaization of objects
> Isnt it a overhead.
>
> Store data in the smarter way? can you please elaborate on this.
>
> Regards
> Sudeep
>
> On Tue, Aug 14, 2012 at 11:39 AM, Bertrand Dechoux <de...@gmail.com>wrote:
>
> You may want to be clearer. Is your question : how can I change the
> serialization strategy of Hive? (If so I let other users answer and I am
> also interested in the answer.)
>
> Else the answer is simple. If you want to join data which can not be
> stored into memory, you need to serialize them. The only solution is to
> store the data in a smarter way which would not require you to do the join.
> By the way, how do you know the serialisation is the bottleneck?
>
> Bertrand
>
>
> On Tue, Aug 14, 2012 at 5:11 PM, sudeep tokala <su...@gmail.com>wrote:
>
>
>
> On Tue, Aug 14, 2012 at 11:08 AM, sudeep tokala <su...@gmail.com>wrote:
>
> Hi all,
>
> How to avoid serialization and deserialization overhead in hive join query
> ? will this optimize my query performance.
>
> Regards
> sudeep
>
>
>
>
>
> --
> Bertrand Dechoux
>
>
>
>
>

Re: OPTIMIZING A HIVE QUERY

Posted by Bejoy Ks <be...@yahoo.com>.
Hi Sudeep

You can also look at join optimizations like map join, bucketed map join,sort merge join etc and choose the right one that fits your requirement.

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins



 
Regards,
Bejoy KS


________________________________
 From: sudeep tokala <su...@gmail.com>
To: user@hive.apache.org 
Sent: Tuesday, August 14, 2012 11:00 PM
Subject: Re: OPTIMIZING A HIVE QUERY
 

hi Bertrand,
 
Thanks for the reply.
 
My question was every join in a hive query would constitute to a Mapreduce job.
Mapreduce job goes through serialization and deserilaization of objects Isnt it a overhead.
 
Store data in the smarter way? can you please elaborate on this.
 
Regards
Sudeep


On Tue, Aug 14, 2012 at 11:39 AM, Bertrand Dechoux <de...@gmail.com> wrote:

You may want to be clearer. Is your question : how can I change the serialization strategy of Hive? (If so I let other users answer and I am also interested in the answer.)
>
>Else the answer is simple. If you want to join data which can not be stored into memory, you need to serialize them. The only solution is to store the data in a smarter way which would not require you to do the join. By the way, how do you know the serialisation is the bottleneck?
>
>Bertrand 
>
>
>
>On Tue, Aug 14, 2012 at 5:11 PM, sudeep tokala <su...@gmail.com> wrote:
>
>
>>
>>
>>On Tue, Aug 14, 2012 at 11:08 AM, sudeep tokala <su...@gmail.com> wrote:
>>
>>Hi all,
>>> 
>>>How to avoid serialization and deserialization overhead in hive join query ? will this optimize my query performance.
>>> 
>>>Regardssudeep
>>
>
>
>-- 
>Bertrand Dechoux
>

Re: OPTIMIZING A HIVE QUERY

Posted by sudeep tokala <su...@gmail.com>.
hi Bertrand,

Thanks for the reply.

My question was every join in a hive query would constitute to a Mapreduce
job.
Mapreduce job goes through serialization and deserilaization of objects
Isnt it a overhead.

Store data in the smarter way? can you please elaborate on this.

Regards
Sudeep

On Tue, Aug 14, 2012 at 11:39 AM, Bertrand Dechoux <de...@gmail.com>wrote:

> You may want to be clearer. Is your question : how can I change the
> serialization strategy of Hive? (If so I let other users answer and I am
> also interested in the answer.)
>
> Else the answer is simple. If you want to join data which can not be
> stored into memory, you need to serialize them. The only solution is to
> store the data in a smarter way which would not require you to do the join.
> By the way, how do you know the serialisation is the bottleneck?
>
> Bertrand
>
>
> On Tue, Aug 14, 2012 at 5:11 PM, sudeep tokala <su...@gmail.com>wrote:
>
>>
>>
>> On Tue, Aug 14, 2012 at 11:08 AM, sudeep tokala <su...@gmail.com>wrote:
>>
>>> Hi all,
>>>
>>> How to avoid serialization and deserialization overhead in hive join
>>> query ? will this optimize my query performance.
>>>
>>> Regards
>>> sudeep
>>>
>>
>>
>
>
> --
> Bertrand Dechoux
>

Re: OPTIMIZING A HIVE QUERY

Posted by Bertrand Dechoux <de...@gmail.com>.
You may want to be clearer. Is your question : how can I change the
serialization strategy of Hive? (If so I let other users answer and I am
also interested in the answer.)

Else the answer is simple. If you want to join data which can not be stored
into memory, you need to serialize them. The only solution is to store the
data in a smarter way which would not require you to do the join. By the
way, how do you know the serialisation is the bottleneck?

Bertrand

On Tue, Aug 14, 2012 at 5:11 PM, sudeep tokala <su...@gmail.com>wrote:

>
>
> On Tue, Aug 14, 2012 at 11:08 AM, sudeep tokala <su...@gmail.com>wrote:
>
>> Hi all,
>>
>> How to avoid serialization and deserialization overhead in hive join
>> query ? will this optimize my query performance.
>>
>> Regards
>> sudeep
>>
>
>


-- 
Bertrand Dechoux

Re: OPTIMIZING A HIVE QUERY

Posted by sudeep tokala <su...@gmail.com>.
On Tue, Aug 14, 2012 at 11:08 AM, sudeep tokala <su...@gmail.com>wrote:

> Hi all,
>
> How to avoid serialization and deserialization overhead in hive join query
> ? will this optimize my query performance.
>
> Regards
> sudeep
>