You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by shan s <my...@gmail.com> on 2012/06/01 13:55:16 UTC

Multi-GroupBy-Insert optimization

I am using Multi-GroupBy-Insert. I was expecting a single map-reduce job
which would club the group-bys together.
However it is scheduling n jobs where n = number of group bys..
Could you please explain this behaviour.

>From X
INSERT OVERWRITE LOCAL DIRECTORY 'output/y1'
SELECT a, b , c, count(*)
group by a,b,c
INSERT OVERWRITE LOCAL DIRECTORY 'output/y2'
SELECT  a ,  count(*)
group by a
INSERT OVERWRITE LOCAL DIRECTORY 'output/y3'
SELECT b,  count(*)
group by b
…..
…..
……

Re: Multi-GroupBy-Insert optimization

Posted by Jan Dolinár <do...@gmail.com>.
Hi Shan,

If you happen to have a lot of repeated data (in the most general
grouping), you might get some speedup by little pre-aggregation. The
following code should produce the same results as the example in your
first post:

>From (
  SELECT a, b , c, count(*) AS cnt
  FROM X
  group by a,b,c
 ) t
INSERT OVERWRITE LOCAL DIRECTORY 'output/y1'
SELECT a, b , c, cnt
INSERT OVERWRITE LOCAL DIRECTORY 'output/y2'
SELECT  a ,  SUM(cnt)
group by a
INSERT OVERWRITE LOCAL DIRECTORY 'output/y3'
SELECT b,  SUM(cnt)
group by b
...

The trick is that there there will be one more job that will first
reduce the number of records that are used in the following jobs. They
will only have to read one line for each distinct triplet a,b,c. Note
that this will only help if the number of distinct combinations is
relatively low compared to the total amount of data. In other cases it
might make no difference or even make the calculation longer.

Hope that helps... I can't think about anything else that could help you.

Jan

On 6/5/12, Jan Dolinár <do...@gmail.com> wrote:
> On 6/4/12, shan s <my...@gmail.com> wrote:
>> Thanks for the explanation Jan.
>> If I understand correctly, the input will be read one single time and
>> will
>> be preprocessed in some form,  and this intermediate data is used for
>> subsequent group-by..
>> Not sure if my scenario will help this single step, since group-by varies
>> across vast entities.
>
> Yes, that is that is correct. The very simplest use case is when you
> only scan a part of table. But if you are interested in all the data,
> it is not going to help you much.
>
>> If I were to implement group-by,manually, generally  we could club them
>> together in single program. Can I do better with hive, with some
>> hints/optimizations?
>> Or  is there a possibility that Pig might perform better in this case.(
>> Assuming Pig would probably handle this in a single job?)
>
> In some cases it might be able to outsmart the hive optimizer and
> write the mapreduce job directly in java in such way that it might
> perform better. In most cases though, it is probably not worth the
> trouble. You might easily end up in situation where buying more
> machines is cheaper than developing the low level solutions that might
> or might not be slightly faster... I'm not familiar with Pig or any
> other tools that might be of use in your situation.
>
> Jan
>

Re: Multi-GroupBy-Insert optimization

Posted by Jan Dolinár <do...@gmail.com>.
On 6/4/12, shan s <my...@gmail.com> wrote:
> Thanks for the explanation Jan.
> If I understand correctly, the input will be read one single time and will
> be preprocessed in some form,  and this intermediate data is used for
> subsequent group-by..
> Not sure if my scenario will help this single step, since group-by varies
> across vast entities.

Yes, that is that is correct. The very simplest use case is when you
only scan a part of table. But if you are interested in all the data,
it is not going to help you much.

> If I were to implement group-by,manually, generally  we could club them
> together in single program. Can I do better with hive, with some
> hints/optimizations?
> Or  is there a possibility that Pig might perform better in this case.(
> Assuming Pig would probably handle this in a single job?)

In some cases it might be able to outsmart the hive optimizer and
write the mapreduce job directly in java in such way that it might
perform better. In most cases though, it is probably not worth the
trouble. You might easily end up in situation where buying more
machines is cheaper than developing the low level solutions that might
or might not be slightly faster... I'm not familiar with Pig or any
other tools that might be of use in your situation.

Jan

Re: Multi-GroupBy-Insert optimization

Posted by shan s <my...@gmail.com>.
Thanks for the explanation Jan.
If I understand correctly, the input will be read one single time and will
be preprocessed in some form,  and this intermediate data is used for
subsequent group-by..
Not sure if my scenario will help this single step, since group-by varies
across vast entities.

If I were to implement group-by,manually, generally  we could club them
together in single program. Can I do better with hive, with some
hints/optimizations?
Or  is there a possibility that Pig might perform better in this case.(
Assuming Pig would probably handle this in a single job?)

Thank You,
Prashant

p.s.
Just In case, if the below data helps...
In my scenario, my data has # of entity1 = 500,000 and # of entity2=500, #
of entity3=5.
Fact table has 250M rows (entity1 * entity2)
Current job has 22 group bys,  based on various combination of 3 entities,
and fact table record types, it produces  22M rows. It takes 3 hours on 4
machine cluster, with good configuration.


On Mon, Jun 4, 2012 at 6:52 PM, Jan Dolinár <do...@gmail.com> wrote:

>
>    On Fri, Jun 1, 2012 at 5:25 PM, shan s <my...@gmail.com> wrote:
>>
>>> I am using Multi-GroupBy-Insert. I was expecting a single map-reduce job
>>> which would club the group-bys together.
>>> However it is scheduling n jobs where n = number of group bys..
>>> Could you please explain this behaviour.
>>>
>>>
>>
> No, it will result in at least as many jobs as there is group-bys. The
> efficiency is hidden not in lowering number of jobs, but in fact that the
> first job usually reduces the amount of the data that the rest needs to go
> through. E.g. if the FROM clause includes subquery or when the group-bys
> have similar WHERE caluses - then this "pre-selection" is executed first
> and the subsequent jobs operate on the results of the first instead of
> entire table/partition and are therefore much faster.
>
>
> J. Dolinar
>

Re: Multi-GroupBy-Insert optimization

Posted by Jan Dolinár <do...@gmail.com>.
> On Fri, Jun 1, 2012 at 5:25 PM, shan s <my...@gmail.com> wrote:
>
>> I am using Multi-GroupBy-Insert. I was expecting a single map-reduce job
>> which would club the group-bys together.
>> However it is scheduling n jobs where n = number of group bys..
>> Could you please explain this behaviour.
>>
>>
>
No, it will result in at least as many jobs as there is group-bys. The
efficiency is hidden not in lowering number of jobs, but in fact that the
first job usually reduces the amount of the data that the rest needs to go
through. E.g. if the FROM clause includes subquery or when the group-bys
have similar WHERE caluses - then this "pre-selection" is executed first
and the subsequent jobs operate on the results of the first instead of
entire table/partition and are therefore much faster.


J. Dolinar

Re: Multi-GroupBy-Insert optimization

Posted by shan s <my...@gmail.com>.
Anyone?
Thanks..

On Fri, Jun 1, 2012 at 5:25 PM, shan s <my...@gmail.com> wrote:

> I am using Multi-GroupBy-Insert. I was expecting a single map-reduce job
> which would club the group-bys together.
> However it is scheduling n jobs where n = number of group bys..
> Could you please explain this behaviour.
>
> From X
> INSERT OVERWRITE LOCAL DIRECTORY 'output/y1'
> SELECT a, b , c, count(*)
> group by a,b,c
> INSERT OVERWRITE LOCAL DIRECTORY 'output/y2'
> SELECT  a ,  count(*)
> group by a
> INSERT OVERWRITE LOCAL DIRECTORY 'output/y3'
> SELECT b,  count(*)
> group by b
> …..
> …..
> ……
>