You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Firas Abuzaid <fa...@stanford.edu> on 2014/07/31 21:28:10 UTC

Tuning Triangle Joins on Hive

Hi,

We're running various "triangle" join queries on Hive 0.9.0, and we're
wondering if we can get any better performance. Here's the query we're
running:

SELECT count(*)
FROM table r1 JOIN table r2 ON (r1.dst = r2.src) JOIN table r3 ON (r2.dst =
r3.src AND r3.dst = r1.src)
WHERE r1.src < r2.src AND r2.src < r3.src;

We're currently passing the following tuning parameters as well:

set mapred.map.tasks=120;
set mapred.reduce.tasks=120;
set mapred.tasktracker.map.tasks.maximum=8;
set mapred.tasktracker.reduce.tasks.maximum=8;
set mapred.child.java.opts=-Xmx5120m;

The dataset we're using has 5 million nodes and 70 million edges, and most
of our time is spent on garbage collection. We have about 30 machines in
our cluster, and each machine has 45GB of RAM. Any thoughts on how we can
improve performance? Thanks in advance!

Re: Tuning Triangle Joins on Hive

Posted by Firas Abuzaid <fa...@stanford.edu>.

Thanks, that's very helpful!


On Sat, Aug 2, 2014 at 12:47 PM, Lefty Leverenz <le...@gmail.com>
wrote:

> How does indexes work in hive?
>>
>
> See the Indexes design doc
> <https://cwiki.apache.org/confluence/display/Hive/IndexDev> in the Hive
> wiki, although it hasn't been updated.
>
> -- Lefty
>
>
> On Sat, Aug 2, 2014 at 2:07 AM, chandra Reddy Bogala <
> chandra.reddy2005@gmail.com> wrote:
>
>> How does indexes work in hive? I thought file formats like ORC have
>> indexes in each block. But not a separate index that can help query
>> performance.
>> Thanks,
>> Chandra
>>
>>
>> On Fri, Aug 1, 2014 at 9:10 AM, Devopam Mittra <de...@gmail.com> wrote:
>>
>>> Please try the following approach and let me know if you are not getting
>>> better performance:
>>>
>>> 1. Ensure indexes are present on dst , rsc columns in the respective
>>> tables.
>>> 2. Create a subset first taking r2 and r2 (i.e.: r3.src > r2.src) in a
>>> physical table, and then create index on its new src column as well
>>> 3. Join this to r1
>>>
>>> If this approach works well, then try out the WITH SELECT ... using the
>>> same approach , just no physical intermediate table will be created.
>>>
>>> Hope it helps..
>>>
>>> regards
>>> Dev
>>>
>>>
>>>
>>>
>>> On Fri, Aug 1, 2014 at 12:58 AM, Firas Abuzaid <fa...@stanford.edu>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> We're running various "triangle" join queries on Hive 0.9.0, and we're
>>>> wondering if we can get any better performance. Here's the query we're
>>>> running:
>>>>
>>>> SELECT count(*)
>>>> FROM table r1 JOIN table r2 ON (r1.dst = r2.src) JOIN table r3 ON
>>>> (r2.dst = r3.src AND r3.dst = r1.src)
>>>> WHERE r1.src < r2.src AND r2.src < r3.src;
>>>>
>>>> We're currently passing the following tuning parameters as well:
>>>>
>>>> set mapred.map.tasks=120;
>>>> set mapred.reduce.tasks=120;
>>>> set mapred.tasktracker.map.tasks.maximum=8;
>>>> set mapred.tasktracker.reduce.tasks.maximum=8;
>>>> set mapred.child.java.opts=-Xmx5120m;
>>>>
>>>> The dataset we're using has 5 million nodes and 70 million edges, and
>>>> most of our time is spent on garbage collection. We have about 30 machines
>>>> in our cluster, and each machine has 45GB of RAM. Any thoughts on how we
>>>> can improve performance? Thanks in advance!
>>>>
>>>
>>>
>>>
>>> --
>>> Devopam Mittra
>>> Life and Relations are not binary
>>>
>>
>>
>

Re: Tuning Triangle Joins on Hive

Posted by Lefty Leverenz <le...@gmail.com>.

>
> How does indexes work in hive?
>

See the Indexes design doc
<https://cwiki.apache.org/confluence/display/Hive/IndexDev> in the Hive
wiki, although it hasn't been updated.

-- Lefty


On Sat, Aug 2, 2014 at 2:07 AM, chandra Reddy Bogala <
chandra.reddy2005@gmail.com> wrote:

> How does indexes work in hive? I thought file formats like ORC have
> indexes in each block. But not a separate index that can help query
> performance.
> Thanks,
> Chandra
>
>
> On Fri, Aug 1, 2014 at 9:10 AM, Devopam Mittra <de...@gmail.com> wrote:
>
>> Please try the following approach and let me know if you are not getting
>> better performance:
>>
>> 1. Ensure indexes are present on dst , rsc columns in the respective
>> tables.
>> 2. Create a subset first taking r2 and r2 (i.e.: r3.src > r2.src) in a
>> physical table, and then create index on its new src column as well
>> 3. Join this to r1
>>
>> If this approach works well, then try out the WITH SELECT ... using the
>> same approach , just no physical intermediate table will be created.
>>
>> Hope it helps..
>>
>> regards
>> Dev
>>
>>
>>
>>
>> On Fri, Aug 1, 2014 at 12:58 AM, Firas Abuzaid <fa...@stanford.edu>
>> wrote:
>>
>>> Hi,
>>>
>>> We're running various "triangle" join queries on Hive 0.9.0, and we're
>>> wondering if we can get any better performance. Here's the query we're
>>> running:
>>>
>>> SELECT count(*)
>>> FROM table r1 JOIN table r2 ON (r1.dst = r2.src) JOIN table r3 ON
>>> (r2.dst = r3.src AND r3.dst = r1.src)
>>> WHERE r1.src < r2.src AND r2.src < r3.src;
>>>
>>> We're currently passing the following tuning parameters as well:
>>>
>>> set mapred.map.tasks=120;
>>> set mapred.reduce.tasks=120;
>>> set mapred.tasktracker.map.tasks.maximum=8;
>>> set mapred.tasktracker.reduce.tasks.maximum=8;
>>> set mapred.child.java.opts=-Xmx5120m;
>>>
>>> The dataset we're using has 5 million nodes and 70 million edges, and
>>> most of our time is spent on garbage collection. We have about 30 machines
>>> in our cluster, and each machine has 45GB of RAM. Any thoughts on how we
>>> can improve performance? Thanks in advance!
>>>
>>
>>
>>
>> --
>> Devopam Mittra
>> Life and Relations are not binary
>>
>
>

Re: Tuning Triangle Joins on Hive

Posted by chandra Reddy Bogala <ch...@gmail.com>.

How does indexes work in hive? I thought file formats like ORC have indexes
in each block. But not a separate index that can help query performance.
Thanks,
Chandra


On Fri, Aug 1, 2014 at 9:10 AM, Devopam Mittra <de...@gmail.com> wrote:

> Please try the following approach and let me know if you are not getting
> better performance:
>
> 1. Ensure indexes are present on dst , rsc columns in the respective
> tables.
> 2. Create a subset first taking r2 and r2 (i.e.: r3.src > r2.src) in a
> physical table, and then create index on its new src column as well
> 3. Join this to r1
>
> If this approach works well, then try out the WITH SELECT ... using the
> same approach , just no physical intermediate table will be created.
>
> Hope it helps..
>
> regards
> Dev
>
>
>
>
> On Fri, Aug 1, 2014 at 12:58 AM, Firas Abuzaid <fa...@stanford.edu>
> wrote:
>
>> Hi,
>>
>> We're running various "triangle" join queries on Hive 0.9.0, and we're
>> wondering if we can get any better performance. Here's the query we're
>> running:
>>
>> SELECT count(*)
>> FROM table r1 JOIN table r2 ON (r1.dst = r2.src) JOIN table r3 ON (r2.dst
>> = r3.src AND r3.dst = r1.src)
>> WHERE r1.src < r2.src AND r2.src < r3.src;
>>
>> We're currently passing the following tuning parameters as well:
>>
>> set mapred.map.tasks=120;
>> set mapred.reduce.tasks=120;
>> set mapred.tasktracker.map.tasks.maximum=8;
>> set mapred.tasktracker.reduce.tasks.maximum=8;
>> set mapred.child.java.opts=-Xmx5120m;
>>
>> The dataset we're using has 5 million nodes and 70 million edges, and
>> most of our time is spent on garbage collection. We have about 30 machines
>> in our cluster, and each machine has 45GB of RAM. Any thoughts on how we
>> can improve performance? Thanks in advance!
>>
>
>
>
> --
> Devopam Mittra
> Life and Relations are not binary
>

Re: Tuning Triangle Joins on Hive

Posted by Devopam Mittra <de...@gmail.com>.

Please try the following approach and let me know if you are not getting
better performance:

1. Ensure indexes are present on dst , rsc columns in the respective tables.
2. Create a subset first taking r2 and r2 (i.e.: r3.src > r2.src) in a
physical table, and then create index on its new src column as well
3. Join this to r1

If this approach works well, then try out the WITH SELECT ... using the
same approach , just no physical intermediate table will be created.

Hope it helps..

regards
Dev

On Fri, Aug 1, 2014 at 12:58 AM, Firas Abuzaid <fa...@stanford.edu>
wrote:

> Hi,
>
> We're running various "triangle" join queries on Hive 0.9.0, and we're
> wondering if we can get any better performance. Here's the query we're
> running:
>
> SELECT count(*)
> FROM table r1 JOIN table r2 ON (r1.dst = r2.src) JOIN table r3 ON (r2.dst
> = r3.src AND r3.dst = r1.src)
> WHERE r1.src < r2.src AND r2.src < r3.src;
>
> We're currently passing the following tuning parameters as well:
>
> set mapred.map.tasks=120;
> set mapred.reduce.tasks=120;
> set mapred.tasktracker.map.tasks.maximum=8;
> set mapred.tasktracker.reduce.tasks.maximum=8;
> set mapred.child.java.opts=-Xmx5120m;
>
> The dataset we're using has 5 million nodes and 70 million edges, and most
> of our time is spent on garbage collection. We have about 30 machines in
> our cluster, and each machine has 45GB of RAM. Any thoughts on how we can
> improve performance? Thanks in advance!
>

-- 
Devopam Mittra
Life and Relations are not binary