You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Vincent Barat <vi...@gmail.com> on 2011/08/23 18:27:39 UTC

Question about request optimization

Hi,

Over the bunch of request I run using PIG 0.8.1, the most heavy one 
is the following:

    /* load session data from HBase */
    start_sessions = load ... (start of sessions)
    end_sessions = load ... (end of sessions)
    location = load ... (session location)
    info = load ... (session info)

    /* join start and end of session */
    sessions = JOIN start_sessions BY sid, end_sessions BY sid

    /* remove invalid or too long sessions */
    sessions = FILTER sessions BY end > start AND end - start < 
MAX_SESSION_DURATION

    /* Join session table with info table */
    sessions = JOIN sessions BY infoid, infos BY infoid;

    /* Join session table with location table */
    sessions = JOIN sessions BY locid LEFT, locations BY locid;

    /* Keep only required fields and format */
    sessions = FOREACH sessions GENERATE ... fileds I want to keep 
and need to format...;

    /* store sessions in an HDFS file */
    store session;

I need to optimize it, and would like your advice. Here is what I 
have tried, verified.

1- this request build a plan of 3 levels
2- I've tried to use a 'merge' join for the first JOIN (since 
start_sessions and end_sessions are indexed by sid). Unfortunatly, 
the HBaseLoader() don't support merge JOIN.
3- I've noticed that the last M/R/ job is not correctly balanced: it 
spawns 3 reduce tasks, but only 1 effectively process some data. The 
location table is actually empty in this case (does this explain the 
badly balanced reduce tasks?).

Any idea ?










Re: Question about request optimization

Posted by Vincent Barat <vi...@gmail.com>.

Le 23/08/11 20:28, Dmitriy Ryaboy a écrit :
> We should add merge join support to HBaseStorage, it should be able to do
> that for joins on the table key.
It would be great !
>
> Are your locids skewed? Have you tried using 'skewed' join for the last job?
> Actually, if locations are small, you can even use replicated.
Unfortunately not (our locid are MD5 hashcodes)
>
> Any particular reason to store and load starts and ends of sessions
> separately? Seems like something you could put into a single HBase table
> row, or at least a single HBase table, and derive the starts and ends via
> grouping on user ids.
Historical reasons only. Yes I'm thinking about how to change this.

Actually locations and infos are small enough to fit into memory, so 
I've used replicated joins and it help a lot (X4 times in our case).
So, using a merge join for the firt JOIN would definitively solve my 
issue.

Thanks for your help.

>
> D
>
> On Tue, Aug 23, 2011 at 9:27 AM, Vincent Barat<vi...@gmail.com>wrote:
>
>> Hi,
>>
>> Over the bunch of request I run using PIG 0.8.1, the most heavy one is the
>> following:
>>
>>    /* load session data from HBase */
>>    start_sessions = load ... (start of sessions)
>>    end_sessions = load ... (end of sessions)
>>    location = load ... (session location)
>>    info = load ... (session info)
>>
>>    /* join start and end of session */
>>    sessions = JOIN start_sessions BY sid, end_sessions BY sid
>>
>>    /* remove invalid or too long sessions */
>>    sessions = FILTER sessions BY end>  start AND end - start<
>> MAX_SESSION_DURATION
>>
>>    /* Join session table with info table */
>>    sessions = JOIN sessions BY infoid, infos BY infoid;
>>
>>    /* Join session table with location table */
>>    sessions = JOIN sessions BY locid LEFT, locations BY locid;
>>
>>    /* Keep only required fields and format */
>>    sessions = FOREACH sessions GENERATE ... fileds I want to keep and need
>> to format...;
>>
>>    /* store sessions in an HDFS file */
>>    store session;
>>
>> I need to optimize it, and would like your advice. Here is what I have
>> tried, verified.
>>
>> 1- this request build a plan of 3 levels
>> 2- I've tried to use a 'merge' join for the first JOIN (since
>> start_sessions and end_sessions are indexed by sid). Unfortunatly, the
>> HBaseLoader() don't support merge JOIN.
>> 3- I've noticed that the last M/R/ job is not correctly balanced: it spawns
>> 3 reduce tasks, but only 1 effectively process some data. The location table
>> is actually empty in this case (does this explain the badly balanced reduce
>> tasks?).
>>
>> Any idea ?
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>

Re: Question about request optimization

Posted by Dmitriy Ryaboy <dv...@gmail.com>.
We should add merge join support to HBaseStorage, it should be able to do
that for joins on the table key.

Are your locids skewed? Have you tried using 'skewed' join for the last job?
Actually, if locations are small, you can even use replicated.

Any particular reason to store and load starts and ends of sessions
separately? Seems like something you could put into a single HBase table
row, or at least a single HBase table, and derive the starts and ends via
grouping on user ids.

D

On Tue, Aug 23, 2011 at 9:27 AM, Vincent Barat <vi...@gmail.com>wrote:

> Hi,
>
> Over the bunch of request I run using PIG 0.8.1, the most heavy one is the
> following:
>
>   /* load session data from HBase */
>   start_sessions = load ... (start of sessions)
>   end_sessions = load ... (end of sessions)
>   location = load ... (session location)
>   info = load ... (session info)
>
>   /* join start and end of session */
>   sessions = JOIN start_sessions BY sid, end_sessions BY sid
>
>   /* remove invalid or too long sessions */
>   sessions = FILTER sessions BY end > start AND end - start <
> MAX_SESSION_DURATION
>
>   /* Join session table with info table */
>   sessions = JOIN sessions BY infoid, infos BY infoid;
>
>   /* Join session table with location table */
>   sessions = JOIN sessions BY locid LEFT, locations BY locid;
>
>   /* Keep only required fields and format */
>   sessions = FOREACH sessions GENERATE ... fileds I want to keep and need
> to format...;
>
>   /* store sessions in an HDFS file */
>   store session;
>
> I need to optimize it, and would like your advice. Here is what I have
> tried, verified.
>
> 1- this request build a plan of 3 levels
> 2- I've tried to use a 'merge' join for the first JOIN (since
> start_sessions and end_sessions are indexed by sid). Unfortunatly, the
> HBaseLoader() don't support merge JOIN.
> 3- I've noticed that the last M/R/ job is not correctly balanced: it spawns
> 3 reduce tasks, but only 1 effectively process some data. The location table
> is actually empty in this case (does this explain the badly balanced reduce
> tasks?).
>
> Any idea ?
>
>
>
>
>
>
>
>
>
>