You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pig.apache.org by "Ankur (JIRA)" <ji...@apache.org> on 2009/07/03 12:19:47 UTC

[jira] Created: (PIG-871) Improve distribution of keys in reduce phase

Improve distribution of keys in reduce phase
--------------------------------------------

                 Key: PIG-871
                 URL: https://issues.apache.org/jira/browse/PIG-871
             Project: Pig
          Issue Type: Improvement
    Affects Versions: 0.3.0
            Reporter: Ankur


The default hashing scheme used to distribute keys in reduce phase sometimes results in an uneven distribution of keys resulting in 5 - 10 % of reducers being overloaded with data. This bottleneck makes the PIG jobs really slow and gives users a bad impression.

While there is no bullet proof solution to the problem in general, the hashing can certainly be improved for better distribution. The proposal here is to evaluate and incorporate other hashing schemes that give high avalanche and more even distribution. We can start by evaluating MurmurHash which is Apache 2.0 licensed and freely available here - http://www.getopt.org/murmur/MurmurHash.java


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (PIG-871) Improve distribution of keys in reduce phase

Posted by "Alan Gates (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alan Gates reassigned PIG-871:
------------------------------

    Assignee: Thejas M Nair

> Improve distribution of keys in reduce phase
> --------------------------------------------
>
>                 Key: PIG-871
>                 URL: https://issues.apache.org/jira/browse/PIG-871
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.3.0
>            Reporter: Ankur
>            Assignee: Thejas M Nair
>
> The default hashing scheme used to distribute keys in reduce phase sometimes results in an uneven distribution of keys resulting in 5 - 10 % of reducers being overloaded with data. This bottleneck makes the PIG jobs really slow and gives users a bad impression.
> While there is no bullet proof solution to the problem in general, the hashing can certainly be improved for better distribution. The proposal here is to evaluate and incorporate other hashing schemes that give high avalanche and more even distribution. We can start by evaluating MurmurHash which is Apache 2.0 licensed and freely available here - http://www.getopt.org/murmur/MurmurHash.java

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-871) Improve distribution of keys in reduce phase

Posted by "Tom White (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12726942#action_12726942 ] 

Tom White commented on PIG-871:
-------------------------------

MurmurHash (and other hashing schemes) can be found in the org.apache.hadoop.util.hash package of Hadoop Common.

> Improve distribution of keys in reduce phase
> --------------------------------------------
>
>                 Key: PIG-871
>                 URL: https://issues.apache.org/jira/browse/PIG-871
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.3.0
>            Reporter: Ankur
>
> The default hashing scheme used to distribute keys in reduce phase sometimes results in an uneven distribution of keys resulting in 5 - 10 % of reducers being overloaded with data. This bottleneck makes the PIG jobs really slow and gives users a bad impression.
> While there is no bullet proof solution to the problem in general, the hashing can certainly be improved for better distribution. The proposal here is to evaluate and incorporate other hashing schemes that give high avalanche and more even distribution. We can start by evaluating MurmurHash which is Apache 2.0 licensed and freely available here - http://www.getopt.org/murmur/MurmurHash.java

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-871) Improve distribution of keys in reduce phase

Posted by "Ankur (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12731389#action_12731389 ] 

Ankur commented on PIG-871:
---------------------------

Sorry for coming back late in this. 

The problem mostly manifest itself in case of large datasets which is skewed in favour of few hundred group keys. The typical approach that user adopts in this case is to generate an additional group key ( typically an integer) in a random or round-robin fashion, group on original + generated and increase the parallelism and task heap size to see if that works. 

A use case is that user wants to separate out the data into 5 - 10 major buckets (directories) and for each major bucket he wants to have data separated out into 20 - 100 minor buckets (again directories).

An example script would look like this

data1 = LOAD 'myfile_1' Using PigStorage() as (f1, f2, f3);
data2 = LOAD 'myfile_2' Using PigStorage() as (f1, f4, f5);
filtered = FILTER data By <some condition>
joined_data = JOIN data1 By f1, data2 By f1 Parallel 100
projected_data = FOREACH joined_data generate f1, GenerateRandomIntUpto(100) as part, f2,f3,f4,f5 ;
SPLIT projected_data INTO major_bucket_1 IF <some_condition>, major_bucket_2 IF <some_condition>, major_bucket_3 IF <some_condition>;

group1= Group major_bucket_1 By (f1, part) Parallel 100;
result1 = FOREACH group1 generate FLATTEN(group),  FLATTEN($1);
STORE result1 INTO 'major_bucket_1' Using CustomStore();

group2= Group major_bucket_2 By (f2, part) Parallel 100;
result2 = FOREACH group2 generate FLATTEN(group),  FLATTEN($1);
STORE result2 INTO 'major_bucket_2' Using CustomStore();

group3= Group major_bucket_3 By (f3, part) Parallel 100;
result3 = FOREACH group3 generate FLATTEN(group),  FLATTEN($1);
STORE result3 INTO 'major_bucket_3' Using CustomStore();

The expectation here is to get the final result in a directory structure like this

major_bucket_1/minor_bucket_1/part-files
major_bucket_1/minor_bucket_2/part-files
major_bucket_1/minor_bucket_3/part-files
...

major_bucket_2/minor_bucket_1/part-files
major_bucket_2/minor_bucket_2/part-files
major_bucket_2/minor_bucket_3/part-files
..

Note:- The minor_bucket directory name is derived from the group key.

> Improve distribution of keys in reduce phase
> --------------------------------------------
>
>                 Key: PIG-871
>                 URL: https://issues.apache.org/jira/browse/PIG-871
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.3.0
>            Reporter: Ankur
>
> The default hashing scheme used to distribute keys in reduce phase sometimes results in an uneven distribution of keys resulting in 5 - 10 % of reducers being overloaded with data. This bottleneck makes the PIG jobs really slow and gives users a bad impression.
> While there is no bullet proof solution to the problem in general, the hashing can certainly be improved for better distribution. The proposal here is to evaluate and incorporate other hashing schemes that give high avalanche and more even distribution. We can start by evaluating MurmurHash which is Apache 2.0 licensed and freely available here - http://www.getopt.org/murmur/MurmurHash.java

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-871) Improve distribution of keys in reduce phase

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12727281#action_12727281 ] 

Olga Natkovich commented on PIG-871:
------------------------------------

Hi Ankur,

Thanks for logging the bug. Before we decide on the solution we need to run some tests that actually show that we do have a problem. We know we have an issue with JOIN with large keys that spill. We are almost done with an implementing a solution in this case. We also similarly addressed the case of large keys in order by.

It would be very interesting to come up with queries that show the behavior for the case of group by queries. If you have such examples, please, post them here.

> Improve distribution of keys in reduce phase
> --------------------------------------------
>
>                 Key: PIG-871
>                 URL: https://issues.apache.org/jira/browse/PIG-871
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.3.0
>            Reporter: Ankur
>
> The default hashing scheme used to distribute keys in reduce phase sometimes results in an uneven distribution of keys resulting in 5 - 10 % of reducers being overloaded with data. This bottleneck makes the PIG jobs really slow and gives users a bad impression.
> While there is no bullet proof solution to the problem in general, the hashing can certainly be improved for better distribution. The proposal here is to evaluate and incorporate other hashing schemes that give high avalanche and more even distribution. We can start by evaluating MurmurHash which is Apache 2.0 licensed and freely available here - http://www.getopt.org/murmur/MurmurHash.java

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.