You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Thejas M Nair (JIRA)" <ji...@apache.org> on 2011/02/08 20:16:57 UTC
[jira] Commented: (PIG-1846) optimize queries like - count distinct users for each gender

    [ https://issues.apache.org/jira/browse/PIG-1846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12992119#comment-12992119 ] 

Thejas M Nair commented on PIG-1846:
------------------------------------

One way to mitigate the problem of skew in above above example query is to add another group-by statement which uses both gender and user as group-by key, and does a partial aggregation. It will introduce and additional MR job. The 2nd MR job will be effectively using only 2 reducers, but the work that needs to be done in the reduce of the 2nd MR job will be very little.

{code}
USER_DATA = load 'file' as (USER, GENDER, AGE);
USER_GROUP_GENDER_PART = group USER_DATA by (GENDER, USER) parallel 100;

-- there is only one distinct user per row since the USER column is one of group-by colums, so just project 1 as count
DIST_USER_PER_GENDER_PART = foreach USER_GROUP_GENDER_PART generate group.GENDER as GENDER, 1 as USER_COUNT; 
USER_GROUP_GENDER = group DIST_USER_PER_GENDER_PART by  GENDER;

-- map-side combiner will do most of the work in parallel, reduce will need to process few small records
DIST_USER_PER_GENDER = foreach USER_GROUP_GENDER generate GENDER, SUM(USER_GROUP_GENDER.USER_COUNT); 
{code}


> optimize queries like - count distinct users for each gender
> ------------------------------------------------------------
>
>                 Key: PIG-1846
>                 URL: https://issues.apache.org/jira/browse/PIG-1846
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.9.0
>            Reporter: Thejas M Nair
>
> The pig group operation does not usually have to deal with skew on the group-by keys if the foreach statement that works on the results of group has only algebraic functions on the bags. But for some queries like the following, skew can be a problem -
> {code}
> user_data = load 'file' as (user, gender, age);
> user_group_gender = group user_data by gender parallel 100;
> dist_users_per_gender = foreach user_group_gender 
>                         { 
>                              dist_user = distinct user_data.user; 
>                              generate group as gender, COUNT(dist_user) as user_count;
>                         }
> {code}
> Since there are only 2 distinct values of the group-by key, only 2 reducers will actually get used in current implementation. ie, you can't get better performance by adding more reducers.
> Similar problem is there when the data is skewed on the group key. With current implementation, another problem is that pig and MR has to deal with records with extremely large bags that have the large number of distinct user names, which results in high memory utilization and having to spill the bags to disk.
> The query plan should be modified to handle the skew in such cases and make use of more reducers.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira