You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by "Barreto, Rafael" <ra...@thebackplane.com> on 2013/03/27 23:43:50 UTC

Problem with GROUP_BY and Java heap space

Hello,

I'm running into OutOfMemoryError exceptions in this Pig script snippet:

grouped_sessions = group sessions_duration
                              by (month, source);
stats = foreach grouped_sessions {
        bounces = filter sessions_duration by duration == 0;
        not_bounces = filter sessions_duration by duration > 0;
        generate flatten(group),
                      StreamingQuartile(not_bounces.duration) as quartiles,
                      AVG(not_bounces.duration) as avg,
                      SQRT(VAR(not_bounces.duration)) as std,
                      (double)COUNT(bounces) as n_bounces,
                      (double)COUNT(sessions_duration) as n_samples;
};

where sessions_duration is a bag of tuples (month, source, duration).
The number of tuples for a given pair of (month, source) can be HUGE
for my data and it seems Pig can't handle that smoothly. Since some of
the UDFs in the generate are no algebraic (e.g. StreamingQuartile), is
there a workaround for it? Maybe there's a better way to express my
intent in Pig that I'm not aware of.

Thank you in advance guys,
Rafael