You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by "Barreto, Rafael" <ra...@thebackplane.com> on 2013/03/27 23:43:50 UTC
Problem with GROUP_BY and Java heap space
Hello,
I'm running into OutOfMemoryError exceptions in this Pig script snippet:
grouped_sessions = group sessions_duration
by (month, source);
stats = foreach grouped_sessions {
bounces = filter sessions_duration by duration == 0;
not_bounces = filter sessions_duration by duration > 0;
generate flatten(group),
StreamingQuartile(not_bounces.duration) as quartiles,
AVG(not_bounces.duration) as avg,
SQRT(VAR(not_bounces.duration)) as std,
(double)COUNT(bounces) as n_bounces,
(double)COUNT(sessions_duration) as n_samples;
};
where sessions_duration is a bag of tuples (month, source, duration).
The number of tuples for a given pair of (month, source) can be HUGE
for my data and it seems Pig can't handle that smoothly. Since some of
the UDFs in the generate are no algebraic (e.g. StreamingQuartile), is
there a workaround for it? Maybe there's a better way to express my
intent in Pig that I'm not aware of.
Thank you in advance guys,
Rafael