You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Shubham Chopra <sh...@gmail.com> on 2011/07/01 00:20:57 UTC

RAM usage

Hi,

I am using pig-0.9 and hadoop-0.20. I have set sort.io.mb to 100 and
mapred.map.child.java.opts to -Xmx400m because of the memory constraints I
have.

I have a pig script that does essentially the following:

a = load 'somedata' using SomeUDF();

b = foreach a generate x1, x2, x3, x4, x5, x6... ;

c1 = filter b by x3 is not null;
d1 = foreach c1 group by (x1, x2, x3);
e1 = foreach d1 generate flatten(group), SUM(c1.x5), SUM(c1.x6)...;

c2 = filter b by x4 is not null;
d2 = foreach c2 gropu by (x1, x2, x4);
e2 = foreach d2 generate flatten(group), SUM(c2.x5), SUM(c2.x6)...;
.
.
.
e14 = foreach d14 generate flatten(group), SUM(c14.x5), SUM(c14.x6)...;

f = union e1, e2, e3, e4 ... e14;
store f into 'somefile';

The data has around 350 columns, so the schema is significantly large. I
have made sure the input splits have around 2500 records each. Even then I
see significant spillage happening. The combiner does come into play but the
spillage kills performance. Why does the multi-query optimized map require
so much ram? The memory usage is really confusing as I see the mappers run
at almost all the memory allotted (400m). If I increase io.sort.mb, the
processes dies with am OOM exception. Any ideas?!

Thanks,
Shubham.