You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Adam Silberstein <ad...@trifacta.com> on 2014/07/11 18:27:48 UTC

Bag spilling with algebraic

Hi,
I’m trying to better understand how algebraic UDFs do or do not help with bag spilling.  I have an algebraic UDF and I see the algebraic part getting invoked.  However, I am getting bad performance and see lots of spilling.  I see the spilling both in heap dumps and in the final pig counters.  E.g.
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 7
Total records proactively spilled: 258472

That 258K number is out of 400K original input records.  As I understand these numbers, 7 bags were spilled with a total of 258K tuples within them.  So it seems like it is not calling intermediate aggregation and instead spilling large bags of singletons to disk.

I know algebraic’s stated purpose is to do map-side aggregation to avoid the network cost of shuffling so many records.  But can it do anything to more proactively call ‘intermediate’ aggregation map side to avoid bags getting so large?  I see for example that Accumulator has ‘pig.accumulative.batchsize.’  I haven’t seen the equivalent for algebraic.  

FYI, part of the reason for all the memory usage is that I am computing algebraics over 10s of columns, and have a few such UDFs chained together.  Pig is managing to get them all run in the same wave of maps.  So that is good in principle, but does cause memory pressure.

Thanks!
Adam