You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Adam Silberstein <ad...@trifacta.com> on 2014/07/16 01:53:07 UTC

Any way to get streaming semantics in algebraic?

Hey All,
I’m struggling with performance of algebraic aggregates.  It seems that Pig will always bag up tuples between the input and intermediate aggregate stages.  For my workload those bags get large and spill to disk.  The spilling in and of itself seems to cause a lot of memory pressure and then GC and slowdown.

The aggregates I am computing are things like MAX, where I would really like to stream the records through the input stage and only maintain the current max.  Is this possible with algebraic, accumulator or anything else?

Thanks,
Adam

Re: Any way to get streaming semantics in algebraic?

Posted by Mark Laurent <aa...@gmail.com>.
Unsubscribe 

--
Sent using a mobile device.

On Sat, Jul 19, 2014 at 7:39 PM, Cheolsoo Park <pi...@gmail.com>
wrote:

> Hi Adam,
> Alegbraic and accumulator are mutually exclusive, so you can't use them at
> the same time.
>  >> It seems that Pig will always bag up tuples between the input and
> intermediate aggregate stages. For my workload those bags get large and
> spill to disk.  The spilling in and of itself seems to cause a lot of
> memory pressure and then GC and slowdown.
> That said, I don't understand why you're seeing bag spilling in combiners.
> Bags are usually small on the mapper side and don't spill to disk because
> only records that are grouped by the same by are bagged together. Bag
> spilling usually happens on the reducer side. Do you have skewed keys in
> your group-by? Or can't you increase the parallelism of mappers to spread
> out the load across more mappers?
> Thanks,
> Cheolsoo
> On Tue, Jul 15, 2014 at 4:53 PM, Adam Silberstein <ad...@trifacta.com> wrote:
>> Hey All,
>> I’m struggling with performance of algebraic aggregates.  It seems that
>> Pig will always bag up tuples between the input and intermediate aggregate
>> stages.  For my workload those bags get large and spill to disk.  The
>> spilling in and of itself seems to cause a lot of memory pressure and then
>> GC and slowdown.
>>
>> The aggregates I am computing are things like MAX, where I would really
>> like to stream the records through the input stage and only maintain the
>> current max.  Is this possible with algebraic, accumulator or anything else?
>>
>> Thanks,
>> Adam

Re: Any way to get streaming semantics in algebraic?

Posted by Cheolsoo Park <pi...@gmail.com>.
Hi Adam,

Alegbraic and accumulator are mutually exclusive, so you can't use them at
the same time.

 >> It seems that Pig will always bag up tuples between the input and
intermediate aggregate stages. For my workload those bags get large and
spill to disk.  The spilling in and of itself seems to cause a lot of
memory pressure and then GC and slowdown.

That said, I don't understand why you're seeing bag spilling in combiners.
Bags are usually small on the mapper side and don't spill to disk because
only records that are grouped by the same by are bagged together. Bag
spilling usually happens on the reducer side. Do you have skewed keys in
your group-by? Or can't you increase the parallelism of mappers to spread
out the load across more mappers?

Thanks,
Cheolsoo



On Tue, Jul 15, 2014 at 4:53 PM, Adam Silberstein <ad...@trifacta.com> wrote:

> Hey All,
> I’m struggling with performance of algebraic aggregates.  It seems that
> Pig will always bag up tuples between the input and intermediate aggregate
> stages.  For my workload those bags get large and spill to disk.  The
> spilling in and of itself seems to cause a lot of memory pressure and then
> GC and slowdown.
>
> The aggregates I am computing are things like MAX, where I would really
> like to stream the records through the input stage and only maintain the
> current max.  Is this possible with algebraic, accumulator or anything else?
>
> Thanks,
> Adam