You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Jim Donofrio <do...@gmail.com> on 2012/09/17 06:33:51 UTC

reuse same Tuple and ArrayList for every getNext call in LoadFunc?

Is it ok to reuse the same Tuple and List of inputs from RecordReader 
across all getNext calls in a LoadFunc? I notice that PigStorage creates 
a new List, mProtoTuple, for every record along with a new tuple. Since 
PigMapBase just use newTupleNoCopy to copy the List, creating a new 
Tuple for every getNext seems unnecessary.

Re: reuse same Tuple and ArrayList for every getNext call in LoadFunc?

Posted by Jim Donofrio <do...@gmail.com>.
Ok thanks for the clarification.

I am interested in this because I am new to Pig and am use to writing 
RecordReaders for mapreduce that reuse the same objects so I thought the 
same logic would apply here. I have not done any performance tests.

On 09/17/2012 01:30 AM, Dmitriy Ryaboy wrote:
> Anything that builds a bag -- for example, I was just looking at the
> DefaultDataBag code (and by extension, DistinctDataBag, etc) and it
> does not do any tuple copies. We could, of course, change all the Pig
> code to respect the assumption that tuples need to be copied if you
> want to keep them across multiple getNext calls, but we'd still get
> into trouble with UDFs that other people wrote before this change.
>
> I am curious why you are interested in this particular inefficiency,
> are you seeing severely degraded performance due to object allocation?
>
> D
>
> On Sun, Sep 16, 2012 at 10:16 PM, Jim Donofrio <do...@gmail.com> wrote:
>> Even if I make new tuple and lists I guess that also means I cannot safely
>> reuse a DataByteArray object inside a Tuple across getNext calls?
>>
>> Also wouldnt the conversion to a Bag only likely happen in a reducer which
>> would not be affected by the loader which only supplies input to the mapper?
>>
>> When you are talking about downstream code from the loader that assumes that
>> each tuple is a new Tuple, is there any code in Pig that assumes that or are
>> you just talking about UDF's and other 3rd party libs that people write for
>> Pig?
>>
>>
>> On 09/17/2012 12:44 AM, Dmitriy Ryaboy wrote:
>>> I looked into this a while back -- trouble comes when something
>>> downstream from the loader tries to collect inputs into a bag, and
>>> doesn't do its own copies. One can easily argue that if someone wants
>>> to do such collection, it should be their responsibility to ensure
>>> they aren't just collecting the same object that keeps being
>>> overwritten, but at this point, I think it's too late to convert
>>> everyone who might be making the "each tuple is a new tuple"
>>> assumption.
>>>
>>> D
>>>
>>> On Sun, Sep 16, 2012 at 9:33 PM, Jim Donofrio <do...@gmail.com>
>>> wrote:
>>>> Is it ok to reuse the same Tuple and List of inputs from RecordReader
>>>> across
>>>> all getNext calls in a LoadFunc? I notice that PigStorage creates a new
>>>> List, mProtoTuple, for every record along with a new tuple. Since
>>>> PigMapBase
>>>> just use newTupleNoCopy to copy the List, creating a new Tuple for every
>>>> getNext seems unnecessary.
>>


Re: reuse same Tuple and ArrayList for every getNext call in LoadFunc?

Posted by Dmitriy Ryaboy <dv...@gmail.com>.
Anything that builds a bag -- for example, I was just looking at the
DefaultDataBag code (and by extension, DistinctDataBag, etc) and it
does not do any tuple copies. We could, of course, change all the Pig
code to respect the assumption that tuples need to be copied if you
want to keep them across multiple getNext calls, but we'd still get
into trouble with UDFs that other people wrote before this change.

I am curious why you are interested in this particular inefficiency,
are you seeing severely degraded performance due to object allocation?

D

On Sun, Sep 16, 2012 at 10:16 PM, Jim Donofrio <do...@gmail.com> wrote:
> Even if I make new tuple and lists I guess that also means I cannot safely
> reuse a DataByteArray object inside a Tuple across getNext calls?
>
> Also wouldnt the conversion to a Bag only likely happen in a reducer which
> would not be affected by the loader which only supplies input to the mapper?
>
> When you are talking about downstream code from the loader that assumes that
> each tuple is a new Tuple, is there any code in Pig that assumes that or are
> you just talking about UDF's and other 3rd party libs that people write for
> Pig?
>
>
> On 09/17/2012 12:44 AM, Dmitriy Ryaboy wrote:
>>
>> I looked into this a while back -- trouble comes when something
>> downstream from the loader tries to collect inputs into a bag, and
>> doesn't do its own copies. One can easily argue that if someone wants
>> to do such collection, it should be their responsibility to ensure
>> they aren't just collecting the same object that keeps being
>> overwritten, but at this point, I think it's too late to convert
>> everyone who might be making the "each tuple is a new tuple"
>> assumption.
>>
>> D
>>
>> On Sun, Sep 16, 2012 at 9:33 PM, Jim Donofrio <do...@gmail.com>
>> wrote:
>>>
>>> Is it ok to reuse the same Tuple and List of inputs from RecordReader
>>> across
>>> all getNext calls in a LoadFunc? I notice that PigStorage creates a new
>>> List, mProtoTuple, for every record along with a new tuple. Since
>>> PigMapBase
>>> just use newTupleNoCopy to copy the List, creating a new Tuple for every
>>> getNext seems unnecessary.
>
>

Re: reuse same Tuple and ArrayList for every getNext call in LoadFunc?

Posted by Jim Donofrio <do...@gmail.com>.
Even if I make new tuple and lists I guess that also means I cannot 
safely reuse a DataByteArray object inside a Tuple across getNext calls?

Also wouldnt the conversion to a Bag only likely happen in a reducer 
which would not be affected by the loader which only supplies input to 
the mapper?

When you are talking about downstream code from the loader that assumes 
that each tuple is a new Tuple, is there any code in Pig that assumes 
that or are you just talking about UDF's and other 3rd party libs that 
people write for Pig?

On 09/17/2012 12:44 AM, Dmitriy Ryaboy wrote:
> I looked into this a while back -- trouble comes when something
> downstream from the loader tries to collect inputs into a bag, and
> doesn't do its own copies. One can easily argue that if someone wants
> to do such collection, it should be their responsibility to ensure
> they aren't just collecting the same object that keeps being
> overwritten, but at this point, I think it's too late to convert
> everyone who might be making the "each tuple is a new tuple"
> assumption.
>
> D
>
> On Sun, Sep 16, 2012 at 9:33 PM, Jim Donofrio <do...@gmail.com> wrote:
>> Is it ok to reuse the same Tuple and List of inputs from RecordReader across
>> all getNext calls in a LoadFunc? I notice that PigStorage creates a new
>> List, mProtoTuple, for every record along with a new tuple. Since PigMapBase
>> just use newTupleNoCopy to copy the List, creating a new Tuple for every
>> getNext seems unnecessary.


Re: reuse same Tuple and ArrayList for every getNext call in LoadFunc?

Posted by Dmitriy Ryaboy <dv...@gmail.com>.
I looked into this a while back -- trouble comes when something
downstream from the loader tries to collect inputs into a bag, and
doesn't do its own copies. One can easily argue that if someone wants
to do such collection, it should be their responsibility to ensure
they aren't just collecting the same object that keeps being
overwritten, but at this point, I think it's too late to convert
everyone who might be making the "each tuple is a new tuple"
assumption.

D

On Sun, Sep 16, 2012 at 9:33 PM, Jim Donofrio <do...@gmail.com> wrote:
> Is it ok to reuse the same Tuple and List of inputs from RecordReader across
> all getNext calls in a LoadFunc? I notice that PigStorage creates a new
> List, mProtoTuple, for every record along with a new tuple. Since PigMapBase
> just use newTupleNoCopy to copy the List, creating a new Tuple for every
> getNext seems unnecessary.