You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by Prasanth J <bu...@gmail.com> on 2012/09/06 03:08:06 UTC

Modifying databag on the fly

Hello devs

I have specific case where I need to modify the contents (remove a field from each tuples) of Databag but I want to do it in-place and do not want to create another databag with new set of tuples. 
The situation is, say I have the following input tuple for an UDF

{(111,222,3,121), (112,223,2,131), (113,224,4,141)}

I want to iterate through this bag and generate an output bag removing the 3rd the of each tuples in the bag to get the following output
{(111,222,121), (112,223,131), (113,224,141)}

Since the number of tuples in this bag are expected to be large I cannot create new set of tuples and create a bag, as this will cause OOM exception. 

Also I do not want to flatten this bag as this bag will be passed to DISTINCT operator for computing distinct elements in the bag.
As seen from the javadocs for DataBag, there is no way to convert a bag on the fly. I wonder if there is any other way to solve this?

Thanks
-- Prasanth


Re: Modifying databag on the fly

Posted by Dmitriy Ryaboy <dv...@gmail.com>.
FYI -- we wound up going with a much cleaner and memory-friendly
solution of returning a new databag implementation which simply
proxied all the calls to the original bag, but returned a special
Iterator which applied the necessary transformation to tuples on the
fly. That way, we don't need to have the whole thing in memory twice
and cause spillage.

D

On Wed, Sep 5, 2012 at 7:38 PM, Alan Gates <ga...@hortonworks.com> wrote:
>
> On Sep 5, 2012, at 6:30 PM, Prasanth J wrote:
>
>> Ahh.. Now it makes more sense.
>>
>> I think I got the solution. I was adding to List<Tuple> and then finally creating a DataBag with that list.. Instead I should create a bag and keep adding to it..!! Is that correct?
> Yes.
>
> Alan.
>
>> Thanks Alan.
>>
>> Thanks
>> -- Prasanth
>>
>> On Sep 5, 2012, at 9:24 PM, Alan Gates <ga...@hortonworks.com> wrote:
>>
>>> You cannot modify a bag once it is written.  The implementation is written around the assumption that bags are immutable after they are written.
>>>
>>> Creating a new bag should not create an OOM exception, as bags are built to spill when they grow too large.  In fact it's this spilling feature that makes in place modification impossible.
>>>
>>> Alan.
>>>
>>> On Sep 5, 2012, at 6:08 PM, Prasanth J wrote:
>>>
>>>> Hello devs
>>>>
>>>> I have specific case where I need to modify the contents (remove a field from each tuples) of Databag but I want to do it in-place and do not want to create another databag with new set of tuples.
>>>> The situation is, say I have the following input tuple for an UDF
>>>>
>>>> {(111,222,3,121), (112,223,2,131), (113,224,4,141)}
>>>>
>>>> I want to iterate through this bag and generate an output bag removing the 3rd the of each tuples in the bag to get the following output
>>>> {(111,222,121), (112,223,131), (113,224,141)}
>>>>
>>>> Since the number of tuples in this bag are expected to be large I cannot create new set of tuples and create a bag, as this will cause OOM exception.
>>>>
>>>> Also I do not want to flatten this bag as this bag will be passed to DISTINCT operator for computing distinct elements in the bag.
>>>> As seen from the javadocs for DataBag, there is no way to convert a bag on the fly. I wonder if there is any other way to solve this?
>>>>
>>>> Thanks
>>>> -- Prasanth
>>>>
>>>
>>
>

Re: Modifying databag on the fly

Posted by Alan Gates <ga...@hortonworks.com>.
On Sep 5, 2012, at 6:30 PM, Prasanth J wrote:

> Ahh.. Now it makes more sense.
> 
> I think I got the solution. I was adding to List<Tuple> and then finally creating a DataBag with that list.. Instead I should create a bag and keep adding to it..!! Is that correct? 
Yes.

Alan.

> Thanks Alan. 
> 
> Thanks
> -- Prasanth
> 
> On Sep 5, 2012, at 9:24 PM, Alan Gates <ga...@hortonworks.com> wrote:
> 
>> You cannot modify a bag once it is written.  The implementation is written around the assumption that bags are immutable after they are written.  
>> 
>> Creating a new bag should not create an OOM exception, as bags are built to spill when they grow too large.  In fact it's this spilling feature that makes in place modification impossible.
>> 
>> Alan.
>> 
>> On Sep 5, 2012, at 6:08 PM, Prasanth J wrote:
>> 
>>> Hello devs
>>> 
>>> I have specific case where I need to modify the contents (remove a field from each tuples) of Databag but I want to do it in-place and do not want to create another databag with new set of tuples. 
>>> The situation is, say I have the following input tuple for an UDF
>>> 
>>> {(111,222,3,121), (112,223,2,131), (113,224,4,141)}
>>> 
>>> I want to iterate through this bag and generate an output bag removing the 3rd the of each tuples in the bag to get the following output
>>> {(111,222,121), (112,223,131), (113,224,141)}
>>> 
>>> Since the number of tuples in this bag are expected to be large I cannot create new set of tuples and create a bag, as this will cause OOM exception. 
>>> 
>>> Also I do not want to flatten this bag as this bag will be passed to DISTINCT operator for computing distinct elements in the bag.
>>> As seen from the javadocs for DataBag, there is no way to convert a bag on the fly. I wonder if there is any other way to solve this?
>>> 
>>> Thanks
>>> -- Prasanth
>>> 
>> 
> 


Re: Modifying databag on the fly

Posted by Prasanth J <bu...@gmail.com>.
Ahh.. Now it makes more sense.

I think I got the solution. I was adding to List<Tuple> and then finally creating a DataBag with that list.. Instead I should create a bag and keep adding to it..!! Is that correct? 
Thanks Alan. 

Thanks
-- Prasanth

On Sep 5, 2012, at 9:24 PM, Alan Gates <ga...@hortonworks.com> wrote:

> You cannot modify a bag once it is written.  The implementation is written around the assumption that bags are immutable after they are written.  
> 
> Creating a new bag should not create an OOM exception, as bags are built to spill when they grow too large.  In fact it's this spilling feature that makes in place modification impossible.
> 
> Alan.
> 
> On Sep 5, 2012, at 6:08 PM, Prasanth J wrote:
> 
>> Hello devs
>> 
>> I have specific case where I need to modify the contents (remove a field from each tuples) of Databag but I want to do it in-place and do not want to create another databag with new set of tuples. 
>> The situation is, say I have the following input tuple for an UDF
>> 
>> {(111,222,3,121), (112,223,2,131), (113,224,4,141)}
>> 
>> I want to iterate through this bag and generate an output bag removing the 3rd the of each tuples in the bag to get the following output
>> {(111,222,121), (112,223,131), (113,224,141)}
>> 
>> Since the number of tuples in this bag are expected to be large I cannot create new set of tuples and create a bag, as this will cause OOM exception. 
>> 
>> Also I do not want to flatten this bag as this bag will be passed to DISTINCT operator for computing distinct elements in the bag.
>> As seen from the javadocs for DataBag, there is no way to convert a bag on the fly. I wonder if there is any other way to solve this?
>> 
>> Thanks
>> -- Prasanth
>> 
> 


Re: Modifying databag on the fly

Posted by Alan Gates <ga...@hortonworks.com>.
You cannot modify a bag once it is written.  The implementation is written around the assumption that bags are immutable after they are written.  

Creating a new bag should not create an OOM exception, as bags are built to spill when they grow too large.  In fact it's this spilling feature that makes in place modification impossible.

Alan.

On Sep 5, 2012, at 6:08 PM, Prasanth J wrote:

> Hello devs
> 
> I have specific case where I need to modify the contents (remove a field from each tuples) of Databag but I want to do it in-place and do not want to create another databag with new set of tuples. 
> The situation is, say I have the following input tuple for an UDF
> 
> {(111,222,3,121), (112,223,2,131), (113,224,4,141)}
> 
> I want to iterate through this bag and generate an output bag removing the 3rd the of each tuples in the bag to get the following output
> {(111,222,121), (112,223,131), (113,224,141)}
> 
> Since the number of tuples in this bag are expected to be large I cannot create new set of tuples and create a bag, as this will cause OOM exception. 
> 
> Also I do not want to flatten this bag as this bag will be passed to DISTINCT operator for computing distinct elements in the bag.
> As seen from the javadocs for DataBag, there is no way to convert a bag on the fly. I wonder if there is any other way to solve this?
> 
> Thanks
> -- Prasanth
>