You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Dmitriy Ryaboy <dv...@gmail.com> on 2012/07/06 17:06:46 UTC

Re: pig reduce OOM

BinSedesTuple is just the tuple, changing it won't do anything about the
fact that lots of tuples are being loaded.

The snippet you provided will not load all the data for computation, since
COUNT implements algebraic interface (partial counts will be done on
combiners).

Something else is causing tuples to be materialized. Are you using other
UDFs? Can you provide more details on the script? When you run "explain" on
"Result", do you see Pig using COUNT$Final, COUNT$Intermediate, etc?

You can check the "pig.alias" property in the jobconf to identify which
relations are being calculated by a given MR job; that might help narrow
things down.

-Dmitriy

On Thu, Jul 5, 2012 at 11:44 PM, Haitao Yao <ya...@gmail.com> wrote:

> hi,
> I wrote a pig script that one of the reduces always OOM no matter how I
> change the parallelism.
>         Here's the script snippet:
> Data = group SourceData all;
> Result = foreach Data generate group, COUNt(SourceData);
> store Result into 'XX';
>   I analyzed the dumped java heap,  and find out that the reason is that
> the reducer load all the data for the foreach and count.
>
> Can I re-implement the BinSedesTuple to avoid reducers load all the data
> for computation?
>
> Here's the object domination tree:
>
>
> here's the jmap result:
>
>
>
> Haitao Yao
> yao.erix@gmail.com
> weibo: @haitao_yao
> Skype:  haitao.yao.final
>
>

Re: pig reduce OOM

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

Saw that right after I sent the reply -- yeah, Pig assumes larger
heaps than what you are running.
It would be a nice project for someone to document all the various
Hadoop and Pig buffers that get allocated, and the parameters that
control them, to see where memory goes in mappers and reducers.

D

On Tue, Jul 10, 2012 at 7:49 AM, Haitao Yao <ya...@gmail.com> wrote:
> I've found the reason: it's InternalCachedBag.
> I've posted all the details in a mail titled: What is the best way to do counting in pig?
> I'm afraid I can not give you the mail link since the mail archiver of apache mailing system still doesn't catch up with that message.
>
>
> Haitao Yao
> yao.erix@gmail.com
> weibo: @haitao_yao
> Skype:  haitao.yao.final
>
> 在 2012-7-10，下午10:35， Dmitriy Ryaboy 写道：
>
>> Like I said earlier, if all you are doing is count, the data bag should not be growing. On the reduce side, it'll just be a bag of counts from each reducer. Something else is happening that's preventing the algebraic and accumulative optimizations from kicking in. Can you share a minimal script that reproduces the problem for you?
>>
>> On Jul 9, 2012, at 3:24 AM, Haitao Yao <ya...@gmail.com> wrote:
>>
>>> seems like Big data big is still a headache for pig.
>>> here's a mail archive  I found : http://mail-archives.apache.org/mod_mbox/pig-user/200806.mbox/%3C1b29507a0806102042p2ca02c6ahc23339b6edf8ac6b@mail.gmail.com%3E
>>>
>>> I've tried all the ways I can think of, and none works.
>>> I think I have to play some tricks inside Pig source code.
>>>
>>>
>>>
>>> Haitao Yao
>>> yao.erix@gmail.com
>>> weibo: @haitao_yao
>>> Skype:  haitao.yao.final
>>>
>>> 在 2012-7-9，下午2:18， Haitao Yao 写道：
>>>
>>>> there's also a reason of the OOM: I group the data by all , and the parallelism is 1, With a big data bag, the reducer OOM
>>>>
>>>> after digging into the pig source code ,  I find out that replace the data bag in BinSedesTuple is quite tricky, and maybe will cause other unknown problems…
>>>>
>>>> Is there anybody else encounter the same problem?
>>>>
>>>>
>>>> Haitao Yao
>>>> yao.erix@gmail.com
>>>> weibo: @haitao_yao
>>>> Skype:  haitao.yao.final
>>>>
>>>> 在 2012-7-9，上午11:11， Haitao Yao 写道：
>>>>
>>>>> sorry for the improper statement.
>>>>> The problem is the DataBag.  The BinSedesTuple read full  data of the DataBag. and while use COUNT for the data, it causes OOM.
>>>>> The diagrams also shows that most of the objects is from the ArrayList.
>>>>>
>>>>> I want to reimplement the DataBag that read by BinSedesTuple, it just holds the reference of the data input and read the data one by one while using iterator to access the data.
>>>>>
>>>>> I will give a shot.
>>>>>
>>>>> Haitao Yao
>>>>> yao.erix@gmail.com
>>>>> weibo: @haitao_yao
>>>>> Skype:  haitao.yao.final
>>>>>
>>>>> 在 2012-7-6，下午11:06， Dmitriy Ryaboy 写道：
>>>>>
>>>>>> BinSedesTuple is just the tuple, changing it won't do anything about the fact that lots of tuples are being loaded.
>>>>>>
>>>>>> The snippet you provided will not load all the data for computation, since COUNT implements algebraic interface (partial counts will be done on combiners).
>>>>>>
>>>>>> Something else is causing tuples to be materialized. Are you using other UDFs? Can you provide more details on the script? When you run "explain" on "Result", do you see Pig using COUNT$Final, COUNT$Intermediate, etc?
>>>>>>
>>>>>> You can check the "pig.alias" property in the jobconf to identify which relations are being calculated by a given MR job; that might help narrow things down.
>>>>>>
>>>>>> -Dmitriy
>>>>>>
>>>>>>
>>>>>> On Thu, Jul 5, 2012 at 11:44 PM, Haitao Yao <ya...@gmail.com> wrote:
>>>>>> hi,
>>>>>>   I wrote a pig script that one of the reduces always OOM no matter how I change the parallelism.
>>>>>>       Here's the script snippet:
>>>>>>       Data = group SourceData all;
>>>>>>       Result = foreach Data generate group, COUNt(SourceData);
>>>>>>       store Result into 'XX';
>>>>>>
>>>>>>   I analyzed the dumped java heap,  and find out that the reason is that the reducer load all the data for the foreach and count.
>>>>>>
>>>>>>   Can I re-implement the BinSedesTuple to avoid reducers load all the data for computation?
>>>>>>
>>>>>> Here's the object domination tree:
>>>>>>
>>>>>>
>>>>>>
>>>>>> here's the jmap result:
>>>>>>
>>>>>>
>>>>>>
>>>>>> Haitao Yao
>>>>>> yao.erix@gmail.com
>>>>>> weibo: @haitao_yao
>>>>>> Skype:  haitao.yao.final
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>

Re: pig reduce OOM

Posted by Haitao Yao <ya...@gmail.com>.

I've found the reason: it's InternalCachedBag.
I've posted all the details in a mail titled: What is the best way to do counting in pig?
I'm afraid I can not give you the mail link since the mail archiver of apache mailing system still doesn't catch up with that message.


Haitao Yao
yao.erix@gmail.com
weibo: @haitao_yao
Skype:  haitao.yao.final

在 2012-7-10，下午10:35， Dmitriy Ryaboy 写道：

> Like I said earlier, if all you are doing is count, the data bag should not be growing. On the reduce side, it'll just be a bag of counts from each reducer. Something else is happening that's preventing the algebraic and accumulative optimizations from kicking in. Can you share a minimal script that reproduces the problem for you?
> 
> On Jul 9, 2012, at 3:24 AM, Haitao Yao <ya...@gmail.com> wrote:
> 
>> seems like Big data big is still a headache for pig. 
>> here's a mail archive  I found : http://mail-archives.apache.org/mod_mbox/pig-user/200806.mbox/%3C1b29507a0806102042p2ca02c6ahc23339b6edf8ac6b@mail.gmail.com%3E
>> 
>> I've tried all the ways I can think of, and none works. 
>> I think I have to play some tricks inside Pig source code.
>> 
>> 
>> 
>> Haitao Yao
>> yao.erix@gmail.com
>> weibo: @haitao_yao
>> Skype:  haitao.yao.final
>> 
>> 在 2012-7-9，下午2:18， Haitao Yao 写道：
>> 
>>> there's also a reason of the OOM: I group the data by all , and the parallelism is 1, With a big data bag, the reducer OOM 
>>> 
>>> after digging into the pig source code ,  I find out that replace the data bag in BinSedesTuple is quite tricky, and maybe will cause other unknown problems… 
>>> 
>>> Is there anybody else encounter the same problem? 
>>> 
>>> 
>>> Haitao Yao
>>> yao.erix@gmail.com
>>> weibo: @haitao_yao
>>> Skype:  haitao.yao.final
>>> 
>>> 在 2012-7-9，上午11:11， Haitao Yao 写道：
>>> 
>>>> sorry for the improper statement. 
>>>> The problem is the DataBag.  The BinSedesTuple read full  data of the DataBag. and while use COUNT for the data, it causes OOM.
>>>> The diagrams also shows that most of the objects is from the ArrayList.
>>>> 
>>>> I want to reimplement the DataBag that read by BinSedesTuple, it just holds the reference of the data input and read the data one by one while using iterator to access the data.
>>>> 
>>>> I will give a shot. 
>>>> 
>>>> Haitao Yao
>>>> yao.erix@gmail.com
>>>> weibo: @haitao_yao
>>>> Skype:  haitao.yao.final
>>>> 
>>>> 在 2012-7-6，下午11:06， Dmitriy Ryaboy 写道：
>>>> 
>>>>> BinSedesTuple is just the tuple, changing it won't do anything about the fact that lots of tuples are being loaded.
>>>>> 
>>>>> The snippet you provided will not load all the data for computation, since COUNT implements algebraic interface (partial counts will be done on combiners).
>>>>> 
>>>>> Something else is causing tuples to be materialized. Are you using other UDFs? Can you provide more details on the script? When you run "explain" on "Result", do you see Pig using COUNT$Final, COUNT$Intermediate, etc?
>>>>> 
>>>>> You can check the "pig.alias" property in the jobconf to identify which relations are being calculated by a given MR job; that might help narrow things down.
>>>>> 
>>>>> -Dmitriy
>>>>> 
>>>>> 
>>>>> On Thu, Jul 5, 2012 at 11:44 PM, Haitao Yao <ya...@gmail.com> wrote:
>>>>> hi,
>>>>>   I wrote a pig script that one of the reduces always OOM no matter how I change the parallelism.
>>>>>       Here's the script snippet:
>>>>>       Data = group SourceData all;
>>>>>       Result = foreach Data generate group, COUNt(SourceData);
>>>>>       store Result into 'XX';
>>>>> 
>>>>>   I analyzed the dumped java heap,  and find out that the reason is that the reducer load all the data for the foreach and count. 
>>>>> 
>>>>>   Can I re-implement the BinSedesTuple to avoid reducers load all the data for computation? 
>>>>> 
>>>>> Here's the object domination tree:
>>>>> 
>>>>> 
>>>>> 
>>>>> here's the jmap result: 
>>>>> 
>>>>> 
>>>>> 
>>>>> Haitao Yao
>>>>> yao.erix@gmail.com
>>>>> weibo: @haitao_yao
>>>>> Skype:  haitao.yao.final
>>>>> 
>>>>> 
>>>> 
>>> 
>>

Re: pig reduce OOM

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

Like I said earlier, if all you are doing is count, the data bag should not be growing. On the reduce side, it'll just be a bag of counts from each reducer. Something else is happening that's preventing the algebraic and accumulative optimizations from kicking in. Can you share a minimal script that reproduces the problem for you?

On Jul 9, 2012, at 3:24 AM, Haitao Yao <ya...@gmail.com> wrote:

> seems like Big data big is still a headache for pig. 
> here's a mail archive  I found : http://mail-archives.apache.org/mod_mbox/pig-user/200806.mbox/%3C1b29507a0806102042p2ca02c6ahc23339b6edf8ac6b@mail.gmail.com%3E
> 
> I've tried all the ways I can think of, and none works. 
> I think I have to play some tricks inside Pig source code.
> 
> 
> 
> Haitao Yao
> yao.erix@gmail.com
> weibo: @haitao_yao
> Skype:  haitao.yao.final
> 
> 在 2012-7-9，下午2:18， Haitao Yao 写道：
> 
>> there's also a reason of the OOM: I group the data by all , and the parallelism is 1, With a big data bag, the reducer OOM 
>> 
>> after digging into the pig source code ,  I find out that replace the data bag in BinSedesTuple is quite tricky, and maybe will cause other unknown problems… 
>> 
>> Is there anybody else encounter the same problem? 
>> 
>> 
>> Haitao Yao
>> yao.erix@gmail.com
>> weibo: @haitao_yao
>> Skype:  haitao.yao.final
>> 
>> 在 2012-7-9，上午11:11， Haitao Yao 写道：
>> 
>>> sorry for the improper statement. 
>>> The problem is the DataBag.  The BinSedesTuple read full  data of the DataBag. and while use COUNT for the data, it causes OOM.
>>> The diagrams also shows that most of the objects is from the ArrayList.
>>> 
>>> I want to reimplement the DataBag that read by BinSedesTuple, it just holds the reference of the data input and read the data one by one while using iterator to access the data.
>>> 
>>> I will give a shot. 
>>> 
>>> Haitao Yao
>>> yao.erix@gmail.com
>>> weibo: @haitao_yao
>>> Skype:  haitao.yao.final
>>> 
>>> 在 2012-7-6，下午11:06， Dmitriy Ryaboy 写道：
>>> 
>>>> BinSedesTuple is just the tuple, changing it won't do anything about the fact that lots of tuples are being loaded.
>>>> 
>>>> The snippet you provided will not load all the data for computation, since COUNT implements algebraic interface (partial counts will be done on combiners).
>>>> 
>>>> Something else is causing tuples to be materialized. Are you using other UDFs? Can you provide more details on the script? When you run "explain" on "Result", do you see Pig using COUNT$Final, COUNT$Intermediate, etc?
>>>> 
>>>> You can check the "pig.alias" property in the jobconf to identify which relations are being calculated by a given MR job; that might help narrow things down.
>>>> 
>>>> -Dmitriy
>>>> 
>>>> 
>>>> On Thu, Jul 5, 2012 at 11:44 PM, Haitao Yao <ya...@gmail.com> wrote:
>>>> hi,
>>>>    I wrote a pig script that one of the reduces always OOM no matter how I change the parallelism.
>>>>        Here's the script snippet:
>>>>        Data = group SourceData all;
>>>>        Result = foreach Data generate group, COUNt(SourceData);
>>>>        store Result into 'XX';
>>>>    
>>>>    I analyzed the dumped java heap,  and find out that the reason is that the reducer load all the data for the foreach and count. 
>>>> 
>>>>    Can I re-implement the BinSedesTuple to avoid reducers load all the data for computation? 
>>>> 
>>>> Here's the object domination tree:
>>>> 
>>>> 
>>>> 
>>>> here's the jmap result: 
>>>> 
>>>> 
>>>> 
>>>> Haitao Yao
>>>> yao.erix@gmail.com
>>>> weibo: @haitao_yao
>>>> Skype:  haitao.yao.final
>>>> 
>>>> 
>>> 
>> 
>

Re: pig reduce OOM

Posted by Haitao Yao <ya...@gmail.com>.

seems like Big data big is still a headache for pig. 
here's a mail archive  I found : http://mail-archives.apache.org/mod_mbox/pig-user/200806.mbox/%3C1b29507a0806102042p2ca02c6ahc23339b6edf8ac6b@mail.gmail.com%3E

I've tried all the ways I can think of, and none works. 
I think I have to play some tricks inside Pig source code.



Haitao Yao
yao.erix@gmail.com
weibo: @haitao_yao
Skype:  haitao.yao.final

在 2012-7-9，下午2:18， Haitao Yao 写道：

> there's also a reason of the OOM: I group the data by all , and the parallelism is 1, With a big data bag, the reducer OOM 
> 
> after digging into the pig source code ,  I find out that replace the data bag in BinSedesTuple is quite tricky, and maybe will cause other unknown problems… 
> 
> Is there anybody else encounter the same problem? 
> 
> 
> Haitao Yao
> yao.erix@gmail.com
> weibo: @haitao_yao
> Skype:  haitao.yao.final
> 
> 在 2012-7-9，上午11:11， Haitao Yao 写道：
> 
>> sorry for the improper statement. 
>> The problem is the DataBag.  The BinSedesTuple read full  data of the DataBag. and while use COUNT for the data, it causes OOM.
>> The diagrams also shows that most of the objects is from the ArrayList.
>> 
>> I want to reimplement the DataBag that read by BinSedesTuple, it just holds the reference of the data input and read the data one by one while using iterator to access the data.
>> 
>> I will give a shot. 
>> 
>> Haitao Yao
>> yao.erix@gmail.com
>> weibo: @haitao_yao
>> Skype:  haitao.yao.final
>> 
>> 在 2012-7-6，下午11:06， Dmitriy Ryaboy 写道：
>> 
>>> BinSedesTuple is just the tuple, changing it won't do anything about the fact that lots of tuples are being loaded.
>>> 
>>> The snippet you provided will not load all the data for computation, since COUNT implements algebraic interface (partial counts will be done on combiners).
>>> 
>>> Something else is causing tuples to be materialized. Are you using other UDFs? Can you provide more details on the script? When you run "explain" on "Result", do you see Pig using COUNT$Final, COUNT$Intermediate, etc?
>>> 
>>> You can check the "pig.alias" property in the jobconf to identify which relations are being calculated by a given MR job; that might help narrow things down.
>>> 
>>> -Dmitriy
>>> 
>>> 
>>> On Thu, Jul 5, 2012 at 11:44 PM, Haitao Yao <ya...@gmail.com> wrote:
>>> hi,
>>> 	I wrote a pig script that one of the reduces always OOM no matter how I change the parallelism.
>>>         Here's the script snippet:
>>> 		Data = group SourceData all;
>>> 		Result = foreach Data generate group, COUNt(SourceData);
>>> 		store Result into 'XX';
>>> 	
>>>  	I analyzed the dumped java heap,  and find out that the reason is that the reducer load all the data for the foreach and count. 
>>> 
>>> 	Can I re-implement the BinSedesTuple to avoid reducers load all the data for computation? 
>>> 
>>> Here's the object domination tree:
>>> 
>>> 
>>> 
>>> here's the jmap result: 
>>> 
>>>  
>>> 
>>> Haitao Yao
>>> yao.erix@gmail.com
>>> weibo: @haitao_yao
>>> Skype:  haitao.yao.final
>>> 
>>> 
>> 
>

Re: pig reduce OOM

Posted by Haitao Yao <ya...@gmail.com>.

there's also a reason of the OOM: I group the data by all , and the parallelism is 1, With a big data bag, the reducer OOM 

after digging into the pig source code ,  I find out that replace the data bag in BinSedesTuple is quite tricky, and maybe will cause other unknown problems… 

Is there anybody else encounter the same problem? 


Haitao Yao
yao.erix@gmail.com
weibo: @haitao_yao
Skype:  haitao.yao.final

在 2012-7-9，上午11:11， Haitao Yao 写道：

> sorry for the improper statement. 
> The problem is the DataBag.  The BinSedesTuple read full  data of the DataBag. and while use COUNT for the data, it causes OOM.
> The diagrams also shows that most of the objects is from the ArrayList.
> 
> I want to reimplement the DataBag that read by BinSedesTuple, it just holds the reference of the data input and read the data one by one while using iterator to access the data.
> 
> I will give a shot. 
> 
> Haitao Yao
> yao.erix@gmail.com
> weibo: @haitao_yao
> Skype:  haitao.yao.final
> 
> 在 2012-7-6，下午11:06， Dmitriy Ryaboy 写道：
> 
>> BinSedesTuple is just the tuple, changing it won't do anything about the fact that lots of tuples are being loaded.
>> 
>> The snippet you provided will not load all the data for computation, since COUNT implements algebraic interface (partial counts will be done on combiners).
>> 
>> Something else is causing tuples to be materialized. Are you using other UDFs? Can you provide more details on the script? When you run "explain" on "Result", do you see Pig using COUNT$Final, COUNT$Intermediate, etc?
>> 
>> You can check the "pig.alias" property in the jobconf to identify which relations are being calculated by a given MR job; that might help narrow things down.
>> 
>> -Dmitriy
>> 
>> 
>> On Thu, Jul 5, 2012 at 11:44 PM, Haitao Yao <ya...@gmail.com> wrote:
>> hi,
>> 	I wrote a pig script that one of the reduces always OOM no matter how I change the parallelism.
>>         Here's the script snippet:
>> 		Data = group SourceData all;
>> 		Result = foreach Data generate group, COUNt(SourceData);
>> 		store Result into 'XX';
>> 	
>>  	I analyzed the dumped java heap,  and find out that the reason is that the reducer load all the data for the foreach and count. 
>> 
>> 	Can I re-implement the BinSedesTuple to avoid reducers load all the data for computation? 
>> 
>> Here's the object domination tree:
>> 
>> 
>> 
>> here's the jmap result: 
>> 
>>  
>> 
>> Haitao Yao
>> yao.erix@gmail.com
>> weibo: @haitao_yao
>> Skype:  haitao.yao.final
>> 
>> 
>

Re: pig reduce OOM

Posted by Haitao Yao <ya...@gmail.com>.

sorry for the improper statement. 
The problem is the DataBag.  The BinSedesTuple read full  data of the DataBag. and while use COUNT for the data, it causes OOM.
The diagrams also shows that most of the objects is from the ArrayList.

I want to reimplement the DataBag that read by BinSedesTuple, it just holds the reference of the data input and read the data one by one while using iterator to access the data.

I will give a shot. 

Haitao Yao
yao.erix@gmail.com
weibo: @haitao_yao
Skype:  haitao.yao.final

在 2012-7-6，下午11:06， Dmitriy Ryaboy 写道：

> BinSedesTuple is just the tuple, changing it won't do anything about the fact that lots of tuples are being loaded.
> 
> The snippet you provided will not load all the data for computation, since COUNT implements algebraic interface (partial counts will be done on combiners).
> 
> Something else is causing tuples to be materialized. Are you using other UDFs? Can you provide more details on the script? When you run "explain" on "Result", do you see Pig using COUNT$Final, COUNT$Intermediate, etc?
> 
> You can check the "pig.alias" property in the jobconf to identify which relations are being calculated by a given MR job; that might help narrow things down.
> 
> -Dmitriy
> 
> 
> On Thu, Jul 5, 2012 at 11:44 PM, Haitao Yao <ya...@gmail.com> wrote:
> hi,
> 	I wrote a pig script that one of the reduces always OOM no matter how I change the parallelism.
>         Here's the script snippet:
> 		Data = group SourceData all;
> 		Result = foreach Data generate group, COUNt(SourceData);
> 		store Result into 'XX';
> 	
>  	I analyzed the dumped java heap,  and find out that the reason is that the reducer load all the data for the foreach and count. 
> 
> 	Can I re-implement the BinSedesTuple to avoid reducers load all the data for computation? 
> 
> Here's the object domination tree:
> 
> 
> 
> here's the jmap result: 
> 
>  
> 
> Haitao Yao
> yao.erix@gmail.com
> weibo: @haitao_yao
> Skype:  haitao.yao.final
> 
>