You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Zheng Ziyi <ma...@gmail.com> on 2013/03/05 23:45:34 UTC

Memory issue with datafu StreamingQuantile in apache pig

Hello,


I have a pig script to compute 1000 quantiles of multiple columns. I run
into an issue with java heap memory. Here is my pig script.

SET mapred.child.java.opts ' -Xmx4096m -Dfile.encoding=UTF8
-Djava.library.path=/apollo/env/TrafficAnalyticsHadoop/lib';

define Quantile1 datafu.pig.stats.StreamingQuantile('1000');

....

-- B has 50 columns

G = GROUP B ALL; Quants = FOREACH G GENERATE Quantile1(B.$1) AS q1,
Quantile1(B.$2) AS q2;

....

The error:

[main] ERROR org.apache.pig.tools.grunt.GruntParser - ERROR 2997: Unable to
recreate exception from backed error: Error initializing
attempt_201301282343_0526_m_000000_0: java.lang.OutOfMemoryError: Java heap
space at java.util.Arrays.copyOf(Arrays.java:2882) at
java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:100)
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:390)
at java.lang.StringBuffer.append(StringBuffer.java:224) at
com.sun.org.apache.xerces.internal.dom.DeferredDocumentImpl.getNodeValueString(DeferredDocumentImpl.java:1167)
at
com.sun.org.apache.xerces.internal.dom.DeferredDocumentImpl.getNodeValueString(DeferredDocumentImpl.java:1120)
at
com.sun.org.apache.xerces.internal.dom.DeferredTextImpl.synchronizeData(DeferredTextImpl.java:93)
at
com.sun.org.apache.xerces.internal.dom.CharacterDataImpl.getData(CharacterDataImpl.java:160)
at
org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:1231)
at
org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:1129)
at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:1063)
at org.apache.hadoop.conf.Configuration.get(Configuration.java:416) at
org.apache.hadoop.mapred.JobConf.checkAndWarnDeprecation(JobConf.java:1910)
at org.apache.hadoop.mapred.JobConf.(JobConf.java:378) at
org.apache.hadoop.mapred.DefaultTaskController.initializeJob(DefaultTaskController.java:186)
at org.apache.hadoop.mapred.TaskTracker$4.run(TaskTracker.java:1226) at
java.security.AccessController.doPrivileged(Native Method) at
javax.security.auth.Subject.doAs(Subject.java:396) at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1093)
at
org.apache.hadoop.mapred.TaskTracker.initializeJob(TaskTracker.java:1201)
at org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:1116)
at org.apache.hadoop.mapred.TaskTracker$5.run(TaskTracker.java:2404) at
java.lang.Thread.run(Thread.java:662)

It works fine if I change the code to

Quants = FOREACH G GENERATE Quantile1(B.$1) AS q1;

But it is very annoying to have multiples pig script for all the 50
columns. Is it the only way to do it? Is it the correct way to use
StreamingQuantile on multiple columns? Do I really need more memory than 4G?

Thanks in Advance!
Ziyi

Re: Memory issue with datafu StreamingQuantile in apache pig

Posted by Cheolsoo Park <ch...@cloudera.com>.
Hi,

Looking at the stack trace, it looks like the task is failing
during initialization because it can't load JobConf into memory. In fact,
Pig uses JobConf heavily. For example, it serializes the entire MR plan,
store it in JobConf, and pass it to the back-end. I don't see any
workaround other than either breaking the script into smaller ones or
increasing the heap size of MR task processes.

I don't know what your script looks like, but I would try to break it down.
You might not have to call Quantile per column if you could factor out
other parts of the script into independent scripts. This is just my wild
guess, so please take it at your own risk.

You might also want to ask your question on the Datafu user group.

Thanks,
Cheolsoo



On Tue, Mar 5, 2013 at 2:45 PM, Zheng Ziyi <ma...@gmail.com> wrote:

> Hello,
>
>
> I have a pig script to compute 1000 quantiles of multiple columns. I run
> into an issue with java heap memory. Here is my pig script.
>
> SET mapred.child.java.opts ' -Xmx4096m -Dfile.encoding=UTF8
> -Djava.library.path=/apollo/env/TrafficAnalyticsHadoop/lib';
>
> define Quantile1 datafu.pig.stats.StreamingQuantile('1000');
>
> ....
>
> -- B has 50 columns
>
> G = GROUP B ALL; Quants = FOREACH G GENERATE Quantile1(B.$1) AS q1,
> Quantile1(B.$2) AS q2;
>
> ....
>
> The error:
>
> [main] ERROR org.apache.pig.tools.grunt.GruntParser - ERROR 2997: Unable to
> recreate exception from backed error: Error initializing
> attempt_201301282343_0526_m_000000_0: java.lang.OutOfMemoryError: Java heap
> space at java.util.Arrays.copyOf(Arrays.java:2882) at
>
> java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:100)
> at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:390)
> at java.lang.StringBuffer.append(StringBuffer.java:224) at
>
> com.sun.org.apache.xerces.internal.dom.DeferredDocumentImpl.getNodeValueString(DeferredDocumentImpl.java:1167)
> at
>
> com.sun.org.apache.xerces.internal.dom.DeferredDocumentImpl.getNodeValueString(DeferredDocumentImpl.java:1120)
> at
>
> com.sun.org.apache.xerces.internal.dom.DeferredTextImpl.synchronizeData(DeferredTextImpl.java:93)
> at
>
> com.sun.org.apache.xerces.internal.dom.CharacterDataImpl.getData(CharacterDataImpl.java:160)
> at
> org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:1231)
> at
> org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:1129)
> at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:1063)
> at org.apache.hadoop.conf.Configuration.get(Configuration.java:416) at
> org.apache.hadoop.mapred.JobConf.checkAndWarnDeprecation(JobConf.java:1910)
> at org.apache.hadoop.mapred.JobConf.(JobConf.java:378) at
>
> org.apache.hadoop.mapred.DefaultTaskController.initializeJob(DefaultTaskController.java:186)
> at org.apache.hadoop.mapred.TaskTracker$4.run(TaskTracker.java:1226) at
> java.security.AccessController.doPrivileged(Native Method) at
> javax.security.auth.Subject.doAs(Subject.java:396) at
>
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1093)
> at
> org.apache.hadoop.mapred.TaskTracker.initializeJob(TaskTracker.java:1201)
> at org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:1116)
> at org.apache.hadoop.mapred.TaskTracker$5.run(TaskTracker.java:2404) at
> java.lang.Thread.run(Thread.java:662)
>
> It works fine if I change the code to
>
> Quants = FOREACH G GENERATE Quantile1(B.$1) AS q1;
>
> But it is very annoying to have multiples pig script for all the 50
> columns. Is it the only way to do it? Is it the correct way to use
> StreamingQuantile on multiple columns? Do I really need more memory than
> 4G?
>
> Thanks in Advance!
> Ziyi
>