You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by wi...@thomsonreuters.com on 2015/12/10 18:23:06 UTC

python UDF invocation or memory problem

Hi Pig community,

I am running a pig process using a python UDF, and getting a failure that is hard to debug.  The relevant parts of the script are:

REGISTER [...]clustercentroid_udfs.py using jython as UDFS ;

[... definition of cluster_vals ...]
grouped = group cluster_vals by (clusters::cluster_id, tfidf::att,
                                 clusters::block_size);
cluster_tfidf = foreach grouped {
    generate
       group.clusters::cluster_id as cluster_id,
       group.clusters::block_size as block_size,
       group.tfidf::att as att,
       UDFS.normalize_avg_words(cluster_vals.tfidf::pairs) as centroid;
}
store cluster_tfidf into [...]

I can remove essentially all the logic from UDFS.normalize_avg_words
and still get the failure, for example I get the failure with this
definition of normalize_avg_words():
@outputSchema('words: {wvpairs: (word: chararray, normvalue: double)}')
def normalize_avg_words(line):
     return []

The log for the failing task has

2015-12-09 16:18:47,510 INFO [main] org.apache.pig.data.SchemaTupleBackend: Key [pig.schematuple] was not set... will not generate code.
2015-12-09 16:18:47,534 INFO [main] org.apache.pig.scripting.jython.JythonScriptEngine: created tmp python.cachedir=/data/3/yarn/nm/usercache/sesadmin/appcache/application_1444666458457_553099/container_e17_1444666458457_553099_01_685857/tmp/pig_jython_6256288828533965407
2015-12-09 16:18:49,443 INFO [main] org.apache.pig.scripting.jython.JythonFunction: Schema 'words: {wvpairs: (word: chararray, normvalue: double)}' defined for func normalize_avg_words
2015-12-09 16:18:49,498 INFO [main] org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce: Aliases being processed per job phase (AliasName[line,offset]): M: grouped[87,10] C:  R: cluster_tfidf[99,16]
2015-12-09 16:18:49,511 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : java.lang.IndexOutOfBoundsException: Index: 1, Size: 1
        at java.util.ArrayList.rangeCheck(ArrayList.java:638)
        at java.util.ArrayList.get(ArrayList.java:414)
        at org.apache.pig.data.DefaultTuple.get(DefaultTuple.java:118)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POPackage.getValueTuple(POPackage.java:348)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POPackage.getNextTuple(POPackage.java:269)
        at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.processOnePackageOutput(PigGenericMapReduce.java:421)
        at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.reduce(PigGenericMapReduce.java:412)
        at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.reduce(PigGenericMapReduce.java:256)
        at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:171)
        at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:627)
        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389)
        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642)
        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)

I do not get the failure if I use just a 5GB segment of my full 300GB data set.

Also, I do not get the failure if I comment out the call to the UDF:

cluster_tfidf = foreach grouped {
    generate
       group.clusters::cluster_id as cluster_id,
       group.clusters::block_size as block_size,
       group.tfidf::att as att;
       -- UDFS.normalize_avg_words(cluster_vals.tfidf::pairs) as centroid;
}

I wonder if the failure is ultimately caused by an out of memory someplace, but I haven't seen anything in the log that indicates that directly. (I have tried using a large number of reducers in the definition of grouped but the result is the same).  What should I look for in the log that would be a telltale for out of memory? How would I address it?

Since I don't get the failure when the UDF call is commented out, I wonder if the problem is in the call itself, but don't know how to diagnose or debug that.

Any help would be much appreciated!

Apache Pig version 0.12.0-cdh5.3.3 (rexported)
Hadoop 2.5.0-cdh5.3.3

Thanks,
Will

William F Dowling
Senior Technologist
Thomson Reuters

Re: python UDF invocation or memory problem

Posted by Rohini Palaniswamy <ro...@gmail.com>.

Run it in local mode after doing export
PIG_OPTS="-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/tmp/heapdump.hprof" . Then you should be able to look
into the heapdump and see where you are leaking memory in your UDF.

On Thu, Dec 10, 2015 at 9:23 AM, <wi...@thomsonreuters.com> wrote:

> Hi Pig community,
>
> I am running a pig process using a python UDF, and getting a failure that
> is hard to debug.  The relevant parts of the script are:
>
> REGISTER [...]clustercentroid_udfs.py using jython as UDFS ;
>
> [... definition of cluster_vals ...]
> grouped = group cluster_vals by (clusters::cluster_id, tfidf::att,
>                                  clusters::block_size);
> cluster_tfidf = foreach grouped {
>     generate
>        group.clusters::cluster_id as cluster_id,
>        group.clusters::block_size as block_size,
>        group.tfidf::att as att,
>        UDFS.normalize_avg_words(cluster_vals.tfidf::pairs) as centroid;
> }
> store cluster_tfidf into [...]
>
> I can remove essentially all the logic from UDFS.normalize_avg_words
> and still get the failure, for example I get the failure with this
> definition of normalize_avg_words():
> @outputSchema('words: {wvpairs: (word: chararray, normvalue: double)}')
> def normalize_avg_words(line):
>      return []
>
> The log for the failing task has
>
> 2015-12-09 16:18:47,510 INFO [main]
> org.apache.pig.data.SchemaTupleBackend: Key [pig.schematuple] was not
> set... will not generate code.
> 2015-12-09 16:18:47,534 INFO [main]
> org.apache.pig.scripting.jython.JythonScriptEngine: created tmp
> python.cachedir=/data/3/yarn/nm/usercache/sesadmin/appcache/application_1444666458457_553099/container_e17_1444666458457_553099_01_685857/tmp/pig_jython_6256288828533965407
> 2015-12-09 16:18:49,443 INFO [main]
> org.apache.pig.scripting.jython.JythonFunction: Schema 'words: {wvpairs:
> (word: chararray, normvalue: double)}' defined for func normalize_avg_words
> 2015-12-09 16:18:49,498 INFO [main]
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce:
> Aliases being processed per job phase (AliasName[line,offset]): M:
> grouped[87,10] C:  R: cluster_tfidf[99,16]
> 2015-12-09 16:18:49,511 WARN [main] org.apache.hadoop.mapred.YarnChild:
> Exception running child : java.lang.IndexOutOfBoundsException: Index: 1,
> Size: 1
>         at java.util.ArrayList.rangeCheck(ArrayList.java:638)
>         at java.util.ArrayList.get(ArrayList.java:414)
>         at org.apache.pig.data.DefaultTuple.get(DefaultTuple.java:118)
>         at
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POPackage.getValueTuple(POPackage.java:348)
>         at
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POPackage.getNextTuple(POPackage.java:269)
>         at
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.processOnePackageOutput(PigGenericMapReduce.java:421)
>         at
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.reduce(PigGenericMapReduce.java:412)
>         at
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.reduce(PigGenericMapReduce.java:256)
>         at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:171)
>         at
> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:627)
>         at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389)
>         at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642)
>         at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
>
> I do not get the failure if I use just a 5GB segment of my full 300GB data
> set.
>
> Also, I do not get the failure if I comment out the call to the UDF:
>
> cluster_tfidf = foreach grouped {
>     generate
>        group.clusters::cluster_id as cluster_id,
>        group.clusters::block_size as block_size,
>        group.tfidf::att as att;
>        -- UDFS.normalize_avg_words(cluster_vals.tfidf::pairs) as centroid;
> }
>
> I wonder if the failure is ultimately caused by an out of memory
> someplace, but I haven't seen anything in the log that indicates that
> directly. (I have tried using a large number of reducers in the definition
> of grouped but the result is the same).  What should I look for in the log
> that would be a telltale for out of memory? How would I address it?
>
> Since I don't get the failure when the UDF call is commented out, I wonder
> if the problem is in the call itself, but don't know how to diagnose or
> debug that.
>
> Any help would be much appreciated!
>
> Apache Pig version 0.12.0-cdh5.3.3 (rexported)
> Hadoop 2.5.0-cdh5.3.3
>
> Thanks,
> Will
>
> William F Dowling
> Senior Technologist
> Thomson Reuters
>