You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Cheolsoo Park (JIRA)" <ji...@apache.org> on 2013/03/29 17:45:16 UTC
[jira] [Commented] (PIG-3263) Resolving UDFs fails while using pig embedded code in Python when using parallel execution

    [ https://issues.apache.org/jira/browse/PIG-3263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13617496#comment-13617496 ] 

Cheolsoo Park commented on PIG-3263:
------------------------------------

Thank you Jakub for reporting the issue.

I am puzzled because packageImportList is a ThreadLocal variable, so any front-end exception shouldn't be thrown for this:
{code}
private static ThreadLocal<ArrayList<String>> packageImportList = new ThreadLocal<ArrayList<String>>();
{code}
Looking at the stack trace, I can see both front-end and back-end errors:
{code}
Caused by: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1000: Error during parsing. Could not resolve my.pig.udf.OrderQueryTokens using imports: [, org.apache.pig.builtin., org.apache.pig.impl.builtin.]
...
Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 1070: Could not resolve my.pig.udf.OrderQueryTokens using imports: [, org.apache.pig.builtin., org.apache.pig.impl.builtin.]
{code}
I suppose that these errors are from different threads?

The back-end error makes sense because the LocalJobRunner of Hadoop 0.20.x and 1.0.x is *not* thread safe. I have seen several similar issues (PIG-2852, PIG-2932, etc) for that, and it is also documented [here|http://pig.apache.org/docs/r0.11.0/start.html#execution-modes].

But the front-end should be thread safe, so if not, it should be fixed.
                
> Resolving UDFs fails while using pig embedded code in Python when using parallel execution
> ------------------------------------------------------------------------------------------
>
>                 Key: PIG-3263
>                 URL: https://issues.apache.org/jira/browse/PIG-3263
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.10.0
>         Environment: pig-0.10.1, hadoop 0.20.2
>            Reporter: Jakub Glapa
>         Attachments: stacktrace.txt
>
>
> I started using embedded Pig in Python scripts. I had a need to execute a pig script with slightly different set of parameters for each run. 
> The job are quite small so taking advantage of the cluster and running them in parallel made sense for me.
> Here's a python code I've used. (I executed it like that: bin/pig run.py script.pig ):
> {code}
> from org.apache.pig.scripting import Pig
> import sys
> def main():
>         SCRIPT_NAME = sys.argv[1]
>         jobParamsSets = prepareParameterSets()
>         NUM_OF_JOBS_TO_RUN_AT_ONCE = 5
>         while len(jobParamsSets) != 0:
>             batchParamSet = jobParamsSets[:NUM_OF_JOBS_TO_RUN_AT_ONCE]
>             del jobParamsSets[:NUM_OF_JOBS_TO_RUN_AT_ONCE]
>             print 'batch to execute:', batchParamSet
>             P = Pig.compileFromFile(SCRIPT_NAME)
>             bound = P.bind(batchParamSet)
>             stats = bound.run()
>             for s in stats:
>                print s.isSuccessful(), s.getDuration(), s.getReturnCode(), s.getErrorMessage()
> def prepareParameterSets():
> # loads properties from files and creates multiple sets of parameters
> {code}
> With {{NUM_OF_JOBS_TO_RUN_AT_ONCE}} variable I'm able to control the parallelism.
> I can have up to 150 parameter sets so that means 150 pig executions. 
> Everything seemed to work just fine but I started noticing single failures for some job executions. 
> It happens occasionally. 0-5 executions fail out of 150 for example. Always with the same kind of error.
> {code}
> 2013-02-14 16:25:04,575 [main] ERROR org.apache.pig.scripting.BoundScript - Pig pipeline failed to complete
> java.util.concurrent.ExecutionException: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1000: Error during parsing. Could not resolve my.pig.udf.OrderQueryTokens using imports: [, org.apache.pig.builtin., org.apache.pig.impl.builtin.]
> ...
> {code}
> Full stacktrace attached.
> I'm using many UDFs so the name of the UDF in the exception is changing.
> I suspect there is a threading issue somewhere. 
> My best guess is that org.apache.pig.impl.PigContext.resolveClassName is not thread safe and when multiple threads are trying to resolve a UDF class something goes wrong.
> I've tried a couple of tricks hoping that maybe it would help. What I did is that to my knowledge there are 3 ways in how you can register your jars with udfs.
> # in pig script ( REGISTER lib/*.jar;)
> # in python Pig.registerJar("/lib/*.jar")
> # command line param for pig command, $PIGDIR/bin/pig -Dpig.additional.jars=lib/*.jar
> Initially the 1) option was used. I was thinking that maybe if I register the jars globally right at the beginning with the option 3) I could go around the bug. Well it seems the problem dropped but didn't go away fully and still appears from time to time.
> The problem is that I cannot provide an reproducible use case. My process is quite complicated and presenting it here seems infeasible. I've tried to strip down my scripts and have something quick and simple to present. I've run that with like 1000 parameter sets with parallelism set to 10 or 20 and it sadly never occurred.
> PS.
> With pig-0.10.1 I had to substitute the distributed jython dependency with a standalone version. Otherwise I wasn't able to use python standard modules.
> I couldn't try if this bug still exists in pig-0.11.0 as the version is incompatible with hadoo 0.20. pig-0.11.1 has not been released yet.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira