You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (JIRA)" <ji...@apache.org> on 2019/05/21 04:35:43 UTC
[jira] [Resolved] (SPARK-16682) pyspark 1.6.0 not handling multiple level import when the necessary files are zipped

     [ https://issues.apache.org/jira/browse/SPARK-16682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hyukjin Kwon resolved SPARK-16682.
----------------------------------
    Resolution: Incomplete

> pyspark 1.6.0 not handling multiple level import when the necessary files are zipped
> ------------------------------------------------------------------------------------
>
>                 Key: SPARK-16682
>                 URL: https://issues.apache.org/jira/browse/SPARK-16682
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 1.6.0
>         Environment: Spark StandAlone
> RedHat Linux
> Mac
>            Reporter: Santosh Balasubramanya
>            Priority: Major
>              Labels: bulk-closed
>
> In Spark Standalone mode (1.6.0), while executing both batch and streaming jobs are not able to pick up dependencies packaged in zip format and added using "addPyFile".
>  The dependency python files are modularized and placed in hireachial folder structure.
> from workflow import di
> from workflow import cache
> this kind of above imports fail and even tried including the imports in each of the functions which are called in map and foreach functions. Tried the option given in the below link (http://stackoverflow.com/questions/27644525/pyspark-py-files-doesnt-work)
> Detailed error code below
> Job aborted due to stage failure: Task 1 in stage 6718.0 failed 4 times, most recent failure: Lost task 1.3 in stage 6718.0 (TID 7287, 10.131.66.63): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
>   File "/home/forty2/analytics/eltkhome/spark-1.6.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/worker.py", line 98, in main
>     command = pickleSer._read_with_length(infile)
>   File "/home/forty2/analytics/eltkhome/spark-1.6.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/serializers.py", line 164, in _read_with_length
>     return self.loads(obj)
>   File "/home/forty2/analytics/eltkhome/spark-1.6.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/serializers.py", line 422, in loads
>     return pickle.loads(obj)
>   File "/home/forty2/analytics/eltkhome/spark-1.6.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 653, in subimport
>     __import__(name)
> ImportError: ('No module named workflow.datainterface', <function subimport at 0x925c08>, ('workflow.datainterface',))
> 	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
> 	at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:207)
> 	at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
> 	at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
> 	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> 	at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> 	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
> 	at org.apache.spark.scheduler.Task.run(Task.scala:89)
> 	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> 	at java.lang.Thread.run(Thread.java:745)
> Driver stacktrace:



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org