You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Andy Davidson <An...@SantaCruzIntegration.com> on 2018/04/05 00:36:43 UTC

how to set up pyspark eclipse, pyDev, virtualenv? syntaxError: yield from walk(

I am having a heck of a time setting up my development environment. I used
pip to install pyspark. I also downloaded spark from apache.

My eclipse pyDev intereperter is configured as a python3 virtualenv

I have a simple unit test that loads a small dataframe. Df.show() generates
the following error


2018-04-04 17:13:56 ERROR Executor:91 - Exception in task 0.0 in stage 0.0
(TID 0)

org.apache.spark.SparkException:

Error from python worker:

  Traceback (most recent call last):

    File "/Users/a/workSpace/pythonEnv/spark-2.3.0/lib/python3.6/site.py",
line 67, in <module>

      import os

    File "/Users/a/workSpace/pythonEnv/spark-2.3.0/lib/python3.6/os.py",
line 409

      yield from walk(new_path, topdown, onerror, followlinks)

               ^

  SyntaxError: invalid syntax





My unittest classs is dervied from.



class PySparkTestCase(unittest.TestCase):



    @classmethod

    def setUpClass(cls):

        conf = SparkConf().setMaster("local[2]") \

            .setAppName(cls.__name__) #\

#             .set("spark.authenticate.secret", "111111")

        cls.sparkContext = SparkContext(conf=conf)

        sc_values[cls.__name__] = cls.sparkContext

        cls.sqlContext = SQLContext(cls.sparkContext)

        print("aedwip:", SparkContext)



    @classmethod

    def tearDownClass(cls):

        print("....calling stop tearDownClas, the content of sc_values=",
sc_values)

        sc_values.clear()

        cls.sparkContext.stop()



This looks similar to Class  PySparkTestCase in
https://github.com/apache/spark/blob/master/python/pyspark/tests.py



Any suggestions would be greatly appreciated.



Andy



My downloaed version is spark-2.3.0-bin-hadoop2.7



My virtual env version is

(spark-2.3.0) $ pip show pySpark

Name: pyspark

Version: 2.3.0

Summary: Apache Spark Python API

Home-page: https://github.com/apache/spark/tree/master/python

Author: Spark Developers

Author-email: dev@spark.apache.org

License: http://www.apache.org/licenses/LICENSE-2.0

Location: 
/Users/a/workSpace/pythonEnv/spark-2.3.0/lib/python3.6/site-packages

Requires: py4j

(spark-2.3.0) $ 



(spark-2.3.0) $ python --version

Python 3.6.1

(spark-2.3.0) $ 





Re: how to set up pyspark eclipse, pyDev, virtualenv? syntaxError: yield from walk(

Posted by Andy Davidson <An...@SantaCruzIntegration.com>.
Hi Hyukjin

Thanks for the links.

At this point I sort of got my eclipse, pyDev, spark, unitTests working. In
my unit test I can run from the cmd line or from with in eclipse a simple
unit test. The test creates a data frame from a text file and calls
df.show()

The last challenge is that it appears pyspark.sql.functions defines some
functions at run time. Examples are lit() and col(). The causes problem with
my IDE

https://issues.apache.org/jira/browse/SPARK-23878?page=com.atlassian.jira.pl
ugin.system.issuetabpanels%3Acomment-tabpanel&focusedCommentId=16427812#comm
ent-16427812

Andy

P.s. I original started my project using jupyter notebooks. The code base
got to big to manage using notebooks. I am in the process of refactoring
common code into python modules using a standard python IDE. In the IDE I
need to be import all the spark functions and be able to write and run unit
tests.

I choose eclipse because I have a lot of spark code written in java. Its
easier for me to have one IDE for all my java and python code.

From:  Hyukjin Kwon <gu...@gmail.com>
Date:  Thursday, April 5, 2018 at 6:09 PM
To:  Andrew Davidson <An...@SantaCruzIntegration.com>
Cc:  "user @spark" <us...@spark.apache.org>
Subject:  Re: how to set up pyspark eclipse, pyDev, virtualenv? syntaxError:
yield from walk(

> FYI, there is a PR and JIRA for virtualEnv support in PySpark
> 
> https://issues.apache.org/jira/browse/SPARK-13587
> https://github.com/apache/spark/pull/13599
> 
> 
> 2018-04-06 7:48 GMT+08:00 Andy Davidson <An...@santacruzintegration.com>:
>> FYI
>> 
>> http://www.learn4master.com/algorithms/pyspark-unit-test-set-up-sparkcontext
>> 
>> From:  Andrew Davidson <An...@SantaCruzIntegration.com>
>> Date:  Wednesday, April 4, 2018 at 5:36 PM
>> To:  "user @spark" <us...@spark.apache.org>
>> Subject:  how to set up pyspark eclipse, pyDev, virtualenv? syntaxError:
>> yield from walk(
>> 
>>> I am having a heck of a time setting up my development environment. I used
>>> pip to install pyspark. I also downloaded spark from apache.
>>> 
>>> My eclipse pyDev intereperter is configured as a python3 virtualenv
>>> 
>>> I have a simple unit test that loads a small dataframe. Df.show() generates
>>> the following error
>>> 
>>> 
>>> 2018-04-04 17:13:56 ERROR Executor:91 - Exception in task 0.0 in stage 0.0
>>> (TID 0)
>>> 
>>> org.apache.spark.SparkException:
>>> 
>>> Error from python worker:
>>> 
>>>   Traceback (most recent call last):
>>> 
>>>     File "/Users/a/workSpace/pythonEnv/spark-2.3.0/lib/python3.6/site.py",
>>> line 67, in <module>
>>> 
>>>       import os
>>> 
>>>     File "/Users/a/workSpace/pythonEnv/spark-2.3.0/lib/python3.6/os.py",
>>> line 409
>>> 
>>>       yield from walk(new_path, topdown, onerror, followlinks)
>>> 
>>>                ^
>>> 
>>>   SyntaxError: invalid syntax
>>> 
>>> 
>>> 
>>> 
>>> 
>>> My unittest classs is dervied from.
>>> 
>>> 
>>> 
>>> class PySparkTestCase(unittest.TestCase):
>>> 
>>> 
>>> 
>>>     @classmethod
>>> 
>>>     def setUpClass(cls):
>>> 
>>>         conf = SparkConf().setMaster("local[2]") \
>>> 
>>>             .setAppName(cls.__name__) #\
>>> 
>>> #             .set("spark.authenticate.secret", "111111")
>>> 
>>>         cls.sparkContext = SparkContext(conf=conf)
>>> 
>>>         sc_values[cls.__name__] = cls.sparkContext
>>> 
>>>         cls.sqlContext = SQLContext(cls.sparkContext)
>>> 
>>>         print("aedwip:", SparkContext)
>>> 
>>> 
>>> 
>>>     @classmethod
>>> 
>>>     def tearDownClass(cls):
>>> 
>>>         print("....calling stop tearDownClas, the content of sc_values=",
>>> sc_values)
>>> 
>>>         sc_values.clear()
>>> 
>>>         cls.sparkContext.stop()
>>> 
>>> 
>>> 
>>> This looks similar to Class  PySparkTestCase in
>>> https://github.com/apache/spark/blob/master/python/pyspark/tests.py
>>> 
>>> 
>>> 
>>> Any suggestions would be greatly appreciated.
>>> 
>>> 
>>> 
>>> Andy
>>> 
>>> 
>>> 
>>> My downloaed version is spark-2.3.0-bin-hadoop2.7
>>> 
>>> 
>>> 
>>> My virtual env version is
>>> 
>>> (spark-2.3.0) $ pip show pySpark
>>> 
>>> Name: pyspark
>>> 
>>> Version: 2.3.0
>>> 
>>> Summary: Apache Spark Python API
>>> 
>>> Home-page: https://github.com/apache/spark/tree/master/python
>>> 
>>> Author: Spark Developers
>>> 
>>> Author-email: dev@spark.apache.org
>>> 
>>> License: http://www.apache.org/licenses/LICENSE-2.0
>>> 
>>> Location: 
>>> /Users/a/workSpace/pythonEnv/spark-2.3.0/lib/python3.6/site-packages
>>> 
>>> Requires: py4j
>>> 
>>> (spark-2.3.0) $
>>> 
>>> 
>>> 
>>> (spark-2.3.0) $ python --version
>>> 
>>> Python 3.6.1
>>> 
>>> (spark-2.3.0) $
>>> 
>>> 
> 



Re: how to set up pyspark eclipse, pyDev, virtualenv? syntaxError: yield from walk(

Posted by Hyukjin Kwon <gu...@gmail.com>.
FYI, there is a PR and JIRA for virtualEnv support in PySpark

https://issues.apache.org/jira/browse/SPARK-13587
https://github.com/apache/spark/pull/13599


2018-04-06 7:48 GMT+08:00 Andy Davidson <An...@santacruzintegration.com>:

> FYI
>
> http://www.learn4master.com/algorithms/pyspark-unit-test-
> set-up-sparkcontext
>
> From: Andrew Davidson <An...@SantaCruzIntegration.com>
> Date: Wednesday, April 4, 2018 at 5:36 PM
> To: "user @spark" <us...@spark.apache.org>
> Subject: how to set up pyspark eclipse, pyDev, virtualenv? syntaxError:
> yield from walk(
>
> I am having a heck of a time setting up my development environment. I used
> pip to install pyspark. I also downloaded spark from apache.
>
> My eclipse pyDev intereperter is configured as a python3 virtualenv
>
> I have a simple unit test that loads a small dataframe. Df.show()
> generates the following error
>
>
> 2018-04-04 17:13:56 ERROR Executor:91 - Exception in task 0.0 in stage 0.0
> (TID 0)
>
> org.apache.spark.SparkException:
>
> Error from python worker:
>
>   Traceback (most recent call last):
>
>     File "/Users/a/workSpace/pythonEnv/spark-2.3.0/lib/python3.6/site.py",
> line 67, in <module>
>
>       import os
>
>     File "/Users/a/workSpace/pythonEnv/spark-2.3.0/lib/python3.6/os.py",
> line 409
>
>       yield from walk(new_path, topdown, onerror, followlinks)
>
>                ^
>
>   SyntaxError: invalid syntax
>
>
>
> My unittest classs is dervied from.
>
>
> class PySparkTestCase(unittest.TestCase):
>
>
>     @classmethod
>
>     def setUpClass(cls):
>
>         conf = SparkConf().setMaster("local[2]") \
>
>             .setAppName(cls.__name__) #\
>
> #             .set("spark.authenticate.secret", "111111")
>
>         cls.sparkContext = SparkContext(conf=conf)
>
>         sc_values[cls.__name__] = cls.sparkContext
>
>         cls.sqlContext = SQLContext(cls.sparkContext)
>
>         print("aedwip:", SparkContext)
>
>
>     @classmethod
>
>     def tearDownClass(cls):
>
>         print("....calling stop tearDownClas, the content of sc_values=",
> sc_values)
>
>         sc_values.clear()
>
>         cls.sparkContext.stop()
>
>
> This looks similar to Class  PySparkTestCase in https://github.com/apache/
> spark/blob/master/python/pyspark/tests.py
>
>
> Any suggestions would be greatly appreciated.
>
>
> Andy
>
>
> My downloaed version is spark-2.3.0-bin-hadoop2.7
>
>
> My virtual env version is
>
> (spark-2.3.0) $ pip show pySpark
>
> Name: pyspark
>
> Version: 2.3.0
>
> Summary: Apache Spark Python API
>
> Home-page: https://github.com/apache/spark/tree/master/python
>
> Author: Spark Developers
>
> Author-email: dev@spark.apache.org
>
> License: http://www.apache.org/licenses/LICENSE-2.0
>
> Location: /Users/a/workSpace/pythonEnv/spark-2.3.0/lib/python3.6/
> site-packages
>
> Requires: py4j
>
> (spark-2.3.0) $
>
>
> (spark-2.3.0) $ python --version
>
> Python 3.6.1
>
> (spark-2.3.0) $
>
>
>

Re: how to set up pyspark eclipse, pyDev, virtualenv? syntaxError: yield from walk(

Posted by Andy Davidson <An...@SantaCruzIntegration.com>.
FYI

http://www.learn4master.com/algorithms/pyspark-unit-test-set-up-sparkcontext

From:  Andrew Davidson <An...@SantaCruzIntegration.com>
Date:  Wednesday, April 4, 2018 at 5:36 PM
To:  "user @spark" <us...@spark.apache.org>
Subject:  how to set up pyspark eclipse, pyDev, virtualenv? syntaxError:
yield from walk(

> I am having a heck of a time setting up my development environment. I used pip
> to install pyspark. I also downloaded spark from apache.
> 
> My eclipse pyDev intereperter is configured as a python3 virtualenv
> 
> I have a simple unit test that loads a small dataframe. Df.show() generates
> the following error
> 
> 
> 2018-04-04 17:13:56 ERROR Executor:91 - Exception in task 0.0 in stage 0.0
> (TID 0)
> 
> org.apache.spark.SparkException:
> 
> Error from python worker:
> 
>   Traceback (most recent call last):
> 
>     File "/Users/a/workSpace/pythonEnv/spark-2.3.0/lib/python3.6/site.py",
> line 67, in <module>
> 
>       import os
> 
>     File "/Users/a/workSpace/pythonEnv/spark-2.3.0/lib/python3.6/os.py", line
> 409
> 
>       yield from walk(new_path, topdown, onerror, followlinks)
> 
>                ^
> 
>   SyntaxError: invalid syntax
> 
> 
> 
> 
> 
> My unittest classs is dervied from.
> 
> 
> 
> class PySparkTestCase(unittest.TestCase):
> 
> 
> 
>     @classmethod
> 
>     def setUpClass(cls):
> 
>         conf = SparkConf().setMaster("local[2]") \
> 
>             .setAppName(cls.__name__) #\
> 
> #             .set("spark.authenticate.secret", "111111")
> 
>         cls.sparkContext = SparkContext(conf=conf)
> 
>         sc_values[cls.__name__] = cls.sparkContext
> 
>         cls.sqlContext = SQLContext(cls.sparkContext)
> 
>         print("aedwip:", SparkContext)
> 
> 
> 
>     @classmethod
> 
>     def tearDownClass(cls):
> 
>         print("....calling stop tearDownClas, the content of sc_values=",
> sc_values)
> 
>         sc_values.clear()
> 
>         cls.sparkContext.stop()
> 
> 
> 
> This looks similar to Class  PySparkTestCase in
> https://github.com/apache/spark/blob/master/python/pyspark/tests.py
> 
> 
> 
> Any suggestions would be greatly appreciated.
> 
> 
> 
> Andy
> 
> 
> 
> My downloaed version is spark-2.3.0-bin-hadoop2.7
> 
> 
> 
> My virtual env version is
> 
> (spark-2.3.0) $ pip show pySpark
> 
> Name: pyspark
> 
> Version: 2.3.0
> 
> Summary: Apache Spark Python API
> 
> Home-page: https://github.com/apache/spark/tree/master/python
> 
> Author: Spark Developers
> 
> Author-email: dev@spark.apache.org
> 
> License: http://www.apache.org/licenses/LICENSE-2.0
> 
> Location: /Users/a/workSpace/pythonEnv/spark-2.3.0/lib/python3.6/site-packages
> 
> Requires: py4j
> 
> (spark-2.3.0) $ 
> 
> 
> 
> (spark-2.3.0) $ python --version
> 
> Python 3.6.1
> 
> (spark-2.3.0) $ 
> 
>