You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by weineran <an...@u.northwestern.edu> on 2016/01/07 23:39:13 UTC

SparkContext SyntaxError: invalid syntax

Hello,

When I try to submit a python job using spark-submit (using --master yarn
--deploy-mode cluster), I get the following error:

/Traceback (most recent call last):
  File "loss_rate_by_probe.py", line 15, in ?
    from pyspark import SparkContext
  File
"/scratch5/hadoop/yarn/local/usercache/<username>/filecache/18/spark-assembly-1.3.1-hadoop2.4.0.jar/pyspark/__init__.py",
line 41, in ?
  File
"/scratch5/hadoop/yarn/local/usercache/<username>/filecache/18/spark-assembly-1.3.1-hadoop2.4.0.jar/pyspark/context.py",
line 219
    with SparkContext._lock:
                    ^
SyntaxError: invalid syntax/

This is very similar to  this post from 2014
<http://apache-spark-user-list.1001560.n3.nabble.com/SparkContext-lock-Error-td18233.html> 
, but unlike that person I am using Python 2.7.8.

Here is what I'm using:
Spark 1.3.1
Hadoop 2.4.0.2.1.5.0-695
Python 2.7.8

Another clue:  I also installed Spark 1.6.0 and tried to submit the same
job.  I got a similar error:

/Traceback (most recent call last):
  File "loss_rate_by_probe.py", line 15, in ?
    from pyspark import SparkContext
  File
"/scratch5/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0119/container_1450370639491_0119_01_000001/pyspark.zip/pyspark/__init__.py",
line 61
    indent = ' ' * (min(len(m) for m in indents) if indents else 0)
                                                  ^
SyntaxError: invalid syntax/

Any thoughts?

Andrew



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkContext-SyntaxError-invalid-syntax-tp25910.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Contrib to Docs: Re: SparkContext SyntaxError: invalid syntax

Posted by Jim Lohse <sp...@megalearningllc.com>.
I don't think you have to build the docs, just fork them on Github and 
submit the pull request?

I have been able to do is submit a pull request just by editing the 
markdown file, I am just confused if I am supposed to merge it myself or 
wait for notification and/or wait for someone else to merge it?

https://github.com/jimlohse/spark/pull/1 ( which I believe everyone can 
see, on my end I can merge it because there's no conflicts, should I?)


>From 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-ContributingDocumentationChanges 
(which you have probably read so I am mostly posting for others thinking 
of contributing to the docs :)

"To have us add a link to an external tutorial you wrote, simply email 
the developer mailing list.
To modify the built-in documentation, edit the Markdown source files in 
Spark's docs directory, whose README file shows how to build the 
documentation locally to test your changes.

The process to propose a doc change is otherwise the same as the process 
for proposing code changes below."



On 01/18/2016 07:35 AM, Andrew Weiner wrote:
> Hi Felix,
>
> Yeah, when I try to build the docs using jekyll build, I get a 
> LoadError (cannot load such file -- pygments) and I'm having trouble 
> getting past it at the moment.
>
> From what I could tell, this does not apply to YARN in client mode.  I 
> was able to submit jobs in client mode and they would run fine without 
> using the appMasterEnv property. I even confirmed that my environment 
> variables persisted during the job when run in client mode.  There is 
> something about YARN cluster mode that uses a different environment 
> (the YARN Application Master environment) and requires the 
> appMasterEnv property for setting environment variables.
>
> On Sun, Jan 17, 2016 at 11:37 PM, Felix Cheung 
> <felixcheung_m@hotmail.com <ma...@hotmail.com>> wrote:
>
>     Do you still need help on the PR?
>     btw, does this apply to YARN client mode?
>
>     ------------------------------------------------------------------------
>     From: andrewweiner2020@u.northwestern.edu
>     <ma...@u.northwestern.edu>
>     Date: Sun, 17 Jan 2016 17:00:39 -0600
>     Subject: Re: SparkContext SyntaxError: invalid syntax
>     To: cutlerb@gmail.com <ma...@gmail.com>
>     CC: user@spark.apache.org <ma...@spark.apache.org>
>
>
>     Yeah, I do think it would be worth explicitly stating this in the
>     docs.  I was going to try to edit the docs myself and submit a
>     pull request, but I'm having trouble building the docs from
>     github.  If anyone else wants to do this, here is approximately
>     what I would say:
>
>     (To be added to
>     http://spark.apache.org/docs/latest/configuration.html#environment-variables)
>     "Note: When running Spark on YARN in clustermode, environment
>     variables need to be set using the
>     spark.yarn.appMasterEnv.[EnvironmentVariableName] property in your
>     conf/spark-defaults.conf file.  Environment variables that are set
>     in spark-env.sh will not be reflected in the YARN Application
>     Master process in cluster mode.  See the YARN-related Spark
>     Properties
>     <http://spark.apache.org/docs/latest/running-on-yarn.html#spark-properties>
>     for more information."
>
>     I might take another crack at building the docs myself if nobody
>     beats me to this.
>
>     Andrew
>
>
>     On Fri, Jan 15, 2016 at 5:01 PM, Bryan Cutler <cutlerb@gmail.com
>     <ma...@gmail.com>> wrote:
>
>         Glad you got it going!  It's wasn't very obvious what needed
>         to be set, maybe it is worth explicitly stating this in the
>         docs since it seems to have come up a couple times before too.
>
>         Bryan
>
>         On Fri, Jan 15, 2016 at 12:33 PM, Andrew Weiner
>         <andrewweiner2020@u.northwestern.edu
>         <ma...@u.northwestern.edu>> wrote:
>
>             Actually, I just found this
>             [https://issues.apache.org/jira/browse/SPARK-1680], which
>             after a bit of googling and reading leads me to believe
>             that the preferred way to change the yarn environment is
>             to edit the spark-defaults.conf file by adding this line:
>             spark.yarn.appMasterEnv.PYSPARK_PYTHON    /path/to/python
>
>             While both this solution and the solution from my prior
>             email work, I believe this is the preferred solution.
>
>             Sorry for the flurry of emails.  Again, thanks for all the
>             help!
>
>             Andrew
>
>             On Fri, Jan 15, 2016 at 1:47 PM, Andrew Weiner
>             <andrewweiner2020@u.northwestern.edu
>             <ma...@u.northwestern.edu>> wrote:
>
>                 I finally got the pi.py example to run in yarn cluster
>                 mode.  This was the key insight:
>                 https://issues.apache.org/jira/browse/SPARK-9229
>
>                 I had to set SPARK_YARN_USER_ENV in spark-env.sh:
>                 export
>                 SPARK_YARN_USER_ENV="PYSPARK_PYTHON=/home/aqualab/local/bin/python"
>
>                 This caused the PYSPARK_PYTHON environment variable to
>                 be used in my yarn environment in cluster mode.
>
>                 Thank you for all your help!
>
>                 Best,
>                 Andrew
>
>
>
>                 On Fri, Jan 15, 2016 at 12:57 PM, Andrew Weiner
>                 <andrewweiner2020@u.northwestern.edu
>                 <ma...@u.northwestern.edu>> wrote:
>
>                     I tried playing around with my environment
>                     variables, and here is an update.
>
>                     When I run in cluster mode, my environment
>                     variables do not persist throughout the entire job.
>                     For example, I tried creating a local copy of
>                     HADOOP_CONF_DIR in
>                     /home/<username>/local/etc/hadoop/conf, and then,
>                     in spark-env.sh I the variable:
>                     export
>                     HADOOP_CONF_DIR=/home/<username>/local/etc/hadoop/conf
>
>                     Later, when we print the environment variables in
>                     the python code, I see this:
>
>                     ('HADOOP_CONF_DIR', '/etc/hadoop/conf')
>
>                     However, when I run in client mode, I see this:
>
>                     ('HADOOP_CONF_DIR',
>                     '/home/awp066/local/etc/hadoop/conf')
>
>                     Furthermore, if I omit that environment variable
>                     from spark-env.sh altogether, I get the expected
>                     error in both client and cluster mode:
>
>                     When running with master 'yarn' either
>                     HADOOP_CONF_DIR or YARN_CONF_DIR must be set in
>                     the environment.
>
>                     This suggests that my environment variables are
>                     being used when I first submit the job, but at
>                     some point during the job, my environment
>                     variables are thrown out and someone's (yarn's?)
>                     environment variables are being used.
>
>                     Andrew
>
>
>                     On Fri, Jan 15, 2016 at 11:03 AM, Andrew Weiner
>                     <andrewweiner2020@u.northwestern.edu
>                     <ma...@u.northwestern.edu>> wrote:
>
>                         Indeed! Here is the output when I run in
>                         cluster mode:
>
>                         Traceback (most recent call last):
>                            File "pi.py", line 22, in ?
>                              raise RuntimeError("\n"+str(sys.version_info) +"\n"+
>                         RuntimeError:
>                         (2, 4, 3, 'final', 0)
>                         [('PYSPARK_GATEWAY_PORT', '48079'), ('PYTHONPATH', '/scratch2/hadoop/yarn/local/usercache/<username>/filecache/116/spark-assembly-1.6.0-hadoop2.4.0.jar:/home/<user>/spark-1.6.0-bin-hadoop2.4/python:/home/<username>/code/libs:/scratch5/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0239/container_1450370639491_0239_01_000001/pyspark.zip:/scratch5/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0239/container_1450370639491_0239_01_000001/py4j-0.9-src.zip'), ('PYTHONUNBUFFERED', 'YES')]
>
>                         As we suspected, it is using Python 2.4
>
>                         One thing that surprises me is that
>                         PYSPARK_PYTHON is not showing up in the list,
>                         even though I am setting it and exporting it
>                         in spark-submit /and/ in spark-env.sh.  Is
>                         there somewhere else I need to set this
>                         variable?  Maybe in one of the hadoop conf
>                         files in my HADOOP_CONF_DIR?
>
>                         Andrew
>
>
>                         On Thu, Jan 14, 2016 at 1:14 PM, Bryan Cutler
>                         <cutlerb@gmail.com <ma...@gmail.com>>
>                         wrote:
>
>                             It seems like it could be the case that
>                             some other Python version is being
>                             invoked.  To make sure, can you add
>                             something like this to the top of the .py
>                             file you are submitting to get some more
>                             info about how the application master is
>                             configured?
>
>                             import sys, os
>                             raise
>                             RuntimeError("\n"+str(sys.version_info)
>                             +"\n"+
>                             str([(k,os.environ[k]) for k in os.environ
>                             if "PY" in k]))
>
>                             On Thu, Jan 14, 2016 at 8:37 AM, Andrew
>                             Weiner
>                             <andrewweiner2020@u.northwestern.edu
>                             <ma...@u.northwestern.edu>>
>                             wrote:
>
>                                 Hi Bryan,
>
>                                 I ran "$> python --version" on every
>                                 node on the cluster, and it is Python
>                                 2.7.8 for every single one.
>
>                                 When I try to submit the Python
>                                 example in client mode
>                                 / ./bin/spark-submit      --master
>                                 yarn --deploy-mode client
>                                 --driver-memory 4g --executor-memory
>                                 2g --executor-cores 1
>                                 ./examples/src/main/python/pi.py     10/
>                                 That's when I get this error that I
>                                 mentioned:/
>                                 /
>
>                                 16/01/14 10:09:10 WARN
>                                 scheduler.TaskSetManager: Lost task
>                                 0.0 in stage 0.0 (TID 0,
>                                 mundonovo-priv):
>                                 org.apache.spark.SparkException:
>                                 Error from python worker:
>                                   python: module pyspark.daemon not found
>                                 PYTHONPATH was:
>                                 /scratch5/hadoop/yarn/local/usercache/<username>/filecache/48/spark-assembly-1.6.0-hadoop2.4.0.jar:/home/aqualab/spark-1.6.0-bin-hadoop2.4/python:/home/jpr123/hg.pacific/python-common:/home/jp
>                                 r123/python-libs:/home/jpr123/lib/python2.7/site-packages:/home/zsb739/local/lib/python2.7/site-packages:/home/jpr123/mobile-cdn-analysis:/home/<username>/lib/python2.7/site-packages:/home/<username>/code/libs:/scratch5/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0187/container_1450370639491_0187_01_000002/pyspark.zip:/scratch5/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0187/container_1450370639491_0187_01_000002/py4j-0.9-src.zip
>                                 java.io.EOFException
>                                 at
>                                 java.io.DataInputStream.readInt(DataInputStream.java:392)
>                                 at
>                                 org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:164)
>                                 at [....]
>
>                                 followed by several more similar
>                                 errors that also say:
>                                 Error from python worker:
>                                   python: module pyspark.daemon not found
>
>
>                                 Even though the default python
>                                 appeared to be correct, I just went
>                                 ahead and explicitly set
>                                 PYSPARK_PYTHON in conf/spark-env.sh to
>                                 the path of the default python binary
>                                 executable. After making this change I
>                                 was able to run the job successfully
>                                 in client! That is, this appeared to
>                                 fix the "pyspark.daemon not found"
>                                 error when running in client mode.
>
>                                 However, when running in cluster mode,
>                                 I am still getting the same syntax error:
>
>                                 Traceback (most recent call last):
>                                    File "pi.py", line 24, in ?
>                                      from pyspark import SparkContext
>                                    File "/home/<username>/spark-1.6.0-bin-hadoop2.4/python/pyspark/__init__.py", line 61
>                                      indent = ' ' * (min(len(m) for m in indents) if indents else 0)
>                                                                                    ^
>                                 SyntaxError: invalid syntax
>
>                                 Is it possible that the PYSPARK_PYTHON
>                                 environment variable is ignored when
>                                 jobs are submitted in cluster mode? 
>                                 It seems that Spark or Yarn is going
>                                 behind my back, so to speak, and using
>                                 some older version of python I didn't
>                                 even know was installed.
>
>                                 Thanks again for all your help thus
>                                 far.  We are getting close....
>
>                                 Andrew
>
>
>                                 On Wed, Jan 13, 2016 at 6:13 PM, Bryan
>                                 Cutler <cutlerb@gmail.com
>                                 <ma...@gmail.com>> wrote:
>
>                                     Hi Andrew,
>
>                                     There are a couple of things to
>                                     check.  First, is Python 2.7 the
>                                     default version on all nodes in
>                                     the cluster or is it an alternate
>                                     install? Meaning what is the
>                                     output of this command "$>  python
>                                     --version"  If it is an alternate
>                                     install, you could set the
>                                     environment variable
>                                     "|PYSPARK_PYTHON|" Python binary
>                                     executable to use for PySpark in
>                                     both driver and workers (default
>                                     is |python|).
>
>                                     Did you try to submit the Python
>                                     example under client mode?
>                                     Otherwise, the command looks fine,
>                                     you don't use the --class option
>                                     for submitting python files
>                                     / ./bin/spark-submit      --master
>                                     yarn --deploy-mode client
>                                     --driver-memory 4g
>                                     --executor-memory 2g
>                                     --executor-cores 1
>                                     ./examples/src/main/python/pi.py  
>                                       10/
>
>                                     That is a good sign that local
>                                     jobs and Java examples work,
>                                     probably just a small
>                                     configuration issue :)
>
>                                     Bryan
>
>                                     On Wed, Jan 13, 2016 at 3:51 PM,
>                                     Andrew Weiner
>                                     <andrewweiner2020@u.northwestern.edu
>                                     <ma...@u.northwestern.edu>>
>                                     wrote:
>
>                                         Thanks for your continuing
>                                         help.  Here is some additional
>                                         info.
>
>                                         _OS/architecture_
>                                         output of /cat /proc/version/:
>                                         Linux version
>                                         2.6.18-400.1.1.el5
>                                         (mockbuild@x86-012.build.bos.redhat.com
>                                         <ma...@x86-012.build.bos.redhat.com>)
>
>                                         output of /lsb_release -a/:
>                                         LSB Version:
>                                          :core-4.0-amd64:core-4.0-ia32:core-4.0-noarch:graphics-4.0-amd64:graphics-4.0-ia32:graphics-4.0-noarch:printing-4.0-amd64:printing-4.0-ia32:printing-4.0-noarch
>                                         Distributor ID:
>                                         RedHatEnterpriseServer
>                                         Description:    Red Hat
>                                         Enterprise Linux Server
>                                         release 5.11 (Tikanga)
>                                         Release:        5.11
>                                         Codename:       Tikanga
>
>                                         _Running a local job_
>                                         I have confirmed that I can
>                                         successfully run python jobs
>                                         using bin/spark-submit
>                                         --master local[*]
>                                         Specifically, this is the
>                                         command I am using:
>                                         /./bin/spark-submit --master
>                                         local[8]
>                                         ./examples/src/main/python/wordcount.py
>                                         file:/home/<username>/spark-1.6.0-bin-hadoop2.4/README.md/
>                                         And it works!
>
>                                         _Additional info_
>                                         I am also able to successfully
>                                         run the Java SparkPi example
>                                         using yarn in cluster mode
>                                         using this command:
>                                         / ./bin/spark-submit --class
>                                         org.apache.spark.examples.SparkPi
>                                             --master yarn
>                                         --deploy-mode cluster
>                                         --driver-memory 4g
>                                         --executor-memory 2g
>                                         --executor-cores 1
>                                         lib/spark-examples*.jar     10/
>                                         This Java job also runs
>                                         successfully when I change
>                                         --deploy-mode to client. The
>                                         fact that I can run Java jobs
>                                         in cluster mode makes me thing
>                                         that everything is installed
>                                         correctly--is that a valid
>                                         assumption?
>
>                                         The problem remains that I
>                                         cannot submit python jobs.
>                                         Here is the command that I am
>                                         using to try to submit python
>                                         jobs:
>                                         / ./bin/spark-submit    
>                                          --master yarn --deploy-mode
>                                         cluster --driver-memory 4g
>                                         --executor-memory 2g
>                                         --executor-cores 1
>                                         ./examples/src/main/python/pi.py
>                                             10/
>                                         Does that look like a correct
>                                         command?  I wasn't sure what
>                                         to put for --class so I
>                                         omitted it. At any rate, the
>                                         result of the above command is
>                                         a syntax error, similar to the
>                                         one I posted in the original
>                                         email:
>
>                                         Traceback (most recent call last):
>                                            File "pi.py", line 24, in ?
>                                              from pyspark import SparkContext
>                                            File "/home/<username>/spark-1.6.0-bin-hadoop2.4/python/pyspark/__init__.py", line 61
>                                              indent = ' ' * (min(len(m) for m in indents) if indents else 0)
>                                                                                            ^
>                                         SyntaxError: invalid syntax
>
>
>                                         This really looks to me like a
>                                         problem with the python
>                                         version. Python 2.4 would
>                                         throw this syntax error but
>                                         Python 2.7 would not. And yet
>                                         I am using Python 2.7.8.  Is
>                                         there any chance that Spark or
>                                         Yarn is somehow using an older
>                                         version of Python without my
>                                         knowledge?
>
>                                         Finally, when I try to run the
>                                         same command in client mode...
>                                         / ./bin/spark-submit    
>                                          --master yarn --deploy-mode
>                                         client --driver-memory 4g
>                                         --executor-memory 2g
>                                         --executor-cores 1
>                                         ./examples/src/main/python/pi.py
>                                         10/
>                                         I get the error I mentioned in
>                                         the prior email:
>                                         Error from python worker:
>                                         python: module pyspark.daemon
>                                         not found
>
>                                         Any thoughts?
>
>                                         Best,
>                                         Andrew
>
>
>                                         On Mon, Jan 11, 2016 at 12:25
>                                         PM, Bryan Cutler
>                                         <cutlerb@gmail.com
>                                         <ma...@gmail.com>> wrote:
>
>                                             This could be an
>                                             environment issue, could
>                                             you give more details
>                                             about the OS/architecture
>                                             that you are using?  If
>                                             you are sure everything is
>                                             installed correctly on
>                                             each node following the
>                                             guide on "Running Spark on
>                                             Yarn"
>                                             http://spark.apache.org/docs/latest/running-on-yarn.html
>                                             and that the spark
>                                             assembly jar is reachable,
>                                             then I would check to see
>                                             if you can submit a local
>                                             job to just run on one node.
>
>                                             On Fri, Jan 8, 2016 at
>                                             5:22 PM, Andrew Weiner
>                                             <andrewweiner2020@u.northwestern.edu
>                                             <ma...@u.northwestern.edu>>
>                                             wrote:
>
>                                                 Now for simplicity I'm
>                                                 testing with
>                                                 wordcount.py from the
>                                                 provided examples, and
>                                                 using Spark 1.6.0
>
>                                                 The first error I get is:
>
>                                                 16/01/08 19:14:46
>                                                 ERROR
>                                                 lzo.GPLNativeCodeLoader:
>                                                 Could not load native
>                                                 gpl library
>                                                 java.lang.UnsatisfiedLinkError:
>                                                 no gplcompression in
>                                                 java.library.path
>                                                 at
>                                                 java.lang.ClassLoader.loadLibrary(ClassLoader.java:1864)
>                                                 at [....]
>
>                                                 A bit lower down, I
>                                                 see this error:
>
>                                                 16/01/08 19:14:48 WARN
>                                                 scheduler.TaskSetManager:
>                                                 Lost task 0.0 in stage
>                                                 0.0 (TID 0,
>                                                 mundonovo-priv):
>                                                 org.apache.spark.SparkException:
>                                                 Error from python worker:
>                                                   python: module
>                                                 pyspark.daemon not found
>                                                 PYTHONPATH was:
>                                                 /scratch5/hadoop/yarn/local/usercache/<username>/filecache/22/spark-assembly-1.6.0-hadoop2.4.0.jar:/home/jpr123/hg.pacific/python-common:/home/jpr123/python-libs:/home/jpr123/lib/python2.7/site-packages:/home/zsb739/local/lib/python2.7/site-packages:/home/jpr123/mobile-cdn-analysis:/home/<username>/lib/python2.7/site-packages:/scratch4/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0136/container_1450370639491_0136_01_000002/pyspark.zip:/scratch4/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0136/container_1450370639491_0136_01_000002/py4j-0.9-src.zip
>                                                 java.io.EOFException
>                                                 at
>                                                 java.io.DataInputStream.readInt(DataInputStream.java:392)
>                                                 at [....]
>
>                                                 And then a few more
>                                                 similar pyspark.daemon
>                                                 not found errors...
>
>                                                 Andrew
>
>
>
>                                                 On Fri, Jan 8, 2016 at
>                                                 2:31 PM, Bryan Cutler
>                                                 <cutlerb@gmail.com
>                                                 <ma...@gmail.com>>
>                                                 wrote:
>
>                                                     Hi Andrew,
>
>                                                     I know that older
>                                                     versions of Spark
>                                                     could not run
>                                                     PySpark on YARN in
>                                                     cluster mode. I'm
>                                                     not sure if that
>                                                     is fixed in 1.6.0
>                                                     though.  Can you
>                                                     try setting
>                                                     deploy-mode option
>                                                     to "client" when
>                                                     calling spark-submit?
>
>                                                     Bryan
>
>                                                     On Thu, Jan 7,
>                                                     2016 at 2:39 PM,
>                                                     weineran
>                                                     <andrewweiner2020@u.northwestern.edu
>                                                     <ma...@u.northwestern.edu>>
>                                                     wrote:
>
>                                                         Hello,
>
>                                                         When I try to
>                                                         submit a
>                                                         python job
>                                                         using
>                                                         spark-submit
>                                                         (using
>                                                         --master yarn
>                                                         --deploy-mode
>                                                         cluster), I
>                                                         get the
>                                                         following error:
>
>                                                         /Traceback
>                                                         (most recent
>                                                         call last):
>                                                           File
>                                                         "loss_rate_by_probe.py",
>                                                         line 15, in ?
>                                                             from
>                                                         pyspark import
>                                                         SparkContext
>                                                           File
>                                                         "/scratch5/hadoop/yarn/local/usercache/<username>/filecache/18/spark-assembly-1.3.1-hadoop2.4.0.jar/pyspark/__init__.py",
>                                                         line 41, in ?
>                                                           File
>                                                         "/scratch5/hadoop/yarn/local/usercache/<username>/filecache/18/spark-assembly-1.3.1-hadoop2.4.0.jar/pyspark/context.py",
>                                                         line 219
>                                                             with
>                                                         SparkContext._lock:
>                                                               ^
>                                                         SyntaxError:
>                                                         invalid syntax/
>
>                                                         This is very
>                                                         similar to
>                                                         this post from
>                                                         2014
>                                                         <http://apache-spark-user-list.1001560.n3.nabble.com/SparkContext-lock-Error-td18233.html>
>                                                         , but unlike
>                                                         that person I
>                                                         am using
>                                                         Python 2.7.8.
>
>                                                         Here is what
>                                                         I'm using:
>                                                         Spark 1.3.1
>                                                         Hadoop
>                                                         2.4.0.2.1.5.0-695
>                                                         Python 2.7.8
>
>                                                         Another clue:
>                                                         I also
>                                                         installed
>                                                         Spark 1.6.0
>                                                         and tried to
>                                                         submit the same
>                                                         job.  I got a
>                                                         similar error:
>
>                                                         /Traceback
>                                                         (most recent
>                                                         call last):
>                                                           File
>                                                         "loss_rate_by_probe.py",
>                                                         line 15, in ?
>                                                             from
>                                                         pyspark import
>                                                         SparkContext
>                                                           File
>                                                         "/scratch5/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0119/container_1450370639491_0119_01_000001/pyspark.zip/pyspark/__init__.py",
>                                                         line 61
>                                                             indent = '
>                                                         ' *
>                                                         (min(len(m)
>                                                         for m in
>                                                         indents) if
>                                                         indents else 0)
>                                                                 ^
>                                                         SyntaxError:
>                                                         invalid syntax/
>
>                                                         Any thoughts?
>
>                                                         Andrew
>
>
>
>                                                         --
>                                                         View this
>                                                         message in
>                                                         context:
>                                                         http://apache-spark-user-list.1001560.n3.nabble.com/SparkContext-SyntaxError-invalid-syntax-tp25910.html
>                                                         Sent from the
>                                                         Apache Spark
>                                                         User List
>                                                         mailing list
>                                                         archive at
>                                                         Nabble.com.
>
>                                                         ---------------------------------------------------------------------
>                                                         To
>                                                         unsubscribe,
>                                                         e-mail:
>                                                         user-unsubscribe@spark.apache.org
>                                                         <ma...@spark.apache.org>
>                                                         For additional
>                                                         commands,
>                                                         e-mail:
>                                                         user-help@spark.apache.org
>                                                         <ma...@spark.apache.org>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>


Re: SparkContext SyntaxError: invalid syntax

Posted by Andrew Weiner <an...@u.northwestern.edu>.
Thanks Felix.  I think I was missing gem install pygments.rb and I also had
to roll back to Python 2.7 but I got it working.  I submitted the PR
submitted with the added explanation in the docs.

Andrew

On Wed, Jan 20, 2016 at 1:44 AM, Felix Cheung <fe...@hotmail.com>
wrote:

>
> I have to run this to install the pre-req to get jeykyll build to work,
> you do need the python pygments package:
>
> (I’m on ubuntu)
> sudo apt-get install ruby ruby-dev make gcc nodejs
> sudo gem install jekyll --no-rdoc --no-ri
> sudo gem install jekyll-redirect-from
> sudo apt-get install python-Pygments
> sudo apt-get install python-sphinx
> sudo gem install pygments.rb
>
>
> Hope that helps!
> If not, I can try putting together doc change but I’d rather you could
> make progress :)
>
>
>
>
>
> On Mon, Jan 18, 2016 at 6:36 AM -0800, "Andrew Weiner" <
> andrewweiner2020@u.northwestern.edu> wrote:
>
> Hi Felix,
>
> Yeah, when I try to build the docs using jekyll build, I get a LoadError
> (cannot load such file -- pygments) and I'm having trouble getting past it
> at the moment.
>
> From what I could tell, this does not apply to YARN in client mode.  I was
> able to submit jobs in client mode and they would run fine without using
> the appMasterEnv property.  I even confirmed that my environment variables
> persisted during the job when run in client mode.  There is something about
> YARN cluster mode that uses a different environment (the YARN Application
> Master environment) and requires the appMasterEnv property for setting
> environment variables.
>
> On Sun, Jan 17, 2016 at 11:37 PM, Felix Cheung <fe...@hotmail.com>
> wrote:
>
> Do you still need help on the PR?
> btw, does this apply to YARN client mode?
>
> ------------------------------
> From: andrewweiner2020@u.northwestern.edu
> Date: Sun, 17 Jan 2016 17:00:39 -0600
> Subject: Re: SparkContext SyntaxError: invalid syntax
> To: cutlerb@gmail.com
> CC: user@spark.apache.org
>
>
> Yeah, I do think it would be worth explicitly stating this in the docs.  I
> was going to try to edit the docs myself and submit a pull request, but I'm
> having trouble building the docs from github.  If anyone else wants to do
> this, here is approximately what I would say:
>
> (To be added to
> http://spark.apache.org/docs/latest/configuration.html#environment-variables
> )
> "Note: When running Spark on YARN in cluster mode, environment variables
> need to be set using the spark.yarn.appMasterEnv.[EnvironmentVariableName]
> property in your conf/spark-defaults.conf file.  Environment variables
> that are set in spark-env.sh will not be reflected in the YARN
> Application Master process in cluster mode.  See the YARN-related Spark
> Properties
> <http://spark.apache.org/docs/latest/running-on-yarn.html#spark-properties>
> for more information."
>
> I might take another crack at building the docs myself if nobody beats me
> to this.
>
> Andrew
>
>
> On Fri, Jan 15, 2016 at 5:01 PM, Bryan Cutler <cu...@gmail.com> wrote:
>
> Glad you got it going!  It's wasn't very obvious what needed to be set,
> maybe it is worth explicitly stating this in the docs since it seems to
> have come up a couple times before too.
>
> Bryan
>
> On Fri, Jan 15, 2016 at 12:33 PM, Andrew Weiner <
> andrewweiner2020@u.northwestern.edu> wrote:
>
> Actually, I just found this [
> https://issues.apache.org/jira/browse/SPARK-1680], which after a bit of
> googling and reading leads me to believe that the preferred way to change
> the yarn environment is to edit the spark-defaults.conf file by adding this
> line:
> spark.yarn.appMasterEnv.PYSPARK_PYTHON    /path/to/python
>
> While both this solution and the solution from my prior email work, I
> believe this is the preferred solution.
>
> Sorry for the flurry of emails.  Again, thanks for all the help!
>
> Andrew
>
> On Fri, Jan 15, 2016 at 1:47 PM, Andrew Weiner <
> andrewweiner2020@u.northwestern.edu> wrote:
>
> I finally got the pi.py example to run in yarn cluster mode.  This was the
> key insight:
> https://issues.apache.org/jira/browse/SPARK-9229
>
> I had to set SPARK_YARN_USER_ENV in spark-env.sh:
> export SPARK_YARN_USER_ENV="PYSPARK_PYTHON=/home/aqualab/local/bin/python"
>
> This caused the PYSPARK_PYTHON environment variable to be used in my yarn
> environment in cluster mode.
>
> Thank you for all your help!
>
> Best,
> Andrew
>
>
>
> On Fri, Jan 15, 2016 at 12:57 PM, Andrew Weiner <
> andrewweiner2020@u.northwestern.edu> wrote:
>
> I tried playing around with my environment variables, and here is an
> update.
>
> When I run in cluster mode, my environment variables do not persist
> throughout the entire job.
> For example, I tried creating a local copy of HADOOP_CONF_DIR in
> /home/<username>/local/etc/hadoop/conf, and then, in spark-env.sh I the
> variable:
> export HADOOP_CONF_DIR=/home/<username>/local/etc/hadoop/conf
>
> Later, when we print the environment variables in the python code, I see
> this:
>
> ('HADOOP_CONF_DIR', '/etc/hadoop/conf')
>
> However, when I run in client mode, I see this:
>
> ('HADOOP_CONF_DIR', '/home/awp066/local/etc/hadoop/conf')
>
> Furthermore, if I omit that environment variable from spark-env.sh altogether, I get the expected error in both client and cluster mode:
>
> When running with master 'yarn'
> either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment.
>
> This suggests that my environment variables are being used when I first submit the job, but at some point during the job, my environment variables are thrown out and someone's (yarn's?) environment variables are being used.
>
> Andrew
>
>
> On Fri, Jan 15, 2016 at 11:03 AM, Andrew Weiner <
> andrewweiner2020@u.northwestern.edu> wrote:
>
> Indeed!  Here is the output when I run in cluster mode:
>
> Traceback (most recent call last):
>   File "pi.py", line 22, in ?
>     raise RuntimeError("\n"+str(sys.version_info) +"\n"+
> RuntimeError:
> (2, 4, 3, 'final', 0)
> [('PYSPARK_GATEWAY_PORT', '48079'), ('PYTHONPATH', '/scratch2/hadoop/yarn/local/usercache/<username>/filecache/116/spark-assembly-1.6.0-hadoop2.4.0.jar:/home/<user>/spark-1.6.0-bin-hadoop2.4/python:/home/<username>/code/libs:/scratch5/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0239/container_1450370639491_0239_01_000001/pyspark.zip:/scratch5/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0239/container_1450370639491_0239_01_000001/py4j-0.9-src.zip'), ('PYTHONUNBUFFERED', 'YES')]
>
> As we suspected, it is using Python 2.4
>
> One thing that surprises me is that PYSPARK_PYTHON is not showing up in the list, even though I am setting it and exporting it in spark-submit *and* in spark-env.sh.  Is there somewhere else I need to set this variable?  Maybe in one of the hadoop conf files in my HADOOP_CONF_DIR?
>
> Andrew
>
>
>
> On Thu, Jan 14, 2016 at 1:14 PM, Bryan Cutler <cu...@gmail.com> wrote:
>
> It seems like it could be the case that some other Python version is being
> invoked.  To make sure, can you add something like this to the top of the
> .py file you are submitting to get some more info about how the application
> master is configured?
>
> import sys, os
> raise RuntimeError("\n"+str(sys.version_info) +"\n"+
>     str([(k,os.environ[k]) for k in os.environ if "PY" in k]))
>
> On Thu, Jan 14, 2016 at 8:37 AM, Andrew Weiner <
> andrewweiner2020@u.northwestern.edu> wrote:
>
> Hi Bryan,
>
> I ran "$> python --version" on every node on the cluster, and it is Python
> 2.7.8 for every single one.
>
> When I try to submit the Python example in client mode
> * ./bin/spark-submit      --master yarn     --deploy-mode client
> --driver-memory 4g     --executor-memory 2g     --executor-cores 1
> ./examples/src/main/python/pi.py     10*
> That's when I get this error that I mentioned:
>
> 16/01/14 10:09:10 WARN scheduler.TaskSetManager: Lost task 0.0 in stage
> 0.0 (TID 0, mundonovo-priv): org.apache.spark.SparkException:
> Error from python worker:
>   python: module pyspark.daemon not found
> PYTHONPATH was:
>
> /scratch5/hadoop/yarn/local/usercache/<username>/filecache/48/spark-assembly-1.6.0-hadoop2.4.0.jar:/home/aqualab/spark-1.6.0-bin-hadoop2.4/python:/home/jpr123/hg.pacific/python-common:/home/jp
>
> r123/python-libs:/home/jpr123/lib/python2.7/site-packages:/home/zsb739/local/lib/python2.7/site-packages:/home/jpr123/mobile-cdn-analysis:/home/<username>/lib/python2.7/site-packages:/home/<username>/code/libs:/scratch5/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0187/container_1450370639491_0187_01_000002/pyspark.zip:/scratch5/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0187/container_1450370639491_0187_01_000002/py4j-0.9-src.zip
> java.io.EOFException
>         at java.io.DataInputStream.readInt(DataInputStream.java:392)
>         at
> org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:164)
>         at [....]
>
> followed by several more similar errors that also say:
> Error from python worker:
>   python: module pyspark.daemon not found
>
>
> Even though the default python appeared to be correct, I just went ahead
> and explicitly set PYSPARK_PYTHON in conf/spark-env.sh to the path of the
> default python binary executable.  After making this change I was able to
> run the job successfully in client!  That is, this appeared to fix the
> "pyspark.daemon not found" error when running in client mode.
>
> However, when running in cluster mode, I am still getting the same syntax
> error:
>
> Traceback (most recent call last):
>   File "pi.py", line 24, in ?
>     from pyspark import SparkContext
>   File "/home/<username>/spark-1.6.0-bin-hadoop2.4/python/pyspark/__init__.py", line 61
>     indent = ' ' * (min(len(m) for m in indents) if indents else 0)
>                                                   ^
> SyntaxError: invalid syntax
>
> Is it possible that the PYSPARK_PYTHON environment variable is ignored when jobs are submitted in cluster mode?  It seems that Spark or Yarn is going behind my back, so to speak, and using some older version of python I didn't even know was installed.
>
> Thanks again for all your help thus far.  We are getting close....
>
> Andrew
>
>
>
> On Wed, Jan 13, 2016 at 6:13 PM, Bryan Cutler <cu...@gmail.com> wrote:
>
> Hi Andrew,
>
> There are a couple of things to check.  First, is Python 2.7 the default
> version on all nodes in the cluster or is it an alternate install? Meaning
> what is the output of this command "$>  python --version"  If it is an
> alternate install, you could set the environment variable "PYSPARK_PYTHON"
> Python binary executable to use for PySpark in both driver and workers
> (default is python).
>
> Did you try to submit the Python example under client mode?  Otherwise,
> the command looks fine, you don't use the --class option for submitting
> python files
> * ./bin/spark-submit      --master yarn     --deploy-mode client
> --driver-memory 4g     --executor-memory 2g     --executor-cores 1
> ./examples/src/main/python/pi.py     10*
>
> That is a good sign that local jobs and Java examples work, probably just
> a small configuration issue :)
>
> Bryan
>
> On Wed, Jan 13, 2016 at 3:51 PM, Andrew Weiner <
> andrewweiner2020@u.northwestern.edu> wrote:
>
> Thanks for your continuing help.  Here is some additional info.
>
> *OS/architecture*
> output of *cat /proc/version*:
> Linux version 2.6.18-400.1.1.el5 (mockbuild@x86-012.build.bos.redhat.com)
>
> output of *lsb_release -a*:
> LSB Version:
>  :core-4.0-amd64:core-4.0-ia32:core-4.0-noarch:graphics-4.0-amd64:graphics-4.0-ia32:graphics-4.0-noarch:printing-4.0-amd64:printing-4.0-ia32:printing-4.0-noarch
> Distributor ID: RedHatEnterpriseServer
> Description:    Red Hat Enterprise Linux Server release 5.11 (Tikanga)
> Release:        5.11
> Codename:       Tikanga
>
> *Running a local job*
> I have confirmed that I can successfully run python jobs using
> bin/spark-submit --master local[*]
> Specifically, this is the command I am using:
> *./bin/spark-submit --master local[8]
> ./examples/src/main/python/wordcount.py
> file:/home/<username>/spark-1.6.0-bin-hadoop2.4/README.md*
> And it works!
>
> *Additional info*
> I am also able to successfully run the Java SparkPi example using yarn in
> cluster mode using this command:
> * ./bin/spark-submit --class org.apache.spark.examples.SparkPi
> --master yarn     --deploy-mode cluster     --driver-memory 4g
> --executor-memory 2g     --executor-cores 1     lib/spark-examples*.jar
> 10*
> This Java job also runs successfully when I change --deploy-mode to
> client.  The fact that I can run Java jobs in cluster mode makes me thing
> that everything is installed correctly--is that a valid assumption?
>
> The problem remains that I cannot submit python jobs.  Here is the command
> that I am using to try to submit python jobs:
> * ./bin/spark-submit      --master yarn     --deploy-mode cluster
> --driver-memory 4g     --executor-memory 2g     --executor-cores 1
> ./examples/src/main/python/pi.py     10*
> Does that look like a correct command?  I wasn't sure what to put for
> --class so I omitted it.  At any rate, the result of the above command is a
> syntax error, similar to the one I posted in the original email:
>
> Traceback (most recent call last):
>   File "pi.py", line 24, in ?
>     from pyspark import SparkContext
>   File "/home/<username>/spark-1.6.0-bin-hadoop2.4/python/pyspark/__init__.py", line 61
>     indent = ' ' * (min(len(m) for m in indents) if indents else 0)
>                                                   ^
> SyntaxError: invalid syntax
>
>
> This really looks to me like a problem with the python version.  Python
> 2.4 would throw this syntax error but Python 2.7 would not.  And yet I am
> using Python 2.7.8.  Is there any chance that Spark or Yarn is somehow
> using an older version of Python without my knowledge?
>
> Finally, when I try to run the same command in client mode...
> * ./bin/spark-submit      --master yarn     --deploy-mode client
> --driver-memory 4g     --executor-memory 2g     --executor-cores 1
> ./examples/src/main/python/pi.py 10*
> I get the error I mentioned in the prior email:
> Error from python worker:
>   python: module pyspark.daemon not found
>
> Any thoughts?
>
> Best,
> Andrew
>
>
> On Mon, Jan 11, 2016 at 12:25 PM, Bryan Cutler <cu...@gmail.com> wrote:
>
> This could be an environment issue, could you give more details about the
> OS/architecture that you are using?  If you are sure everything is
> installed correctly on each node following the guide on "Running Spark on
> Yarn" http://spark.apache.org/docs/latest/running-on-yarn.html and that
> the spark assembly jar is reachable, then I would check to see if you can
> submit a local job to just run on one node.
>
> On Fri, Jan 8, 2016 at 5:22 PM, Andrew Weiner <
> andrewweiner2020@u.northwestern.edu> wrote:
>
> Now for simplicity I'm testing with wordcount.py from the provided
> examples, and using Spark 1.6.0
>
> The first error I get is:
>
> 16/01/08 19:14:46 ERROR lzo.GPLNativeCodeLoader: Could not load native gpl
> library
> java.lang.UnsatisfiedLinkError: no gplcompression in java.library.path
>         at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1864)
>         at [....]
>
> A bit lower down, I see this error:
>
> 16/01/08 19:14:48 WARN scheduler.TaskSetManager: Lost task 0.0 in stage
> 0.0 (TID 0, mundonovo-priv): org.apache.spark.SparkException:
> Error from python worker:
>   python: module pyspark.daemon not found
> PYTHONPATH was:
>
> /scratch5/hadoop/yarn/local/usercache/<username>/filecache/22/spark-assembly-1.6.0-hadoop2.4.0.jar:/home/jpr123/hg.pacific/python-common:/home/jpr123/python-libs:/home/jpr123/lib/python2.7/site-packages:/home/zsb739/local/lib/python2.7/site-packages:/home/jpr123/mobile-cdn-analysis:/home/<username>/lib/python2.7/site-packages:/scratch4/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0136/container_1450370639491_0136_01_000002/pyspark.zip:/scratch4/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0136/container_1450370639491_0136_01_000002/py4j-0.9-src.zip
> java.io.EOFException
>         at java.io.DataInputStream.readInt(DataInputStream.java:392)
>         at [....]
>
> And then a few more similar pyspark.daemon not found errors...
>
> Andrew
>
>
>
> On Fri, Jan 8, 2016 at 2:31 PM, Bryan Cutler <cu...@gmail.com> wrote:
>
> Hi Andrew,
>
> I know that older versions of Spark could not run PySpark on YARN in
> cluster mode.  I'm not sure if that is fixed in 1.6.0 though.  Can you try
> setting deploy-mode option to "client" when calling spark-submit?
>
> Bryan
>
> On Thu, Jan 7, 2016 at 2:39 PM, weineran <
> andrewweiner2020@u.northwestern.edu> wrote:
>
> Hello,
>
> When I try to submit a python job using spark-submit (using --master yarn
> --deploy-mode cluster), I get the following error:
>
> /Traceback (most recent call last):
>   File "loss_rate_by_probe.py", line 15, in ?
>     from pyspark import SparkContext
>   File
>
> "/scratch5/hadoop/yarn/local/usercache/<username>/filecache/18/spark-assembly-1.3.1-hadoop2.4.0.jar/pyspark/__init__.py",
> line 41, in ?
>   File
>
> "/scratch5/hadoop/yarn/local/usercache/<username>/filecache/18/spark-assembly-1.3.1-hadoop2.4.0.jar/pyspark/context.py",
> line 219
>     with SparkContext._lock:
>                     ^
> SyntaxError: invalid syntax/
>
> This is very similar to  this post from 2014
> <
> http://apache-spark-user-list.1001560.n3.nabble.com/SparkContext-lock-Error-td18233.html
> >
> , but unlike that person I am using Python 2.7.8.
>
> Here is what I'm using:
> Spark 1.3.1
> Hadoop 2.4.0.2.1.5.0-695
> Python 2.7.8
>
> Another clue:  I also installed Spark 1.6.0 and tried to submit the same
> job.  I got a similar error:
>
> /Traceback (most recent call last):
>   File "loss_rate_by_probe.py", line 15, in ?
>     from pyspark import SparkContext
>   File
>
> "/scratch5/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0119/container_1450370639491_0119_01_000001/pyspark.zip/pyspark/__init__.py",
> line 61
>     indent = ' ' * (min(len(m) for m in indents) if indents else 0)
>                                                   ^
> SyntaxError: invalid syntax/
>
> Any thoughts?
>
> Andrew
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/SparkContext-SyntaxError-invalid-syntax-tp25910.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>

Re: SparkContext SyntaxError: invalid syntax

Posted by Felix Cheung <fe...@hotmail.com>.
I have to run this to install the pre-req to get jeykyll build to work, you do need the python pygments package:
(I’m on ubuntu)sudo apt-get install ruby ruby-dev make gcc nodejssudo gem install jekyll --no-rdoc --no-risudo gem install jekyll-redirect-fromsudo apt-get install python-Pygmentssudo apt-get install python-sphinxsudo gem install pygments.rb

Hope that helps!If not, I can try putting together doc change but I’d rather you could make progress :)





On Mon, Jan 18, 2016 at 6:36 AM -0800, "Andrew Weiner" <an...@u.northwestern.edu> wrote:





Hi Felix,

Yeah, when I try to build the docs using jekyll build, I get a LoadError
(cannot load such file -- pygments) and I'm having trouble getting past it
at the moment.

>From what I could tell, this does not apply to YARN in client mode.  I was
able to submit jobs in client mode and they would run fine without using
the appMasterEnv property.  I even confirmed that my environment variables
persisted during the job when run in client mode.  There is something about
YARN cluster mode that uses a different environment (the YARN Application
Master environment) and requires the appMasterEnv property for setting
environment variables.

On Sun, Jan 17, 2016 at 11:37 PM, Felix Cheung <fe...@hotmail.com>
wrote:

> Do you still need help on the PR?
> btw, does this apply to YARN client mode?
>
> ------------------------------
> From: andrewweiner2020@u.northwestern.edu
> Date: Sun, 17 Jan 2016 17:00:39 -0600
> Subject: Re: SparkContext SyntaxError: invalid syntax
> To: cutlerb@gmail.com
> CC: user@spark.apache.org
>
>
> Yeah, I do think it would be worth explicitly stating this in the docs.  I
> was going to try to edit the docs myself and submit a pull request, but I'm
> having trouble building the docs from github.  If anyone else wants to do
> this, here is approximately what I would say:
>
> (To be added to
> http://spark.apache.org/docs/latest/configuration.html#environment-variables
> )
> "Note: When running Spark on YARN in cluster mode, environment variables
> need to be set using the spark.yarn.appMasterEnv.[EnvironmentVariableName]
> property in your conf/spark-defaults.conf file.  Environment variables
> that are set in spark-env.sh will not be reflected in the YARN
> Application Master process in cluster mode.  See the YARN-related Spark
> Properties
> <http://spark.apache.org/docs/latest/running-on-yarn.html#spark-properties>
> for more information."
>
> I might take another crack at building the docs myself if nobody beats me
> to this.
>
> Andrew
>
>
> On Fri, Jan 15, 2016 at 5:01 PM, Bryan Cutler <cu...@gmail.com> wrote:
>
> Glad you got it going!  It's wasn't very obvious what needed to be set,
> maybe it is worth explicitly stating this in the docs since it seems to
> have come up a couple times before too.
>
> Bryan
>
> On Fri, Jan 15, 2016 at 12:33 PM, Andrew Weiner <
> andrewweiner2020@u.northwestern.edu> wrote:
>
> Actually, I just found this [
> https://issues.apache.org/jira/browse/SPARK-1680], which after a bit of
> googling and reading leads me to believe that the preferred way to change
> the yarn environment is to edit the spark-defaults.conf file by adding this
> line:
> spark.yarn.appMasterEnv.PYSPARK_PYTHON    /path/to/python
>
> While both this solution and the solution from my prior email work, I
> believe this is the preferred solution.
>
> Sorry for the flurry of emails.  Again, thanks for all the help!
>
> Andrew
>
> On Fri, Jan 15, 2016 at 1:47 PM, Andrew Weiner <
> andrewweiner2020@u.northwestern.edu> wrote:
>
> I finally got the pi.py example to run in yarn cluster mode.  This was the
> key insight:
> https://issues.apache.org/jira/browse/SPARK-9229
>
> I had to set SPARK_YARN_USER_ENV in spark-env.sh:
> export SPARK_YARN_USER_ENV="PYSPARK_PYTHON=/home/aqualab/local/bin/python"
>
> This caused the PYSPARK_PYTHON environment variable to be used in my yarn
> environment in cluster mode.
>
> Thank you for all your help!
>
> Best,
> Andrew
>
>
>
> On Fri, Jan 15, 2016 at 12:57 PM, Andrew Weiner <
> andrewweiner2020@u.northwestern.edu> wrote:
>
> I tried playing around with my environment variables, and here is an
> update.
>
> When I run in cluster mode, my environment variables do not persist
> throughout the entire job.
> For example, I tried creating a local copy of HADOOP_CONF_DIR in
> /home/<username>/local/etc/hadoop/conf, and then, in spark-env.sh I the
> variable:
> export HADOOP_CONF_DIR=/home/<username>/local/etc/hadoop/conf
>
> Later, when we print the environment variables in the python code, I see
> this:
>
> ('HADOOP_CONF_DIR', '/etc/hadoop/conf')
>
> However, when I run in client mode, I see this:
>
> ('HADOOP_CONF_DIR', '/home/awp066/local/etc/hadoop/conf')
>
> Furthermore, if I omit that environment variable from spark-env.sh altogether, I get the expected error in both client and cluster mode:
>
> When running with master 'yarn'
> either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment.
>
> This suggests that my environment variables are being used when I first submit the job, but at some point during the job, my environment variables are thrown out and someone's (yarn's?) environment variables are being used.
>
> Andrew
>
>
> On Fri, Jan 15, 2016 at 11:03 AM, Andrew Weiner <
> andrewweiner2020@u.northwestern.edu> wrote:
>
> Indeed!  Here is the output when I run in cluster mode:
>
> Traceback (most recent call last):
>   File "pi.py", line 22, in ?
>     raise RuntimeError("\n"+str(sys.version_info) +"\n"+
> RuntimeError:
> (2, 4, 3, 'final', 0)
> [('PYSPARK_GATEWAY_PORT', '48079'), ('PYTHONPATH', '/scratch2/hadoop/yarn/local/usercache/<username>/filecache/116/spark-assembly-1.6.0-hadoop2.4.0.jar:/home/<user>/spark-1.6.0-bin-hadoop2.4/python:/home/<username>/code/libs:/scratch5/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0239/container_1450370639491_0239_01_000001/pyspark.zip:/scratch5/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0239/container_1450370639491_0239_01_000001/py4j-0.9-src.zip'), ('PYTHONUNBUFFERED', 'YES')]
>
> As we suspected, it is using Python 2.4
>
> One thing that surprises me is that PYSPARK_PYTHON is not showing up in the list, even though I am setting it and exporting it in spark-submit *and* in spark-env.sh.  Is there somewhere else I need to set this variable?  Maybe in one of the hadoop conf files in my HADOOP_CONF_DIR?
>
> Andrew
>
>
>
> On Thu, Jan 14, 2016 at 1:14 PM, Bryan Cutler <cu...@gmail.com> wrote:
>
> It seems like it could be the case that some other Python version is being
> invoked.  To make sure, can you add something like this to the top of the
> .py file you are submitting to get some more info about how the application
> master is configured?
>
> import sys, os
> raise RuntimeError("\n"+str(sys.version_info) +"\n"+
>     str([(k,os.environ[k]) for k in os.environ if "PY" in k]))
>
> On Thu, Jan 14, 2016 at 8:37 AM, Andrew Weiner <
> andrewweiner2020@u.northwestern.edu> wrote:
>
> Hi Bryan,
>
> I ran "$> python --version" on every node on the cluster, and it is Python
> 2.7.8 for every single one.
>
> When I try to submit the Python example in client mode
> * ./bin/spark-submit      --master yarn     --deploy-mode client
> --driver-memory 4g     --executor-memory 2g     --executor-cores 1
> ./examples/src/main/python/pi.py     10*
> That's when I get this error that I mentioned:
>
> 16/01/14 10:09:10 WARN scheduler.TaskSetManager: Lost task 0.0 in stage
> 0.0 (TID 0, mundonovo-priv): org.apache.spark.SparkException:
> Error from python worker:
>   python: module pyspark.daemon not found
> PYTHONPATH was:
>
> /scratch5/hadoop/yarn/local/usercache/<username>/filecache/48/spark-assembly-1.6.0-hadoop2.4.0.jar:/home/aqualab/spark-1.6.0-bin-hadoop2.4/python:/home/jpr123/hg.pacific/python-common:/home/jp
>
> r123/python-libs:/home/jpr123/lib/python2.7/site-packages:/home/zsb739/local/lib/python2.7/site-packages:/home/jpr123/mobile-cdn-analysis:/home/<username>/lib/python2.7/site-packages:/home/<username>/code/libs:/scratch5/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0187/container_1450370639491_0187_01_000002/pyspark.zip:/scratch5/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0187/container_1450370639491_0187_01_000002/py4j-0.9-src.zip
> java.io.EOFException
>         at java.io.DataInputStream.readInt(DataInputStream.java:392)
>         at
> org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:164)
>         at [....]
>
> followed by several more similar errors that also say:
> Error from python worker:
>   python: module pyspark.daemon not found
>
>
> Even though the default python appeared to be correct, I just went ahead
> and explicitly set PYSPARK_PYTHON in conf/spark-env.sh to the path of the
> default python binary executable.  After making this change I was able to
> run the job successfully in client!  That is, this appeared to fix the
> "pyspark.daemon not found" error when running in client mode.
>
> However, when running in cluster mode, I am still getting the same syntax
> error:
>
> Traceback (most recent call last):
>   File "pi.py", line 24, in ?
>     from pyspark import SparkContext
>   File "/home/<username>/spark-1.6.0-bin-hadoop2.4/python/pyspark/__init__.py", line 61
>     indent = ' ' * (min(len(m) for m in indents) if indents else 0)
>                                                   ^
> SyntaxError: invalid syntax
>
> Is it possible that the PYSPARK_PYTHON environment variable is ignored when jobs are submitted in cluster mode?  It seems that Spark or Yarn is going behind my back, so to speak, and using some older version of python I didn't even know was installed.
>
> Thanks again for all your help thus far.  We are getting close....
>
> Andrew
>
>
>
> On Wed, Jan 13, 2016 at 6:13 PM, Bryan Cutler <cu...@gmail.com> wrote:
>
> Hi Andrew,
>
> There are a couple of things to check.  First, is Python 2.7 the default
> version on all nodes in the cluster or is it an alternate install? Meaning
> what is the output of this command "$>  python --version"  If it is an
> alternate install, you could set the environment variable "PYSPARK_PYTHON"
> Python binary executable to use for PySpark in both driver and workers
> (default is python).
>
> Did you try to submit the Python example under client mode?  Otherwise,
> the command looks fine, you don't use the --class option for submitting
> python files
> * ./bin/spark-submit      --master yarn     --deploy-mode client
> --driver-memory 4g     --executor-memory 2g     --executor-cores 1
> ./examples/src/main/python/pi.py     10*
>
> That is a good sign that local jobs and Java examples work, probably just
> a small configuration issue :)
>
> Bryan
>
> On Wed, Jan 13, 2016 at 3:51 PM, Andrew Weiner <
> andrewweiner2020@u.northwestern.edu> wrote:
>
> Thanks for your continuing help.  Here is some additional info.
>
> *OS/architecture*
> output of *cat /proc/version*:
> Linux version 2.6.18-400.1.1.el5 (mockbuild@x86-012.build.bos.redhat.com)
>
> output of *lsb_release -a*:
> LSB Version:
>  :core-4.0-amd64:core-4.0-ia32:core-4.0-noarch:graphics-4.0-amd64:graphics-4.0-ia32:graphics-4.0-noarch:printing-4.0-amd64:printing-4.0-ia32:printing-4.0-noarch
> Distributor ID: RedHatEnterpriseServer
> Description:    Red Hat Enterprise Linux Server release 5.11 (Tikanga)
> Release:        5.11
> Codename:       Tikanga
>
> *Running a local job*
> I have confirmed that I can successfully run python jobs using
> bin/spark-submit --master local[*]
> Specifically, this is the command I am using:
> *./bin/spark-submit --master local[8]
> ./examples/src/main/python/wordcount.py
> file:/home/<username>/spark-1.6.0-bin-hadoop2.4/README.md*
> And it works!
>
> *Additional info*
> I am also able to successfully run the Java SparkPi example using yarn in
> cluster mode using this command:
> * ./bin/spark-submit --class org.apache.spark.examples.SparkPi
> --master yarn     --deploy-mode cluster     --driver-memory 4g
> --executor-memory 2g     --executor-cores 1     lib/spark-examples*.jar
> 10*
> This Java job also runs successfully when I change --deploy-mode to
> client.  The fact that I can run Java jobs in cluster mode makes me thing
> that everything is installed correctly--is that a valid assumption?
>
> The problem remains that I cannot submit python jobs.  Here is the command
> that I am using to try to submit python jobs:
> * ./bin/spark-submit      --master yarn     --deploy-mode cluster
> --driver-memory 4g     --executor-memory 2g     --executor-cores 1
> ./examples/src/main/python/pi.py     10*
> Does that look like a correct command?  I wasn't sure what to put for
> --class so I omitted it.  At any rate, the result of the above command is a
> syntax error, similar to the one I posted in the original email:
>
> Traceback (most recent call last):
>   File "pi.py", line 24, in ?
>     from pyspark import SparkContext
>   File "/home/<username>/spark-1.6.0-bin-hadoop2.4/python/pyspark/__init__.py", line 61
>     indent = ' ' * (min(len(m) for m in indents) if indents else 0)
>                                                   ^
> SyntaxError: invalid syntax
>
>
> This really looks to me like a problem with the python version.  Python
> 2.4 would throw this syntax error but Python 2.7 would not.  And yet I am
> using Python 2.7.8.  Is there any chance that Spark or Yarn is somehow
> using an older version of Python without my knowledge?
>
> Finally, when I try to run the same command in client mode...
> * ./bin/spark-submit      --master yarn     --deploy-mode client
> --driver-memory 4g     --executor-memory 2g     --executor-cores 1
> ./examples/src/main/python/pi.py 10*
> I get the error I mentioned in the prior email:
> Error from python worker:
>   python: module pyspark.daemon not found
>
> Any thoughts?
>
> Best,
> Andrew
>
>
> On Mon, Jan 11, 2016 at 12:25 PM, Bryan Cutler <cu...@gmail.com> wrote:
>
> This could be an environment issue, could you give more details about the
> OS/architecture that you are using?  If you are sure everything is
> installed correctly on each node following the guide on "Running Spark on
> Yarn" http://spark.apache.org/docs/latest/running-on-yarn.html and that
> the spark assembly jar is reachable, then I would check to see if you can
> submit a local job to just run on one node.
>
> On Fri, Jan 8, 2016 at 5:22 PM, Andrew Weiner <
> andrewweiner2020@u.northwestern.edu> wrote:
>
> Now for simplicity I'm testing with wordcount.py from the provided
> examples, and using Spark 1.6.0
>
> The first error I get is:
>
> 16/01/08 19:14:46 ERROR lzo.GPLNativeCodeLoader: Could not load native gpl
> library
> java.lang.UnsatisfiedLinkError: no gplcompression in java.library.path
>         at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1864)
>         at [....]
>
> A bit lower down, I see this error:
>
> 16/01/08 19:14:48 WARN scheduler.TaskSetManager: Lost task 0.0 in stage
> 0.0 (TID 0, mundonovo-priv): org.apache.spark.SparkException:
> Error from python worker:
>   python: module pyspark.daemon not found
> PYTHONPATH was:
>
> /scratch5/hadoop/yarn/local/usercache/<username>/filecache/22/spark-assembly-1.6.0-hadoop2.4.0.jar:/home/jpr123/hg.pacific/python-common:/home/jpr123/python-libs:/home/jpr123/lib/python2.7/site-packages:/home/zsb739/local/lib/python2.7/site-packages:/home/jpr123/mobile-cdn-analysis:/home/<username>/lib/python2.7/site-packages:/scratch4/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0136/container_1450370639491_0136_01_000002/pyspark.zip:/scratch4/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0136/container_1450370639491_0136_01_000002/py4j-0.9-src.zip
> java.io.EOFException
>         at java.io.DataInputStream.readInt(DataInputStream.java:392)
>         at [....]
>
> And then a few more similar pyspark.daemon not found errors...
>
> Andrew
>
>
>
> On Fri, Jan 8, 2016 at 2:31 PM, Bryan Cutler <cu...@gmail.com> wrote:
>
> Hi Andrew,
>
> I know that older versions of Spark could not run PySpark on YARN in
> cluster mode.  I'm not sure if that is fixed in 1.6.0 though.  Can you try
> setting deploy-mode option to "client" when calling spark-submit?
>
> Bryan
>
> On Thu, Jan 7, 2016 at 2:39 PM, weineran <
> andrewweiner2020@u.northwestern.edu> wrote:
>
> Hello,
>
> When I try to submit a python job using spark-submit (using --master yarn
> --deploy-mode cluster), I get the following error:
>
> /Traceback (most recent call last):
>   File "loss_rate_by_probe.py", line 15, in ?
>     from pyspark import SparkContext
>   File
>
> "/scratch5/hadoop/yarn/local/usercache/<username>/filecache/18/spark-assembly-1.3.1-hadoop2.4.0.jar/pyspark/__init__.py",
> line 41, in ?
>   File
>
> "/scratch5/hadoop/yarn/local/usercache/<username>/filecache/18/spark-assembly-1.3.1-hadoop2.4.0.jar/pyspark/context.py",
> line 219
>     with SparkContext._lock:
>                     ^
> SyntaxError: invalid syntax/
>
> This is very similar to  this post from 2014
> <
> http://apache-spark-user-list.1001560.n3.nabble.com/SparkContext-lock-Error-td18233.html
> >
> , but unlike that person I am using Python 2.7.8.
>
> Here is what I'm using:
> Spark 1.3.1
> Hadoop 2.4.0.2.1.5.0-695
> Python 2.7.8
>
> Another clue:  I also installed Spark 1.6.0 and tried to submit the same
> job.  I got a similar error:
>
> /Traceback (most recent call last):
>   File "loss_rate_by_probe.py", line 15, in ?
>     from pyspark import SparkContext
>   File
>
> "/scratch5/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0119/container_1450370639491_0119_01_000001/pyspark.zip/pyspark/__init__.py",
> line 61
>     indent = ' ' * (min(len(m) for m in indents) if indents else 0)
>                                                   ^
> SyntaxError: invalid syntax/
>
> Any thoughts?
>
> Andrew
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/SparkContext-SyntaxError-invalid-syntax-tp25910.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>

Re: SparkContext SyntaxError: invalid syntax

Posted by Andrew Weiner <an...@u.northwestern.edu>.
Hi Felix,

Yeah, when I try to build the docs using jekyll build, I get a LoadError
(cannot load such file -- pygments) and I'm having trouble getting past it
at the moment.

>From what I could tell, this does not apply to YARN in client mode.  I was
able to submit jobs in client mode and they would run fine without using
the appMasterEnv property.  I even confirmed that my environment variables
persisted during the job when run in client mode.  There is something about
YARN cluster mode that uses a different environment (the YARN Application
Master environment) and requires the appMasterEnv property for setting
environment variables.

On Sun, Jan 17, 2016 at 11:37 PM, Felix Cheung <fe...@hotmail.com>
wrote:

> Do you still need help on the PR?
> btw, does this apply to YARN client mode?
>
> ------------------------------
> From: andrewweiner2020@u.northwestern.edu
> Date: Sun, 17 Jan 2016 17:00:39 -0600
> Subject: Re: SparkContext SyntaxError: invalid syntax
> To: cutlerb@gmail.com
> CC: user@spark.apache.org
>
>
> Yeah, I do think it would be worth explicitly stating this in the docs.  I
> was going to try to edit the docs myself and submit a pull request, but I'm
> having trouble building the docs from github.  If anyone else wants to do
> this, here is approximately what I would say:
>
> (To be added to
> http://spark.apache.org/docs/latest/configuration.html#environment-variables
> )
> "Note: When running Spark on YARN in cluster mode, environment variables
> need to be set using the spark.yarn.appMasterEnv.[EnvironmentVariableName]
> property in your conf/spark-defaults.conf file.  Environment variables
> that are set in spark-env.sh will not be reflected in the YARN
> Application Master process in cluster mode.  See the YARN-related Spark
> Properties
> <http://spark.apache.org/docs/latest/running-on-yarn.html#spark-properties>
> for more information."
>
> I might take another crack at building the docs myself if nobody beats me
> to this.
>
> Andrew
>
>
> On Fri, Jan 15, 2016 at 5:01 PM, Bryan Cutler <cu...@gmail.com> wrote:
>
> Glad you got it going!  It's wasn't very obvious what needed to be set,
> maybe it is worth explicitly stating this in the docs since it seems to
> have come up a couple times before too.
>
> Bryan
>
> On Fri, Jan 15, 2016 at 12:33 PM, Andrew Weiner <
> andrewweiner2020@u.northwestern.edu> wrote:
>
> Actually, I just found this [
> https://issues.apache.org/jira/browse/SPARK-1680], which after a bit of
> googling and reading leads me to believe that the preferred way to change
> the yarn environment is to edit the spark-defaults.conf file by adding this
> line:
> spark.yarn.appMasterEnv.PYSPARK_PYTHON    /path/to/python
>
> While both this solution and the solution from my prior email work, I
> believe this is the preferred solution.
>
> Sorry for the flurry of emails.  Again, thanks for all the help!
>
> Andrew
>
> On Fri, Jan 15, 2016 at 1:47 PM, Andrew Weiner <
> andrewweiner2020@u.northwestern.edu> wrote:
>
> I finally got the pi.py example to run in yarn cluster mode.  This was the
> key insight:
> https://issues.apache.org/jira/browse/SPARK-9229
>
> I had to set SPARK_YARN_USER_ENV in spark-env.sh:
> export SPARK_YARN_USER_ENV="PYSPARK_PYTHON=/home/aqualab/local/bin/python"
>
> This caused the PYSPARK_PYTHON environment variable to be used in my yarn
> environment in cluster mode.
>
> Thank you for all your help!
>
> Best,
> Andrew
>
>
>
> On Fri, Jan 15, 2016 at 12:57 PM, Andrew Weiner <
> andrewweiner2020@u.northwestern.edu> wrote:
>
> I tried playing around with my environment variables, and here is an
> update.
>
> When I run in cluster mode, my environment variables do not persist
> throughout the entire job.
> For example, I tried creating a local copy of HADOOP_CONF_DIR in
> /home/<username>/local/etc/hadoop/conf, and then, in spark-env.sh I the
> variable:
> export HADOOP_CONF_DIR=/home/<username>/local/etc/hadoop/conf
>
> Later, when we print the environment variables in the python code, I see
> this:
>
> ('HADOOP_CONF_DIR', '/etc/hadoop/conf')
>
> However, when I run in client mode, I see this:
>
> ('HADOOP_CONF_DIR', '/home/awp066/local/etc/hadoop/conf')
>
> Furthermore, if I omit that environment variable from spark-env.sh altogether, I get the expected error in both client and cluster mode:
>
> When running with master 'yarn'
> either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment.
>
> This suggests that my environment variables are being used when I first submit the job, but at some point during the job, my environment variables are thrown out and someone's (yarn's?) environment variables are being used.
>
> Andrew
>
>
> On Fri, Jan 15, 2016 at 11:03 AM, Andrew Weiner <
> andrewweiner2020@u.northwestern.edu> wrote:
>
> Indeed!  Here is the output when I run in cluster mode:
>
> Traceback (most recent call last):
>   File "pi.py", line 22, in ?
>     raise RuntimeError("\n"+str(sys.version_info) +"\n"+
> RuntimeError:
> (2, 4, 3, 'final', 0)
> [('PYSPARK_GATEWAY_PORT', '48079'), ('PYTHONPATH', '/scratch2/hadoop/yarn/local/usercache/<username>/filecache/116/spark-assembly-1.6.0-hadoop2.4.0.jar:/home/<user>/spark-1.6.0-bin-hadoop2.4/python:/home/<username>/code/libs:/scratch5/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0239/container_1450370639491_0239_01_000001/pyspark.zip:/scratch5/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0239/container_1450370639491_0239_01_000001/py4j-0.9-src.zip'), ('PYTHONUNBUFFERED', 'YES')]
>
> As we suspected, it is using Python 2.4
>
> One thing that surprises me is that PYSPARK_PYTHON is not showing up in the list, even though I am setting it and exporting it in spark-submit *and* in spark-env.sh.  Is there somewhere else I need to set this variable?  Maybe in one of the hadoop conf files in my HADOOP_CONF_DIR?
>
> Andrew
>
>
>
> On Thu, Jan 14, 2016 at 1:14 PM, Bryan Cutler <cu...@gmail.com> wrote:
>
> It seems like it could be the case that some other Python version is being
> invoked.  To make sure, can you add something like this to the top of the
> .py file you are submitting to get some more info about how the application
> master is configured?
>
> import sys, os
> raise RuntimeError("\n"+str(sys.version_info) +"\n"+
>     str([(k,os.environ[k]) for k in os.environ if "PY" in k]))
>
> On Thu, Jan 14, 2016 at 8:37 AM, Andrew Weiner <
> andrewweiner2020@u.northwestern.edu> wrote:
>
> Hi Bryan,
>
> I ran "$> python --version" on every node on the cluster, and it is Python
> 2.7.8 for every single one.
>
> When I try to submit the Python example in client mode
> * ./bin/spark-submit      --master yarn     --deploy-mode client
> --driver-memory 4g     --executor-memory 2g     --executor-cores 1
> ./examples/src/main/python/pi.py     10*
> That's when I get this error that I mentioned:
>
> 16/01/14 10:09:10 WARN scheduler.TaskSetManager: Lost task 0.0 in stage
> 0.0 (TID 0, mundonovo-priv): org.apache.spark.SparkException:
> Error from python worker:
>   python: module pyspark.daemon not found
> PYTHONPATH was:
>
> /scratch5/hadoop/yarn/local/usercache/<username>/filecache/48/spark-assembly-1.6.0-hadoop2.4.0.jar:/home/aqualab/spark-1.6.0-bin-hadoop2.4/python:/home/jpr123/hg.pacific/python-common:/home/jp
>
> r123/python-libs:/home/jpr123/lib/python2.7/site-packages:/home/zsb739/local/lib/python2.7/site-packages:/home/jpr123/mobile-cdn-analysis:/home/<username>/lib/python2.7/site-packages:/home/<username>/code/libs:/scratch5/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0187/container_1450370639491_0187_01_000002/pyspark.zip:/scratch5/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0187/container_1450370639491_0187_01_000002/py4j-0.9-src.zip
> java.io.EOFException
>         at java.io.DataInputStream.readInt(DataInputStream.java:392)
>         at
> org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:164)
>         at [....]
>
> followed by several more similar errors that also say:
> Error from python worker:
>   python: module pyspark.daemon not found
>
>
> Even though the default python appeared to be correct, I just went ahead
> and explicitly set PYSPARK_PYTHON in conf/spark-env.sh to the path of the
> default python binary executable.  After making this change I was able to
> run the job successfully in client!  That is, this appeared to fix the
> "pyspark.daemon not found" error when running in client mode.
>
> However, when running in cluster mode, I am still getting the same syntax
> error:
>
> Traceback (most recent call last):
>   File "pi.py", line 24, in ?
>     from pyspark import SparkContext
>   File "/home/<username>/spark-1.6.0-bin-hadoop2.4/python/pyspark/__init__.py", line 61
>     indent = ' ' * (min(len(m) for m in indents) if indents else 0)
>                                                   ^
> SyntaxError: invalid syntax
>
> Is it possible that the PYSPARK_PYTHON environment variable is ignored when jobs are submitted in cluster mode?  It seems that Spark or Yarn is going behind my back, so to speak, and using some older version of python I didn't even know was installed.
>
> Thanks again for all your help thus far.  We are getting close....
>
> Andrew
>
>
>
> On Wed, Jan 13, 2016 at 6:13 PM, Bryan Cutler <cu...@gmail.com> wrote:
>
> Hi Andrew,
>
> There are a couple of things to check.  First, is Python 2.7 the default
> version on all nodes in the cluster or is it an alternate install? Meaning
> what is the output of this command "$>  python --version"  If it is an
> alternate install, you could set the environment variable "PYSPARK_PYTHON"
> Python binary executable to use for PySpark in both driver and workers
> (default is python).
>
> Did you try to submit the Python example under client mode?  Otherwise,
> the command looks fine, you don't use the --class option for submitting
> python files
> * ./bin/spark-submit      --master yarn     --deploy-mode client
> --driver-memory 4g     --executor-memory 2g     --executor-cores 1
> ./examples/src/main/python/pi.py     10*
>
> That is a good sign that local jobs and Java examples work, probably just
> a small configuration issue :)
>
> Bryan
>
> On Wed, Jan 13, 2016 at 3:51 PM, Andrew Weiner <
> andrewweiner2020@u.northwestern.edu> wrote:
>
> Thanks for your continuing help.  Here is some additional info.
>
> *OS/architecture*
> output of *cat /proc/version*:
> Linux version 2.6.18-400.1.1.el5 (mockbuild@x86-012.build.bos.redhat.com)
>
> output of *lsb_release -a*:
> LSB Version:
>  :core-4.0-amd64:core-4.0-ia32:core-4.0-noarch:graphics-4.0-amd64:graphics-4.0-ia32:graphics-4.0-noarch:printing-4.0-amd64:printing-4.0-ia32:printing-4.0-noarch
> Distributor ID: RedHatEnterpriseServer
> Description:    Red Hat Enterprise Linux Server release 5.11 (Tikanga)
> Release:        5.11
> Codename:       Tikanga
>
> *Running a local job*
> I have confirmed that I can successfully run python jobs using
> bin/spark-submit --master local[*]
> Specifically, this is the command I am using:
> *./bin/spark-submit --master local[8]
> ./examples/src/main/python/wordcount.py
> file:/home/<username>/spark-1.6.0-bin-hadoop2.4/README.md*
> And it works!
>
> *Additional info*
> I am also able to successfully run the Java SparkPi example using yarn in
> cluster mode using this command:
> * ./bin/spark-submit --class org.apache.spark.examples.SparkPi
> --master yarn     --deploy-mode cluster     --driver-memory 4g
> --executor-memory 2g     --executor-cores 1     lib/spark-examples*.jar
> 10*
> This Java job also runs successfully when I change --deploy-mode to
> client.  The fact that I can run Java jobs in cluster mode makes me thing
> that everything is installed correctly--is that a valid assumption?
>
> The problem remains that I cannot submit python jobs.  Here is the command
> that I am using to try to submit python jobs:
> * ./bin/spark-submit      --master yarn     --deploy-mode cluster
> --driver-memory 4g     --executor-memory 2g     --executor-cores 1
> ./examples/src/main/python/pi.py     10*
> Does that look like a correct command?  I wasn't sure what to put for
> --class so I omitted it.  At any rate, the result of the above command is a
> syntax error, similar to the one I posted in the original email:
>
> Traceback (most recent call last):
>   File "pi.py", line 24, in ?
>     from pyspark import SparkContext
>   File "/home/<username>/spark-1.6.0-bin-hadoop2.4/python/pyspark/__init__.py", line 61
>     indent = ' ' * (min(len(m) for m in indents) if indents else 0)
>                                                   ^
> SyntaxError: invalid syntax
>
>
> This really looks to me like a problem with the python version.  Python
> 2.4 would throw this syntax error but Python 2.7 would not.  And yet I am
> using Python 2.7.8.  Is there any chance that Spark or Yarn is somehow
> using an older version of Python without my knowledge?
>
> Finally, when I try to run the same command in client mode...
> * ./bin/spark-submit      --master yarn     --deploy-mode client
> --driver-memory 4g     --executor-memory 2g     --executor-cores 1
> ./examples/src/main/python/pi.py 10*
> I get the error I mentioned in the prior email:
> Error from python worker:
>   python: module pyspark.daemon not found
>
> Any thoughts?
>
> Best,
> Andrew
>
>
> On Mon, Jan 11, 2016 at 12:25 PM, Bryan Cutler <cu...@gmail.com> wrote:
>
> This could be an environment issue, could you give more details about the
> OS/architecture that you are using?  If you are sure everything is
> installed correctly on each node following the guide on "Running Spark on
> Yarn" http://spark.apache.org/docs/latest/running-on-yarn.html and that
> the spark assembly jar is reachable, then I would check to see if you can
> submit a local job to just run on one node.
>
> On Fri, Jan 8, 2016 at 5:22 PM, Andrew Weiner <
> andrewweiner2020@u.northwestern.edu> wrote:
>
> Now for simplicity I'm testing with wordcount.py from the provided
> examples, and using Spark 1.6.0
>
> The first error I get is:
>
> 16/01/08 19:14:46 ERROR lzo.GPLNativeCodeLoader: Could not load native gpl
> library
> java.lang.UnsatisfiedLinkError: no gplcompression in java.library.path
>         at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1864)
>         at [....]
>
> A bit lower down, I see this error:
>
> 16/01/08 19:14:48 WARN scheduler.TaskSetManager: Lost task 0.0 in stage
> 0.0 (TID 0, mundonovo-priv): org.apache.spark.SparkException:
> Error from python worker:
>   python: module pyspark.daemon not found
> PYTHONPATH was:
>
> /scratch5/hadoop/yarn/local/usercache/<username>/filecache/22/spark-assembly-1.6.0-hadoop2.4.0.jar:/home/jpr123/hg.pacific/python-common:/home/jpr123/python-libs:/home/jpr123/lib/python2.7/site-packages:/home/zsb739/local/lib/python2.7/site-packages:/home/jpr123/mobile-cdn-analysis:/home/<username>/lib/python2.7/site-packages:/scratch4/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0136/container_1450370639491_0136_01_000002/pyspark.zip:/scratch4/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0136/container_1450370639491_0136_01_000002/py4j-0.9-src.zip
> java.io.EOFException
>         at java.io.DataInputStream.readInt(DataInputStream.java:392)
>         at [....]
>
> And then a few more similar pyspark.daemon not found errors...
>
> Andrew
>
>
>
> On Fri, Jan 8, 2016 at 2:31 PM, Bryan Cutler <cu...@gmail.com> wrote:
>
> Hi Andrew,
>
> I know that older versions of Spark could not run PySpark on YARN in
> cluster mode.  I'm not sure if that is fixed in 1.6.0 though.  Can you try
> setting deploy-mode option to "client" when calling spark-submit?
>
> Bryan
>
> On Thu, Jan 7, 2016 at 2:39 PM, weineran <
> andrewweiner2020@u.northwestern.edu> wrote:
>
> Hello,
>
> When I try to submit a python job using spark-submit (using --master yarn
> --deploy-mode cluster), I get the following error:
>
> /Traceback (most recent call last):
>   File "loss_rate_by_probe.py", line 15, in ?
>     from pyspark import SparkContext
>   File
>
> "/scratch5/hadoop/yarn/local/usercache/<username>/filecache/18/spark-assembly-1.3.1-hadoop2.4.0.jar/pyspark/__init__.py",
> line 41, in ?
>   File
>
> "/scratch5/hadoop/yarn/local/usercache/<username>/filecache/18/spark-assembly-1.3.1-hadoop2.4.0.jar/pyspark/context.py",
> line 219
>     with SparkContext._lock:
>                     ^
> SyntaxError: invalid syntax/
>
> This is very similar to  this post from 2014
> <
> http://apache-spark-user-list.1001560.n3.nabble.com/SparkContext-lock-Error-td18233.html
> >
> , but unlike that person I am using Python 2.7.8.
>
> Here is what I'm using:
> Spark 1.3.1
> Hadoop 2.4.0.2.1.5.0-695
> Python 2.7.8
>
> Another clue:  I also installed Spark 1.6.0 and tried to submit the same
> job.  I got a similar error:
>
> /Traceback (most recent call last):
>   File "loss_rate_by_probe.py", line 15, in ?
>     from pyspark import SparkContext
>   File
>
> "/scratch5/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0119/container_1450370639491_0119_01_000001/pyspark.zip/pyspark/__init__.py",
> line 61
>     indent = ' ' * (min(len(m) for m in indents) if indents else 0)
>                                                   ^
> SyntaxError: invalid syntax/
>
> Any thoughts?
>
> Andrew
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/SparkContext-SyntaxError-invalid-syntax-tp25910.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>

RE: SparkContext SyntaxError: invalid syntax

Posted by Felix Cheung <fe...@hotmail.com>.
Do you still need help on the PR?
btw, does this apply to YARN client mode?
 
From: andrewweiner2020@u.northwestern.edu
Date: Sun, 17 Jan 2016 17:00:39 -0600
Subject: Re: SparkContext SyntaxError: invalid syntax
To: cutlerb@gmail.com
CC: user@spark.apache.org

Yeah, I do think it would be worth explicitly stating this in the docs.  I was going to try to edit the docs myself and submit a pull request, but I'm having trouble building the docs from github.  If anyone else wants to do this, here is approximately what I would say:
(To be added to http://spark.apache.org/docs/latest/configuration.html#environment-variables)"Note: When running Spark on YARN in cluster mode, environment variables need to be set using the spark.yarn.appMasterEnv.[EnvironmentVariableName]  property in your conf/spark-defaults.conf file.  Environment variables that are set in spark-env.sh will not be reflected in the YARN Application Master process in cluster mode.  See the YARN-related Spark Properties for more information."
I might take another crack at building the docs myself if nobody beats me to this.
Andrew

On Fri, Jan 15, 2016 at 5:01 PM, Bryan Cutler <cu...@gmail.com> wrote:
Glad you got it going!  It's wasn't very obvious what needed to be set, maybe it is worth explicitly stating this in the docs since it seems to have come up a couple times before too.
Bryan
On Fri, Jan 15, 2016 at 12:33 PM, Andrew Weiner <an...@u.northwestern.edu> wrote:
Actually, I just found this [https://issues.apache.org/jira/browse/SPARK-1680], which after a bit of googling and reading leads me to believe that the preferred way to change the yarn environment is to edit the spark-defaults.conf file by adding this line:spark.yarn.appMasterEnv.PYSPARK_PYTHON    /path/to/python

While both this solution and the solution from my prior email work, I believe this is the preferred solution.
Sorry for the flurry of emails.  Again, thanks for all the help!
Andrew
On Fri, Jan 15, 2016 at 1:47 PM, Andrew Weiner <an...@u.northwestern.edu> wrote:
I finally got the pi.py example to run in yarn cluster mode.  This was the key insight:https://issues.apache.org/jira/browse/SPARK-9229

I had to set SPARK_YARN_USER_ENV in spark-env.sh:export SPARK_YARN_USER_ENV="PYSPARK_PYTHON=/home/aqualab/local/bin/python"
This caused the PYSPARK_PYTHON environment variable to be used in my yarn environment in cluster mode.
Thank you for all your help!
Best,Andrew


On Fri, Jan 15, 2016 at 12:57 PM, Andrew Weiner <an...@u.northwestern.edu> wrote:
I tried playing around with my environment variables, and here is an update.
When I run in cluster mode, my environment variables do not persist throughout the entire job.For example, I tried creating a local copy of HADOOP_CONF_DIR in /home/<username>/local/etc/hadoop/conf, and then, in spark-env.sh I the variable:export HADOOP_CONF_DIR=/home/<username>/local/etc/hadoop/conf
Later, when we print the environment variables in the python code, I see this:('HADOOP_CONF_DIR', '/etc/hadoop/conf')However, when I run in client mode, I see this:('HADOOP_CONF_DIR', '/home/awp066/local/etc/hadoop/conf')
Furthermore, if I omit that environment variable from spark-env.sh altogether, I get the expected error in both client and cluster mode:When running with master 'yarn'
either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment.This suggests that my environment variables are being used when I first submit the job, but at some point during the job, my environment variables are thrown out and someone's (yarn's?) environment variables are being used.Andrew
On Fri, Jan 15, 2016 at 11:03 AM, Andrew Weiner <an...@u.northwestern.edu> wrote:
Indeed!  Here is the output when I run in cluster mode:Traceback (most recent call last):
  File "pi.py", line 22, in ?
    raise RuntimeError("\n"+str(sys.version_info) +"\n"+ 
RuntimeError: 
(2, 4, 3, 'final', 0)
[('PYSPARK_GATEWAY_PORT', '48079'), ('PYTHONPATH', '/scratch2/hadoop/yarn/local/usercache/<username>/filecache/116/spark-assembly-1.6.0-hadoop2.4.0.jar:/home/<user>/spark-1.6.0-bin-hadoop2.4/python:/home/<username>/code/libs:/scratch5/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0239/container_1450370639491_0239_01_000001/pyspark.zip:/scratch5/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0239/container_1450370639491_0239_01_000001/py4j-0.9-src.zip'), ('PYTHONUNBUFFERED', 'YES')]As we suspected, it is using Python 2.4
One thing that surprises me is that PYSPARK_PYTHON is not showing up in the list, even though I am setting it and exporting it in spark-submit and in spark-env.sh.  Is there somewhere else I need to set this variable?  Maybe in one of the hadoop conf files in my HADOOP_CONF_DIR?Andrew

On Thu, Jan 14, 2016 at 1:14 PM, Bryan Cutler <cu...@gmail.com> wrote:
It seems like it could be the case that some other Python version is being invoked.  To make sure, can you add something like this to the top of the .py file you are submitting to get some more info about how the application master is configured?

import sys, os
raise RuntimeError("\n"+str(sys.version_info) +"\n"+ 
    str([(k,os.environ[k]) for k in os.environ if "PY" in k]))

On Thu, Jan 14, 2016 at 8:37 AM, Andrew Weiner <an...@u.northwestern.edu> wrote:
Hi Bryan,
I ran "$> python --version" on every node on the cluster, and it is Python 2.7.8 for every single one.
When I try to submit the Python example in client mode ./bin/spark-submit      --master yarn     --deploy-mode client     --driver-memory 4g     --executor-memory 2g     --executor-cores 1     ./examples/src/main/python/pi.py     10That's when I get this error that I mentioned:

16/01/14 10:09:10 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, mundonovo-priv): org.apache.spark.SparkException:Error from python worker:  python: module pyspark.daemon not foundPYTHONPATH was:  /scratch5/hadoop/yarn/local/usercache/<username>/filecache/48/spark-assembly-1.6.0-hadoop2.4.0.jar:/home/aqualab/spark-1.6.0-bin-hadoop2.4/python:/home/jpr123/hg.pacific/python-common:/home/jpr123/python-libs:/home/jpr123/lib/python2.7/site-packages:/home/zsb739/local/lib/python2.7/site-packages:/home/jpr123/mobile-cdn-analysis:/home/<username>/lib/python2.7/site-packages:/home/<username>/code/libs:/scratch5/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0187/container_1450370639491_0187_01_000002/pyspark.zip:/scratch5/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0187/container_1450370639491_0187_01_000002/py4j-0.9-src.zipjava.io.EOFException        at java.io.DataInputStream.readInt(DataInputStream.java:392)        at org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:164)        at [....]
followed by several more similar errors that also say:Error from python worker:  python: module pyspark.daemon not found

Even though the default python appeared to be correct, I just went ahead and explicitly set PYSPARK_PYTHON in conf/spark-env.sh to the path of the default python binary executable.  After making this change I was able to run the job successfully in client!  That is, this appeared to fix the "pyspark.daemon not found" error when running in client mode.
However, when running in cluster mode, I am still getting the same syntax error:Traceback (most recent call last):
  File "pi.py", line 24, in ?
    from pyspark import SparkContext
  File "/home/<username>/spark-1.6.0-bin-hadoop2.4/python/pyspark/__init__.py", line 61
    indent = ' ' * (min(len(m) for m in indents) if indents else 0)
                                                  ^
SyntaxError: invalid syntaxIs it possible that the PYSPARK_PYTHON environment variable is ignored when jobs are submitted in cluster mode?  It seems that Spark or Yarn is going behind my back, so to speak, and using some older version of python I didn't even know was installed.Thanks again for all your help thus far.  We are getting close....Andrew


On Wed, Jan 13, 2016 at 6:13 PM, Bryan Cutler <cu...@gmail.com> wrote:
Hi Andrew,

There are a couple of things to check.  First, is Python 2.7 the default version on all nodes in the cluster or is it an alternate install? Meaning what is the output of this command "$>  python --version"  If it is an alternate install, you could set the environment variable "PYSPARK_PYTHON"
    Python binary executable to use for PySpark in both driver and workers (default is python).

Did you try to submit the Python example under client mode?  Otherwise, the command looks fine, you don't use the --class option for submitting python files
 ./bin/spark-submit      --master yarn     --deploy-mode client     
--driver-memory 4g     --executor-memory 2g     --executor-cores 1     
./examples/src/main/python/pi.py     10

That is a good sign that local jobs and Java examples work, probably just a small configuration issue :)

Bryan

On Wed, Jan 13, 2016 at 3:51 PM, Andrew Weiner <an...@u.northwestern.edu> wrote:
Thanks for your continuing help.  Here is some additional info.
OS/architecture
output of cat /proc/version:Linux version 2.6.18-400.1.1.el5 (mockbuild@x86-012.build.bos.redhat.com)
output of lsb_release -a:LSB Version:    :core-4.0-amd64:core-4.0-ia32:core-4.0-noarch:graphics-4.0-amd64:graphics-4.0-ia32:graphics-4.0-noarch:printing-4.0-amd64:printing-4.0-ia32:printing-4.0-noarchDistributor ID: RedHatEnterpriseServerDescription:    Red Hat Enterprise Linux Server release 5.11 (Tikanga)Release:        5.11Codename:       Tikanga
Running a local jobI have confirmed that I can successfully run python jobs using bin/spark-submit --master local[*]Specifically, this is the command I am using:./bin/spark-submit --master local[8] ./examples/src/main/python/wordcount.py file:/home/<username>/spark-1.6.0-bin-hadoop2.4/README.mdAnd it works!
Additional infoI am also able to successfully run the Java SparkPi example using yarn in cluster mode using this command: ./bin/spark-submit --class org.apache.spark.examples.SparkPi     --master yarn     --deploy-mode cluster     --driver-memory 4g     --executor-memory 2g     --executor-cores 1     lib/spark-examples*.jar     10This Java job also runs successfully when I change --deploy-mode to client.  The fact that I can run Java jobs in cluster mode makes me thing that everything is installed correctly--is that a valid assumption?
The problem remains that I cannot submit python jobs.  Here is the command that I am using to try to submit python jobs: ./bin/spark-submit      --master yarn     --deploy-mode cluster     --driver-memory 4g     --executor-memory 2g     --executor-cores 1     ./examples/src/main/python/pi.py     10Does that look like a correct command?  I wasn't sure what to put for --class so I omitted it.  At any rate, the result of the above command is a syntax error, similar to the one I posted in the original email:Traceback (most recent call last):
  File "pi.py", line 24, in ?
    from pyspark import SparkContext
  File "/home/<username>/spark-1.6.0-bin-hadoop2.4/python/pyspark/__init__.py", line 61
    indent = ' ' * (min(len(m) for m in indents) if indents else 0)
                                                  ^
SyntaxError: invalid syntax
This really looks to me like a problem with the python version.  Python 2.4 would throw this syntax error but Python 2.7 would not.  And yet I am using Python 2.7.8.  Is there any chance that Spark or Yarn is somehow using an older version of Python without my knowledge?
Finally, when I try to run the same command in client mode... ./bin/spark-submit      --master yarn     --deploy-mode client     --driver-memory 4g     --executor-memory 2g     --executor-cores 1     ./examples/src/main/python/pi.py 10I get the error I mentioned in the prior email:Error from python worker:  python: module pyspark.daemon not found
Any thoughts?
Best,Andrew

On Mon, Jan 11, 2016 at 12:25 PM, Bryan Cutler <cu...@gmail.com> wrote:
This could be an environment issue, could you give more details about the OS/architecture that you are using?  If you are sure everything is installed correctly on each node following the guide on "Running Spark on Yarn" http://spark.apache.org/docs/latest/running-on-yarn.html  and that the spark assembly jar is reachable, then I would check to see if you can submit a local job to just run on one node.

On Fri, Jan 8, 2016 at 5:22 PM, Andrew Weiner <an...@u.northwestern.edu> wrote:
Now for simplicity I'm testing with wordcount.py from the provided examples, and using Spark 1.6.0
The first error I get is:
16/01/08 19:14:46 ERROR lzo.GPLNativeCodeLoader: Could not load native gpl libraryjava.lang.UnsatisfiedLinkError: no gplcompression in java.library.path        at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1864)        at [....]
A bit lower down, I see this error:
16/01/08 19:14:48 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, mundonovo-priv): org.apache.spark.SparkException:Error from python worker:  python: module pyspark.daemon not foundPYTHONPATH was:  /scratch5/hadoop/yarn/local/usercache/<username>/filecache/22/spark-assembly-1.6.0-hadoop2.4.0.jar:/home/jpr123/hg.pacific/python-common:/home/jpr123/python-libs:/home/jpr123/lib/python2.7/site-packages:/home/zsb739/local/lib/python2.7/site-packages:/home/jpr123/mobile-cdn-analysis:/home/<username>/lib/python2.7/site-packages:/scratch4/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0136/container_1450370639491_0136_01_000002/pyspark.zip:/scratch4/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0136/container_1450370639491_0136_01_000002/py4j-0.9-src.zipjava.io.EOFException        at java.io.DataInputStream.readInt(DataInputStream.java:392)        at [....]
And then a few more similar pyspark.daemon not found errors...
Andrew


On Fri, Jan 8, 2016 at 2:31 PM, Bryan Cutler <cu...@gmail.com> wrote:
Hi Andrew,

I know that older versions of Spark could not run PySpark on YARN in cluster mode.  I'm not sure if that is fixed in 1.6.0 though.  Can you try setting deploy-mode option to "client" when calling spark-submit?

Bryan

On Thu, Jan 7, 2016 at 2:39 PM, weineran <an...@u.northwestern.edu> wrote:
Hello,



When I try to submit a python job using spark-submit (using --master yarn

--deploy-mode cluster), I get the following error:



/Traceback (most recent call last):

  File "loss_rate_by_probe.py", line 15, in ?

    from pyspark import SparkContext

  File

"/scratch5/hadoop/yarn/local/usercache/<username>/filecache/18/spark-assembly-1.3.1-hadoop2.4.0.jar/pyspark/__init__.py",

line 41, in ?

  File

"/scratch5/hadoop/yarn/local/usercache/<username>/filecache/18/spark-assembly-1.3.1-hadoop2.4.0.jar/pyspark/context.py",

line 219

    with SparkContext._lock:

                    ^

SyntaxError: invalid syntax/



This is very similar to  this post from 2014

<http://apache-spark-user-list.1001560.n3.nabble.com/SparkContext-lock-Error-td18233.html>

, but unlike that person I am using Python 2.7.8.



Here is what I'm using:

Spark 1.3.1

Hadoop 2.4.0.2.1.5.0-695

Python 2.7.8



Another clue:  I also installed Spark 1.6.0 and tried to submit the same

job.  I got a similar error:



/Traceback (most recent call last):

  File "loss_rate_by_probe.py", line 15, in ?

    from pyspark import SparkContext

  File

"/scratch5/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0119/container_1450370639491_0119_01_000001/pyspark.zip/pyspark/__init__.py",

line 61

    indent = ' ' * (min(len(m) for m in indents) if indents else 0)

                                                  ^

SyntaxError: invalid syntax/



Any thoughts?



Andrew







--

View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkContext-SyntaxError-invalid-syntax-tp25910.html

Sent from the Apache Spark User List mailing list archive at Nabble.com.



---------------------------------------------------------------------

To unsubscribe, e-mail: user-unsubscribe@spark.apache.org

For additional commands, e-mail: user-help@spark.apache.org




























 		 	   		  

Re: SparkContext SyntaxError: invalid syntax

Posted by Andrew Weiner <an...@u.northwestern.edu>.
Yeah, I do think it would be worth explicitly stating this in the docs.  I
was going to try to edit the docs myself and submit a pull request, but I'm
having trouble building the docs from github.  If anyone else wants to do
this, here is approximately what I would say:

(To be added to
http://spark.apache.org/docs/latest/configuration.html#environment-variables
)
"Note: When running Spark on YARN in cluster mode, environment variables
need to be set using the spark.yarn.appMasterEnv.[EnvironmentVariableName]
property in your conf/spark-defaults.conf file.  Environment variables that
are set in spark-env.sh will not be reflected in the YARN Application
Master process in cluster mode.  See the YARN-related Spark Properties
<http://spark.apache.org/docs/latest/running-on-yarn.html#spark-properties>
for more information."

I might take another crack at building the docs myself if nobody beats me
to this.

Andrew


On Fri, Jan 15, 2016 at 5:01 PM, Bryan Cutler <cu...@gmail.com> wrote:

> Glad you got it going!  It's wasn't very obvious what needed to be set,
> maybe it is worth explicitly stating this in the docs since it seems to
> have come up a couple times before too.
>
> Bryan
>
> On Fri, Jan 15, 2016 at 12:33 PM, Andrew Weiner <
> andrewweiner2020@u.northwestern.edu> wrote:
>
>> Actually, I just found this [
>> https://issues.apache.org/jira/browse/SPARK-1680], which after a bit of
>> googling and reading leads me to believe that the preferred way to change
>> the yarn environment is to edit the spark-defaults.conf file by adding this
>> line:
>> spark.yarn.appMasterEnv.PYSPARK_PYTHON    /path/to/python
>>
>> While both this solution and the solution from my prior email work, I
>> believe this is the preferred solution.
>>
>> Sorry for the flurry of emails.  Again, thanks for all the help!
>>
>> Andrew
>>
>> On Fri, Jan 15, 2016 at 1:47 PM, Andrew Weiner <
>> andrewweiner2020@u.northwestern.edu> wrote:
>>
>>> I finally got the pi.py example to run in yarn cluster mode.  This was
>>> the key insight:
>>> https://issues.apache.org/jira/browse/SPARK-9229
>>>
>>> I had to set SPARK_YARN_USER_ENV in spark-env.sh:
>>> export
>>> SPARK_YARN_USER_ENV="PYSPARK_PYTHON=/home/aqualab/local/bin/python"
>>>
>>> This caused the PYSPARK_PYTHON environment variable to be used in my
>>> yarn environment in cluster mode.
>>>
>>> Thank you for all your help!
>>>
>>> Best,
>>> Andrew
>>>
>>>
>>>
>>> On Fri, Jan 15, 2016 at 12:57 PM, Andrew Weiner <
>>> andrewweiner2020@u.northwestern.edu> wrote:
>>>
>>>> I tried playing around with my environment variables, and here is an
>>>> update.
>>>>
>>>> When I run in cluster mode, my environment variables do not persist
>>>> throughout the entire job.
>>>> For example, I tried creating a local copy of HADOOP_CONF_DIR in
>>>> /home/<username>/local/etc/hadoop/conf, and then, in spark-env.sh I the
>>>> variable:
>>>> export HADOOP_CONF_DIR=/home/<username>/local/etc/hadoop/conf
>>>>
>>>> Later, when we print the environment variables in the python code, I
>>>> see this:
>>>>
>>>> ('HADOOP_CONF_DIR', '/etc/hadoop/conf')
>>>>
>>>> However, when I run in client mode, I see this:
>>>>
>>>> ('HADOOP_CONF_DIR', '/home/awp066/local/etc/hadoop/conf')
>>>>
>>>> Furthermore, if I omit that environment variable from spark-env.sh altogether, I get the expected error in both client and cluster mode:
>>>>
>>>> When running with master 'yarn'
>>>> either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment.
>>>>
>>>> This suggests that my environment variables are being used when I first submit the job, but at some point during the job, my environment variables are thrown out and someone's (yarn's?) environment variables are being used.
>>>>
>>>> Andrew
>>>>
>>>>
>>>> On Fri, Jan 15, 2016 at 11:03 AM, Andrew Weiner <
>>>> andrewweiner2020@u.northwestern.edu> wrote:
>>>>
>>>>> Indeed!  Here is the output when I run in cluster mode:
>>>>>
>>>>> Traceback (most recent call last):
>>>>>   File "pi.py", line 22, in ?
>>>>>     raise RuntimeError("\n"+str(sys.version_info) +"\n"+
>>>>> RuntimeError:
>>>>> (2, 4, 3, 'final', 0)
>>>>> [('PYSPARK_GATEWAY_PORT', '48079'), ('PYTHONPATH', '/scratch2/hadoop/yarn/local/usercache/<username>/filecache/116/spark-assembly-1.6.0-hadoop2.4.0.jar:/home/<user>/spark-1.6.0-bin-hadoop2.4/python:/home/<username>/code/libs:/scratch5/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0239/container_1450370639491_0239_01_000001/pyspark.zip:/scratch5/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0239/container_1450370639491_0239_01_000001/py4j-0.9-src.zip'), ('PYTHONUNBUFFERED', 'YES')]
>>>>>
>>>>> As we suspected, it is using Python 2.4
>>>>>
>>>>> One thing that surprises me is that PYSPARK_PYTHON is not showing up in the list, even though I am setting it and exporting it in spark-submit *and* in spark-env.sh.  Is there somewhere else I need to set this variable?  Maybe in one of the hadoop conf files in my HADOOP_CONF_DIR?
>>>>>
>>>>> Andrew
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Jan 14, 2016 at 1:14 PM, Bryan Cutler <cu...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> It seems like it could be the case that some other Python version is
>>>>>> being invoked.  To make sure, can you add something like this to the top of
>>>>>> the .py file you are submitting to get some more info about how the
>>>>>> application master is configured?
>>>>>>
>>>>>> import sys, os
>>>>>> raise RuntimeError("\n"+str(sys.version_info) +"\n"+
>>>>>>     str([(k,os.environ[k]) for k in os.environ if "PY" in k]))
>>>>>>
>>>>>> On Thu, Jan 14, 2016 at 8:37 AM, Andrew Weiner <
>>>>>> andrewweiner2020@u.northwestern.edu> wrote:
>>>>>>
>>>>>>> Hi Bryan,
>>>>>>>
>>>>>>> I ran "$> python --version" on every node on the cluster, and it is
>>>>>>> Python 2.7.8 for every single one.
>>>>>>>
>>>>>>> When I try to submit the Python example in client mode
>>>>>>> * ./bin/spark-submit      --master yarn     --deploy-mode client
>>>>>>> --driver-memory 4g     --executor-memory 2g     --executor-cores 1
>>>>>>> ./examples/src/main/python/pi.py     10*
>>>>>>> That's when I get this error that I mentioned:
>>>>>>>
>>>>>>> 16/01/14 10:09:10 WARN scheduler.TaskSetManager: Lost task 0.0 in
>>>>>>> stage 0.0 (TID 0, mundonovo-priv): org.apache.spark.SparkException:
>>>>>>> Error from python worker:
>>>>>>>   python: module pyspark.daemon not found
>>>>>>> PYTHONPATH was:
>>>>>>>
>>>>>>> /scratch5/hadoop/yarn/local/usercache/<username>/filecache/48/spark-assembly-1.6.0-hadoop2.4.0.jar:/home/aqualab/spark-1.6.0-bin-hadoop2.4/python:/home/jpr123/hg.pacific/python-common:/home/jp
>>>>>>>
>>>>>>> r123/python-libs:/home/jpr123/lib/python2.7/site-packages:/home/zsb739/local/lib/python2.7/site-packages:/home/jpr123/mobile-cdn-analysis:/home/<username>/lib/python2.7/site-packages:/home/<username>/code/libs:/scratch5/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0187/container_1450370639491_0187_01_000002/pyspark.zip:/scratch5/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0187/container_1450370639491_0187_01_000002/py4j-0.9-src.zip
>>>>>>> java.io.EOFException
>>>>>>>         at java.io.DataInputStream.readInt(DataInputStream.java:392)
>>>>>>>         at
>>>>>>> org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:164)
>>>>>>>         at [....]
>>>>>>>
>>>>>>> followed by several more similar errors that also say:
>>>>>>> Error from python worker:
>>>>>>>   python: module pyspark.daemon not found
>>>>>>>
>>>>>>>
>>>>>>> Even though the default python appeared to be correct, I just went
>>>>>>> ahead and explicitly set PYSPARK_PYTHON in conf/spark-env.sh to the path of
>>>>>>> the default python binary executable.  After making this change I was able
>>>>>>> to run the job successfully in client!  That is, this appeared to fix the
>>>>>>> "pyspark.daemon not found" error when running in client mode.
>>>>>>>
>>>>>>> However, when running in cluster mode, I am still getting the same
>>>>>>> syntax error:
>>>>>>>
>>>>>>> Traceback (most recent call last):
>>>>>>>   File "pi.py", line 24, in ?
>>>>>>>     from pyspark import SparkContext
>>>>>>>   File "/home/<username>/spark-1.6.0-bin-hadoop2.4/python/pyspark/__init__.py", line 61
>>>>>>>     indent = ' ' * (min(len(m) for m in indents) if indents else 0)
>>>>>>>                                                   ^
>>>>>>> SyntaxError: invalid syntax
>>>>>>>
>>>>>>> Is it possible that the PYSPARK_PYTHON environment variable is ignored when jobs are submitted in cluster mode?  It seems that Spark or Yarn is going behind my back, so to speak, and using some older version of python I didn't even know was installed.
>>>>>>>
>>>>>>> Thanks again for all your help thus far.  We are getting close....
>>>>>>>
>>>>>>> Andrew
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Jan 13, 2016 at 6:13 PM, Bryan Cutler <cu...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Andrew,
>>>>>>>>
>>>>>>>> There are a couple of things to check.  First, is Python 2.7 the
>>>>>>>> default version on all nodes in the cluster or is it an alternate install?
>>>>>>>> Meaning what is the output of this command "$>  python --version"  If it is
>>>>>>>> an alternate install, you could set the environment variable "
>>>>>>>> PYSPARK_PYTHON" Python binary executable to use for PySpark in
>>>>>>>> both driver and workers (default is python).
>>>>>>>>
>>>>>>>> Did you try to submit the Python example under client mode?
>>>>>>>> Otherwise, the command looks fine, you don't use the --class option for
>>>>>>>> submitting python files
>>>>>>>> * ./bin/spark-submit      --master yarn     --deploy-mode client
>>>>>>>>   --driver-memory 4g     --executor-memory 2g     --executor-cores 1
>>>>>>>> ./examples/src/main/python/pi.py     10*
>>>>>>>>
>>>>>>>> That is a good sign that local jobs and Java examples work,
>>>>>>>> probably just a small configuration issue :)
>>>>>>>>
>>>>>>>> Bryan
>>>>>>>>
>>>>>>>> On Wed, Jan 13, 2016 at 3:51 PM, Andrew Weiner <
>>>>>>>> andrewweiner2020@u.northwestern.edu> wrote:
>>>>>>>>
>>>>>>>>> Thanks for your continuing help.  Here is some additional info.
>>>>>>>>>
>>>>>>>>> *OS/architecture*
>>>>>>>>> output of *cat /proc/version*:
>>>>>>>>> Linux version 2.6.18-400.1.1.el5 (
>>>>>>>>> mockbuild@x86-012.build.bos.redhat.com)
>>>>>>>>>
>>>>>>>>> output of *lsb_release -a*:
>>>>>>>>> LSB Version:
>>>>>>>>>  :core-4.0-amd64:core-4.0-ia32:core-4.0-noarch:graphics-4.0-amd64:graphics-4.0-ia32:graphics-4.0-noarch:printing-4.0-amd64:printing-4.0-ia32:printing-4.0-noarch
>>>>>>>>> Distributor ID: RedHatEnterpriseServer
>>>>>>>>> Description:    Red Hat Enterprise Linux Server release 5.11
>>>>>>>>> (Tikanga)
>>>>>>>>> Release:        5.11
>>>>>>>>> Codename:       Tikanga
>>>>>>>>>
>>>>>>>>> *Running a local job*
>>>>>>>>> I have confirmed that I can successfully run python jobs using
>>>>>>>>> bin/spark-submit --master local[*]
>>>>>>>>> Specifically, this is the command I am using:
>>>>>>>>> *./bin/spark-submit --master local[8]
>>>>>>>>> ./examples/src/main/python/wordcount.py
>>>>>>>>> file:/home/<username>/spark-1.6.0-bin-hadoop2.4/README.md*
>>>>>>>>> And it works!
>>>>>>>>>
>>>>>>>>> *Additional info*
>>>>>>>>> I am also able to successfully run the Java SparkPi example using
>>>>>>>>> yarn in cluster mode using this command:
>>>>>>>>> * ./bin/spark-submit --class org.apache.spark.examples.SparkPi
>>>>>>>>> --master yarn     --deploy-mode cluster     --driver-memory 4g
>>>>>>>>> --executor-memory 2g     --executor-cores 1     lib/spark-examples*.jar
>>>>>>>>> 10*
>>>>>>>>> This Java job also runs successfully when I change --deploy-mode
>>>>>>>>> to client.  The fact that I can run Java jobs in cluster mode makes me
>>>>>>>>> thing that everything is installed correctly--is that a valid assumption?
>>>>>>>>>
>>>>>>>>> The problem remains that I cannot submit python jobs.  Here is the
>>>>>>>>> command that I am using to try to submit python jobs:
>>>>>>>>> * ./bin/spark-submit      --master yarn     --deploy-mode cluster
>>>>>>>>>     --driver-memory 4g     --executor-memory 2g     --executor-cores 1
>>>>>>>>> ./examples/src/main/python/pi.py     10*
>>>>>>>>> Does that look like a correct command?  I wasn't sure what to put
>>>>>>>>> for --class so I omitted it.  At any rate, the result of the above command
>>>>>>>>> is a syntax error, similar to the one I posted in the original email:
>>>>>>>>>
>>>>>>>>> Traceback (most recent call last):
>>>>>>>>>   File "pi.py", line 24, in ?
>>>>>>>>>     from pyspark import SparkContext
>>>>>>>>>   File "/home/<username>/spark-1.6.0-bin-hadoop2.4/python/pyspark/__init__.py", line 61
>>>>>>>>>     indent = ' ' * (min(len(m) for m in indents) if indents else 0)
>>>>>>>>>                                                   ^
>>>>>>>>> SyntaxError: invalid syntax
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> This really looks to me like a problem with the python version.
>>>>>>>>> Python 2.4 would throw this syntax error but Python 2.7 would not.  And yet
>>>>>>>>> I am using Python 2.7.8.  Is there any chance that Spark or Yarn is somehow
>>>>>>>>> using an older version of Python without my knowledge?
>>>>>>>>>
>>>>>>>>> Finally, when I try to run the same command in client mode...
>>>>>>>>> * ./bin/spark-submit      --master yarn     --deploy-mode client
>>>>>>>>>   --driver-memory 4g     --executor-memory 2g     --executor-cores 1
>>>>>>>>> ./examples/src/main/python/pi.py 10*
>>>>>>>>> I get the error I mentioned in the prior email:
>>>>>>>>> Error from python worker:
>>>>>>>>>   python: module pyspark.daemon not found
>>>>>>>>>
>>>>>>>>> Any thoughts?
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>> Andrew
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Jan 11, 2016 at 12:25 PM, Bryan Cutler <cu...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> This could be an environment issue, could you give more details
>>>>>>>>>> about the OS/architecture that you are using?  If you are sure everything
>>>>>>>>>> is installed correctly on each node following the guide on "Running Spark
>>>>>>>>>> on Yarn" http://spark.apache.org/docs/latest/running-on-yarn.html
>>>>>>>>>> and that the spark assembly jar is reachable, then I would check to see if
>>>>>>>>>> you can submit a local job to just run on one node.
>>>>>>>>>>
>>>>>>>>>> On Fri, Jan 8, 2016 at 5:22 PM, Andrew Weiner <
>>>>>>>>>> andrewweiner2020@u.northwestern.edu> wrote:
>>>>>>>>>>
>>>>>>>>>>> Now for simplicity I'm testing with wordcount.py from the
>>>>>>>>>>> provided examples, and using Spark 1.6.0
>>>>>>>>>>>
>>>>>>>>>>> The first error I get is:
>>>>>>>>>>>
>>>>>>>>>>> 16/01/08 19:14:46 ERROR lzo.GPLNativeCodeLoader: Could not load
>>>>>>>>>>> native gpl library
>>>>>>>>>>> java.lang.UnsatisfiedLinkError: no gplcompression in
>>>>>>>>>>> java.library.path
>>>>>>>>>>>         at
>>>>>>>>>>> java.lang.ClassLoader.loadLibrary(ClassLoader.java:1864)
>>>>>>>>>>>         at [....]
>>>>>>>>>>>
>>>>>>>>>>> A bit lower down, I see this error:
>>>>>>>>>>>
>>>>>>>>>>> 16/01/08 19:14:48 WARN scheduler.TaskSetManager: Lost task 0.0
>>>>>>>>>>> in stage 0.0 (TID 0, mundonovo-priv): org.apache.spark.SparkException:
>>>>>>>>>>> Error from python worker:
>>>>>>>>>>>   python: module pyspark.daemon not found
>>>>>>>>>>> PYTHONPATH was:
>>>>>>>>>>>
>>>>>>>>>>> /scratch5/hadoop/yarn/local/usercache/<username>/filecache/22/spark-assembly-1.6.0-hadoop2.4.0.jar:/home/jpr123/hg.pacific/python-common:/home/jpr123/python-libs:/home/jpr123/lib/python2.7/site-packages:/home/zsb739/local/lib/python2.7/site-packages:/home/jpr123/mobile-cdn-analysis:/home/<username>/lib/python2.7/site-packages:/scratch4/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0136/container_1450370639491_0136_01_000002/pyspark.zip:/scratch4/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0136/container_1450370639491_0136_01_000002/py4j-0.9-src.zip
>>>>>>>>>>> java.io.EOFException
>>>>>>>>>>>         at
>>>>>>>>>>> java.io.DataInputStream.readInt(DataInputStream.java:392)
>>>>>>>>>>>         at [....]
>>>>>>>>>>>
>>>>>>>>>>> And then a few more similar pyspark.daemon not found errors...
>>>>>>>>>>>
>>>>>>>>>>> Andrew
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Jan 8, 2016 at 2:31 PM, Bryan Cutler <cu...@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Andrew,
>>>>>>>>>>>>
>>>>>>>>>>>> I know that older versions of Spark could not run PySpark on
>>>>>>>>>>>> YARN in cluster mode.  I'm not sure if that is fixed in 1.6.0 though.  Can
>>>>>>>>>>>> you try setting deploy-mode option to "client" when calling spark-submit?
>>>>>>>>>>>>
>>>>>>>>>>>> Bryan
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Jan 7, 2016 at 2:39 PM, weineran <
>>>>>>>>>>>> andrewweiner2020@u.northwestern.edu> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>
>>>>>>>>>>>>> When I try to submit a python job using spark-submit (using
>>>>>>>>>>>>> --master yarn
>>>>>>>>>>>>> --deploy-mode cluster), I get the following error:
>>>>>>>>>>>>>
>>>>>>>>>>>>> /Traceback (most recent call last):
>>>>>>>>>>>>>   File "loss_rate_by_probe.py", line 15, in ?
>>>>>>>>>>>>>     from pyspark import SparkContext
>>>>>>>>>>>>>   File
>>>>>>>>>>>>>
>>>>>>>>>>>>> "/scratch5/hadoop/yarn/local/usercache/<username>/filecache/18/spark-assembly-1.3.1-hadoop2.4.0.jar/pyspark/__init__.py",
>>>>>>>>>>>>> line 41, in ?
>>>>>>>>>>>>>   File
>>>>>>>>>>>>>
>>>>>>>>>>>>> "/scratch5/hadoop/yarn/local/usercache/<username>/filecache/18/spark-assembly-1.3.1-hadoop2.4.0.jar/pyspark/context.py",
>>>>>>>>>>>>> line 219
>>>>>>>>>>>>>     with SparkContext._lock:
>>>>>>>>>>>>>                     ^
>>>>>>>>>>>>> SyntaxError: invalid syntax/
>>>>>>>>>>>>>
>>>>>>>>>>>>> This is very similar to  this post from 2014
>>>>>>>>>>>>> <
>>>>>>>>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/SparkContext-lock-Error-td18233.html
>>>>>>>>>>>>> >
>>>>>>>>>>>>> , but unlike that person I am using Python 2.7.8.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Here is what I'm using:
>>>>>>>>>>>>> Spark 1.3.1
>>>>>>>>>>>>> Hadoop 2.4.0.2.1.5.0-695
>>>>>>>>>>>>> Python 2.7.8
>>>>>>>>>>>>>
>>>>>>>>>>>>> Another clue:  I also installed Spark 1.6.0 and tried to
>>>>>>>>>>>>> submit the same
>>>>>>>>>>>>> job.  I got a similar error:
>>>>>>>>>>>>>
>>>>>>>>>>>>> /Traceback (most recent call last):
>>>>>>>>>>>>>   File "loss_rate_by_probe.py", line 15, in ?
>>>>>>>>>>>>>     from pyspark import SparkContext
>>>>>>>>>>>>>   File
>>>>>>>>>>>>>
>>>>>>>>>>>>> "/scratch5/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0119/container_1450370639491_0119_01_000001/pyspark.zip/pyspark/__init__.py",
>>>>>>>>>>>>> line 61
>>>>>>>>>>>>>     indent = ' ' * (min(len(m) for m in indents) if indents
>>>>>>>>>>>>> else 0)
>>>>>>>>>>>>>                                                   ^
>>>>>>>>>>>>> SyntaxError: invalid syntax/
>>>>>>>>>>>>>
>>>>>>>>>>>>> Any thoughts?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Andrew
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> View this message in context:
>>>>>>>>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/SparkContext-SyntaxError-invalid-syntax-tp25910.html
>>>>>>>>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>>>>>>>>> Nabble.com.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>>>>>>>>>>> For additional commands, e-mail: user-help@spark.apache.org
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: SparkContext SyntaxError: invalid syntax

Posted by Bryan Cutler <cu...@gmail.com>.
Glad you got it going!  It's wasn't very obvious what needed to be set,
maybe it is worth explicitly stating this in the docs since it seems to
have come up a couple times before too.

Bryan

On Fri, Jan 15, 2016 at 12:33 PM, Andrew Weiner <
andrewweiner2020@u.northwestern.edu> wrote:

> Actually, I just found this [
> https://issues.apache.org/jira/browse/SPARK-1680], which after a bit of
> googling and reading leads me to believe that the preferred way to change
> the yarn environment is to edit the spark-defaults.conf file by adding this
> line:
> spark.yarn.appMasterEnv.PYSPARK_PYTHON    /path/to/python
>
> While both this solution and the solution from my prior email work, I
> believe this is the preferred solution.
>
> Sorry for the flurry of emails.  Again, thanks for all the help!
>
> Andrew
>
> On Fri, Jan 15, 2016 at 1:47 PM, Andrew Weiner <
> andrewweiner2020@u.northwestern.edu> wrote:
>
>> I finally got the pi.py example to run in yarn cluster mode.  This was
>> the key insight:
>> https://issues.apache.org/jira/browse/SPARK-9229
>>
>> I had to set SPARK_YARN_USER_ENV in spark-env.sh:
>> export SPARK_YARN_USER_ENV="PYSPARK_PYTHON=/home/aqualab/local/bin/python"
>>
>> This caused the PYSPARK_PYTHON environment variable to be used in my yarn
>> environment in cluster mode.
>>
>> Thank you for all your help!
>>
>> Best,
>> Andrew
>>
>>
>>
>> On Fri, Jan 15, 2016 at 12:57 PM, Andrew Weiner <
>> andrewweiner2020@u.northwestern.edu> wrote:
>>
>>> I tried playing around with my environment variables, and here is an
>>> update.
>>>
>>> When I run in cluster mode, my environment variables do not persist
>>> throughout the entire job.
>>> For example, I tried creating a local copy of HADOOP_CONF_DIR in
>>> /home/<username>/local/etc/hadoop/conf, and then, in spark-env.sh I the
>>> variable:
>>> export HADOOP_CONF_DIR=/home/<username>/local/etc/hadoop/conf
>>>
>>> Later, when we print the environment variables in the python code, I see
>>> this:
>>>
>>> ('HADOOP_CONF_DIR', '/etc/hadoop/conf')
>>>
>>> However, when I run in client mode, I see this:
>>>
>>> ('HADOOP_CONF_DIR', '/home/awp066/local/etc/hadoop/conf')
>>>
>>> Furthermore, if I omit that environment variable from spark-env.sh altogether, I get the expected error in both client and cluster mode:
>>>
>>> When running with master 'yarn'
>>> either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment.
>>>
>>> This suggests that my environment variables are being used when I first submit the job, but at some point during the job, my environment variables are thrown out and someone's (yarn's?) environment variables are being used.
>>>
>>> Andrew
>>>
>>>
>>> On Fri, Jan 15, 2016 at 11:03 AM, Andrew Weiner <
>>> andrewweiner2020@u.northwestern.edu> wrote:
>>>
>>>> Indeed!  Here is the output when I run in cluster mode:
>>>>
>>>> Traceback (most recent call last):
>>>>   File "pi.py", line 22, in ?
>>>>     raise RuntimeError("\n"+str(sys.version_info) +"\n"+
>>>> RuntimeError:
>>>> (2, 4, 3, 'final', 0)
>>>> [('PYSPARK_GATEWAY_PORT', '48079'), ('PYTHONPATH', '/scratch2/hadoop/yarn/local/usercache/<username>/filecache/116/spark-assembly-1.6.0-hadoop2.4.0.jar:/home/<user>/spark-1.6.0-bin-hadoop2.4/python:/home/<username>/code/libs:/scratch5/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0239/container_1450370639491_0239_01_000001/pyspark.zip:/scratch5/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0239/container_1450370639491_0239_01_000001/py4j-0.9-src.zip'), ('PYTHONUNBUFFERED', 'YES')]
>>>>
>>>> As we suspected, it is using Python 2.4
>>>>
>>>> One thing that surprises me is that PYSPARK_PYTHON is not showing up in the list, even though I am setting it and exporting it in spark-submit *and* in spark-env.sh.  Is there somewhere else I need to set this variable?  Maybe in one of the hadoop conf files in my HADOOP_CONF_DIR?
>>>>
>>>> Andrew
>>>>
>>>>
>>>>
>>>> On Thu, Jan 14, 2016 at 1:14 PM, Bryan Cutler <cu...@gmail.com>
>>>> wrote:
>>>>
>>>>> It seems like it could be the case that some other Python version is
>>>>> being invoked.  To make sure, can you add something like this to the top of
>>>>> the .py file you are submitting to get some more info about how the
>>>>> application master is configured?
>>>>>
>>>>> import sys, os
>>>>> raise RuntimeError("\n"+str(sys.version_info) +"\n"+
>>>>>     str([(k,os.environ[k]) for k in os.environ if "PY" in k]))
>>>>>
>>>>> On Thu, Jan 14, 2016 at 8:37 AM, Andrew Weiner <
>>>>> andrewweiner2020@u.northwestern.edu> wrote:
>>>>>
>>>>>> Hi Bryan,
>>>>>>
>>>>>> I ran "$> python --version" on every node on the cluster, and it is
>>>>>> Python 2.7.8 for every single one.
>>>>>>
>>>>>> When I try to submit the Python example in client mode
>>>>>> * ./bin/spark-submit      --master yarn     --deploy-mode client
>>>>>> --driver-memory 4g     --executor-memory 2g     --executor-cores 1
>>>>>> ./examples/src/main/python/pi.py     10*
>>>>>> That's when I get this error that I mentioned:
>>>>>>
>>>>>> 16/01/14 10:09:10 WARN scheduler.TaskSetManager: Lost task 0.0 in
>>>>>> stage 0.0 (TID 0, mundonovo-priv): org.apache.spark.SparkException:
>>>>>> Error from python worker:
>>>>>>   python: module pyspark.daemon not found
>>>>>> PYTHONPATH was:
>>>>>>
>>>>>> /scratch5/hadoop/yarn/local/usercache/<username>/filecache/48/spark-assembly-1.6.0-hadoop2.4.0.jar:/home/aqualab/spark-1.6.0-bin-hadoop2.4/python:/home/jpr123/hg.pacific/python-common:/home/jp
>>>>>>
>>>>>> r123/python-libs:/home/jpr123/lib/python2.7/site-packages:/home/zsb739/local/lib/python2.7/site-packages:/home/jpr123/mobile-cdn-analysis:/home/<username>/lib/python2.7/site-packages:/home/<username>/code/libs:/scratch5/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0187/container_1450370639491_0187_01_000002/pyspark.zip:/scratch5/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0187/container_1450370639491_0187_01_000002/py4j-0.9-src.zip
>>>>>> java.io.EOFException
>>>>>>         at java.io.DataInputStream.readInt(DataInputStream.java:392)
>>>>>>         at
>>>>>> org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:164)
>>>>>>         at [....]
>>>>>>
>>>>>> followed by several more similar errors that also say:
>>>>>> Error from python worker:
>>>>>>   python: module pyspark.daemon not found
>>>>>>
>>>>>>
>>>>>> Even though the default python appeared to be correct, I just went
>>>>>> ahead and explicitly set PYSPARK_PYTHON in conf/spark-env.sh to the path of
>>>>>> the default python binary executable.  After making this change I was able
>>>>>> to run the job successfully in client!  That is, this appeared to fix the
>>>>>> "pyspark.daemon not found" error when running in client mode.
>>>>>>
>>>>>> However, when running in cluster mode, I am still getting the same
>>>>>> syntax error:
>>>>>>
>>>>>> Traceback (most recent call last):
>>>>>>   File "pi.py", line 24, in ?
>>>>>>     from pyspark import SparkContext
>>>>>>   File "/home/<username>/spark-1.6.0-bin-hadoop2.4/python/pyspark/__init__.py", line 61
>>>>>>     indent = ' ' * (min(len(m) for m in indents) if indents else 0)
>>>>>>                                                   ^
>>>>>> SyntaxError: invalid syntax
>>>>>>
>>>>>> Is it possible that the PYSPARK_PYTHON environment variable is ignored when jobs are submitted in cluster mode?  It seems that Spark or Yarn is going behind my back, so to speak, and using some older version of python I didn't even know was installed.
>>>>>>
>>>>>> Thanks again for all your help thus far.  We are getting close....
>>>>>>
>>>>>> Andrew
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Jan 13, 2016 at 6:13 PM, Bryan Cutler <cu...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Andrew,
>>>>>>>
>>>>>>> There are a couple of things to check.  First, is Python 2.7 the
>>>>>>> default version on all nodes in the cluster or is it an alternate install?
>>>>>>> Meaning what is the output of this command "$>  python --version"  If it is
>>>>>>> an alternate install, you could set the environment variable "
>>>>>>> PYSPARK_PYTHON" Python binary executable to use for PySpark in both
>>>>>>> driver and workers (default is python).
>>>>>>>
>>>>>>> Did you try to submit the Python example under client mode?
>>>>>>> Otherwise, the command looks fine, you don't use the --class option for
>>>>>>> submitting python files
>>>>>>> * ./bin/spark-submit      --master yarn     --deploy-mode client
>>>>>>> --driver-memory 4g     --executor-memory 2g     --executor-cores 1
>>>>>>> ./examples/src/main/python/pi.py     10*
>>>>>>>
>>>>>>> That is a good sign that local jobs and Java examples work, probably
>>>>>>> just a small configuration issue :)
>>>>>>>
>>>>>>> Bryan
>>>>>>>
>>>>>>> On Wed, Jan 13, 2016 at 3:51 PM, Andrew Weiner <
>>>>>>> andrewweiner2020@u.northwestern.edu> wrote:
>>>>>>>
>>>>>>>> Thanks for your continuing help.  Here is some additional info.
>>>>>>>>
>>>>>>>> *OS/architecture*
>>>>>>>> output of *cat /proc/version*:
>>>>>>>> Linux version 2.6.18-400.1.1.el5 (
>>>>>>>> mockbuild@x86-012.build.bos.redhat.com)
>>>>>>>>
>>>>>>>> output of *lsb_release -a*:
>>>>>>>> LSB Version:
>>>>>>>>  :core-4.0-amd64:core-4.0-ia32:core-4.0-noarch:graphics-4.0-amd64:graphics-4.0-ia32:graphics-4.0-noarch:printing-4.0-amd64:printing-4.0-ia32:printing-4.0-noarch
>>>>>>>> Distributor ID: RedHatEnterpriseServer
>>>>>>>> Description:    Red Hat Enterprise Linux Server release 5.11
>>>>>>>> (Tikanga)
>>>>>>>> Release:        5.11
>>>>>>>> Codename:       Tikanga
>>>>>>>>
>>>>>>>> *Running a local job*
>>>>>>>> I have confirmed that I can successfully run python jobs using
>>>>>>>> bin/spark-submit --master local[*]
>>>>>>>> Specifically, this is the command I am using:
>>>>>>>> *./bin/spark-submit --master local[8]
>>>>>>>> ./examples/src/main/python/wordcount.py
>>>>>>>> file:/home/<username>/spark-1.6.0-bin-hadoop2.4/README.md*
>>>>>>>> And it works!
>>>>>>>>
>>>>>>>> *Additional info*
>>>>>>>> I am also able to successfully run the Java SparkPi example using
>>>>>>>> yarn in cluster mode using this command:
>>>>>>>> * ./bin/spark-submit --class org.apache.spark.examples.SparkPi
>>>>>>>> --master yarn     --deploy-mode cluster     --driver-memory 4g
>>>>>>>> --executor-memory 2g     --executor-cores 1     lib/spark-examples*.jar
>>>>>>>> 10*
>>>>>>>> This Java job also runs successfully when I change --deploy-mode to
>>>>>>>> client.  The fact that I can run Java jobs in cluster mode makes me thing
>>>>>>>> that everything is installed correctly--is that a valid assumption?
>>>>>>>>
>>>>>>>> The problem remains that I cannot submit python jobs.  Here is the
>>>>>>>> command that I am using to try to submit python jobs:
>>>>>>>> * ./bin/spark-submit      --master yarn     --deploy-mode cluster
>>>>>>>>   --driver-memory 4g     --executor-memory 2g     --executor-cores 1
>>>>>>>> ./examples/src/main/python/pi.py     10*
>>>>>>>> Does that look like a correct command?  I wasn't sure what to put
>>>>>>>> for --class so I omitted it.  At any rate, the result of the above command
>>>>>>>> is a syntax error, similar to the one I posted in the original email:
>>>>>>>>
>>>>>>>> Traceback (most recent call last):
>>>>>>>>   File "pi.py", line 24, in ?
>>>>>>>>     from pyspark import SparkContext
>>>>>>>>   File "/home/<username>/spark-1.6.0-bin-hadoop2.4/python/pyspark/__init__.py", line 61
>>>>>>>>     indent = ' ' * (min(len(m) for m in indents) if indents else 0)
>>>>>>>>                                                   ^
>>>>>>>> SyntaxError: invalid syntax
>>>>>>>>
>>>>>>>>
>>>>>>>> This really looks to me like a problem with the python version.
>>>>>>>> Python 2.4 would throw this syntax error but Python 2.7 would not.  And yet
>>>>>>>> I am using Python 2.7.8.  Is there any chance that Spark or Yarn is somehow
>>>>>>>> using an older version of Python without my knowledge?
>>>>>>>>
>>>>>>>> Finally, when I try to run the same command in client mode...
>>>>>>>> * ./bin/spark-submit      --master yarn     --deploy-mode client
>>>>>>>>   --driver-memory 4g     --executor-memory 2g     --executor-cores 1
>>>>>>>> ./examples/src/main/python/pi.py 10*
>>>>>>>> I get the error I mentioned in the prior email:
>>>>>>>> Error from python worker:
>>>>>>>>   python: module pyspark.daemon not found
>>>>>>>>
>>>>>>>> Any thoughts?
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Andrew
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Jan 11, 2016 at 12:25 PM, Bryan Cutler <cu...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> This could be an environment issue, could you give more details
>>>>>>>>> about the OS/architecture that you are using?  If you are sure everything
>>>>>>>>> is installed correctly on each node following the guide on "Running Spark
>>>>>>>>> on Yarn" http://spark.apache.org/docs/latest/running-on-yarn.html
>>>>>>>>> and that the spark assembly jar is reachable, then I would check to see if
>>>>>>>>> you can submit a local job to just run on one node.
>>>>>>>>>
>>>>>>>>> On Fri, Jan 8, 2016 at 5:22 PM, Andrew Weiner <
>>>>>>>>> andrewweiner2020@u.northwestern.edu> wrote:
>>>>>>>>>
>>>>>>>>>> Now for simplicity I'm testing with wordcount.py from the
>>>>>>>>>> provided examples, and using Spark 1.6.0
>>>>>>>>>>
>>>>>>>>>> The first error I get is:
>>>>>>>>>>
>>>>>>>>>> 16/01/08 19:14:46 ERROR lzo.GPLNativeCodeLoader: Could not load
>>>>>>>>>> native gpl library
>>>>>>>>>> java.lang.UnsatisfiedLinkError: no gplcompression in
>>>>>>>>>> java.library.path
>>>>>>>>>>         at
>>>>>>>>>> java.lang.ClassLoader.loadLibrary(ClassLoader.java:1864)
>>>>>>>>>>         at [....]
>>>>>>>>>>
>>>>>>>>>> A bit lower down, I see this error:
>>>>>>>>>>
>>>>>>>>>> 16/01/08 19:14:48 WARN scheduler.TaskSetManager: Lost task 0.0 in
>>>>>>>>>> stage 0.0 (TID 0, mundonovo-priv): org.apache.spark.SparkException:
>>>>>>>>>> Error from python worker:
>>>>>>>>>>   python: module pyspark.daemon not found
>>>>>>>>>> PYTHONPATH was:
>>>>>>>>>>
>>>>>>>>>> /scratch5/hadoop/yarn/local/usercache/<username>/filecache/22/spark-assembly-1.6.0-hadoop2.4.0.jar:/home/jpr123/hg.pacific/python-common:/home/jpr123/python-libs:/home/jpr123/lib/python2.7/site-packages:/home/zsb739/local/lib/python2.7/site-packages:/home/jpr123/mobile-cdn-analysis:/home/<username>/lib/python2.7/site-packages:/scratch4/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0136/container_1450370639491_0136_01_000002/pyspark.zip:/scratch4/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0136/container_1450370639491_0136_01_000002/py4j-0.9-src.zip
>>>>>>>>>> java.io.EOFException
>>>>>>>>>>         at
>>>>>>>>>> java.io.DataInputStream.readInt(DataInputStream.java:392)
>>>>>>>>>>         at [....]
>>>>>>>>>>
>>>>>>>>>> And then a few more similar pyspark.daemon not found errors...
>>>>>>>>>>
>>>>>>>>>> Andrew
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, Jan 8, 2016 at 2:31 PM, Bryan Cutler <cu...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Andrew,
>>>>>>>>>>>
>>>>>>>>>>> I know that older versions of Spark could not run PySpark on
>>>>>>>>>>> YARN in cluster mode.  I'm not sure if that is fixed in 1.6.0 though.  Can
>>>>>>>>>>> you try setting deploy-mode option to "client" when calling spark-submit?
>>>>>>>>>>>
>>>>>>>>>>> Bryan
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Jan 7, 2016 at 2:39 PM, weineran <
>>>>>>>>>>> andrewweiner2020@u.northwestern.edu> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hello,
>>>>>>>>>>>>
>>>>>>>>>>>> When I try to submit a python job using spark-submit (using
>>>>>>>>>>>> --master yarn
>>>>>>>>>>>> --deploy-mode cluster), I get the following error:
>>>>>>>>>>>>
>>>>>>>>>>>> /Traceback (most recent call last):
>>>>>>>>>>>>   File "loss_rate_by_probe.py", line 15, in ?
>>>>>>>>>>>>     from pyspark import SparkContext
>>>>>>>>>>>>   File
>>>>>>>>>>>>
>>>>>>>>>>>> "/scratch5/hadoop/yarn/local/usercache/<username>/filecache/18/spark-assembly-1.3.1-hadoop2.4.0.jar/pyspark/__init__.py",
>>>>>>>>>>>> line 41, in ?
>>>>>>>>>>>>   File
>>>>>>>>>>>>
>>>>>>>>>>>> "/scratch5/hadoop/yarn/local/usercache/<username>/filecache/18/spark-assembly-1.3.1-hadoop2.4.0.jar/pyspark/context.py",
>>>>>>>>>>>> line 219
>>>>>>>>>>>>     with SparkContext._lock:
>>>>>>>>>>>>                     ^
>>>>>>>>>>>> SyntaxError: invalid syntax/
>>>>>>>>>>>>
>>>>>>>>>>>> This is very similar to  this post from 2014
>>>>>>>>>>>> <
>>>>>>>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/SparkContext-lock-Error-td18233.html
>>>>>>>>>>>> >
>>>>>>>>>>>> , but unlike that person I am using Python 2.7.8.
>>>>>>>>>>>>
>>>>>>>>>>>> Here is what I'm using:
>>>>>>>>>>>> Spark 1.3.1
>>>>>>>>>>>> Hadoop 2.4.0.2.1.5.0-695
>>>>>>>>>>>> Python 2.7.8
>>>>>>>>>>>>
>>>>>>>>>>>> Another clue:  I also installed Spark 1.6.0 and tried to submit
>>>>>>>>>>>> the same
>>>>>>>>>>>> job.  I got a similar error:
>>>>>>>>>>>>
>>>>>>>>>>>> /Traceback (most recent call last):
>>>>>>>>>>>>   File "loss_rate_by_probe.py", line 15, in ?
>>>>>>>>>>>>     from pyspark import SparkContext
>>>>>>>>>>>>   File
>>>>>>>>>>>>
>>>>>>>>>>>> "/scratch5/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0119/container_1450370639491_0119_01_000001/pyspark.zip/pyspark/__init__.py",
>>>>>>>>>>>> line 61
>>>>>>>>>>>>     indent = ' ' * (min(len(m) for m in indents) if indents
>>>>>>>>>>>> else 0)
>>>>>>>>>>>>                                                   ^
>>>>>>>>>>>> SyntaxError: invalid syntax/
>>>>>>>>>>>>
>>>>>>>>>>>> Any thoughts?
>>>>>>>>>>>>
>>>>>>>>>>>> Andrew
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> View this message in context:
>>>>>>>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/SparkContext-SyntaxError-invalid-syntax-tp25910.html
>>>>>>>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>>>>>>>> Nabble.com.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>>>>>>>>>> For additional commands, e-mail: user-help@spark.apache.org
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: SparkContext SyntaxError: invalid syntax

Posted by Andrew Weiner <an...@u.northwestern.edu>.
Actually, I just found this [
https://issues.apache.org/jira/browse/SPARK-1680], which after a bit of
googling and reading leads me to believe that the preferred way to change
the yarn environment is to edit the spark-defaults.conf file by adding this
line:
spark.yarn.appMasterEnv.PYSPARK_PYTHON    /path/to/python

While both this solution and the solution from my prior email work, I
believe this is the preferred solution.

Sorry for the flurry of emails.  Again, thanks for all the help!

Andrew

On Fri, Jan 15, 2016 at 1:47 PM, Andrew Weiner <
andrewweiner2020@u.northwestern.edu> wrote:

> I finally got the pi.py example to run in yarn cluster mode.  This was the
> key insight:
> https://issues.apache.org/jira/browse/SPARK-9229
>
> I had to set SPARK_YARN_USER_ENV in spark-env.sh:
> export SPARK_YARN_USER_ENV="PYSPARK_PYTHON=/home/aqualab/local/bin/python"
>
> This caused the PYSPARK_PYTHON environment variable to be used in my yarn
> environment in cluster mode.
>
> Thank you for all your help!
>
> Best,
> Andrew
>
>
>
> On Fri, Jan 15, 2016 at 12:57 PM, Andrew Weiner <
> andrewweiner2020@u.northwestern.edu> wrote:
>
>> I tried playing around with my environment variables, and here is an
>> update.
>>
>> When I run in cluster mode, my environment variables do not persist
>> throughout the entire job.
>> For example, I tried creating a local copy of HADOOP_CONF_DIR in
>> /home/<username>/local/etc/hadoop/conf, and then, in spark-env.sh I the
>> variable:
>> export HADOOP_CONF_DIR=/home/<username>/local/etc/hadoop/conf
>>
>> Later, when we print the environment variables in the python code, I see
>> this:
>>
>> ('HADOOP_CONF_DIR', '/etc/hadoop/conf')
>>
>> However, when I run in client mode, I see this:
>>
>> ('HADOOP_CONF_DIR', '/home/awp066/local/etc/hadoop/conf')
>>
>> Furthermore, if I omit that environment variable from spark-env.sh altogether, I get the expected error in both client and cluster mode:
>>
>> When running with master 'yarn'
>> either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment.
>>
>> This suggests that my environment variables are being used when I first submit the job, but at some point during the job, my environment variables are thrown out and someone's (yarn's?) environment variables are being used.
>>
>> Andrew
>>
>>
>> On Fri, Jan 15, 2016 at 11:03 AM, Andrew Weiner <
>> andrewweiner2020@u.northwestern.edu> wrote:
>>
>>> Indeed!  Here is the output when I run in cluster mode:
>>>
>>> Traceback (most recent call last):
>>>   File "pi.py", line 22, in ?
>>>     raise RuntimeError("\n"+str(sys.version_info) +"\n"+
>>> RuntimeError:
>>> (2, 4, 3, 'final', 0)
>>> [('PYSPARK_GATEWAY_PORT', '48079'), ('PYTHONPATH', '/scratch2/hadoop/yarn/local/usercache/<username>/filecache/116/spark-assembly-1.6.0-hadoop2.4.0.jar:/home/<user>/spark-1.6.0-bin-hadoop2.4/python:/home/<username>/code/libs:/scratch5/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0239/container_1450370639491_0239_01_000001/pyspark.zip:/scratch5/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0239/container_1450370639491_0239_01_000001/py4j-0.9-src.zip'), ('PYTHONUNBUFFERED', 'YES')]
>>>
>>> As we suspected, it is using Python 2.4
>>>
>>> One thing that surprises me is that PYSPARK_PYTHON is not showing up in the list, even though I am setting it and exporting it in spark-submit *and* in spark-env.sh.  Is there somewhere else I need to set this variable?  Maybe in one of the hadoop conf files in my HADOOP_CONF_DIR?
>>>
>>> Andrew
>>>
>>>
>>>
>>> On Thu, Jan 14, 2016 at 1:14 PM, Bryan Cutler <cu...@gmail.com> wrote:
>>>
>>>> It seems like it could be the case that some other Python version is
>>>> being invoked.  To make sure, can you add something like this to the top of
>>>> the .py file you are submitting to get some more info about how the
>>>> application master is configured?
>>>>
>>>> import sys, os
>>>> raise RuntimeError("\n"+str(sys.version_info) +"\n"+
>>>>     str([(k,os.environ[k]) for k in os.environ if "PY" in k]))
>>>>
>>>> On Thu, Jan 14, 2016 at 8:37 AM, Andrew Weiner <
>>>> andrewweiner2020@u.northwestern.edu> wrote:
>>>>
>>>>> Hi Bryan,
>>>>>
>>>>> I ran "$> python --version" on every node on the cluster, and it is
>>>>> Python 2.7.8 for every single one.
>>>>>
>>>>> When I try to submit the Python example in client mode
>>>>> * ./bin/spark-submit      --master yarn     --deploy-mode client
>>>>> --driver-memory 4g     --executor-memory 2g     --executor-cores 1
>>>>> ./examples/src/main/python/pi.py     10*
>>>>> That's when I get this error that I mentioned:
>>>>>
>>>>> 16/01/14 10:09:10 WARN scheduler.TaskSetManager: Lost task 0.0 in
>>>>> stage 0.0 (TID 0, mundonovo-priv): org.apache.spark.SparkException:
>>>>> Error from python worker:
>>>>>   python: module pyspark.daemon not found
>>>>> PYTHONPATH was:
>>>>>
>>>>> /scratch5/hadoop/yarn/local/usercache/<username>/filecache/48/spark-assembly-1.6.0-hadoop2.4.0.jar:/home/aqualab/spark-1.6.0-bin-hadoop2.4/python:/home/jpr123/hg.pacific/python-common:/home/jp
>>>>>
>>>>> r123/python-libs:/home/jpr123/lib/python2.7/site-packages:/home/zsb739/local/lib/python2.7/site-packages:/home/jpr123/mobile-cdn-analysis:/home/<username>/lib/python2.7/site-packages:/home/<username>/code/libs:/scratch5/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0187/container_1450370639491_0187_01_000002/pyspark.zip:/scratch5/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0187/container_1450370639491_0187_01_000002/py4j-0.9-src.zip
>>>>> java.io.EOFException
>>>>>         at java.io.DataInputStream.readInt(DataInputStream.java:392)
>>>>>         at
>>>>> org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:164)
>>>>>         at [....]
>>>>>
>>>>> followed by several more similar errors that also say:
>>>>> Error from python worker:
>>>>>   python: module pyspark.daemon not found
>>>>>
>>>>>
>>>>> Even though the default python appeared to be correct, I just went
>>>>> ahead and explicitly set PYSPARK_PYTHON in conf/spark-env.sh to the path of
>>>>> the default python binary executable.  After making this change I was able
>>>>> to run the job successfully in client!  That is, this appeared to fix the
>>>>> "pyspark.daemon not found" error when running in client mode.
>>>>>
>>>>> However, when running in cluster mode, I am still getting the same
>>>>> syntax error:
>>>>>
>>>>> Traceback (most recent call last):
>>>>>   File "pi.py", line 24, in ?
>>>>>     from pyspark import SparkContext
>>>>>   File "/home/<username>/spark-1.6.0-bin-hadoop2.4/python/pyspark/__init__.py", line 61
>>>>>     indent = ' ' * (min(len(m) for m in indents) if indents else 0)
>>>>>                                                   ^
>>>>> SyntaxError: invalid syntax
>>>>>
>>>>> Is it possible that the PYSPARK_PYTHON environment variable is ignored when jobs are submitted in cluster mode?  It seems that Spark or Yarn is going behind my back, so to speak, and using some older version of python I didn't even know was installed.
>>>>>
>>>>> Thanks again for all your help thus far.  We are getting close....
>>>>>
>>>>> Andrew
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Jan 13, 2016 at 6:13 PM, Bryan Cutler <cu...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Andrew,
>>>>>>
>>>>>> There are a couple of things to check.  First, is Python 2.7 the
>>>>>> default version on all nodes in the cluster or is it an alternate install?
>>>>>> Meaning what is the output of this command "$>  python --version"  If it is
>>>>>> an alternate install, you could set the environment variable "
>>>>>> PYSPARK_PYTHON" Python binary executable to use for PySpark in both
>>>>>> driver and workers (default is python).
>>>>>>
>>>>>> Did you try to submit the Python example under client mode?
>>>>>> Otherwise, the command looks fine, you don't use the --class option for
>>>>>> submitting python files
>>>>>> * ./bin/spark-submit      --master yarn     --deploy-mode client
>>>>>> --driver-memory 4g     --executor-memory 2g     --executor-cores 1
>>>>>> ./examples/src/main/python/pi.py     10*
>>>>>>
>>>>>> That is a good sign that local jobs and Java examples work, probably
>>>>>> just a small configuration issue :)
>>>>>>
>>>>>> Bryan
>>>>>>
>>>>>> On Wed, Jan 13, 2016 at 3:51 PM, Andrew Weiner <
>>>>>> andrewweiner2020@u.northwestern.edu> wrote:
>>>>>>
>>>>>>> Thanks for your continuing help.  Here is some additional info.
>>>>>>>
>>>>>>> *OS/architecture*
>>>>>>> output of *cat /proc/version*:
>>>>>>> Linux version 2.6.18-400.1.1.el5 (
>>>>>>> mockbuild@x86-012.build.bos.redhat.com)
>>>>>>>
>>>>>>> output of *lsb_release -a*:
>>>>>>> LSB Version:
>>>>>>>  :core-4.0-amd64:core-4.0-ia32:core-4.0-noarch:graphics-4.0-amd64:graphics-4.0-ia32:graphics-4.0-noarch:printing-4.0-amd64:printing-4.0-ia32:printing-4.0-noarch
>>>>>>> Distributor ID: RedHatEnterpriseServer
>>>>>>> Description:    Red Hat Enterprise Linux Server release 5.11
>>>>>>> (Tikanga)
>>>>>>> Release:        5.11
>>>>>>> Codename:       Tikanga
>>>>>>>
>>>>>>> *Running a local job*
>>>>>>> I have confirmed that I can successfully run python jobs using
>>>>>>> bin/spark-submit --master local[*]
>>>>>>> Specifically, this is the command I am using:
>>>>>>> *./bin/spark-submit --master local[8]
>>>>>>> ./examples/src/main/python/wordcount.py
>>>>>>> file:/home/<username>/spark-1.6.0-bin-hadoop2.4/README.md*
>>>>>>> And it works!
>>>>>>>
>>>>>>> *Additional info*
>>>>>>> I am also able to successfully run the Java SparkPi example using
>>>>>>> yarn in cluster mode using this command:
>>>>>>> * ./bin/spark-submit --class org.apache.spark.examples.SparkPi
>>>>>>> --master yarn     --deploy-mode cluster     --driver-memory 4g
>>>>>>> --executor-memory 2g     --executor-cores 1     lib/spark-examples*.jar
>>>>>>> 10*
>>>>>>> This Java job also runs successfully when I change --deploy-mode to
>>>>>>> client.  The fact that I can run Java jobs in cluster mode makes me thing
>>>>>>> that everything is installed correctly--is that a valid assumption?
>>>>>>>
>>>>>>> The problem remains that I cannot submit python jobs.  Here is the
>>>>>>> command that I am using to try to submit python jobs:
>>>>>>> * ./bin/spark-submit      --master yarn     --deploy-mode cluster
>>>>>>>   --driver-memory 4g     --executor-memory 2g     --executor-cores 1
>>>>>>> ./examples/src/main/python/pi.py     10*
>>>>>>> Does that look like a correct command?  I wasn't sure what to put
>>>>>>> for --class so I omitted it.  At any rate, the result of the above command
>>>>>>> is a syntax error, similar to the one I posted in the original email:
>>>>>>>
>>>>>>> Traceback (most recent call last):
>>>>>>>   File "pi.py", line 24, in ?
>>>>>>>     from pyspark import SparkContext
>>>>>>>   File "/home/<username>/spark-1.6.0-bin-hadoop2.4/python/pyspark/__init__.py", line 61
>>>>>>>     indent = ' ' * (min(len(m) for m in indents) if indents else 0)
>>>>>>>                                                   ^
>>>>>>> SyntaxError: invalid syntax
>>>>>>>
>>>>>>>
>>>>>>> This really looks to me like a problem with the python version.
>>>>>>> Python 2.4 would throw this syntax error but Python 2.7 would not.  And yet
>>>>>>> I am using Python 2.7.8.  Is there any chance that Spark or Yarn is somehow
>>>>>>> using an older version of Python without my knowledge?
>>>>>>>
>>>>>>> Finally, when I try to run the same command in client mode...
>>>>>>> * ./bin/spark-submit      --master yarn     --deploy-mode client
>>>>>>> --driver-memory 4g     --executor-memory 2g     --executor-cores 1
>>>>>>> ./examples/src/main/python/pi.py 10*
>>>>>>> I get the error I mentioned in the prior email:
>>>>>>> Error from python worker:
>>>>>>>   python: module pyspark.daemon not found
>>>>>>>
>>>>>>> Any thoughts?
>>>>>>>
>>>>>>> Best,
>>>>>>> Andrew
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Jan 11, 2016 at 12:25 PM, Bryan Cutler <cu...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> This could be an environment issue, could you give more details
>>>>>>>> about the OS/architecture that you are using?  If you are sure everything
>>>>>>>> is installed correctly on each node following the guide on "Running Spark
>>>>>>>> on Yarn" http://spark.apache.org/docs/latest/running-on-yarn.html
>>>>>>>> and that the spark assembly jar is reachable, then I would check to see if
>>>>>>>> you can submit a local job to just run on one node.
>>>>>>>>
>>>>>>>> On Fri, Jan 8, 2016 at 5:22 PM, Andrew Weiner <
>>>>>>>> andrewweiner2020@u.northwestern.edu> wrote:
>>>>>>>>
>>>>>>>>> Now for simplicity I'm testing with wordcount.py from the provided
>>>>>>>>> examples, and using Spark 1.6.0
>>>>>>>>>
>>>>>>>>> The first error I get is:
>>>>>>>>>
>>>>>>>>> 16/01/08 19:14:46 ERROR lzo.GPLNativeCodeLoader: Could not load
>>>>>>>>> native gpl library
>>>>>>>>> java.lang.UnsatisfiedLinkError: no gplcompression in
>>>>>>>>> java.library.path
>>>>>>>>>         at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1864)
>>>>>>>>>         at [....]
>>>>>>>>>
>>>>>>>>> A bit lower down, I see this error:
>>>>>>>>>
>>>>>>>>> 16/01/08 19:14:48 WARN scheduler.TaskSetManager: Lost task 0.0 in
>>>>>>>>> stage 0.0 (TID 0, mundonovo-priv): org.apache.spark.SparkException:
>>>>>>>>> Error from python worker:
>>>>>>>>>   python: module pyspark.daemon not found
>>>>>>>>> PYTHONPATH was:
>>>>>>>>>
>>>>>>>>> /scratch5/hadoop/yarn/local/usercache/<username>/filecache/22/spark-assembly-1.6.0-hadoop2.4.0.jar:/home/jpr123/hg.pacific/python-common:/home/jpr123/python-libs:/home/jpr123/lib/python2.7/site-packages:/home/zsb739/local/lib/python2.7/site-packages:/home/jpr123/mobile-cdn-analysis:/home/<username>/lib/python2.7/site-packages:/scratch4/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0136/container_1450370639491_0136_01_000002/pyspark.zip:/scratch4/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0136/container_1450370639491_0136_01_000002/py4j-0.9-src.zip
>>>>>>>>> java.io.EOFException
>>>>>>>>>         at
>>>>>>>>> java.io.DataInputStream.readInt(DataInputStream.java:392)
>>>>>>>>>         at [....]
>>>>>>>>>
>>>>>>>>> And then a few more similar pyspark.daemon not found errors...
>>>>>>>>>
>>>>>>>>> Andrew
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Jan 8, 2016 at 2:31 PM, Bryan Cutler <cu...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Andrew,
>>>>>>>>>>
>>>>>>>>>> I know that older versions of Spark could not run PySpark on YARN
>>>>>>>>>> in cluster mode.  I'm not sure if that is fixed in 1.6.0 though.  Can you
>>>>>>>>>> try setting deploy-mode option to "client" when calling spark-submit?
>>>>>>>>>>
>>>>>>>>>> Bryan
>>>>>>>>>>
>>>>>>>>>> On Thu, Jan 7, 2016 at 2:39 PM, weineran <
>>>>>>>>>> andrewweiner2020@u.northwestern.edu> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hello,
>>>>>>>>>>>
>>>>>>>>>>> When I try to submit a python job using spark-submit (using
>>>>>>>>>>> --master yarn
>>>>>>>>>>> --deploy-mode cluster), I get the following error:
>>>>>>>>>>>
>>>>>>>>>>> /Traceback (most recent call last):
>>>>>>>>>>>   File "loss_rate_by_probe.py", line 15, in ?
>>>>>>>>>>>     from pyspark import SparkContext
>>>>>>>>>>>   File
>>>>>>>>>>>
>>>>>>>>>>> "/scratch5/hadoop/yarn/local/usercache/<username>/filecache/18/spark-assembly-1.3.1-hadoop2.4.0.jar/pyspark/__init__.py",
>>>>>>>>>>> line 41, in ?
>>>>>>>>>>>   File
>>>>>>>>>>>
>>>>>>>>>>> "/scratch5/hadoop/yarn/local/usercache/<username>/filecache/18/spark-assembly-1.3.1-hadoop2.4.0.jar/pyspark/context.py",
>>>>>>>>>>> line 219
>>>>>>>>>>>     with SparkContext._lock:
>>>>>>>>>>>                     ^
>>>>>>>>>>> SyntaxError: invalid syntax/
>>>>>>>>>>>
>>>>>>>>>>> This is very similar to  this post from 2014
>>>>>>>>>>> <
>>>>>>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/SparkContext-lock-Error-td18233.html
>>>>>>>>>>> >
>>>>>>>>>>> , but unlike that person I am using Python 2.7.8.
>>>>>>>>>>>
>>>>>>>>>>> Here is what I'm using:
>>>>>>>>>>> Spark 1.3.1
>>>>>>>>>>> Hadoop 2.4.0.2.1.5.0-695
>>>>>>>>>>> Python 2.7.8
>>>>>>>>>>>
>>>>>>>>>>> Another clue:  I also installed Spark 1.6.0 and tried to submit
>>>>>>>>>>> the same
>>>>>>>>>>> job.  I got a similar error:
>>>>>>>>>>>
>>>>>>>>>>> /Traceback (most recent call last):
>>>>>>>>>>>   File "loss_rate_by_probe.py", line 15, in ?
>>>>>>>>>>>     from pyspark import SparkContext
>>>>>>>>>>>   File
>>>>>>>>>>>
>>>>>>>>>>> "/scratch5/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0119/container_1450370639491_0119_01_000001/pyspark.zip/pyspark/__init__.py",
>>>>>>>>>>> line 61
>>>>>>>>>>>     indent = ' ' * (min(len(m) for m in indents) if indents else
>>>>>>>>>>> 0)
>>>>>>>>>>>                                                   ^
>>>>>>>>>>> SyntaxError: invalid syntax/
>>>>>>>>>>>
>>>>>>>>>>> Any thoughts?
>>>>>>>>>>>
>>>>>>>>>>> Andrew
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> View this message in context:
>>>>>>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/SparkContext-SyntaxError-invalid-syntax-tp25910.html
>>>>>>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>>>>>>> Nabble.com.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>>>>>>>>> For additional commands, e-mail: user-help@spark.apache.org
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: SparkContext SyntaxError: invalid syntax

Posted by Andrew Weiner <an...@u.northwestern.edu>.
I finally got the pi.py example to run in yarn cluster mode.  This was the
key insight:
https://issues.apache.org/jira/browse/SPARK-9229

I had to set SPARK_YARN_USER_ENV in spark-env.sh:
export SPARK_YARN_USER_ENV="PYSPARK_PYTHON=/home/aqualab/local/bin/python"

This caused the PYSPARK_PYTHON environment variable to be used in my yarn
environment in cluster mode.

Thank you for all your help!

Best,
Andrew



On Fri, Jan 15, 2016 at 12:57 PM, Andrew Weiner <
andrewweiner2020@u.northwestern.edu> wrote:

> I tried playing around with my environment variables, and here is an
> update.
>
> When I run in cluster mode, my environment variables do not persist
> throughout the entire job.
> For example, I tried creating a local copy of HADOOP_CONF_DIR in
> /home/<username>/local/etc/hadoop/conf, and then, in spark-env.sh I the
> variable:
> export HADOOP_CONF_DIR=/home/<username>/local/etc/hadoop/conf
>
> Later, when we print the environment variables in the python code, I see
> this:
>
> ('HADOOP_CONF_DIR', '/etc/hadoop/conf')
>
> However, when I run in client mode, I see this:
>
> ('HADOOP_CONF_DIR', '/home/awp066/local/etc/hadoop/conf')
>
> Furthermore, if I omit that environment variable from spark-env.sh altogether, I get the expected error in both client and cluster mode:
>
> When running with master 'yarn'
> either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment.
>
> This suggests that my environment variables are being used when I first submit the job, but at some point during the job, my environment variables are thrown out and someone's (yarn's?) environment variables are being used.
>
> Andrew
>
>
> On Fri, Jan 15, 2016 at 11:03 AM, Andrew Weiner <
> andrewweiner2020@u.northwestern.edu> wrote:
>
>> Indeed!  Here is the output when I run in cluster mode:
>>
>> Traceback (most recent call last):
>>   File "pi.py", line 22, in ?
>>     raise RuntimeError("\n"+str(sys.version_info) +"\n"+
>> RuntimeError:
>> (2, 4, 3, 'final', 0)
>> [('PYSPARK_GATEWAY_PORT', '48079'), ('PYTHONPATH', '/scratch2/hadoop/yarn/local/usercache/<username>/filecache/116/spark-assembly-1.6.0-hadoop2.4.0.jar:/home/<user>/spark-1.6.0-bin-hadoop2.4/python:/home/<username>/code/libs:/scratch5/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0239/container_1450370639491_0239_01_000001/pyspark.zip:/scratch5/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0239/container_1450370639491_0239_01_000001/py4j-0.9-src.zip'), ('PYTHONUNBUFFERED', 'YES')]
>>
>> As we suspected, it is using Python 2.4
>>
>> One thing that surprises me is that PYSPARK_PYTHON is not showing up in the list, even though I am setting it and exporting it in spark-submit *and* in spark-env.sh.  Is there somewhere else I need to set this variable?  Maybe in one of the hadoop conf files in my HADOOP_CONF_DIR?
>>
>> Andrew
>>
>>
>>
>> On Thu, Jan 14, 2016 at 1:14 PM, Bryan Cutler <cu...@gmail.com> wrote:
>>
>>> It seems like it could be the case that some other Python version is
>>> being invoked.  To make sure, can you add something like this to the top of
>>> the .py file you are submitting to get some more info about how the
>>> application master is configured?
>>>
>>> import sys, os
>>> raise RuntimeError("\n"+str(sys.version_info) +"\n"+
>>>     str([(k,os.environ[k]) for k in os.environ if "PY" in k]))
>>>
>>> On Thu, Jan 14, 2016 at 8:37 AM, Andrew Weiner <
>>> andrewweiner2020@u.northwestern.edu> wrote:
>>>
>>>> Hi Bryan,
>>>>
>>>> I ran "$> python --version" on every node on the cluster, and it is
>>>> Python 2.7.8 for every single one.
>>>>
>>>> When I try to submit the Python example in client mode
>>>> * ./bin/spark-submit      --master yarn     --deploy-mode client
>>>> --driver-memory 4g     --executor-memory 2g     --executor-cores 1
>>>> ./examples/src/main/python/pi.py     10*
>>>> That's when I get this error that I mentioned:
>>>>
>>>> 16/01/14 10:09:10 WARN scheduler.TaskSetManager: Lost task 0.0 in stage
>>>> 0.0 (TID 0, mundonovo-priv): org.apache.spark.SparkException:
>>>> Error from python worker:
>>>>   python: module pyspark.daemon not found
>>>> PYTHONPATH was:
>>>>
>>>> /scratch5/hadoop/yarn/local/usercache/<username>/filecache/48/spark-assembly-1.6.0-hadoop2.4.0.jar:/home/aqualab/spark-1.6.0-bin-hadoop2.4/python:/home/jpr123/hg.pacific/python-common:/home/jp
>>>>
>>>> r123/python-libs:/home/jpr123/lib/python2.7/site-packages:/home/zsb739/local/lib/python2.7/site-packages:/home/jpr123/mobile-cdn-analysis:/home/<username>/lib/python2.7/site-packages:/home/<username>/code/libs:/scratch5/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0187/container_1450370639491_0187_01_000002/pyspark.zip:/scratch5/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0187/container_1450370639491_0187_01_000002/py4j-0.9-src.zip
>>>> java.io.EOFException
>>>>         at java.io.DataInputStream.readInt(DataInputStream.java:392)
>>>>         at
>>>> org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:164)
>>>>         at [....]
>>>>
>>>> followed by several more similar errors that also say:
>>>> Error from python worker:
>>>>   python: module pyspark.daemon not found
>>>>
>>>>
>>>> Even though the default python appeared to be correct, I just went
>>>> ahead and explicitly set PYSPARK_PYTHON in conf/spark-env.sh to the path of
>>>> the default python binary executable.  After making this change I was able
>>>> to run the job successfully in client!  That is, this appeared to fix the
>>>> "pyspark.daemon not found" error when running in client mode.
>>>>
>>>> However, when running in cluster mode, I am still getting the same
>>>> syntax error:
>>>>
>>>> Traceback (most recent call last):
>>>>   File "pi.py", line 24, in ?
>>>>     from pyspark import SparkContext
>>>>   File "/home/<username>/spark-1.6.0-bin-hadoop2.4/python/pyspark/__init__.py", line 61
>>>>     indent = ' ' * (min(len(m) for m in indents) if indents else 0)
>>>>                                                   ^
>>>> SyntaxError: invalid syntax
>>>>
>>>> Is it possible that the PYSPARK_PYTHON environment variable is ignored when jobs are submitted in cluster mode?  It seems that Spark or Yarn is going behind my back, so to speak, and using some older version of python I didn't even know was installed.
>>>>
>>>> Thanks again for all your help thus far.  We are getting close....
>>>>
>>>> Andrew
>>>>
>>>>
>>>>
>>>> On Wed, Jan 13, 2016 at 6:13 PM, Bryan Cutler <cu...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Andrew,
>>>>>
>>>>> There are a couple of things to check.  First, is Python 2.7 the
>>>>> default version on all nodes in the cluster or is it an alternate install?
>>>>> Meaning what is the output of this command "$>  python --version"  If it is
>>>>> an alternate install, you could set the environment variable "
>>>>> PYSPARK_PYTHON" Python binary executable to use for PySpark in both
>>>>> driver and workers (default is python).
>>>>>
>>>>> Did you try to submit the Python example under client mode?
>>>>> Otherwise, the command looks fine, you don't use the --class option for
>>>>> submitting python files
>>>>> * ./bin/spark-submit      --master yarn     --deploy-mode client
>>>>> --driver-memory 4g     --executor-memory 2g     --executor-cores 1
>>>>> ./examples/src/main/python/pi.py     10*
>>>>>
>>>>> That is a good sign that local jobs and Java examples work, probably
>>>>> just a small configuration issue :)
>>>>>
>>>>> Bryan
>>>>>
>>>>> On Wed, Jan 13, 2016 at 3:51 PM, Andrew Weiner <
>>>>> andrewweiner2020@u.northwestern.edu> wrote:
>>>>>
>>>>>> Thanks for your continuing help.  Here is some additional info.
>>>>>>
>>>>>> *OS/architecture*
>>>>>> output of *cat /proc/version*:
>>>>>> Linux version 2.6.18-400.1.1.el5 (
>>>>>> mockbuild@x86-012.build.bos.redhat.com)
>>>>>>
>>>>>> output of *lsb_release -a*:
>>>>>> LSB Version:
>>>>>>  :core-4.0-amd64:core-4.0-ia32:core-4.0-noarch:graphics-4.0-amd64:graphics-4.0-ia32:graphics-4.0-noarch:printing-4.0-amd64:printing-4.0-ia32:printing-4.0-noarch
>>>>>> Distributor ID: RedHatEnterpriseServer
>>>>>> Description:    Red Hat Enterprise Linux Server release 5.11 (Tikanga)
>>>>>> Release:        5.11
>>>>>> Codename:       Tikanga
>>>>>>
>>>>>> *Running a local job*
>>>>>> I have confirmed that I can successfully run python jobs using
>>>>>> bin/spark-submit --master local[*]
>>>>>> Specifically, this is the command I am using:
>>>>>> *./bin/spark-submit --master local[8]
>>>>>> ./examples/src/main/python/wordcount.py
>>>>>> file:/home/<username>/spark-1.6.0-bin-hadoop2.4/README.md*
>>>>>> And it works!
>>>>>>
>>>>>> *Additional info*
>>>>>> I am also able to successfully run the Java SparkPi example using
>>>>>> yarn in cluster mode using this command:
>>>>>> * ./bin/spark-submit --class org.apache.spark.examples.SparkPi
>>>>>> --master yarn     --deploy-mode cluster     --driver-memory 4g
>>>>>> --executor-memory 2g     --executor-cores 1     lib/spark-examples*.jar
>>>>>> 10*
>>>>>> This Java job also runs successfully when I change --deploy-mode to
>>>>>> client.  The fact that I can run Java jobs in cluster mode makes me thing
>>>>>> that everything is installed correctly--is that a valid assumption?
>>>>>>
>>>>>> The problem remains that I cannot submit python jobs.  Here is the
>>>>>> command that I am using to try to submit python jobs:
>>>>>> * ./bin/spark-submit      --master yarn     --deploy-mode cluster
>>>>>> --driver-memory 4g     --executor-memory 2g     --executor-cores 1
>>>>>> ./examples/src/main/python/pi.py     10*
>>>>>> Does that look like a correct command?  I wasn't sure what to put for
>>>>>> --class so I omitted it.  At any rate, the result of the above command is a
>>>>>> syntax error, similar to the one I posted in the original email:
>>>>>>
>>>>>> Traceback (most recent call last):
>>>>>>   File "pi.py", line 24, in ?
>>>>>>     from pyspark import SparkContext
>>>>>>   File "/home/<username>/spark-1.6.0-bin-hadoop2.4/python/pyspark/__init__.py", line 61
>>>>>>     indent = ' ' * (min(len(m) for m in indents) if indents else 0)
>>>>>>                                                   ^
>>>>>> SyntaxError: invalid syntax
>>>>>>
>>>>>>
>>>>>> This really looks to me like a problem with the python version.
>>>>>> Python 2.4 would throw this syntax error but Python 2.7 would not.  And yet
>>>>>> I am using Python 2.7.8.  Is there any chance that Spark or Yarn is somehow
>>>>>> using an older version of Python without my knowledge?
>>>>>>
>>>>>> Finally, when I try to run the same command in client mode...
>>>>>> * ./bin/spark-submit      --master yarn     --deploy-mode client
>>>>>> --driver-memory 4g     --executor-memory 2g     --executor-cores 1
>>>>>> ./examples/src/main/python/pi.py 10*
>>>>>> I get the error I mentioned in the prior email:
>>>>>> Error from python worker:
>>>>>>   python: module pyspark.daemon not found
>>>>>>
>>>>>> Any thoughts?
>>>>>>
>>>>>> Best,
>>>>>> Andrew
>>>>>>
>>>>>>
>>>>>> On Mon, Jan 11, 2016 at 12:25 PM, Bryan Cutler <cu...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> This could be an environment issue, could you give more details
>>>>>>> about the OS/architecture that you are using?  If you are sure everything
>>>>>>> is installed correctly on each node following the guide on "Running Spark
>>>>>>> on Yarn" http://spark.apache.org/docs/latest/running-on-yarn.html
>>>>>>> and that the spark assembly jar is reachable, then I would check to see if
>>>>>>> you can submit a local job to just run on one node.
>>>>>>>
>>>>>>> On Fri, Jan 8, 2016 at 5:22 PM, Andrew Weiner <
>>>>>>> andrewweiner2020@u.northwestern.edu> wrote:
>>>>>>>
>>>>>>>> Now for simplicity I'm testing with wordcount.py from the provided
>>>>>>>> examples, and using Spark 1.6.0
>>>>>>>>
>>>>>>>> The first error I get is:
>>>>>>>>
>>>>>>>> 16/01/08 19:14:46 ERROR lzo.GPLNativeCodeLoader: Could not load
>>>>>>>> native gpl library
>>>>>>>> java.lang.UnsatisfiedLinkError: no gplcompression in
>>>>>>>> java.library.path
>>>>>>>>         at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1864)
>>>>>>>>         at [....]
>>>>>>>>
>>>>>>>> A bit lower down, I see this error:
>>>>>>>>
>>>>>>>> 16/01/08 19:14:48 WARN scheduler.TaskSetManager: Lost task 0.0 in
>>>>>>>> stage 0.0 (TID 0, mundonovo-priv): org.apache.spark.SparkException:
>>>>>>>> Error from python worker:
>>>>>>>>   python: module pyspark.daemon not found
>>>>>>>> PYTHONPATH was:
>>>>>>>>
>>>>>>>> /scratch5/hadoop/yarn/local/usercache/<username>/filecache/22/spark-assembly-1.6.0-hadoop2.4.0.jar:/home/jpr123/hg.pacific/python-common:/home/jpr123/python-libs:/home/jpr123/lib/python2.7/site-packages:/home/zsb739/local/lib/python2.7/site-packages:/home/jpr123/mobile-cdn-analysis:/home/<username>/lib/python2.7/site-packages:/scratch4/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0136/container_1450370639491_0136_01_000002/pyspark.zip:/scratch4/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0136/container_1450370639491_0136_01_000002/py4j-0.9-src.zip
>>>>>>>> java.io.EOFException
>>>>>>>>         at java.io.DataInputStream.readInt(DataInputStream.java:392)
>>>>>>>>         at [....]
>>>>>>>>
>>>>>>>> And then a few more similar pyspark.daemon not found errors...
>>>>>>>>
>>>>>>>> Andrew
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Jan 8, 2016 at 2:31 PM, Bryan Cutler <cu...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi Andrew,
>>>>>>>>>
>>>>>>>>> I know that older versions of Spark could not run PySpark on YARN
>>>>>>>>> in cluster mode.  I'm not sure if that is fixed in 1.6.0 though.  Can you
>>>>>>>>> try setting deploy-mode option to "client" when calling spark-submit?
>>>>>>>>>
>>>>>>>>> Bryan
>>>>>>>>>
>>>>>>>>> On Thu, Jan 7, 2016 at 2:39 PM, weineran <
>>>>>>>>> andrewweiner2020@u.northwestern.edu> wrote:
>>>>>>>>>
>>>>>>>>>> Hello,
>>>>>>>>>>
>>>>>>>>>> When I try to submit a python job using spark-submit (using
>>>>>>>>>> --master yarn
>>>>>>>>>> --deploy-mode cluster), I get the following error:
>>>>>>>>>>
>>>>>>>>>> /Traceback (most recent call last):
>>>>>>>>>>   File "loss_rate_by_probe.py", line 15, in ?
>>>>>>>>>>     from pyspark import SparkContext
>>>>>>>>>>   File
>>>>>>>>>>
>>>>>>>>>> "/scratch5/hadoop/yarn/local/usercache/<username>/filecache/18/spark-assembly-1.3.1-hadoop2.4.0.jar/pyspark/__init__.py",
>>>>>>>>>> line 41, in ?
>>>>>>>>>>   File
>>>>>>>>>>
>>>>>>>>>> "/scratch5/hadoop/yarn/local/usercache/<username>/filecache/18/spark-assembly-1.3.1-hadoop2.4.0.jar/pyspark/context.py",
>>>>>>>>>> line 219
>>>>>>>>>>     with SparkContext._lock:
>>>>>>>>>>                     ^
>>>>>>>>>> SyntaxError: invalid syntax/
>>>>>>>>>>
>>>>>>>>>> This is very similar to  this post from 2014
>>>>>>>>>> <
>>>>>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/SparkContext-lock-Error-td18233.html
>>>>>>>>>> >
>>>>>>>>>> , but unlike that person I am using Python 2.7.8.
>>>>>>>>>>
>>>>>>>>>> Here is what I'm using:
>>>>>>>>>> Spark 1.3.1
>>>>>>>>>> Hadoop 2.4.0.2.1.5.0-695
>>>>>>>>>> Python 2.7.8
>>>>>>>>>>
>>>>>>>>>> Another clue:  I also installed Spark 1.6.0 and tried to submit
>>>>>>>>>> the same
>>>>>>>>>> job.  I got a similar error:
>>>>>>>>>>
>>>>>>>>>> /Traceback (most recent call last):
>>>>>>>>>>   File "loss_rate_by_probe.py", line 15, in ?
>>>>>>>>>>     from pyspark import SparkContext
>>>>>>>>>>   File
>>>>>>>>>>
>>>>>>>>>> "/scratch5/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0119/container_1450370639491_0119_01_000001/pyspark.zip/pyspark/__init__.py",
>>>>>>>>>> line 61
>>>>>>>>>>     indent = ' ' * (min(len(m) for m in indents) if indents else
>>>>>>>>>> 0)
>>>>>>>>>>                                                   ^
>>>>>>>>>> SyntaxError: invalid syntax/
>>>>>>>>>>
>>>>>>>>>> Any thoughts?
>>>>>>>>>>
>>>>>>>>>> Andrew
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> View this message in context:
>>>>>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/SparkContext-SyntaxError-invalid-syntax-tp25910.html
>>>>>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>>>>>> Nabble.com.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>>>>>>>> For additional commands, e-mail: user-help@spark.apache.org
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: SparkContext SyntaxError: invalid syntax

Posted by Andrew Weiner <an...@u.northwestern.edu>.
I tried playing around with my environment variables, and here is an update.

When I run in cluster mode, my environment variables do not persist
throughout the entire job.
For example, I tried creating a local copy of HADOOP_CONF_DIR in
/home/<username>/local/etc/hadoop/conf, and then, in spark-env.sh I the
variable:
export HADOOP_CONF_DIR=/home/<username>/local/etc/hadoop/conf

Later, when we print the environment variables in the python code, I see
this:

('HADOOP_CONF_DIR', '/etc/hadoop/conf')

However, when I run in client mode, I see this:

('HADOOP_CONF_DIR', '/home/awp066/local/etc/hadoop/conf')

Furthermore, if I omit that environment variable from spark-env.sh
altogether, I get the expected error in both client and cluster mode:

When running with master 'yarn'
either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment.

This suggests that my environment variables are being used when I
first submit the job, but at some point during the job, my environment
variables are thrown out and someone's (yarn's?) environment variables
are being used.

Andrew


On Fri, Jan 15, 2016 at 11:03 AM, Andrew Weiner <
andrewweiner2020@u.northwestern.edu> wrote:

> Indeed!  Here is the output when I run in cluster mode:
>
> Traceback (most recent call last):
>   File "pi.py", line 22, in ?
>     raise RuntimeError("\n"+str(sys.version_info) +"\n"+
> RuntimeError:
> (2, 4, 3, 'final', 0)
> [('PYSPARK_GATEWAY_PORT', '48079'), ('PYTHONPATH', '/scratch2/hadoop/yarn/local/usercache/<username>/filecache/116/spark-assembly-1.6.0-hadoop2.4.0.jar:/home/<user>/spark-1.6.0-bin-hadoop2.4/python:/home/<username>/code/libs:/scratch5/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0239/container_1450370639491_0239_01_000001/pyspark.zip:/scratch5/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0239/container_1450370639491_0239_01_000001/py4j-0.9-src.zip'), ('PYTHONUNBUFFERED', 'YES')]
>
> As we suspected, it is using Python 2.4
>
> One thing that surprises me is that PYSPARK_PYTHON is not showing up in the list, even though I am setting it and exporting it in spark-submit *and* in spark-env.sh.  Is there somewhere else I need to set this variable?  Maybe in one of the hadoop conf files in my HADOOP_CONF_DIR?
>
> Andrew
>
>
>
> On Thu, Jan 14, 2016 at 1:14 PM, Bryan Cutler <cu...@gmail.com> wrote:
>
>> It seems like it could be the case that some other Python version is
>> being invoked.  To make sure, can you add something like this to the top of
>> the .py file you are submitting to get some more info about how the
>> application master is configured?
>>
>> import sys, os
>> raise RuntimeError("\n"+str(sys.version_info) +"\n"+
>>     str([(k,os.environ[k]) for k in os.environ if "PY" in k]))
>>
>> On Thu, Jan 14, 2016 at 8:37 AM, Andrew Weiner <
>> andrewweiner2020@u.northwestern.edu> wrote:
>>
>>> Hi Bryan,
>>>
>>> I ran "$> python --version" on every node on the cluster, and it is
>>> Python 2.7.8 for every single one.
>>>
>>> When I try to submit the Python example in client mode
>>> * ./bin/spark-submit      --master yarn     --deploy-mode client
>>> --driver-memory 4g     --executor-memory 2g     --executor-cores 1
>>> ./examples/src/main/python/pi.py     10*
>>> That's when I get this error that I mentioned:
>>>
>>> 16/01/14 10:09:10 WARN scheduler.TaskSetManager: Lost task 0.0 in stage
>>> 0.0 (TID 0, mundonovo-priv): org.apache.spark.SparkException:
>>> Error from python worker:
>>>   python: module pyspark.daemon not found
>>> PYTHONPATH was:
>>>
>>> /scratch5/hadoop/yarn/local/usercache/<username>/filecache/48/spark-assembly-1.6.0-hadoop2.4.0.jar:/home/aqualab/spark-1.6.0-bin-hadoop2.4/python:/home/jpr123/hg.pacific/python-common:/home/jp
>>>
>>> r123/python-libs:/home/jpr123/lib/python2.7/site-packages:/home/zsb739/local/lib/python2.7/site-packages:/home/jpr123/mobile-cdn-analysis:/home/<username>/lib/python2.7/site-packages:/home/<username>/code/libs:/scratch5/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0187/container_1450370639491_0187_01_000002/pyspark.zip:/scratch5/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0187/container_1450370639491_0187_01_000002/py4j-0.9-src.zip
>>> java.io.EOFException
>>>         at java.io.DataInputStream.readInt(DataInputStream.java:392)
>>>         at
>>> org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:164)
>>>         at [....]
>>>
>>> followed by several more similar errors that also say:
>>> Error from python worker:
>>>   python: module pyspark.daemon not found
>>>
>>>
>>> Even though the default python appeared to be correct, I just went ahead
>>> and explicitly set PYSPARK_PYTHON in conf/spark-env.sh to the path of the
>>> default python binary executable.  After making this change I was able to
>>> run the job successfully in client!  That is, this appeared to fix the
>>> "pyspark.daemon not found" error when running in client mode.
>>>
>>> However, when running in cluster mode, I am still getting the same
>>> syntax error:
>>>
>>> Traceback (most recent call last):
>>>   File "pi.py", line 24, in ?
>>>     from pyspark import SparkContext
>>>   File "/home/<username>/spark-1.6.0-bin-hadoop2.4/python/pyspark/__init__.py", line 61
>>>     indent = ' ' * (min(len(m) for m in indents) if indents else 0)
>>>                                                   ^
>>> SyntaxError: invalid syntax
>>>
>>> Is it possible that the PYSPARK_PYTHON environment variable is ignored when jobs are submitted in cluster mode?  It seems that Spark or Yarn is going behind my back, so to speak, and using some older version of python I didn't even know was installed.
>>>
>>> Thanks again for all your help thus far.  We are getting close....
>>>
>>> Andrew
>>>
>>>
>>>
>>> On Wed, Jan 13, 2016 at 6:13 PM, Bryan Cutler <cu...@gmail.com> wrote:
>>>
>>>> Hi Andrew,
>>>>
>>>> There are a couple of things to check.  First, is Python 2.7 the
>>>> default version on all nodes in the cluster or is it an alternate install?
>>>> Meaning what is the output of this command "$>  python --version"  If it is
>>>> an alternate install, you could set the environment variable "
>>>> PYSPARK_PYTHON" Python binary executable to use for PySpark in both
>>>> driver and workers (default is python).
>>>>
>>>> Did you try to submit the Python example under client mode?  Otherwise,
>>>> the command looks fine, you don't use the --class option for submitting
>>>> python files
>>>> * ./bin/spark-submit      --master yarn     --deploy-mode client
>>>> --driver-memory 4g     --executor-memory 2g     --executor-cores 1
>>>> ./examples/src/main/python/pi.py     10*
>>>>
>>>> That is a good sign that local jobs and Java examples work, probably
>>>> just a small configuration issue :)
>>>>
>>>> Bryan
>>>>
>>>> On Wed, Jan 13, 2016 at 3:51 PM, Andrew Weiner <
>>>> andrewweiner2020@u.northwestern.edu> wrote:
>>>>
>>>>> Thanks for your continuing help.  Here is some additional info.
>>>>>
>>>>> *OS/architecture*
>>>>> output of *cat /proc/version*:
>>>>> Linux version 2.6.18-400.1.1.el5 (
>>>>> mockbuild@x86-012.build.bos.redhat.com)
>>>>>
>>>>> output of *lsb_release -a*:
>>>>> LSB Version:
>>>>>  :core-4.0-amd64:core-4.0-ia32:core-4.0-noarch:graphics-4.0-amd64:graphics-4.0-ia32:graphics-4.0-noarch:printing-4.0-amd64:printing-4.0-ia32:printing-4.0-noarch
>>>>> Distributor ID: RedHatEnterpriseServer
>>>>> Description:    Red Hat Enterprise Linux Server release 5.11 (Tikanga)
>>>>> Release:        5.11
>>>>> Codename:       Tikanga
>>>>>
>>>>> *Running a local job*
>>>>> I have confirmed that I can successfully run python jobs using
>>>>> bin/spark-submit --master local[*]
>>>>> Specifically, this is the command I am using:
>>>>> *./bin/spark-submit --master local[8]
>>>>> ./examples/src/main/python/wordcount.py
>>>>> file:/home/<username>/spark-1.6.0-bin-hadoop2.4/README.md*
>>>>> And it works!
>>>>>
>>>>> *Additional info*
>>>>> I am also able to successfully run the Java SparkPi example using yarn
>>>>> in cluster mode using this command:
>>>>> * ./bin/spark-submit --class org.apache.spark.examples.SparkPi
>>>>> --master yarn     --deploy-mode cluster     --driver-memory 4g
>>>>> --executor-memory 2g     --executor-cores 1     lib/spark-examples*.jar
>>>>> 10*
>>>>> This Java job also runs successfully when I change --deploy-mode to
>>>>> client.  The fact that I can run Java jobs in cluster mode makes me thing
>>>>> that everything is installed correctly--is that a valid assumption?
>>>>>
>>>>> The problem remains that I cannot submit python jobs.  Here is the
>>>>> command that I am using to try to submit python jobs:
>>>>> * ./bin/spark-submit      --master yarn     --deploy-mode cluster
>>>>> --driver-memory 4g     --executor-memory 2g     --executor-cores 1
>>>>> ./examples/src/main/python/pi.py     10*
>>>>> Does that look like a correct command?  I wasn't sure what to put for
>>>>> --class so I omitted it.  At any rate, the result of the above command is a
>>>>> syntax error, similar to the one I posted in the original email:
>>>>>
>>>>> Traceback (most recent call last):
>>>>>   File "pi.py", line 24, in ?
>>>>>     from pyspark import SparkContext
>>>>>   File "/home/<username>/spark-1.6.0-bin-hadoop2.4/python/pyspark/__init__.py", line 61
>>>>>     indent = ' ' * (min(len(m) for m in indents) if indents else 0)
>>>>>                                                   ^
>>>>> SyntaxError: invalid syntax
>>>>>
>>>>>
>>>>> This really looks to me like a problem with the python version.
>>>>> Python 2.4 would throw this syntax error but Python 2.7 would not.  And yet
>>>>> I am using Python 2.7.8.  Is there any chance that Spark or Yarn is somehow
>>>>> using an older version of Python without my knowledge?
>>>>>
>>>>> Finally, when I try to run the same command in client mode...
>>>>> * ./bin/spark-submit      --master yarn     --deploy-mode client
>>>>> --driver-memory 4g     --executor-memory 2g     --executor-cores 1
>>>>> ./examples/src/main/python/pi.py 10*
>>>>> I get the error I mentioned in the prior email:
>>>>> Error from python worker:
>>>>>   python: module pyspark.daemon not found
>>>>>
>>>>> Any thoughts?
>>>>>
>>>>> Best,
>>>>> Andrew
>>>>>
>>>>>
>>>>> On Mon, Jan 11, 2016 at 12:25 PM, Bryan Cutler <cu...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> This could be an environment issue, could you give more details about
>>>>>> the OS/architecture that you are using?  If you are sure everything is
>>>>>> installed correctly on each node following the guide on "Running Spark on
>>>>>> Yarn" http://spark.apache.org/docs/latest/running-on-yarn.html and
>>>>>> that the spark assembly jar is reachable, then I would check to see if you
>>>>>> can submit a local job to just run on one node.
>>>>>>
>>>>>> On Fri, Jan 8, 2016 at 5:22 PM, Andrew Weiner <
>>>>>> andrewweiner2020@u.northwestern.edu> wrote:
>>>>>>
>>>>>>> Now for simplicity I'm testing with wordcount.py from the provided
>>>>>>> examples, and using Spark 1.6.0
>>>>>>>
>>>>>>> The first error I get is:
>>>>>>>
>>>>>>> 16/01/08 19:14:46 ERROR lzo.GPLNativeCodeLoader: Could not load
>>>>>>> native gpl library
>>>>>>> java.lang.UnsatisfiedLinkError: no gplcompression in
>>>>>>> java.library.path
>>>>>>>         at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1864)
>>>>>>>         at [....]
>>>>>>>
>>>>>>> A bit lower down, I see this error:
>>>>>>>
>>>>>>> 16/01/08 19:14:48 WARN scheduler.TaskSetManager: Lost task 0.0 in
>>>>>>> stage 0.0 (TID 0, mundonovo-priv): org.apache.spark.SparkException:
>>>>>>> Error from python worker:
>>>>>>>   python: module pyspark.daemon not found
>>>>>>> PYTHONPATH was:
>>>>>>>
>>>>>>> /scratch5/hadoop/yarn/local/usercache/<username>/filecache/22/spark-assembly-1.6.0-hadoop2.4.0.jar:/home/jpr123/hg.pacific/python-common:/home/jpr123/python-libs:/home/jpr123/lib/python2.7/site-packages:/home/zsb739/local/lib/python2.7/site-packages:/home/jpr123/mobile-cdn-analysis:/home/<username>/lib/python2.7/site-packages:/scratch4/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0136/container_1450370639491_0136_01_000002/pyspark.zip:/scratch4/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0136/container_1450370639491_0136_01_000002/py4j-0.9-src.zip
>>>>>>> java.io.EOFException
>>>>>>>         at java.io.DataInputStream.readInt(DataInputStream.java:392)
>>>>>>>         at [....]
>>>>>>>
>>>>>>> And then a few more similar pyspark.daemon not found errors...
>>>>>>>
>>>>>>> Andrew
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Jan 8, 2016 at 2:31 PM, Bryan Cutler <cu...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Andrew,
>>>>>>>>
>>>>>>>> I know that older versions of Spark could not run PySpark on YARN
>>>>>>>> in cluster mode.  I'm not sure if that is fixed in 1.6.0 though.  Can you
>>>>>>>> try setting deploy-mode option to "client" when calling spark-submit?
>>>>>>>>
>>>>>>>> Bryan
>>>>>>>>
>>>>>>>> On Thu, Jan 7, 2016 at 2:39 PM, weineran <
>>>>>>>> andrewweiner2020@u.northwestern.edu> wrote:
>>>>>>>>
>>>>>>>>> Hello,
>>>>>>>>>
>>>>>>>>> When I try to submit a python job using spark-submit (using
>>>>>>>>> --master yarn
>>>>>>>>> --deploy-mode cluster), I get the following error:
>>>>>>>>>
>>>>>>>>> /Traceback (most recent call last):
>>>>>>>>>   File "loss_rate_by_probe.py", line 15, in ?
>>>>>>>>>     from pyspark import SparkContext
>>>>>>>>>   File
>>>>>>>>>
>>>>>>>>> "/scratch5/hadoop/yarn/local/usercache/<username>/filecache/18/spark-assembly-1.3.1-hadoop2.4.0.jar/pyspark/__init__.py",
>>>>>>>>> line 41, in ?
>>>>>>>>>   File
>>>>>>>>>
>>>>>>>>> "/scratch5/hadoop/yarn/local/usercache/<username>/filecache/18/spark-assembly-1.3.1-hadoop2.4.0.jar/pyspark/context.py",
>>>>>>>>> line 219
>>>>>>>>>     with SparkContext._lock:
>>>>>>>>>                     ^
>>>>>>>>> SyntaxError: invalid syntax/
>>>>>>>>>
>>>>>>>>> This is very similar to  this post from 2014
>>>>>>>>> <
>>>>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/SparkContext-lock-Error-td18233.html
>>>>>>>>> >
>>>>>>>>> , but unlike that person I am using Python 2.7.8.
>>>>>>>>>
>>>>>>>>> Here is what I'm using:
>>>>>>>>> Spark 1.3.1
>>>>>>>>> Hadoop 2.4.0.2.1.5.0-695
>>>>>>>>> Python 2.7.8
>>>>>>>>>
>>>>>>>>> Another clue:  I also installed Spark 1.6.0 and tried to submit
>>>>>>>>> the same
>>>>>>>>> job.  I got a similar error:
>>>>>>>>>
>>>>>>>>> /Traceback (most recent call last):
>>>>>>>>>   File "loss_rate_by_probe.py", line 15, in ?
>>>>>>>>>     from pyspark import SparkContext
>>>>>>>>>   File
>>>>>>>>>
>>>>>>>>> "/scratch5/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0119/container_1450370639491_0119_01_000001/pyspark.zip/pyspark/__init__.py",
>>>>>>>>> line 61
>>>>>>>>>     indent = ' ' * (min(len(m) for m in indents) if indents else 0)
>>>>>>>>>                                                   ^
>>>>>>>>> SyntaxError: invalid syntax/
>>>>>>>>>
>>>>>>>>> Any thoughts?
>>>>>>>>>
>>>>>>>>> Andrew
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> View this message in context:
>>>>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/SparkContext-SyntaxError-invalid-syntax-tp25910.html
>>>>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>>>>> Nabble.com.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>>>>>>> For additional commands, e-mail: user-help@spark.apache.org
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: SparkContext SyntaxError: invalid syntax

Posted by Andrew Weiner <an...@u.northwestern.edu>.
Indeed!  Here is the output when I run in cluster mode:

Traceback (most recent call last):
  File "pi.py", line 22, in ?
    raise RuntimeError("\n"+str(sys.version_info) +"\n"+
RuntimeError:
(2, 4, 3, 'final', 0)
[('PYSPARK_GATEWAY_PORT', '48079'), ('PYTHONPATH',
'/scratch2/hadoop/yarn/local/usercache/<username>/filecache/116/spark-assembly-1.6.0-hadoop2.4.0.jar:/home/<user>/spark-1.6.0-bin-hadoop2.4/python:/home/<username>/code/libs:/scratch5/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0239/container_1450370639491_0239_01_000001/pyspark.zip:/scratch5/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0239/container_1450370639491_0239_01_000001/py4j-0.9-src.zip'),
('PYTHONUNBUFFERED', 'YES')]

As we suspected, it is using Python 2.4

One thing that surprises me is that PYSPARK_PYTHON is not showing up
in the list, even though I am setting it and exporting it in
spark-submit *and* in spark-env.sh.  Is there somewhere else I need to
set this variable?  Maybe in one of the hadoop conf files in my
HADOOP_CONF_DIR?

Andrew



On Thu, Jan 14, 2016 at 1:14 PM, Bryan Cutler <cu...@gmail.com> wrote:

> It seems like it could be the case that some other Python version is being
> invoked.  To make sure, can you add something like this to the top of the
> .py file you are submitting to get some more info about how the application
> master is configured?
>
> import sys, os
> raise RuntimeError("\n"+str(sys.version_info) +"\n"+
>     str([(k,os.environ[k]) for k in os.environ if "PY" in k]))
>
> On Thu, Jan 14, 2016 at 8:37 AM, Andrew Weiner <
> andrewweiner2020@u.northwestern.edu> wrote:
>
>> Hi Bryan,
>>
>> I ran "$> python --version" on every node on the cluster, and it is
>> Python 2.7.8 for every single one.
>>
>> When I try to submit the Python example in client mode
>> * ./bin/spark-submit      --master yarn     --deploy-mode client
>> --driver-memory 4g     --executor-memory 2g     --executor-cores 1
>> ./examples/src/main/python/pi.py     10*
>> That's when I get this error that I mentioned:
>>
>> 16/01/14 10:09:10 WARN scheduler.TaskSetManager: Lost task 0.0 in stage
>> 0.0 (TID 0, mundonovo-priv): org.apache.spark.SparkException:
>> Error from python worker:
>>   python: module pyspark.daemon not found
>> PYTHONPATH was:
>>
>> /scratch5/hadoop/yarn/local/usercache/<username>/filecache/48/spark-assembly-1.6.0-hadoop2.4.0.jar:/home/aqualab/spark-1.6.0-bin-hadoop2.4/python:/home/jpr123/hg.pacific/python-common:/home/jp
>>
>> r123/python-libs:/home/jpr123/lib/python2.7/site-packages:/home/zsb739/local/lib/python2.7/site-packages:/home/jpr123/mobile-cdn-analysis:/home/<username>/lib/python2.7/site-packages:/home/<username>/code/libs:/scratch5/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0187/container_1450370639491_0187_01_000002/pyspark.zip:/scratch5/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0187/container_1450370639491_0187_01_000002/py4j-0.9-src.zip
>> java.io.EOFException
>>         at java.io.DataInputStream.readInt(DataInputStream.java:392)
>>         at
>> org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:164)
>>         at [....]
>>
>> followed by several more similar errors that also say:
>> Error from python worker:
>>   python: module pyspark.daemon not found
>>
>>
>> Even though the default python appeared to be correct, I just went ahead
>> and explicitly set PYSPARK_PYTHON in conf/spark-env.sh to the path of the
>> default python binary executable.  After making this change I was able to
>> run the job successfully in client!  That is, this appeared to fix the
>> "pyspark.daemon not found" error when running in client mode.
>>
>> However, when running in cluster mode, I am still getting the same syntax
>> error:
>>
>> Traceback (most recent call last):
>>   File "pi.py", line 24, in ?
>>     from pyspark import SparkContext
>>   File "/home/<username>/spark-1.6.0-bin-hadoop2.4/python/pyspark/__init__.py", line 61
>>     indent = ' ' * (min(len(m) for m in indents) if indents else 0)
>>                                                   ^
>> SyntaxError: invalid syntax
>>
>> Is it possible that the PYSPARK_PYTHON environment variable is ignored when jobs are submitted in cluster mode?  It seems that Spark or Yarn is going behind my back, so to speak, and using some older version of python I didn't even know was installed.
>>
>> Thanks again for all your help thus far.  We are getting close....
>>
>> Andrew
>>
>>
>>
>> On Wed, Jan 13, 2016 at 6:13 PM, Bryan Cutler <cu...@gmail.com> wrote:
>>
>>> Hi Andrew,
>>>
>>> There are a couple of things to check.  First, is Python 2.7 the default
>>> version on all nodes in the cluster or is it an alternate install? Meaning
>>> what is the output of this command "$>  python --version"  If it is an
>>> alternate install, you could set the environment variable "
>>> PYSPARK_PYTHON" Python binary executable to use for PySpark in both
>>> driver and workers (default is python).
>>>
>>> Did you try to submit the Python example under client mode?  Otherwise,
>>> the command looks fine, you don't use the --class option for submitting
>>> python files
>>> * ./bin/spark-submit      --master yarn     --deploy-mode client
>>> --driver-memory 4g     --executor-memory 2g     --executor-cores 1
>>> ./examples/src/main/python/pi.py     10*
>>>
>>> That is a good sign that local jobs and Java examples work, probably
>>> just a small configuration issue :)
>>>
>>> Bryan
>>>
>>> On Wed, Jan 13, 2016 at 3:51 PM, Andrew Weiner <
>>> andrewweiner2020@u.northwestern.edu> wrote:
>>>
>>>> Thanks for your continuing help.  Here is some additional info.
>>>>
>>>> *OS/architecture*
>>>> output of *cat /proc/version*:
>>>> Linux version 2.6.18-400.1.1.el5 (
>>>> mockbuild@x86-012.build.bos.redhat.com)
>>>>
>>>> output of *lsb_release -a*:
>>>> LSB Version:
>>>>  :core-4.0-amd64:core-4.0-ia32:core-4.0-noarch:graphics-4.0-amd64:graphics-4.0-ia32:graphics-4.0-noarch:printing-4.0-amd64:printing-4.0-ia32:printing-4.0-noarch
>>>> Distributor ID: RedHatEnterpriseServer
>>>> Description:    Red Hat Enterprise Linux Server release 5.11 (Tikanga)
>>>> Release:        5.11
>>>> Codename:       Tikanga
>>>>
>>>> *Running a local job*
>>>> I have confirmed that I can successfully run python jobs using
>>>> bin/spark-submit --master local[*]
>>>> Specifically, this is the command I am using:
>>>> *./bin/spark-submit --master local[8]
>>>> ./examples/src/main/python/wordcount.py
>>>> file:/home/<username>/spark-1.6.0-bin-hadoop2.4/README.md*
>>>> And it works!
>>>>
>>>> *Additional info*
>>>> I am also able to successfully run the Java SparkPi example using yarn
>>>> in cluster mode using this command:
>>>> * ./bin/spark-submit --class org.apache.spark.examples.SparkPi
>>>> --master yarn     --deploy-mode cluster     --driver-memory 4g
>>>> --executor-memory 2g     --executor-cores 1     lib/spark-examples*.jar
>>>> 10*
>>>> This Java job also runs successfully when I change --deploy-mode to
>>>> client.  The fact that I can run Java jobs in cluster mode makes me thing
>>>> that everything is installed correctly--is that a valid assumption?
>>>>
>>>> The problem remains that I cannot submit python jobs.  Here is the
>>>> command that I am using to try to submit python jobs:
>>>> * ./bin/spark-submit      --master yarn     --deploy-mode cluster
>>>> --driver-memory 4g     --executor-memory 2g     --executor-cores 1
>>>> ./examples/src/main/python/pi.py     10*
>>>> Does that look like a correct command?  I wasn't sure what to put for
>>>> --class so I omitted it.  At any rate, the result of the above command is a
>>>> syntax error, similar to the one I posted in the original email:
>>>>
>>>> Traceback (most recent call last):
>>>>   File "pi.py", line 24, in ?
>>>>     from pyspark import SparkContext
>>>>   File "/home/<username>/spark-1.6.0-bin-hadoop2.4/python/pyspark/__init__.py", line 61
>>>>     indent = ' ' * (min(len(m) for m in indents) if indents else 0)
>>>>                                                   ^
>>>> SyntaxError: invalid syntax
>>>>
>>>>
>>>> This really looks to me like a problem with the python version.  Python
>>>> 2.4 would throw this syntax error but Python 2.7 would not.  And yet I am
>>>> using Python 2.7.8.  Is there any chance that Spark or Yarn is somehow
>>>> using an older version of Python without my knowledge?
>>>>
>>>> Finally, when I try to run the same command in client mode...
>>>> * ./bin/spark-submit      --master yarn     --deploy-mode client
>>>> --driver-memory 4g     --executor-memory 2g     --executor-cores 1
>>>> ./examples/src/main/python/pi.py 10*
>>>> I get the error I mentioned in the prior email:
>>>> Error from python worker:
>>>>   python: module pyspark.daemon not found
>>>>
>>>> Any thoughts?
>>>>
>>>> Best,
>>>> Andrew
>>>>
>>>>
>>>> On Mon, Jan 11, 2016 at 12:25 PM, Bryan Cutler <cu...@gmail.com>
>>>> wrote:
>>>>
>>>>> This could be an environment issue, could you give more details about
>>>>> the OS/architecture that you are using?  If you are sure everything is
>>>>> installed correctly on each node following the guide on "Running Spark on
>>>>> Yarn" http://spark.apache.org/docs/latest/running-on-yarn.html and
>>>>> that the spark assembly jar is reachable, then I would check to see if you
>>>>> can submit a local job to just run on one node.
>>>>>
>>>>> On Fri, Jan 8, 2016 at 5:22 PM, Andrew Weiner <
>>>>> andrewweiner2020@u.northwestern.edu> wrote:
>>>>>
>>>>>> Now for simplicity I'm testing with wordcount.py from the provided
>>>>>> examples, and using Spark 1.6.0
>>>>>>
>>>>>> The first error I get is:
>>>>>>
>>>>>> 16/01/08 19:14:46 ERROR lzo.GPLNativeCodeLoader: Could not load
>>>>>> native gpl library
>>>>>> java.lang.UnsatisfiedLinkError: no gplcompression in java.library.path
>>>>>>         at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1864)
>>>>>>         at [....]
>>>>>>
>>>>>> A bit lower down, I see this error:
>>>>>>
>>>>>> 16/01/08 19:14:48 WARN scheduler.TaskSetManager: Lost task 0.0 in
>>>>>> stage 0.0 (TID 0, mundonovo-priv): org.apache.spark.SparkException:
>>>>>> Error from python worker:
>>>>>>   python: module pyspark.daemon not found
>>>>>> PYTHONPATH was:
>>>>>>
>>>>>> /scratch5/hadoop/yarn/local/usercache/<username>/filecache/22/spark-assembly-1.6.0-hadoop2.4.0.jar:/home/jpr123/hg.pacific/python-common:/home/jpr123/python-libs:/home/jpr123/lib/python2.7/site-packages:/home/zsb739/local/lib/python2.7/site-packages:/home/jpr123/mobile-cdn-analysis:/home/<username>/lib/python2.7/site-packages:/scratch4/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0136/container_1450370639491_0136_01_000002/pyspark.zip:/scratch4/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0136/container_1450370639491_0136_01_000002/py4j-0.9-src.zip
>>>>>> java.io.EOFException
>>>>>>         at java.io.DataInputStream.readInt(DataInputStream.java:392)
>>>>>>         at [....]
>>>>>>
>>>>>> And then a few more similar pyspark.daemon not found errors...
>>>>>>
>>>>>> Andrew
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Jan 8, 2016 at 2:31 PM, Bryan Cutler <cu...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Andrew,
>>>>>>>
>>>>>>> I know that older versions of Spark could not run PySpark on YARN in
>>>>>>> cluster mode.  I'm not sure if that is fixed in 1.6.0 though.  Can you try
>>>>>>> setting deploy-mode option to "client" when calling spark-submit?
>>>>>>>
>>>>>>> Bryan
>>>>>>>
>>>>>>> On Thu, Jan 7, 2016 at 2:39 PM, weineran <
>>>>>>> andrewweiner2020@u.northwestern.edu> wrote:
>>>>>>>
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> When I try to submit a python job using spark-submit (using
>>>>>>>> --master yarn
>>>>>>>> --deploy-mode cluster), I get the following error:
>>>>>>>>
>>>>>>>> /Traceback (most recent call last):
>>>>>>>>   File "loss_rate_by_probe.py", line 15, in ?
>>>>>>>>     from pyspark import SparkContext
>>>>>>>>   File
>>>>>>>>
>>>>>>>> "/scratch5/hadoop/yarn/local/usercache/<username>/filecache/18/spark-assembly-1.3.1-hadoop2.4.0.jar/pyspark/__init__.py",
>>>>>>>> line 41, in ?
>>>>>>>>   File
>>>>>>>>
>>>>>>>> "/scratch5/hadoop/yarn/local/usercache/<username>/filecache/18/spark-assembly-1.3.1-hadoop2.4.0.jar/pyspark/context.py",
>>>>>>>> line 219
>>>>>>>>     with SparkContext._lock:
>>>>>>>>                     ^
>>>>>>>> SyntaxError: invalid syntax/
>>>>>>>>
>>>>>>>> This is very similar to  this post from 2014
>>>>>>>> <
>>>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/SparkContext-lock-Error-td18233.html
>>>>>>>> >
>>>>>>>> , but unlike that person I am using Python 2.7.8.
>>>>>>>>
>>>>>>>> Here is what I'm using:
>>>>>>>> Spark 1.3.1
>>>>>>>> Hadoop 2.4.0.2.1.5.0-695
>>>>>>>> Python 2.7.8
>>>>>>>>
>>>>>>>> Another clue:  I also installed Spark 1.6.0 and tried to submit the
>>>>>>>> same
>>>>>>>> job.  I got a similar error:
>>>>>>>>
>>>>>>>> /Traceback (most recent call last):
>>>>>>>>   File "loss_rate_by_probe.py", line 15, in ?
>>>>>>>>     from pyspark import SparkContext
>>>>>>>>   File
>>>>>>>>
>>>>>>>> "/scratch5/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0119/container_1450370639491_0119_01_000001/pyspark.zip/pyspark/__init__.py",
>>>>>>>> line 61
>>>>>>>>     indent = ' ' * (min(len(m) for m in indents) if indents else 0)
>>>>>>>>                                                   ^
>>>>>>>> SyntaxError: invalid syntax/
>>>>>>>>
>>>>>>>> Any thoughts?
>>>>>>>>
>>>>>>>> Andrew
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> View this message in context:
>>>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/SparkContext-SyntaxError-invalid-syntax-tp25910.html
>>>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>>>> Nabble.com.
>>>>>>>>
>>>>>>>>
>>>>>>>> ---------------------------------------------------------------------
>>>>>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>>>>>> For additional commands, e-mail: user-help@spark.apache.org
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: SparkContext SyntaxError: invalid syntax

Posted by Bryan Cutler <cu...@gmail.com>.
It seems like it could be the case that some other Python version is being
invoked.  To make sure, can you add something like this to the top of the
.py file you are submitting to get some more info about how the application
master is configured?

import sys, os
raise RuntimeError("\n"+str(sys.version_info) +"\n"+
    str([(k,os.environ[k]) for k in os.environ if "PY" in k]))

On Thu, Jan 14, 2016 at 8:37 AM, Andrew Weiner <
andrewweiner2020@u.northwestern.edu> wrote:

> Hi Bryan,
>
> I ran "$> python --version" on every node on the cluster, and it is Python
> 2.7.8 for every single one.
>
> When I try to submit the Python example in client mode
> * ./bin/spark-submit      --master yarn     --deploy-mode client
> --driver-memory 4g     --executor-memory 2g     --executor-cores 1
> ./examples/src/main/python/pi.py     10*
> That's when I get this error that I mentioned:
>
> 16/01/14 10:09:10 WARN scheduler.TaskSetManager: Lost task 0.0 in stage
> 0.0 (TID 0, mundonovo-priv): org.apache.spark.SparkException:
> Error from python worker:
>   python: module pyspark.daemon not found
> PYTHONPATH was:
>
> /scratch5/hadoop/yarn/local/usercache/<username>/filecache/48/spark-assembly-1.6.0-hadoop2.4.0.jar:/home/aqualab/spark-1.6.0-bin-hadoop2.4/python:/home/jpr123/hg.pacific/python-common:/home/jp
>
> r123/python-libs:/home/jpr123/lib/python2.7/site-packages:/home/zsb739/local/lib/python2.7/site-packages:/home/jpr123/mobile-cdn-analysis:/home/<username>/lib/python2.7/site-packages:/home/<username>/code/libs:/scratch5/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0187/container_1450370639491_0187_01_000002/pyspark.zip:/scratch5/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0187/container_1450370639491_0187_01_000002/py4j-0.9-src.zip
> java.io.EOFException
>         at java.io.DataInputStream.readInt(DataInputStream.java:392)
>         at
> org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:164)
>         at [....]
>
> followed by several more similar errors that also say:
> Error from python worker:
>   python: module pyspark.daemon not found
>
>
> Even though the default python appeared to be correct, I just went ahead
> and explicitly set PYSPARK_PYTHON in conf/spark-env.sh to the path of the
> default python binary executable.  After making this change I was able to
> run the job successfully in client!  That is, this appeared to fix the
> "pyspark.daemon not found" error when running in client mode.
>
> However, when running in cluster mode, I am still getting the same syntax
> error:
>
> Traceback (most recent call last):
>   File "pi.py", line 24, in ?
>     from pyspark import SparkContext
>   File "/home/<username>/spark-1.6.0-bin-hadoop2.4/python/pyspark/__init__.py", line 61
>     indent = ' ' * (min(len(m) for m in indents) if indents else 0)
>                                                   ^
> SyntaxError: invalid syntax
>
> Is it possible that the PYSPARK_PYTHON environment variable is ignored when jobs are submitted in cluster mode?  It seems that Spark or Yarn is going behind my back, so to speak, and using some older version of python I didn't even know was installed.
>
> Thanks again for all your help thus far.  We are getting close....
>
> Andrew
>
>
>
> On Wed, Jan 13, 2016 at 6:13 PM, Bryan Cutler <cu...@gmail.com> wrote:
>
>> Hi Andrew,
>>
>> There are a couple of things to check.  First, is Python 2.7 the default
>> version on all nodes in the cluster or is it an alternate install? Meaning
>> what is the output of this command "$>  python --version"  If it is an
>> alternate install, you could set the environment variable "PYSPARK_PYTHON"
>> Python binary executable to use for PySpark in both driver and workers
>> (default is python).
>>
>> Did you try to submit the Python example under client mode?  Otherwise,
>> the command looks fine, you don't use the --class option for submitting
>> python files
>> * ./bin/spark-submit      --master yarn     --deploy-mode client
>> --driver-memory 4g     --executor-memory 2g     --executor-cores 1
>> ./examples/src/main/python/pi.py     10*
>>
>> That is a good sign that local jobs and Java examples work, probably just
>> a small configuration issue :)
>>
>> Bryan
>>
>> On Wed, Jan 13, 2016 at 3:51 PM, Andrew Weiner <
>> andrewweiner2020@u.northwestern.edu> wrote:
>>
>>> Thanks for your continuing help.  Here is some additional info.
>>>
>>> *OS/architecture*
>>> output of *cat /proc/version*:
>>> Linux version 2.6.18-400.1.1.el5 (mockbuild@x86-012.build.bos.redhat.com
>>> )
>>>
>>> output of *lsb_release -a*:
>>> LSB Version:
>>>  :core-4.0-amd64:core-4.0-ia32:core-4.0-noarch:graphics-4.0-amd64:graphics-4.0-ia32:graphics-4.0-noarch:printing-4.0-amd64:printing-4.0-ia32:printing-4.0-noarch
>>> Distributor ID: RedHatEnterpriseServer
>>> Description:    Red Hat Enterprise Linux Server release 5.11 (Tikanga)
>>> Release:        5.11
>>> Codename:       Tikanga
>>>
>>> *Running a local job*
>>> I have confirmed that I can successfully run python jobs using
>>> bin/spark-submit --master local[*]
>>> Specifically, this is the command I am using:
>>> *./bin/spark-submit --master local[8]
>>> ./examples/src/main/python/wordcount.py
>>> file:/home/<username>/spark-1.6.0-bin-hadoop2.4/README.md*
>>> And it works!
>>>
>>> *Additional info*
>>> I am also able to successfully run the Java SparkPi example using yarn
>>> in cluster mode using this command:
>>> * ./bin/spark-submit --class org.apache.spark.examples.SparkPi
>>> --master yarn     --deploy-mode cluster     --driver-memory 4g
>>> --executor-memory 2g     --executor-cores 1     lib/spark-examples*.jar
>>> 10*
>>> This Java job also runs successfully when I change --deploy-mode to
>>> client.  The fact that I can run Java jobs in cluster mode makes me thing
>>> that everything is installed correctly--is that a valid assumption?
>>>
>>> The problem remains that I cannot submit python jobs.  Here is the
>>> command that I am using to try to submit python jobs:
>>> * ./bin/spark-submit      --master yarn     --deploy-mode cluster
>>> --driver-memory 4g     --executor-memory 2g     --executor-cores 1
>>> ./examples/src/main/python/pi.py     10*
>>> Does that look like a correct command?  I wasn't sure what to put for
>>> --class so I omitted it.  At any rate, the result of the above command is a
>>> syntax error, similar to the one I posted in the original email:
>>>
>>> Traceback (most recent call last):
>>>   File "pi.py", line 24, in ?
>>>     from pyspark import SparkContext
>>>   File "/home/<username>/spark-1.6.0-bin-hadoop2.4/python/pyspark/__init__.py", line 61
>>>     indent = ' ' * (min(len(m) for m in indents) if indents else 0)
>>>                                                   ^
>>> SyntaxError: invalid syntax
>>>
>>>
>>> This really looks to me like a problem with the python version.  Python
>>> 2.4 would throw this syntax error but Python 2.7 would not.  And yet I am
>>> using Python 2.7.8.  Is there any chance that Spark or Yarn is somehow
>>> using an older version of Python without my knowledge?
>>>
>>> Finally, when I try to run the same command in client mode...
>>> * ./bin/spark-submit      --master yarn     --deploy-mode client
>>> --driver-memory 4g     --executor-memory 2g     --executor-cores 1
>>> ./examples/src/main/python/pi.py 10*
>>> I get the error I mentioned in the prior email:
>>> Error from python worker:
>>>   python: module pyspark.daemon not found
>>>
>>> Any thoughts?
>>>
>>> Best,
>>> Andrew
>>>
>>>
>>> On Mon, Jan 11, 2016 at 12:25 PM, Bryan Cutler <cu...@gmail.com>
>>> wrote:
>>>
>>>> This could be an environment issue, could you give more details about
>>>> the OS/architecture that you are using?  If you are sure everything is
>>>> installed correctly on each node following the guide on "Running Spark on
>>>> Yarn" http://spark.apache.org/docs/latest/running-on-yarn.html and
>>>> that the spark assembly jar is reachable, then I would check to see if you
>>>> can submit a local job to just run on one node.
>>>>
>>>> On Fri, Jan 8, 2016 at 5:22 PM, Andrew Weiner <
>>>> andrewweiner2020@u.northwestern.edu> wrote:
>>>>
>>>>> Now for simplicity I'm testing with wordcount.py from the provided
>>>>> examples, and using Spark 1.6.0
>>>>>
>>>>> The first error I get is:
>>>>>
>>>>> 16/01/08 19:14:46 ERROR lzo.GPLNativeCodeLoader: Could not load native
>>>>> gpl library
>>>>> java.lang.UnsatisfiedLinkError: no gplcompression in java.library.path
>>>>>         at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1864)
>>>>>         at [....]
>>>>>
>>>>> A bit lower down, I see this error:
>>>>>
>>>>> 16/01/08 19:14:48 WARN scheduler.TaskSetManager: Lost task 0.0 in
>>>>> stage 0.0 (TID 0, mundonovo-priv): org.apache.spark.SparkException:
>>>>> Error from python worker:
>>>>>   python: module pyspark.daemon not found
>>>>> PYTHONPATH was:
>>>>>
>>>>> /scratch5/hadoop/yarn/local/usercache/<username>/filecache/22/spark-assembly-1.6.0-hadoop2.4.0.jar:/home/jpr123/hg.pacific/python-common:/home/jpr123/python-libs:/home/jpr123/lib/python2.7/site-packages:/home/zsb739/local/lib/python2.7/site-packages:/home/jpr123/mobile-cdn-analysis:/home/<username>/lib/python2.7/site-packages:/scratch4/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0136/container_1450370639491_0136_01_000002/pyspark.zip:/scratch4/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0136/container_1450370639491_0136_01_000002/py4j-0.9-src.zip
>>>>> java.io.EOFException
>>>>>         at java.io.DataInputStream.readInt(DataInputStream.java:392)
>>>>>         at [....]
>>>>>
>>>>> And then a few more similar pyspark.daemon not found errors...
>>>>>
>>>>> Andrew
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Jan 8, 2016 at 2:31 PM, Bryan Cutler <cu...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Andrew,
>>>>>>
>>>>>> I know that older versions of Spark could not run PySpark on YARN in
>>>>>> cluster mode.  I'm not sure if that is fixed in 1.6.0 though.  Can you try
>>>>>> setting deploy-mode option to "client" when calling spark-submit?
>>>>>>
>>>>>> Bryan
>>>>>>
>>>>>> On Thu, Jan 7, 2016 at 2:39 PM, weineran <
>>>>>> andrewweiner2020@u.northwestern.edu> wrote:
>>>>>>
>>>>>>> Hello,
>>>>>>>
>>>>>>> When I try to submit a python job using spark-submit (using --master
>>>>>>> yarn
>>>>>>> --deploy-mode cluster), I get the following error:
>>>>>>>
>>>>>>> /Traceback (most recent call last):
>>>>>>>   File "loss_rate_by_probe.py", line 15, in ?
>>>>>>>     from pyspark import SparkContext
>>>>>>>   File
>>>>>>>
>>>>>>> "/scratch5/hadoop/yarn/local/usercache/<username>/filecache/18/spark-assembly-1.3.1-hadoop2.4.0.jar/pyspark/__init__.py",
>>>>>>> line 41, in ?
>>>>>>>   File
>>>>>>>
>>>>>>> "/scratch5/hadoop/yarn/local/usercache/<username>/filecache/18/spark-assembly-1.3.1-hadoop2.4.0.jar/pyspark/context.py",
>>>>>>> line 219
>>>>>>>     with SparkContext._lock:
>>>>>>>                     ^
>>>>>>> SyntaxError: invalid syntax/
>>>>>>>
>>>>>>> This is very similar to  this post from 2014
>>>>>>> <
>>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/SparkContext-lock-Error-td18233.html
>>>>>>> >
>>>>>>> , but unlike that person I am using Python 2.7.8.
>>>>>>>
>>>>>>> Here is what I'm using:
>>>>>>> Spark 1.3.1
>>>>>>> Hadoop 2.4.0.2.1.5.0-695
>>>>>>> Python 2.7.8
>>>>>>>
>>>>>>> Another clue:  I also installed Spark 1.6.0 and tried to submit the
>>>>>>> same
>>>>>>> job.  I got a similar error:
>>>>>>>
>>>>>>> /Traceback (most recent call last):
>>>>>>>   File "loss_rate_by_probe.py", line 15, in ?
>>>>>>>     from pyspark import SparkContext
>>>>>>>   File
>>>>>>>
>>>>>>> "/scratch5/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0119/container_1450370639491_0119_01_000001/pyspark.zip/pyspark/__init__.py",
>>>>>>> line 61
>>>>>>>     indent = ' ' * (min(len(m) for m in indents) if indents else 0)
>>>>>>>                                                   ^
>>>>>>> SyntaxError: invalid syntax/
>>>>>>>
>>>>>>> Any thoughts?
>>>>>>>
>>>>>>> Andrew
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> View this message in context:
>>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/SparkContext-SyntaxError-invalid-syntax-tp25910.html
>>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>>> Nabble.com.
>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>>>>> For additional commands, e-mail: user-help@spark.apache.org
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: SparkContext SyntaxError: invalid syntax

Posted by Andrew Weiner <an...@u.northwestern.edu>.
Hi Bryan,

I ran "$> python --version" on every node on the cluster, and it is Python
2.7.8 for every single one.

When I try to submit the Python example in client mode
* ./bin/spark-submit      --master yarn     --deploy-mode client
--driver-memory 4g     --executor-memory 2g     --executor-cores 1
./examples/src/main/python/pi.py     10*
That's when I get this error that I mentioned:

16/01/14 10:09:10 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0
(TID 0, mundonovo-priv): org.apache.spark.SparkException:
Error from python worker:
  python: module pyspark.daemon not found
PYTHONPATH was:

/scratch5/hadoop/yarn/local/usercache/<username>/filecache/48/spark-assembly-1.6.0-hadoop2.4.0.jar:/home/aqualab/spark-1.6.0-bin-hadoop2.4/python:/home/jpr123/hg.pacific/python-common:/home/jp
r123/python-libs:/home/jpr123/lib/python2.7/site-packages:/home/zsb739/local/lib/python2.7/site-packages:/home/jpr123/mobile-cdn-analysis:/home/<username>/lib/python2.7/site-packages:/home/<username>/code/libs:/scratch5/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0187/container_1450370639491_0187_01_000002/pyspark.zip:/scratch5/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0187/container_1450370639491_0187_01_000002/py4j-0.9-src.zip
java.io.EOFException
        at java.io.DataInputStream.readInt(DataInputStream.java:392)
        at
org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:164)
        at [....]

followed by several more similar errors that also say:
Error from python worker:
  python: module pyspark.daemon not found


Even though the default python appeared to be correct, I just went ahead
and explicitly set PYSPARK_PYTHON in conf/spark-env.sh to the path of the
default python binary executable.  After making this change I was able to
run the job successfully in client!  That is, this appeared to fix the
"pyspark.daemon not found" error when running in client mode.

However, when running in cluster mode, I am still getting the same syntax
error:

Traceback (most recent call last):
  File "pi.py", line 24, in ?
    from pyspark import SparkContext
  File "/home/<username>/spark-1.6.0-bin-hadoop2.4/python/pyspark/__init__.py",
line 61
    indent = ' ' * (min(len(m) for m in indents) if indents else 0)
                                                  ^
SyntaxError: invalid syntax

Is it possible that the PYSPARK_PYTHON environment variable is ignored
when jobs are submitted in cluster mode?  It seems that Spark or Yarn
is going behind my back, so to speak, and using some older version of
python I didn't even know was installed.

Thanks again for all your help thus far.  We are getting close....

Andrew



On Wed, Jan 13, 2016 at 6:13 PM, Bryan Cutler <cu...@gmail.com> wrote:

> Hi Andrew,
>
> There are a couple of things to check.  First, is Python 2.7 the default
> version on all nodes in the cluster or is it an alternate install? Meaning
> what is the output of this command "$>  python --version"  If it is an
> alternate install, you could set the environment variable "PYSPARK_PYTHON"
> Python binary executable to use for PySpark in both driver and workers
> (default is python).
>
> Did you try to submit the Python example under client mode?  Otherwise,
> the command looks fine, you don't use the --class option for submitting
> python files
> * ./bin/spark-submit      --master yarn     --deploy-mode client
> --driver-memory 4g     --executor-memory 2g     --executor-cores 1
> ./examples/src/main/python/pi.py     10*
>
> That is a good sign that local jobs and Java examples work, probably just
> a small configuration issue :)
>
> Bryan
>
> On Wed, Jan 13, 2016 at 3:51 PM, Andrew Weiner <
> andrewweiner2020@u.northwestern.edu> wrote:
>
>> Thanks for your continuing help.  Here is some additional info.
>>
>> *OS/architecture*
>> output of *cat /proc/version*:
>> Linux version 2.6.18-400.1.1.el5 (mockbuild@x86-012.build.bos.redhat.com)
>>
>> output of *lsb_release -a*:
>> LSB Version:
>>  :core-4.0-amd64:core-4.0-ia32:core-4.0-noarch:graphics-4.0-amd64:graphics-4.0-ia32:graphics-4.0-noarch:printing-4.0-amd64:printing-4.0-ia32:printing-4.0-noarch
>> Distributor ID: RedHatEnterpriseServer
>> Description:    Red Hat Enterprise Linux Server release 5.11 (Tikanga)
>> Release:        5.11
>> Codename:       Tikanga
>>
>> *Running a local job*
>> I have confirmed that I can successfully run python jobs using
>> bin/spark-submit --master local[*]
>> Specifically, this is the command I am using:
>> *./bin/spark-submit --master local[8]
>> ./examples/src/main/python/wordcount.py
>> file:/home/<username>/spark-1.6.0-bin-hadoop2.4/README.md*
>> And it works!
>>
>> *Additional info*
>> I am also able to successfully run the Java SparkPi example using yarn in
>> cluster mode using this command:
>> * ./bin/spark-submit --class org.apache.spark.examples.SparkPi
>> --master yarn     --deploy-mode cluster     --driver-memory 4g
>> --executor-memory 2g     --executor-cores 1     lib/spark-examples*.jar
>> 10*
>> This Java job also runs successfully when I change --deploy-mode to
>> client.  The fact that I can run Java jobs in cluster mode makes me thing
>> that everything is installed correctly--is that a valid assumption?
>>
>> The problem remains that I cannot submit python jobs.  Here is the
>> command that I am using to try to submit python jobs:
>> * ./bin/spark-submit      --master yarn     --deploy-mode cluster
>> --driver-memory 4g     --executor-memory 2g     --executor-cores 1
>> ./examples/src/main/python/pi.py     10*
>> Does that look like a correct command?  I wasn't sure what to put for
>> --class so I omitted it.  At any rate, the result of the above command is a
>> syntax error, similar to the one I posted in the original email:
>>
>> Traceback (most recent call last):
>>   File "pi.py", line 24, in ?
>>     from pyspark import SparkContext
>>   File "/home/<username>/spark-1.6.0-bin-hadoop2.4/python/pyspark/__init__.py", line 61
>>     indent = ' ' * (min(len(m) for m in indents) if indents else 0)
>>                                                   ^
>> SyntaxError: invalid syntax
>>
>>
>> This really looks to me like a problem with the python version.  Python
>> 2.4 would throw this syntax error but Python 2.7 would not.  And yet I am
>> using Python 2.7.8.  Is there any chance that Spark or Yarn is somehow
>> using an older version of Python without my knowledge?
>>
>> Finally, when I try to run the same command in client mode...
>> * ./bin/spark-submit      --master yarn     --deploy-mode client
>> --driver-memory 4g     --executor-memory 2g     --executor-cores 1
>> ./examples/src/main/python/pi.py 10*
>> I get the error I mentioned in the prior email:
>> Error from python worker:
>>   python: module pyspark.daemon not found
>>
>> Any thoughts?
>>
>> Best,
>> Andrew
>>
>>
>> On Mon, Jan 11, 2016 at 12:25 PM, Bryan Cutler <cu...@gmail.com> wrote:
>>
>>> This could be an environment issue, could you give more details about
>>> the OS/architecture that you are using?  If you are sure everything is
>>> installed correctly on each node following the guide on "Running Spark on
>>> Yarn" http://spark.apache.org/docs/latest/running-on-yarn.html and that
>>> the spark assembly jar is reachable, then I would check to see if you can
>>> submit a local job to just run on one node.
>>>
>>> On Fri, Jan 8, 2016 at 5:22 PM, Andrew Weiner <
>>> andrewweiner2020@u.northwestern.edu> wrote:
>>>
>>>> Now for simplicity I'm testing with wordcount.py from the provided
>>>> examples, and using Spark 1.6.0
>>>>
>>>> The first error I get is:
>>>>
>>>> 16/01/08 19:14:46 ERROR lzo.GPLNativeCodeLoader: Could not load native
>>>> gpl library
>>>> java.lang.UnsatisfiedLinkError: no gplcompression in java.library.path
>>>>         at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1864)
>>>>         at [....]
>>>>
>>>> A bit lower down, I see this error:
>>>>
>>>> 16/01/08 19:14:48 WARN scheduler.TaskSetManager: Lost task 0.0 in stage
>>>> 0.0 (TID 0, mundonovo-priv): org.apache.spark.SparkException:
>>>> Error from python worker:
>>>>   python: module pyspark.daemon not found
>>>> PYTHONPATH was:
>>>>
>>>> /scratch5/hadoop/yarn/local/usercache/<username>/filecache/22/spark-assembly-1.6.0-hadoop2.4.0.jar:/home/jpr123/hg.pacific/python-common:/home/jpr123/python-libs:/home/jpr123/lib/python2.7/site-packages:/home/zsb739/local/lib/python2.7/site-packages:/home/jpr123/mobile-cdn-analysis:/home/<username>/lib/python2.7/site-packages:/scratch4/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0136/container_1450370639491_0136_01_000002/pyspark.zip:/scratch4/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0136/container_1450370639491_0136_01_000002/py4j-0.9-src.zip
>>>> java.io.EOFException
>>>>         at java.io.DataInputStream.readInt(DataInputStream.java:392)
>>>>         at [....]
>>>>
>>>> And then a few more similar pyspark.daemon not found errors...
>>>>
>>>> Andrew
>>>>
>>>>
>>>>
>>>> On Fri, Jan 8, 2016 at 2:31 PM, Bryan Cutler <cu...@gmail.com> wrote:
>>>>
>>>>> Hi Andrew,
>>>>>
>>>>> I know that older versions of Spark could not run PySpark on YARN in
>>>>> cluster mode.  I'm not sure if that is fixed in 1.6.0 though.  Can you try
>>>>> setting deploy-mode option to "client" when calling spark-submit?
>>>>>
>>>>> Bryan
>>>>>
>>>>> On Thu, Jan 7, 2016 at 2:39 PM, weineran <
>>>>> andrewweiner2020@u.northwestern.edu> wrote:
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> When I try to submit a python job using spark-submit (using --master
>>>>>> yarn
>>>>>> --deploy-mode cluster), I get the following error:
>>>>>>
>>>>>> /Traceback (most recent call last):
>>>>>>   File "loss_rate_by_probe.py", line 15, in ?
>>>>>>     from pyspark import SparkContext
>>>>>>   File
>>>>>>
>>>>>> "/scratch5/hadoop/yarn/local/usercache/<username>/filecache/18/spark-assembly-1.3.1-hadoop2.4.0.jar/pyspark/__init__.py",
>>>>>> line 41, in ?
>>>>>>   File
>>>>>>
>>>>>> "/scratch5/hadoop/yarn/local/usercache/<username>/filecache/18/spark-assembly-1.3.1-hadoop2.4.0.jar/pyspark/context.py",
>>>>>> line 219
>>>>>>     with SparkContext._lock:
>>>>>>                     ^
>>>>>> SyntaxError: invalid syntax/
>>>>>>
>>>>>> This is very similar to  this post from 2014
>>>>>> <
>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/SparkContext-lock-Error-td18233.html
>>>>>> >
>>>>>> , but unlike that person I am using Python 2.7.8.
>>>>>>
>>>>>> Here is what I'm using:
>>>>>> Spark 1.3.1
>>>>>> Hadoop 2.4.0.2.1.5.0-695
>>>>>> Python 2.7.8
>>>>>>
>>>>>> Another clue:  I also installed Spark 1.6.0 and tried to submit the
>>>>>> same
>>>>>> job.  I got a similar error:
>>>>>>
>>>>>> /Traceback (most recent call last):
>>>>>>   File "loss_rate_by_probe.py", line 15, in ?
>>>>>>     from pyspark import SparkContext
>>>>>>   File
>>>>>>
>>>>>> "/scratch5/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0119/container_1450370639491_0119_01_000001/pyspark.zip/pyspark/__init__.py",
>>>>>> line 61
>>>>>>     indent = ' ' * (min(len(m) for m in indents) if indents else 0)
>>>>>>                                                   ^
>>>>>> SyntaxError: invalid syntax/
>>>>>>
>>>>>> Any thoughts?
>>>>>>
>>>>>> Andrew
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> View this message in context:
>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/SparkContext-SyntaxError-invalid-syntax-tp25910.html
>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>> Nabble.com.
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>>>> For additional commands, e-mail: user-help@spark.apache.org
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: SparkContext SyntaxError: invalid syntax

Posted by Bryan Cutler <cu...@gmail.com>.
Hi Andrew,

There are a couple of things to check.  First, is Python 2.7 the default
version on all nodes in the cluster or is it an alternate install? Meaning
what is the output of this command "$>  python --version"  If it is an
alternate install, you could set the environment variable "PYSPARK_PYTHON"
Python binary executable to use for PySpark in both driver and workers
(default is python).

Did you try to submit the Python example under client mode?  Otherwise, the
command looks fine, you don't use the --class option for submitting python
files
* ./bin/spark-submit      --master yarn     --deploy-mode client
--driver-memory 4g     --executor-memory 2g     --executor-cores 1
./examples/src/main/python/pi.py     10*

That is a good sign that local jobs and Java examples work, probably just a
small configuration issue :)

Bryan

On Wed, Jan 13, 2016 at 3:51 PM, Andrew Weiner <
andrewweiner2020@u.northwestern.edu> wrote:

> Thanks for your continuing help.  Here is some additional info.
>
> *OS/architecture*
> output of *cat /proc/version*:
> Linux version 2.6.18-400.1.1.el5 (mockbuild@x86-012.build.bos.redhat.com)
>
> output of *lsb_release -a*:
> LSB Version:
>  :core-4.0-amd64:core-4.0-ia32:core-4.0-noarch:graphics-4.0-amd64:graphics-4.0-ia32:graphics-4.0-noarch:printing-4.0-amd64:printing-4.0-ia32:printing-4.0-noarch
> Distributor ID: RedHatEnterpriseServer
> Description:    Red Hat Enterprise Linux Server release 5.11 (Tikanga)
> Release:        5.11
> Codename:       Tikanga
>
> *Running a local job*
> I have confirmed that I can successfully run python jobs using
> bin/spark-submit --master local[*]
> Specifically, this is the command I am using:
> *./bin/spark-submit --master local[8]
> ./examples/src/main/python/wordcount.py
> file:/home/<username>/spark-1.6.0-bin-hadoop2.4/README.md*
> And it works!
>
> *Additional info*
> I am also able to successfully run the Java SparkPi example using yarn in
> cluster mode using this command:
> * ./bin/spark-submit --class org.apache.spark.examples.SparkPi
> --master yarn     --deploy-mode cluster     --driver-memory 4g
> --executor-memory 2g     --executor-cores 1     lib/spark-examples*.jar
> 10*
> This Java job also runs successfully when I change --deploy-mode to
> client.  The fact that I can run Java jobs in cluster mode makes me thing
> that everything is installed correctly--is that a valid assumption?
>
> The problem remains that I cannot submit python jobs.  Here is the command
> that I am using to try to submit python jobs:
> * ./bin/spark-submit      --master yarn     --deploy-mode cluster
> --driver-memory 4g     --executor-memory 2g     --executor-cores 1
> ./examples/src/main/python/pi.py     10*
> Does that look like a correct command?  I wasn't sure what to put for
> --class so I omitted it.  At any rate, the result of the above command is a
> syntax error, similar to the one I posted in the original email:
>
> Traceback (most recent call last):
>   File "pi.py", line 24, in ?
>     from pyspark import SparkContext
>   File "/home/<username>/spark-1.6.0-bin-hadoop2.4/python/pyspark/__init__.py", line 61
>     indent = ' ' * (min(len(m) for m in indents) if indents else 0)
>                                                   ^
> SyntaxError: invalid syntax
>
>
> This really looks to me like a problem with the python version.  Python
> 2.4 would throw this syntax error but Python 2.7 would not.  And yet I am
> using Python 2.7.8.  Is there any chance that Spark or Yarn is somehow
> using an older version of Python without my knowledge?
>
> Finally, when I try to run the same command in client mode...
> * ./bin/spark-submit      --master yarn     --deploy-mode client
> --driver-memory 4g     --executor-memory 2g     --executor-cores 1
> ./examples/src/main/python/pi.py 10*
> I get the error I mentioned in the prior email:
> Error from python worker:
>   python: module pyspark.daemon not found
>
> Any thoughts?
>
> Best,
> Andrew
>
>
> On Mon, Jan 11, 2016 at 12:25 PM, Bryan Cutler <cu...@gmail.com> wrote:
>
>> This could be an environment issue, could you give more details about the
>> OS/architecture that you are using?  If you are sure everything is
>> installed correctly on each node following the guide on "Running Spark on
>> Yarn" http://spark.apache.org/docs/latest/running-on-yarn.html and that
>> the spark assembly jar is reachable, then I would check to see if you can
>> submit a local job to just run on one node.
>>
>> On Fri, Jan 8, 2016 at 5:22 PM, Andrew Weiner <
>> andrewweiner2020@u.northwestern.edu> wrote:
>>
>>> Now for simplicity I'm testing with wordcount.py from the provided
>>> examples, and using Spark 1.6.0
>>>
>>> The first error I get is:
>>>
>>> 16/01/08 19:14:46 ERROR lzo.GPLNativeCodeLoader: Could not load native
>>> gpl library
>>> java.lang.UnsatisfiedLinkError: no gplcompression in java.library.path
>>>         at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1864)
>>>         at [....]
>>>
>>> A bit lower down, I see this error:
>>>
>>> 16/01/08 19:14:48 WARN scheduler.TaskSetManager: Lost task 0.0 in stage
>>> 0.0 (TID 0, mundonovo-priv): org.apache.spark.SparkException:
>>> Error from python worker:
>>>   python: module pyspark.daemon not found
>>> PYTHONPATH was:
>>>
>>> /scratch5/hadoop/yarn/local/usercache/<username>/filecache/22/spark-assembly-1.6.0-hadoop2.4.0.jar:/home/jpr123/hg.pacific/python-common:/home/jpr123/python-libs:/home/jpr123/lib/python2.7/site-packages:/home/zsb739/local/lib/python2.7/site-packages:/home/jpr123/mobile-cdn-analysis:/home/<username>/lib/python2.7/site-packages:/scratch4/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0136/container_1450370639491_0136_01_000002/pyspark.zip:/scratch4/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0136/container_1450370639491_0136_01_000002/py4j-0.9-src.zip
>>> java.io.EOFException
>>>         at java.io.DataInputStream.readInt(DataInputStream.java:392)
>>>         at [....]
>>>
>>> And then a few more similar pyspark.daemon not found errors...
>>>
>>> Andrew
>>>
>>>
>>>
>>> On Fri, Jan 8, 2016 at 2:31 PM, Bryan Cutler <cu...@gmail.com> wrote:
>>>
>>>> Hi Andrew,
>>>>
>>>> I know that older versions of Spark could not run PySpark on YARN in
>>>> cluster mode.  I'm not sure if that is fixed in 1.6.0 though.  Can you try
>>>> setting deploy-mode option to "client" when calling spark-submit?
>>>>
>>>> Bryan
>>>>
>>>> On Thu, Jan 7, 2016 at 2:39 PM, weineran <
>>>> andrewweiner2020@u.northwestern.edu> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> When I try to submit a python job using spark-submit (using --master
>>>>> yarn
>>>>> --deploy-mode cluster), I get the following error:
>>>>>
>>>>> /Traceback (most recent call last):
>>>>>   File "loss_rate_by_probe.py", line 15, in ?
>>>>>     from pyspark import SparkContext
>>>>>   File
>>>>>
>>>>> "/scratch5/hadoop/yarn/local/usercache/<username>/filecache/18/spark-assembly-1.3.1-hadoop2.4.0.jar/pyspark/__init__.py",
>>>>> line 41, in ?
>>>>>   File
>>>>>
>>>>> "/scratch5/hadoop/yarn/local/usercache/<username>/filecache/18/spark-assembly-1.3.1-hadoop2.4.0.jar/pyspark/context.py",
>>>>> line 219
>>>>>     with SparkContext._lock:
>>>>>                     ^
>>>>> SyntaxError: invalid syntax/
>>>>>
>>>>> This is very similar to  this post from 2014
>>>>> <
>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/SparkContext-lock-Error-td18233.html
>>>>> >
>>>>> , but unlike that person I am using Python 2.7.8.
>>>>>
>>>>> Here is what I'm using:
>>>>> Spark 1.3.1
>>>>> Hadoop 2.4.0.2.1.5.0-695
>>>>> Python 2.7.8
>>>>>
>>>>> Another clue:  I also installed Spark 1.6.0 and tried to submit the
>>>>> same
>>>>> job.  I got a similar error:
>>>>>
>>>>> /Traceback (most recent call last):
>>>>>   File "loss_rate_by_probe.py", line 15, in ?
>>>>>     from pyspark import SparkContext
>>>>>   File
>>>>>
>>>>> "/scratch5/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0119/container_1450370639491_0119_01_000001/pyspark.zip/pyspark/__init__.py",
>>>>> line 61
>>>>>     indent = ' ' * (min(len(m) for m in indents) if indents else 0)
>>>>>                                                   ^
>>>>> SyntaxError: invalid syntax/
>>>>>
>>>>> Any thoughts?
>>>>>
>>>>> Andrew
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> View this message in context:
>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/SparkContext-SyntaxError-invalid-syntax-tp25910.html
>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>> Nabble.com.
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>>> For additional commands, e-mail: user-help@spark.apache.org
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: SparkContext SyntaxError: invalid syntax

Posted by Andrew Weiner <an...@u.northwestern.edu>.
Thanks for your continuing help.  Here is some additional info.

*OS/architecture*
output of *cat /proc/version*:
Linux version 2.6.18-400.1.1.el5 (mockbuild@x86-012.build.bos.redhat.com)

output of *lsb_release -a*:
LSB Version:
 :core-4.0-amd64:core-4.0-ia32:core-4.0-noarch:graphics-4.0-amd64:graphics-4.0-ia32:graphics-4.0-noarch:printing-4.0-amd64:printing-4.0-ia32:printing-4.0-noarch
Distributor ID: RedHatEnterpriseServer
Description:    Red Hat Enterprise Linux Server release 5.11 (Tikanga)
Release:        5.11
Codename:       Tikanga

*Running a local job*
I have confirmed that I can successfully run python jobs using
bin/spark-submit --master local[*]
Specifically, this is the command I am using:
*./bin/spark-submit --master local[8]
./examples/src/main/python/wordcount.py
file:/home/<username>/spark-1.6.0-bin-hadoop2.4/README.md*
And it works!

*Additional info*
I am also able to successfully run the Java SparkPi example using yarn in
cluster mode using this command:
* ./bin/spark-submit --class org.apache.spark.examples.SparkPi     --master
yarn     --deploy-mode cluster     --driver-memory 4g     --executor-memory
2g     --executor-cores 1     lib/spark-examples*.jar     10*
This Java job also runs successfully when I change --deploy-mode to
client.  The fact that I can run Java jobs in cluster mode makes me thing
that everything is installed correctly--is that a valid assumption?

The problem remains that I cannot submit python jobs.  Here is the command
that I am using to try to submit python jobs:
* ./bin/spark-submit      --master yarn     --deploy-mode cluster
--driver-memory 4g     --executor-memory 2g     --executor-cores 1
./examples/src/main/python/pi.py     10*
Does that look like a correct command?  I wasn't sure what to put for
--class so I omitted it.  At any rate, the result of the above command is a
syntax error, similar to the one I posted in the original email:

Traceback (most recent call last):
  File "pi.py", line 24, in ?
    from pyspark import SparkContext
  File "/home/<username>/spark-1.6.0-bin-hadoop2.4/python/pyspark/__init__.py",
line 61
    indent = ' ' * (min(len(m) for m in indents) if indents else 0)
                                                  ^
SyntaxError: invalid syntax


This really looks to me like a problem with the python version.  Python 2.4
would throw this syntax error but Python 2.7 would not.  And yet I am using
Python 2.7.8.  Is there any chance that Spark or Yarn is somehow using an
older version of Python without my knowledge?

Finally, when I try to run the same command in client mode...
* ./bin/spark-submit      --master yarn     --deploy-mode client
--driver-memory 4g     --executor-memory 2g     --executor-cores 1
./examples/src/main/python/pi.py 10*
I get the error I mentioned in the prior email:
Error from python worker:
  python: module pyspark.daemon not found

Any thoughts?

Best,
Andrew


On Mon, Jan 11, 2016 at 12:25 PM, Bryan Cutler <cu...@gmail.com> wrote:

> This could be an environment issue, could you give more details about the
> OS/architecture that you are using?  If you are sure everything is
> installed correctly on each node following the guide on "Running Spark on
> Yarn" http://spark.apache.org/docs/latest/running-on-yarn.html and that
> the spark assembly jar is reachable, then I would check to see if you can
> submit a local job to just run on one node.
>
> On Fri, Jan 8, 2016 at 5:22 PM, Andrew Weiner <
> andrewweiner2020@u.northwestern.edu> wrote:
>
>> Now for simplicity I'm testing with wordcount.py from the provided
>> examples, and using Spark 1.6.0
>>
>> The first error I get is:
>>
>> 16/01/08 19:14:46 ERROR lzo.GPLNativeCodeLoader: Could not load native
>> gpl library
>> java.lang.UnsatisfiedLinkError: no gplcompression in java.library.path
>>         at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1864)
>>         at [....]
>>
>> A bit lower down, I see this error:
>>
>> 16/01/08 19:14:48 WARN scheduler.TaskSetManager: Lost task 0.0 in stage
>> 0.0 (TID 0, mundonovo-priv): org.apache.spark.SparkException:
>> Error from python worker:
>>   python: module pyspark.daemon not found
>> PYTHONPATH was:
>>
>> /scratch5/hadoop/yarn/local/usercache/<username>/filecache/22/spark-assembly-1.6.0-hadoop2.4.0.jar:/home/jpr123/hg.pacific/python-common:/home/jpr123/python-libs:/home/jpr123/lib/python2.7/site-packages:/home/zsb739/local/lib/python2.7/site-packages:/home/jpr123/mobile-cdn-analysis:/home/<username>/lib/python2.7/site-packages:/scratch4/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0136/container_1450370639491_0136_01_000002/pyspark.zip:/scratch4/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0136/container_1450370639491_0136_01_000002/py4j-0.9-src.zip
>> java.io.EOFException
>>         at java.io.DataInputStream.readInt(DataInputStream.java:392)
>>         at [....]
>>
>> And then a few more similar pyspark.daemon not found errors...
>>
>> Andrew
>>
>>
>>
>> On Fri, Jan 8, 2016 at 2:31 PM, Bryan Cutler <cu...@gmail.com> wrote:
>>
>>> Hi Andrew,
>>>
>>> I know that older versions of Spark could not run PySpark on YARN in
>>> cluster mode.  I'm not sure if that is fixed in 1.6.0 though.  Can you try
>>> setting deploy-mode option to "client" when calling spark-submit?
>>>
>>> Bryan
>>>
>>> On Thu, Jan 7, 2016 at 2:39 PM, weineran <
>>> andrewweiner2020@u.northwestern.edu> wrote:
>>>
>>>> Hello,
>>>>
>>>> When I try to submit a python job using spark-submit (using --master
>>>> yarn
>>>> --deploy-mode cluster), I get the following error:
>>>>
>>>> /Traceback (most recent call last):
>>>>   File "loss_rate_by_probe.py", line 15, in ?
>>>>     from pyspark import SparkContext
>>>>   File
>>>>
>>>> "/scratch5/hadoop/yarn/local/usercache/<username>/filecache/18/spark-assembly-1.3.1-hadoop2.4.0.jar/pyspark/__init__.py",
>>>> line 41, in ?
>>>>   File
>>>>
>>>> "/scratch5/hadoop/yarn/local/usercache/<username>/filecache/18/spark-assembly-1.3.1-hadoop2.4.0.jar/pyspark/context.py",
>>>> line 219
>>>>     with SparkContext._lock:
>>>>                     ^
>>>> SyntaxError: invalid syntax/
>>>>
>>>> This is very similar to  this post from 2014
>>>> <
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/SparkContext-lock-Error-td18233.html
>>>> >
>>>> , but unlike that person I am using Python 2.7.8.
>>>>
>>>> Here is what I'm using:
>>>> Spark 1.3.1
>>>> Hadoop 2.4.0.2.1.5.0-695
>>>> Python 2.7.8
>>>>
>>>> Another clue:  I also installed Spark 1.6.0 and tried to submit the same
>>>> job.  I got a similar error:
>>>>
>>>> /Traceback (most recent call last):
>>>>   File "loss_rate_by_probe.py", line 15, in ?
>>>>     from pyspark import SparkContext
>>>>   File
>>>>
>>>> "/scratch5/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0119/container_1450370639491_0119_01_000001/pyspark.zip/pyspark/__init__.py",
>>>> line 61
>>>>     indent = ' ' * (min(len(m) for m in indents) if indents else 0)
>>>>                                                   ^
>>>> SyntaxError: invalid syntax/
>>>>
>>>> Any thoughts?
>>>>
>>>> Andrew
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context:
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/SparkContext-SyntaxError-invalid-syntax-tp25910.html
>>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>> For additional commands, e-mail: user-help@spark.apache.org
>>>>
>>>>
>>>
>>
>

Re: SparkContext SyntaxError: invalid syntax

Posted by Andrew Weiner <an...@u.northwestern.edu>.
Now for simplicity I'm testing with wordcount.py from the provided
examples, and using Spark 1.6.0

The first error I get is:

16/01/08 19:14:46 ERROR lzo.GPLNativeCodeLoader: Could not load native gpl
library
java.lang.UnsatisfiedLinkError: no gplcompression in java.library.path
        at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1864)
        at [....]

A bit lower down, I see this error:

16/01/08 19:14:48 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0
(TID 0, mundonovo-priv): org.apache.spark.SparkException:
Error from python worker:
  python: module pyspark.daemon not found
PYTHONPATH was:

/scratch5/hadoop/yarn/local/usercache/awp066/filecache/22/spark-assembly-1.6.0-hadoop2.4.0.jar:/home/jpr123/hg.pacific/python-common:/home/jpr123/python-libs:/home/jpr123/lib/python2.7/site-packages:/home/zsb739/local/lib/python2.7/site-packages:/home/jpr123/mobile-cdn-analysis:/home/awp066/lib/python2.7/site-packages:/scratch4/hadoop/yarn/local/usercache/awp066/appcache/application_1450370639491_0136/container_1450370639491_0136_01_000002/pyspark.zip:/scratch4/hadoop/yarn/local/usercache/awp066/appcache/application_1450370639491_0136/container_1450370639491_0136_01_000002/py4j-0.9-src.zip
java.io.EOFException
        at java.io.DataInputStream.readInt(DataInputStream.java:392)
        at [....]

And then a few more similar pyspark.daemon not found errors...

Andrew



On Fri, Jan 8, 2016 at 2:31 PM, Bryan Cutler <cu...@gmail.com> wrote:

> Hi Andrew,
>
> I know that older versions of Spark could not run PySpark on YARN in
> cluster mode.  I'm not sure if that is fixed in 1.6.0 though.  Can you try
> setting deploy-mode option to "client" when calling spark-submit?
>
> Bryan
>
> On Thu, Jan 7, 2016 at 2:39 PM, weineran <
> andrewweiner2020@u.northwestern.edu> wrote:
>
>> Hello,
>>
>> When I try to submit a python job using spark-submit (using --master yarn
>> --deploy-mode cluster), I get the following error:
>>
>> /Traceback (most recent call last):
>>   File "loss_rate_by_probe.py", line 15, in ?
>>     from pyspark import SparkContext
>>   File
>>
>> "/scratch5/hadoop/yarn/local/usercache/<username>/filecache/18/spark-assembly-1.3.1-hadoop2.4.0.jar/pyspark/__init__.py",
>> line 41, in ?
>>   File
>>
>> "/scratch5/hadoop/yarn/local/usercache/<username>/filecache/18/spark-assembly-1.3.1-hadoop2.4.0.jar/pyspark/context.py",
>> line 219
>>     with SparkContext._lock:
>>                     ^
>> SyntaxError: invalid syntax/
>>
>> This is very similar to  this post from 2014
>> <
>> http://apache-spark-user-list.1001560.n3.nabble.com/SparkContext-lock-Error-td18233.html
>> >
>> , but unlike that person I am using Python 2.7.8.
>>
>> Here is what I'm using:
>> Spark 1.3.1
>> Hadoop 2.4.0.2.1.5.0-695
>> Python 2.7.8
>>
>> Another clue:  I also installed Spark 1.6.0 and tried to submit the same
>> job.  I got a similar error:
>>
>> /Traceback (most recent call last):
>>   File "loss_rate_by_probe.py", line 15, in ?
>>     from pyspark import SparkContext
>>   File
>>
>> "/scratch5/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0119/container_1450370639491_0119_01_000001/pyspark.zip/pyspark/__init__.py",
>> line 61
>>     indent = ' ' * (min(len(m) for m in indents) if indents else 0)
>>                                                   ^
>> SyntaxError: invalid syntax/
>>
>> Any thoughts?
>>
>> Andrew
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/SparkContext-SyntaxError-invalid-syntax-tp25910.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>>
>

Re: SparkContext SyntaxError: invalid syntax

Posted by Bryan Cutler <cu...@gmail.com>.
Hi Andrew,

I know that older versions of Spark could not run PySpark on YARN in
cluster mode.  I'm not sure if that is fixed in 1.6.0 though.  Can you try
setting deploy-mode option to "client" when calling spark-submit?

Bryan

On Thu, Jan 7, 2016 at 2:39 PM, weineran <
andrewweiner2020@u.northwestern.edu> wrote:

> Hello,
>
> When I try to submit a python job using spark-submit (using --master yarn
> --deploy-mode cluster), I get the following error:
>
> /Traceback (most recent call last):
>   File "loss_rate_by_probe.py", line 15, in ?
>     from pyspark import SparkContext
>   File
>
> "/scratch5/hadoop/yarn/local/usercache/<username>/filecache/18/spark-assembly-1.3.1-hadoop2.4.0.jar/pyspark/__init__.py",
> line 41, in ?
>   File
>
> "/scratch5/hadoop/yarn/local/usercache/<username>/filecache/18/spark-assembly-1.3.1-hadoop2.4.0.jar/pyspark/context.py",
> line 219
>     with SparkContext._lock:
>                     ^
> SyntaxError: invalid syntax/
>
> This is very similar to  this post from 2014
> <
> http://apache-spark-user-list.1001560.n3.nabble.com/SparkContext-lock-Error-td18233.html
> >
> , but unlike that person I am using Python 2.7.8.
>
> Here is what I'm using:
> Spark 1.3.1
> Hadoop 2.4.0.2.1.5.0-695
> Python 2.7.8
>
> Another clue:  I also installed Spark 1.6.0 and tried to submit the same
> job.  I got a similar error:
>
> /Traceback (most recent call last):
>   File "loss_rate_by_probe.py", line 15, in ?
>     from pyspark import SparkContext
>   File
>
> "/scratch5/hadoop/yarn/local/usercache/<username>/appcache/application_1450370639491_0119/container_1450370639491_0119_01_000001/pyspark.zip/pyspark/__init__.py",
> line 61
>     indent = ' ' * (min(len(m) for m in indents) if indents else 0)
>                                                   ^
> SyntaxError: invalid syntax/
>
> Any thoughts?
>
> Andrew
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/SparkContext-SyntaxError-invalid-syntax-tp25910.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>