You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by sryza <gi...@git.apache.org> on 2014/02/27 10:22:35 UTC

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

GitHub user sryza opened a pull request:

    https://github.com/apache/spark/pull/30

    SPARK-1004.  PySpark on YARN

    This reopens https://github.com/apache/incubator-spark/pull/640 against the new repo

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/sryza/spark sandy-spark-1004

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/30.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #30
    
----
commit e49ff667154de988a1cb58d90c9743c6c24ef5bc
Author: Josh Rosen <jo...@apache.org>
Date:   2014-01-24T18:19:58Z

    Automatically set Yarn env vars in PySpark (SPARK-1030).

commit 59ac972026a7600fded49d906ef27bbb017fc9d2
Author: Josh Rosen <jo...@apache.org>
Date:   2014-01-25T23:28:56Z

    WIP towards PySpark on YARN:
    
    - Remove reliance on SPARK_HOME on the workers.  Only the driver
      should know about SPARK_HOME.  On the workers, we ensure that the
      PySpark Python libraries are added to the PYTHONPATH.
    
    - Add a Makefile for generating a "fat zip" that contains PySpark's
      Python dependencies.  This is a bit of a hack and I'd be open to
      better packaging tools, but this doesn't require any extra Python
      libraries.  This use case doesn't seem to be well-addressed by the
      existing Python packaging tools: there are plenty of tools to package
      complete Python environments (such as pyinstaller and virtualenv) or
      to bundle *individual* libraries (e.g. distutils), but few to generate
      portable fat zips or eggs.
    
    This hasn't been tested with YARN and may not actually compile.

commit 54bd8c0aec51d5d5cb24d6453dea2fb627db05cd
Author: Josh Rosen <jo...@apache.org>
Date:   2014-02-19T06:27:21Z

    Add missing setup.py file for PySpark.

commit 514b2d0cfc8995b86186d02aebf61500d25df7db
Author: Sandy Ryza <sa...@cloudera.com>
Date:   2014-02-24T07:06:42Z

    Improvements

commit ee3cc204dcabd7d092e3d6ed205e01c5deffc7ca
Author: Sandy Ryza <sa...@cloudera.com>
Date:   2014-02-24T07:26:01Z

    Don't set SPARK_JAR

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/30#issuecomment-40286771
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by sryza <gi...@git.apache.org>.

Github user sryza commented on the pull request:

    https://github.com/apache/spark/pull/30#issuecomment-41746962
  
    Posted a rebased version.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by andrewor14 <gi...@git.apache.org>.

Github user andrewor14 commented on the pull request:

    https://github.com/apache/spark/pull/30#issuecomment-41592426
  
    @sryza We dug around a little more and realized it's a file permissions issue. On the CDH cluster, if I ran the application as the user `ubuntu` then it fails. However, if we ran the application as `root`, then it succeeds. This is because the jar in the file cache on the executor nodes have very restrictive permissions (`-r-x------  1 yarn`). Instead of throwing a message about permissions, it simply says `No module named pyspark`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/30#issuecomment-38881926
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/30#issuecomment-40139453
  
    Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/30#issuecomment-40762962
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by andrewor14 <gi...@git.apache.org>.

Github user andrewor14 commented on the pull request:

    https://github.com/apache/spark/pull/30#issuecomment-41356226
  
    Forgot to mention, the line numbers are shifted because I added a few log statements in my test branch. Here's the link: https://github.com/andrewor14/spark/tree/test-pyspark-yarn


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/30#issuecomment-40155741
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/30#issuecomment-41912646
  
    @tgravescs This is exactly what Patrick and I found in https://issues.apache.org/jira/browse/SPARK-1520 Some artifacts were removed to get it back under size. Is something else coming back to make the .jar too big? see my command line there to investigate what's adding so many files.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by andrewor14 <gi...@git.apache.org>.

Github user andrewor14 commented on the pull request:

    https://github.com/apache/spark/pull/30#issuecomment-41764738
  
    Also, we should definitely document how to set up PySpark on YARN, so the user doesn't have to jump through hoops to get a simple job running. The biggest thing is probably emphasize that it only works if we build with maven. Maybe we should also have a section that explains what to do when you run into 
    the unhelpful `java.io.EOFException`. Or better still, throw a nicer exception message that prints out the PYTHONPATH and complains that it can't find pyspark.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by pwendell <gi...@git.apache.org>.

Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/30#issuecomment-41761129
  
    @andrewor14 if you could do a final test on this just to double check, I think it's good to go.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/30#issuecomment-36223978
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/30#issuecomment-40771150
  
    Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by pwendell <gi...@git.apache.org>.

Github user pwendell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/30#discussion_r10367892
  
    --- Diff: sbin/spark-config.sh ---
    @@ -34,3 +34,6 @@ this="$config_bin/$script"
     export SPARK_PREFIX=`dirname "$this"`/..
     export SPARK_HOME=${SPARK_PREFIX}
     export SPARK_CONF_DIR="$SPARK_HOME/conf"
    +# Add the PySpark classes to the PYTHONPATH:
    +export PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH
    --- End diff --
    
    So this script I think only gets called when launching the standalone daemons. Would it make more sense to put this in `spark-class`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by andrewor14 <gi...@git.apache.org>.

Github user andrewor14 commented on the pull request:

    https://github.com/apache/spark/pull/30#issuecomment-41735556
  
    Hey @sryza, when you have time can you update this to master? It seems that there are quite a few merge conflicts


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by andrewor14 <gi...@git.apache.org>.

Github user andrewor14 commented on the pull request:

    https://github.com/apache/spark/pull/30#issuecomment-41827629
  
    I tried it with the latest master jar with your examples and both of them work. What happens if you do the following on the submit node:
    
    ```
    PYTHONPATH=<jar> python
    ...
    >>> import pyspark
    >>> import py4j
    ```
    
    Then go to the `yarn.nodemanager.local-dirs` of each container node after setting `yarn.nodemanager.delete.debug-delay-sec` to a high value (as you previously suggested), and try the same on the spark.jar.
    
    For me, I am able to import both pyspark and py4j directly on both the submit node and the container nodes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by ahirreddy <gi...@git.apache.org>.

Github user ahirreddy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/30#discussion_r10403945
  
    --- Diff: python/Makefile ---
    @@ -0,0 +1,7 @@
    +assembly: clean
    +	python setup.py build --build-lib build/lib
    +	unzip lib/py4j*.zip -d build/lib
    +	cd build/lib && zip -r ../pyspark-assembly.zip .
    +
    --- End diff --
    
    Would it make more sense to try to package PySpark dependencies inside a jar, instead of requiring this extra step? Since jars are just zips, Python should be able to run against them in the same way, so most of this PR wouldn't have to be changed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/30#issuecomment-40773874
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by sryza <gi...@git.apache.org>.

Github user sryza commented on the pull request:

    https://github.com/apache/spark/pull/30#issuecomment-41757854
  
    Updated PR moves unzipping py4j to an earlier phase so that it gets included the first time around.  Tested it out and saw it appear in the jar the first time.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by tgravescs <gi...@git.apache.org>.

Github user tgravescs commented on the pull request:

    https://github.com/apache/spark/pull/30#issuecomment-41853864
  
    The command I used is:  mvn  -Dyarn.version=2.4.0  -Dhadoop.version=2.4.0  -Pyarn package -DskipTests 
    
    I'll try doing a clean build with the 2.2.0 version to see if it makes a difference.
    
    unzip works fine (no errors). So it appears that is I unjar my assembly jar, remove the scala directory and jar it back up then it works fine.  I'll try to investigate what in the scala directory might be causing problems.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by pwendell <gi...@git.apache.org>.

Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/30#issuecomment-40773808
  
    Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by sryza <gi...@git.apache.org>.

Github user sryza commented on the pull request:

    https://github.com/apache/spark/pull/30#issuecomment-36953787
  
    Updated to 1.0.0 and removed incubating


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark issue #30: SPARK-1004. PySpark on YARN

Posted by databricks-jenkins <gi...@git.apache.org>.

Github user databricks-jenkins commented on the issue:

    https://github.com/apache/spark/pull/30
  
    **[Test build #8 has started](https://jenkins.test.databricks.com/job/spark-pull-request-builder/8/consoleFull)** for PR 30 at commit [`091cd1a`](https://github.com/apache/spark/commit/091cd1ad6f11d187c97223255d9f17ad13c8dd18).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/30#issuecomment-41749230
  
    Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/30#issuecomment-36958876
  
    Merged build finished.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/30#issuecomment-40159495
  
    
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14027/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/30#issuecomment-40287754
  
    Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by andrewor14 <gi...@git.apache.org>.

Github user andrewor14 commented on the pull request:

    https://github.com/apache/spark/pull/30#issuecomment-41628627
  
    (Again, disregard my previous comment about this being a file permissions issue. We talked more offline about this and @sryza discovered that this is due to `SPARK_YARN_USER_ENV` not being shipped to executor nodes)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by andrewor14 <gi...@git.apache.org>.

Github user andrewor14 commented on the pull request:

    https://github.com/apache/spark/pull/30#issuecomment-41356120
  
    @sryza I have tested this on a standalone cluster with success. However, I haven't been able to get it working on a CDH cluster. I tried building both with maven and SBT (the latter of which clearly doesn't work yet), but neither was fruitful.
    
    More specifically, I did
    
    ```
    mvn -Pyarn -Dhadoop.version=2.3.0-cdh5.0.0 -Dyarn.version=2.3.0-cdh5.0.0 -DskipTests clean package
    MASTER=yarn-client bin/pyspark
    ```
    
    and ran into
    
    ```
    14/04/25 03:16:54 INFO CoarseGrainedExecutorBackend: Got assigned task 0
    14/04/25 03:16:55 INFO Executor: Running task ID 0
    14/04/25 03:16:56 ERROR Executor: Exception in task ID 0
    java.io.EOFException
            at java.io.DataInputStream.readInt(DataInputStream.java:392)
            at org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:183)
            at org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:55)
            at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:42)
            at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:97)
            at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:57)
            at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
            at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
            at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
            at org.apache.spark.scheduler.Task.run(Task.scala:51)
            at org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:210)
            at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:43)
            at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:42)
            at java.security.AccessController.doPrivileged(Native Method)
            at javax.security.auth.Subject.doAs(Subject.java:415)
            at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
            at org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:42)
            at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:175)
            at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
            at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
            at java.lang.Thread.run(Thread.java:744)
    14/04/25 03:16:56 ERROR Executor: Uncaught exception in thread Thread[stderr reader for python,5,main]
    java.lang.NullPointerException
            at org.apache.spark.api.python.PythonWorkerFactory$$anon$3$$anonfun$run$3.apply$mcV$sp(PythonWorkerFactory.scala:171)
            at org.apache.spark.api.python.PythonWorkerFactory$$anon$3$$anonfun$run$3.apply(PythonWorkerFactory.scala:169)
            at org.apache.spark.api.python.PythonWorkerFactory$$anon$3$$anonfun$run$3.apply(PythonWorkerFactory.scala:169)
    ```
    
    I will spend some time digging into what the NPE is, but in the mean time do you see anything obvious that I'm missing?
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by tgravescs <gi...@git.apache.org>.

Github user tgravescs commented on the pull request:

    https://github.com/apache/spark/pull/30#issuecomment-41925261
  
    So I randomly removed file and directories and got it does to just below 65536 (went to 65453) and the jar then worked.  Note that the total including directories was actually 69373


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark issue #30: SPARK-1004. PySpark on YARN

Posted by swaapnika-guntaka <gi...@git.apache.org>.

Github user swaapnika-guntaka commented on the issue:

    https://github.com/apache/spark/pull/30
  
    I see the Java EOF Exception when I run python packaged jar(using JDK 8) using Spark-2.2
    I'm trying to run this using the below command.
    `time bash -x $SPARK_HOME/bin/spark-submit --driver-class-path .:<pathtojars>:</spark/python/lib> -v $PYTHONPATH/<packaged.jar> >& run.log` 
    ```
    Recent failure: Lost task 3.3 in stage 0.0 (TID 36, 10.15.163.25, executor 0): java.io.EOFException
        at java.io.DataInputStream.readInt(DataInputStream.java:392)
        at org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:166)
        at org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:89)
        at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:65)
        at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:117)
        at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:128)
        at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
        at org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:395)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by tgravescs <gi...@git.apache.org>.

Github user tgravescs commented on the pull request:

    https://github.com/apache/spark/pull/30#issuecomment-41919332
  
    @srowen thanks for the hint, that very well might be my problem.  Although I'm both building and running with jdk7.   I'll try just cutting down the # of files vs the size of the jar to see what happens.
    
    Looks like there are 67583 files in my assembly jar.
    
    using your command:
       1380 scala/tools/nsc/typechecker/
       1188 scala/reflect/internal/
        991 org/apache/hadoop/hdfs/protocol/proto/
        897 com/google/common/collect/
        895 tachyon/thrift/
        798 scala/tools/nsc/transform/
        750 scala/tools/nsc/interpreter/
        724 org/netlib/lapack/
        632 scala/
        531 scala/collection/
        492 org/apache/spark/storage/
        473 scala/collection/parallel/
        454 org/apache/spark/repl/
        441 scala/tools/nsc/transform/patmat/
        436 org/apache/spark/rdd/
        426 akka/actor/
        425 scala/tools/nsc/backend/icode/
        416 akka/io/
        400 org/apache/hadoop/yarn/proto/
        360 scala/collection/mutable/
        355 scala/tools/scalap/scalax/rules/scalasig/
        353 scala/collection/immutable/
        352 scala/tools/nsc/interactive/
        339 scala/tools/nsc/backend/jvm/
        327 org/apache/spark/scheduler/
        316 scala/tools/nsc/



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by sryza <gi...@git.apache.org>.

Github user sryza commented on the pull request:

    https://github.com/apache/spark/pull/30#issuecomment-40528655
  
    SPARK_JAR is generally required by Spark on YARN, not PySpark on YARN in particular.  The check in the Python code is just to warn users earlier. You have a patch that removes that requirement?  Which?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by tgravescs <gi...@git.apache.org>.

Github user tgravescs commented on the pull request:

    https://github.com/apache/spark/pull/30#issuecomment-42038226
  
    I setup my mac to work with the maven builds and when I build on the mac and run the python command there, then it works.  I've built on 3 different redhat 6 boxes, ran on 3 different clusters (all with jdk7) and it doesn't work there.    
    
    If I copy the assembly jar I built on my mac over to the rhel boxes then it works.   


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by pwendell <gi...@git.apache.org>.

Github user pwendell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/30#discussion_r11653682
  
    --- Diff: python/pyspark/context.py ---
    @@ -130,6 +130,13 @@ def __init__(self, master=None, appName=None, sparkHome=None, pyFiles=None,
                     varName = k[len("spark.executorEnv."):]
                     self.environment[varName] = v
     
    +        # Check if we're running on YARN:
    +        if self.master == "yarn-client":
    +            if not os.environ.get("SPARK_JAR"):
    +                raise Exception("Must set SPARK_JAR when using yarn-client mode")
    +            if not os.environ.get("PYSPARK_ZIP"):
    --- End diff --
    
    Actually even better - why not just have `./bin/pyspark` go and make this jar if it's not made already. Then you wouldn't have to expose any new options to the user.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/30#issuecomment-36958877
  
    All automated tests passed.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13030/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/30#issuecomment-41746881
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by pwendell <gi...@git.apache.org>.

Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/30#issuecomment-41756737
  
    @andrewor14 So you are saying that if I just download this from a blank build it won't work... but if I happen to build twice it will work.
    
    I wonder if the issue might be that the maven-exec-plugin isn't guarenteed to execute before the packaging of the jar itself.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/30#issuecomment-36226245
  
    Merged build finished.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by pwendell <gi...@git.apache.org>.

Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/30#issuecomment-36951182
  
    Hey @JoshRosen mind taking a look at this I think @sryza has tested it on YARN. But personally don't know enough about python packaging to look it over with confidence.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by sryza <gi...@git.apache.org>.

Github user sryza commented on the pull request:

    https://github.com/apache/spark/pull/30#issuecomment-41949135
  
    I've got 67000+ files (not counting directories) in my assembly jar.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/30#issuecomment-40763512
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/30#issuecomment-40152504
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by andrewor14 <gi...@git.apache.org>.

Github user andrewor14 commented on the pull request:

    https://github.com/apache/spark/pull/30#issuecomment-42057384
  
    Looks like on both environments I'm running exactly the same versions of Java (1.7.0), Scala (2.10.4) and maven (3.2.1). This suggests that it does have something to do with RedHat, which seems to be a potential problem only for HDP clusters (they only support RedHat / SUSE). A workaround is to manually copy the jars over, as @tgravescs and I have done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by JoshRosen <gi...@git.apache.org>.

Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/30#discussion_r10410916
  
    --- Diff: sbin/spark-config.sh ---
    @@ -34,3 +34,6 @@ this="$config_bin/$script"
     export SPARK_PREFIX=`dirname "$this"`/..
     export SPARK_HOME=${SPARK_PREFIX}
     export SPARK_CONF_DIR="$SPARK_HOME/conf"
    +# Add the PySpark classes to the PYTHONPATH:
    +export PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH
    --- End diff --
    
    Good point; I think we should move these lines to `spark-class` to make sure that workers use the right PYTHONPATH even if they're started manually through `spark-class`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/30#issuecomment-36955199
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by pwendell <gi...@git.apache.org>.

Github user pwendell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/30#discussion_r11653646
  
    --- Diff: python/pyspark/context.py ---
    @@ -130,6 +130,13 @@ def __init__(self, master=None, appName=None, sparkHome=None, pyFiles=None,
                     varName = k[len("spark.executorEnv."):]
                     self.environment[varName] = v
     
    +        # Check if we're running on YARN:
    +        if self.master == "yarn-client":
    +            if not os.environ.get("SPARK_JAR"):
    +                raise Exception("Must set SPARK_JAR when using yarn-client mode")
    +            if not os.environ.get("PYSPARK_ZIP"):
    --- End diff --
    
    Rather than exposing this to the user, why not just export it in the `./bin/pyspark` script, and there you can fail with a message that says you need to run `make` if the user hasn't done it already.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by sryza <gi...@git.apache.org>.

Github user sryza commented on a diff in the pull request:

    https://github.com/apache/spark/pull/30#discussion_r11089191
  
    --- Diff: python/Makefile ---
    @@ -0,0 +1,7 @@
    +assembly: clean
    +	python setup.py build --build-lib build/lib
    +	unzip lib/py4j*.zip -d build/lib
    +	cd build/lib && zip -r ../pyspark-assembly.zip .
    +
    --- End diff --
    
    I could probably figure out how to do this in Maven, but have no idea how to do it with SBT.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/30#issuecomment-40155748
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/30#issuecomment-41997425
  
    Yes 65536 is the magic number of course. I know the previous change to remove fastutil knocked out about 10K files and brought it under control. Excluding jruby probably helped too. 
    
    I wonder why it's back over? Is it a recent new dependency, or is it just occurring for the Hadoop/YARN dependency profiles?
    
    The file count isn't showing obvious culprits. My command is pretty simplistic and so may not be showing the issue. I think it's worth examining the contents in more detail to find the culprit.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/30#issuecomment-36955197
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark issue #30: SPARK-1004. PySpark on YARN

Posted by databricks-jenkins <gi...@git.apache.org>.

Github user databricks-jenkins commented on the issue:

    https://github.com/apache/spark/pull/30
  
    **[Test build #27 has started](https://jenkins.test.databricks.com/job/spark-pull-request-builder/27/consoleFull)** for PR 30 at commit [`091cd1a`](https://github.com/apache/spark/commit/091cd1ad6f11d187c97223255d9f17ad13c8dd18).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/30#issuecomment-40153246
  
    
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14023/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by tgravescs <gi...@git.apache.org>.

Github user tgravescs commented on the pull request:

    https://github.com/apache/spark/pull/30#issuecomment-41798885
  
    Maybe I have something configured wrong, but I'm still getting a lot of EOFExceptions.  Certain actions seem to work fine, but when I try to do anything that really runs on the executors I get EOFExceptions again and /usr/bin/python: No module named pyspark. I'm just using whats checked into master.
    
    // this works
    >>> words = sc.textFile("README.md")
    >>> words.filter(lambda w: w.startswith("spar")).take(5)
    >>> words.collect()
    
    // this doesn't
    >>> words = sc.textFile("README.md")
    >>> words.filter(lambda w: w.startswith("spar")).collect()
    >>> wods.count()
    
    ideas?
    
    I checked and PYTHONPATH is set on the executor to be =spark.jar, and py4j is in the assembly jar.  launching with MASTER=yarn-client ./bin/pyspark


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/30#issuecomment-36223976
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by andrewor14 <gi...@git.apache.org>.

Github user andrewor14 commented on the pull request:

    https://github.com/apache/spark/pull/30#issuecomment-41753415
  
    Update - with the latest commits from master I am able to run PySpark on YARN successfully on both CDH and HDP clusters.
    
    There is still an issue with the maven build, however. The first build produces a jar that does not include the `py4j/*.py` files, while the second build produces one that does include all needed files. This is because we try to include these python files before unzipping `py4j*zip`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/30#issuecomment-41759555
  
    Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/30#issuecomment-40151340
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/30#issuecomment-40762975
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/30#issuecomment-40771152
  
    
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14212/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by pwendell <gi...@git.apache.org>.

Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/30#issuecomment-40527887
  
    @sryza I think this is looking good. I played around with this on a local yarn install and it worked. The only points are twofold. Could we ditch requiring SPARK_JAR? I'm going to merge a patch shortly that removes that requirement. Also, we just automatically create the pyspark zip file and not expose this to the user?
    
    Eventually we'll probably bundle this inside of the Spark assembly... but in the mean time having a thing that "just works" for users where they don't have to e.g. set environment variables would be nice.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/30#issuecomment-40763499
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/30#issuecomment-41758004
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by andrewor14 <gi...@git.apache.org>.

Github user andrewor14 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/30#discussion_r12129988
  
    --- Diff: sbin/spark-config.sh ---
    @@ -34,3 +34,6 @@ this="$config_bin/$script"
     export SPARK_PREFIX=`dirname "$this"`/..
     export SPARK_HOME=${SPARK_PREFIX}
     export SPARK_CONF_DIR="$SPARK_HOME/conf"
    +# Add the PySpark classes to the PYTHONPATH:
    +export PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH
    --- End diff --
    
    Looks like we never addressed this. Should we move this into `spark-submit`, now that we have that?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by pwendell <gi...@git.apache.org>.

Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/30#issuecomment-40155675
  
    Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by pwendell <gi...@git.apache.org>.

Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/30#issuecomment-40284983
  
    @sryza this is failing due to a python syntax error. In general if you wouldn't mind it would be good to run tests locally before pushing, since spinning up the test suite on jenkins does take time. This syntax error would be immediately caught by the test suite.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/30#issuecomment-41758002
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by sryza <gi...@git.apache.org>.

Github user sryza commented on the pull request:

    https://github.com/apache/spark/pull/30#issuecomment-40138022
  
    I posted a patch that addresses Josh's comments and updates the Python Programming Guide.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by pwendell <gi...@git.apache.org>.

Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/30#issuecomment-41963400
  
    Tom are you 100% sure you are running with JDK 7? Just curious because this is the exact problem we've observed when building with JDK 7 and running with JDK 6.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by jyotiska <gi...@git.apache.org>.

Github user jyotiska commented on a diff in the pull request:

    https://github.com/apache/spark/pull/30#discussion_r10117264
  
    --- Diff: python/setup.py ---
    @@ -0,0 +1,30 @@
    +#
    +# Licensed to the Apache Software Foundation (ASF) under one or more
    +# contributor license agreements.  See the NOTICE file distributed with
    +# this work for additional information regarding copyright ownership.
    +# The ASF licenses this file to You under the Apache License, Version 2.0
    +# (the "License"); you may not use this file except in compliance with
    +# the License.  You may obtain a copy of the License at
    +#
    +#    http://www.apache.org/licenses/LICENSE-2.0
    +#
    +# Unless required by applicable law or agreed to in writing, software
    +# distributed under the License is distributed on an "AS IS" BASIS,
    +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +# See the License for the specific language governing permissions and
    +# limitations under the License.
    +#
    +
    +from distutils.core import setup
    +
    +
    +setup(
    +    name='pyspark',
    +    version='0.9.0-incubating-SNAPSHOT',
    +    description='Python API for Spark',
    +    author='The Apache Software Foundation',
    +    author_email='user@spark.incubator.apache.org',
    +    license='Apache License 2.0',
    +    url='spark-project.org',
    +    packages=['pyspark'],
    --- End diff --
    
    I believe we should add numpy as required dependency. Also, "incubating" should be removed from <code>version</code> and <code>author-email</code>


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/30#issuecomment-40159493
  
    Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/30#issuecomment-40287755
  
    All automated tests passed.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14079/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/30#issuecomment-40770821
  
    Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/30#issuecomment-40153244
  
    Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by pwendell <gi...@git.apache.org>.

Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/30#issuecomment-41764569
  
    Thanks - merged!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/30


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/30#issuecomment-36223993
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/30#issuecomment-36226247
  
    All automated tests passed.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/12913/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by andrewor14 <gi...@git.apache.org>.

Github user andrewor14 commented on the pull request:

    https://github.com/apache/spark/pull/30#issuecomment-42084965
  
    Sure. In the event that we don't figure out what the issue is by the time we release a voting candidate for 1.0, we should at the very least document the peculiarity with RedHat so the user doesn't go through the headaches that you had to go through.
    
    But yes, I agree that we should not just give up on RedHat, since it is a common platform. We can even look into it for 1.0.1.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by tgravescs <gi...@git.apache.org>.

Github user tgravescs commented on the pull request:

    https://github.com/apache/spark/pull/30#issuecomment-41673696
  
    Not sure if its the same error as you but when I had tried this patch on yarn last week the error I saw in the logs above the EOFException in the executor log was:
    
    /usr/bin/python: No module named py4j.java_gateway



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by sryza <gi...@git.apache.org>.

Github user sryza commented on the pull request:

    https://github.com/apache/spark/pull/30#issuecomment-40763303
  
    Updated patch places the python files in the Spark jar itself, so no additional build or configuration steps are required.  I've only had the chance to test it on a pseudo-distributed YARN cluster so far.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by tgravescs <gi...@git.apache.org>.

Github user tgravescs commented on the pull request:

    https://github.com/apache/spark/pull/30#issuecomment-41912194
  
    Thanks for the help @andrewor14 it almost appears to be a jar file size things.  if i unjar the assembly jar and remove just a little bit of stuff and jar it back up it seems to work.   I'll try to look into that some more.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/30#issuecomment-41759556
  
    All automated tests passed.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14580/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/30#issuecomment-40773861
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by sryza <gi...@git.apache.org>.

Github user sryza commented on the pull request:

    https://github.com/apache/spark/pull/30#issuecomment-41447512
  
    It looks like the root cause of the error is that the command to start python is failing for some reason.
    
    You're running CDH5 I assume?  Is it a CM-managed CDH cluster?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by andrewor14 <gi...@git.apache.org>.

Github user andrewor14 commented on the pull request:

    https://github.com/apache/spark/pull/30#issuecomment-41846495
  
    Hm that should work. What command did you build the jar with? I did
    
    ```mvn -Pyarn -Dhadoop.version=2.2.0 -Dyarn.version=2.2.0 -DskipTests clean package```
    
    And if you do `unzip -l <jar> | head`, does it spew any warnings or errors about the length of the jar being incorrect? I had the experience of building the jar with the same command on two different environments, on the CentOS HDP cluster and on my local OSX, and the former gave me a corrupt jar.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by JoshRosen <gi...@git.apache.org>.

Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/30#discussion_r10410991
  
    --- Diff: python/pyspark/java_gateway.py ---
    @@ -15,6 +15,7 @@
     # limitations under the License.
     #
     
    +from glob import glob
    --- End diff --
    
    I added this import in my original patch, but it's unused now and can be removed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/30#issuecomment-40770822
  
    
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14211/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/30#issuecomment-41749231
  
    All automated tests passed.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14573/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/30#issuecomment-41746874
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by pwendell <gi...@git.apache.org>.

Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/30#issuecomment-40144329
  
    Mind adding a license header?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by tgravescs <gi...@git.apache.org>.

Github user tgravescs commented on the pull request:

    https://github.com/apache/spark/pull/30#issuecomment-41832999
  
    just trying it from my build directory it doesn't find pyspark so perhaps my jar isn't built right. Although when I look at the jar its in there. 
    
    $ PYTHONPATH=assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop2.4.0.jar  python
    Python 2.6.6 (r266:84292, May 27 2013, 05:35:12) 
    [GCC 4.4.7 20120313 (Red Hat 4.4.7-3)] on linux2
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import pyspark
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    ImportError: No module named pyspark
    
    $ jar -tvf assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop2.4.0.jar  | grep pyspark
         0 Wed Apr 30 14:45:44 UTC 2014 pyspark/
      8970 Wed Apr 30 14:45:44 UTC 2014 pyspark/tests.py
      4080 Wed Apr 30 14:45:44 UTC 2014 pyspark/worker.py
      ...
      ...
      4333 Wed Apr 30 14:45:44 UTC 2014 pyspark/statcounter.py


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by andrewor14 <gi...@git.apache.org>.

Github user andrewor14 commented on the pull request:

    https://github.com/apache/spark/pull/30#issuecomment-41459274
  
    @sryza I ended up rebuilding the jar and this time it worked!
    
    For the failed jar, I was able to confirm that py4j/*.py were missing (even though py4j/*.class were present). I have no idea how that is possible, considering that I built the jars using the same maven command.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/30#issuecomment-40138858
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by tgravescs <gi...@git.apache.org>.

Github user tgravescs commented on the pull request:

    https://github.com/apache/spark/pull/30#issuecomment-42069910
  
    I'm not real fond of this workaround as our normal build systems are redhat.  Are you proposing we document the limitation in 1.0 and investigate it more for 1.1?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/30#issuecomment-40139456
  
    
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14015/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by andrewor14 <gi...@git.apache.org>.

Github user andrewor14 commented on the pull request:

    https://github.com/apache/spark/pull/30#issuecomment-42050399
  
    From this it seems that we haven't been able to build a jar on RedHat based OS that runs PySpark on YARN. I've noticed the same thing, and I wonder if it's simply an artifact of our set-up or something inherent to RedHat altogether. I will investigate the different versions of Java / Scala / maven on the two environments to confirm.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by pwendell <gi...@git.apache.org>.

Github user pwendell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/30#discussion_r11649970
  
    --- Diff: python/pyspark/context.py ---
    @@ -130,6 +130,13 @@ def __init__(self, master=None, appName=None, sparkHome=None, pyFiles=None,
                     varName = k[len("spark.executorEnv."):]
                     self.environment[varName] = v
     
    +        # Check if we're running on YARN:
    +        if self.master == "yarn-client":
    +            if not os.environ.get("SPARK_JAR"):
    --- End diff --
    
    @sryza do we need to require this specifically for pyspark? I've proposed some changes in #299 that allow us to detect the jar automatically, so I don't think this is always necessary.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/30#issuecomment-40777872
  
    Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by pwendell <gi...@git.apache.org>.

Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/30#issuecomment-39881130
  
    @sryza I think it's okay to do a first cut of this that asks users to build the python assembly. Would you be able to address the outstanding comments here and also update the Python Programming Guide to explain how to use this on YARN?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by pwendell <gi...@git.apache.org>.

Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/30#issuecomment-41757036
  
    @sryza I also asked @ahirreddy to look into whether we can just publish a jar to maven central that contains the py4j python side. Then we can just depend on that jar and be done with it and not count on `unzip` being installed. He seemed to think it was possible...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by sryza <gi...@git.apache.org>.

Github user sryza commented on the pull request:

    https://github.com/apache/spark/pull/30#issuecomment-38977379
  
    We could ditch make entirely and call the Python build from Maven.  I don't trust myself to get this working in SBT though - would somebody else be able to pick that up?  Or leave it to a separate patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by JoshRosen <gi...@git.apache.org>.

Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/30#discussion_r10410942
  
    --- Diff: python/Makefile ---
    @@ -0,0 +1,7 @@
    +assembly: clean
    +	python setup.py build --build-lib build/lib
    +	unzip lib/py4j*.zip -d build/lib
    +	cd build/lib && zip -r ../pyspark-assembly.zip .
    +
    --- End diff --
    
    Are you envisioning including the PySpark dependencies in the Spark assembly jar?  I think that could work, since we need to build that jar anyways when running under YARN.
    
    I'm not sure how easy it will be to modify the Maven or SBT builds to include those files.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by pwendell <gi...@git.apache.org>.

Github user pwendell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/30#discussion_r10366986
  
    --- Diff: python/setup.py ---
    @@ -0,0 +1,30 @@
    +#
    +# Licensed to the Apache Software Foundation (ASF) under one or more
    +# contributor license agreements.  See the NOTICE file distributed with
    +# this work for additional information regarding copyright ownership.
    +# The ASF licenses this file to You under the Apache License, Version 2.0
    +# (the "License"); you may not use this file except in compliance with
    +# the License.  You may obtain a copy of the License at
    +#
    +#    http://www.apache.org/licenses/LICENSE-2.0
    +#
    +# Unless required by applicable law or agreed to in writing, software
    +# distributed under the License is distributed on an "AS IS" BASIS,
    +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +# See the License for the specific language governing permissions and
    +# limitations under the License.
    +#
    +
    +from distutils.core import setup
    +
    +
    +setup(
    +    name='pyspark',
    +    version='0.9.0-incubating-SNAPSHOT',
    --- End diff --
    
    Mind updating this to `1.0.0-SNAPSHOT`? Also mind removing "incubator" from below?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by JoshRosen <gi...@git.apache.org>.

Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/30#discussion_r10410987
  
    --- Diff: python/pyspark/java_gateway.py ---
    @@ -66,3 +71,30 @@ def run(self):
         java_import(gateway.jvm, "org.apache.spark.mllib.api.python.*")
         java_import(gateway.jvm, "scala.Tuple2")
         return gateway
    +
    +def set_env_vars_for_yarn(pyspark_zip):
    +    if "SPARK_YARN_DIST_FILES" in os.environ:
    +        os.environ["SPARK_YARN_DIST_FILES"] += ("," + pyspark_zip)
    +    else:
    +        os.environ["SPARK_YARN_DIST_FILES"] = pyspark_zip
    +    
    +    # Add the pyspark zip to the python path
    +    env_map = parse_env(os.environ.get("SPARK_YARN_USER_ENV", ""))
    +    if "PYTHONPATH" in env_map:
    +        env_map["PYTHONPATH"] += (":" + os.path.basename(pyspark_zip))
    +    else:
    +        env_map["PYTHONPATH"] = os.path.basename(pyspark_zip)
    +
    +    os.environ["SPARK_YARN_USER_ENV"] = ",".join(map(lambda v: v[0] + "=" + v[1],
    --- End diff --
    
    I think you can write this a little more clearly as
    
    ```
     os.environ["SPARK_YARN_USER_ENV"] = ",".join(k + '=' + v for (k, v) in env_map.items())
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by pwendell <gi...@git.apache.org>.

Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/30#issuecomment-36953290
  
    Hey @sryza I tested this using a local standalone cluster and it didn't seem to work. The executors failed when they were asked to launch pyspark:
    
    ```
    14/03/06 15:50:56 INFO Executor: Running task ID 1
    14/03/06 15:50:56 INFO Executor: Running task ID 2
    /usr/bin/python: No module named pyspark
    14/03/06 15:50:57 ERROR Executor: Exception in task ID 3
    java.io.EOFException
            at java.io.DataInputStream.readInt(DataInputStream.java:392)
            at org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:177)
            at org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:55)
            at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:42)
            at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:88)
            at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:53)
            at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:239)
            at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
            at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
    ```
    
    Here is what I ran:
    
        ./bin/spark-class org.apache.spark.deploy.master.Master
        ./bin/spark-class org.apache.spark.deploy.worker.Worker <master url>
        MASTER=<master url> ./bin/pyspark
        >>> sc.parallelize(range(1000), 10).count



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by andrewor14 <gi...@git.apache.org>.

Github user andrewor14 commented on the pull request:

    https://github.com/apache/spark/pull/30#issuecomment-41735835
  
    @tgravescs the error above my EOFException on the executors was `/usr/bin/python: No module named pyspark`. I think both your error and mine are caused by the fact that the jar is not on the `PYTHONPATH`, because `SPARK_YARN_USER_ENV` is not shipped to the executors. I believe this is fixed in #586.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by JoshRosen <gi...@git.apache.org>.

Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/30#discussion_r10410971
  
    --- Diff: python/pyspark/java_gateway.py ---
    @@ -66,3 +71,30 @@ def run(self):
         java_import(gateway.jvm, "org.apache.spark.mllib.api.python.*")
         java_import(gateway.jvm, "scala.Tuple2")
         return gateway
    +
    +def set_env_vars_for_yarn(pyspark_zip):
    +    if "SPARK_YARN_DIST_FILES" in os.environ:
    +        os.environ["SPARK_YARN_DIST_FILES"] += ("," + pyspark_zip)
    +    else:
    +        os.environ["SPARK_YARN_DIST_FILES"] = pyspark_zip
    +    
    +    # Add the pyspark zip to the python path
    +    env_map = parse_env(os.environ.get("SPARK_YARN_USER_ENV", ""))
    +    if "PYTHONPATH" in env_map:
    +        env_map["PYTHONPATH"] += (":" + os.path.basename(pyspark_zip))
    +    else:
    +        env_map["PYTHONPATH"] = os.path.basename(pyspark_zip)
    +
    +    os.environ["SPARK_YARN_USER_ENV"] = ",".join(map(lambda v: v[0] + "=" + v[1],
    +        env_map.items()))
    +
    +def parse_env(env_str):
    +    # Turns a comma-separated of env settings into a dict that maps env vars to
    +    # their values.
    +    env = {}
    +    for var_str in env_str.split(","):
    +        parts = var_str.split("=")
    +        if len(parts) == 2:
    --- End diff --
    
    Do you think it would be worth it to crash or throw an error when passed an invalid env string?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/30#issuecomment-40138919
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by tgravescs <gi...@git.apache.org>.

Github user tgravescs commented on the pull request:

    https://github.com/apache/spark/pull/30#issuecomment-41949042
  
    @andrewor14  I'm curious, can you check to see how many files and directories are in you assembly jar?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark issue #30: SPARK-1004. PySpark on YARN

Posted by databricks-jenkins <gi...@git.apache.org>.

Github user databricks-jenkins commented on the issue:

    https://github.com/apache/spark/pull/30
  
    **[Test build #13 has finished](https://jenkins.test.databricks.com/job/spark-pull-request-builder/13/consoleFull)** for PR 30 at commit [`091cd1a`](https://github.com/apache/spark/commit/091cd1ad6f11d187c97223255d9f17ad13c8dd18).
     * This patch **fails some tests**.
     * This patch **does not merge cleanly**.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by andrewor14 <gi...@git.apache.org>.

Github user andrewor14 commented on the pull request:

    https://github.com/apache/spark/pull/30#issuecomment-41983996
  
    Mine definitely has more than 67000 files too


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by pwendell <gi...@git.apache.org>.

Github user pwendell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/30#discussion_r11654985
  
    --- Diff: docs/python-programming-guide.md ---
    @@ -63,6 +63,11 @@ All of PySpark's library dependencies, including [Py4J](http://py4j.sourceforge.
     Standalone PySpark applications should be run using the `bin/pyspark` script, which automatically configures the Java and Python environment using the settings in `conf/spark-env.sh` or `.cmd`.
     The script automatically adds the `bin/pyspark` package to the `PYTHONPATH`.
     
    +# Running PySpark on YARN
    +
    +Running PySpark on a YARN-managed cluster requires a few extra steps. The client must reference a ZIP file containing PySpark and its dependencies. To create this file, run "make" inside the `python/` directory in the Spark source. This will generate `pyspark-assembly.zip`  under `python/build/`. Then, set the PYSPARK_ZIP environment variable to point to the location of this file. Lastly, set MASTER=yarn-client.
    --- End diff --
    
    If you make the proposed changes this could be simplified to just saying that you can run it in yarn-client mode.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/30#issuecomment-40286775
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark issue #30: SPARK-1004. PySpark on YARN

Posted by databricks-jenkins <gi...@git.apache.org>.

Github user databricks-jenkins commented on the issue:

    https://github.com/apache/spark/pull/30
  
    **[Test build #13 has started](https://jenkins.test.databricks.com/job/spark-pull-request-builder/13/consoleFull)** for PR 30 at commit [`091cd1a`](https://github.com/apache/spark/commit/091cd1ad6f11d187c97223255d9f17ad13c8dd18).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by andrewor14 <gi...@git.apache.org>.

Github user andrewor14 commented on the pull request:

    https://github.com/apache/spark/pull/30#issuecomment-41764337
  
    Just confirmed that this works on a CDH cluster. This should be ready for merge.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by sryza <gi...@git.apache.org>.

Github user sryza commented on the pull request:

    https://github.com/apache/spark/pull/30#issuecomment-41756809
  
    Yeah, that's the issue.  Looking into the best way to fix it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by sryza <gi...@git.apache.org>.

Github user sryza commented on the pull request:

    https://github.com/apache/spark/pull/30#issuecomment-36953590
  
    You need to run make inside the python directory first.  Did you do that?  (This obviously needs to be documented).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by sryza <gi...@git.apache.org>.

Github user sryza commented on the pull request:

    https://github.com/apache/spark/pull/30#issuecomment-40286819
  
    My bad.  I made a change after running tests and should have re-run them.  Posted a patch that fixes the syntax error.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark issue #30: SPARK-1004. PySpark on YARN

Posted by databricks-jenkins <gi...@git.apache.org>.

Github user databricks-jenkins commented on the issue:

    https://github.com/apache/spark/pull/30
  
    **[Test build #8 has finished](https://jenkins.test.databricks.com/job/spark-pull-request-builder/8/consoleFull)** for PR 30 at commit [`091cd1a`](https://github.com/apache/spark/commit/091cd1ad6f11d187c97223255d9f17ad13c8dd18).
     * This patch **fails some tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by andrewor14 <gi...@git.apache.org>.

Github user andrewor14 commented on the pull request:

    https://github.com/apache/spark/pull/30#issuecomment-41459452
  
    (Please disregard the previous comment that I just deleted. I still can't get PySpark to work on the CDH cluster)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by pwendell <gi...@git.apache.org>.

Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/30#issuecomment-40528959
  
    #299 will remove this as a requirement - there isn't any reason we can't just look at the currently present spark jar and ship that as the deafult.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by pwendell <gi...@git.apache.org>.

Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/30#issuecomment-36957701
  
    Ah okay works fine when I do that. Sorry about that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by andrewor14 <gi...@git.apache.org>.

Github user andrewor14 commented on the pull request:

    https://github.com/apache/spark/pull/30#issuecomment-41453892
  
    Yes, this is CDH5 managed by CM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/30#issuecomment-40777875
  
    All automated tests passed.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14216/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by pwendell <gi...@git.apache.org>.

Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/30#issuecomment-37109537
  
    @sryza another thing here is, whatever the make target ends up being we should add it to the `make_release` script and the `make-distribution` script (those two need to be merged soon but for now they both exist).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Posted by JoshRosen <gi...@git.apache.org>.

Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/30#issuecomment-37108992
  
    I left a few minor comments in the diff, but overall this looks good to me.
    
    It might be worth adding build/run instructions in either the PySpark Programming Guide or YARN guide.
    
    It also occurred to me that the Makefile-based build for the PySpark fat zip might be a problem for Windows users; Scala/Java Spark works fine under Cygwin, but PySpark only works in cmd.exe / powershell (the main difficulty is that in some cases the Java and Python halves of the PySpark driver expect different types of paths, so we'd have to replicate parts of the cygpath logic in Java and Python).  I suppose we could use the Python [`zipfile`](http://docs.python.org/2/library/zipfile) library and implement the build script in Python.  Or, as @ahirreddy suggested, maybe we could package the Python libraries into a JAR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---