You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by JoshRosen <gi...@git.apache.org> on 2014/10/04 03:57:57 UTC

[GitHub] spark pull request: [SPARK-3772] Allow `ipython` to be used by Pys...

GitHub user JoshRosen opened a pull request:

    https://github.com/apache/spark/pull/2651

    [SPARK-3772] Allow `ipython` to be used by Pyspark workers; IPython fixes:

    This pull request addresses a few issues related to PySpark's IPython support:
    
    - Fix the remaining uses of the '-u' flag, which IPython doesn't support (see SPARK-3772).
    - Change PYSPARK_PYTHON_OPTS to PYSPARK_DRIVER_PYTHON_OPTS, so that the old name is reserved in case we ever want to allow the worker Python options to be customized (this variable was introduced in #2554 and hasn't landed in a release yet, so this doesn't break any compatibility).
    - Retain the old semantics for IPYTHON=1 and IPYTHON_OPTS (to avoid breaking existing example programs).
    
    There are more details in a large block comment in `bin/pyspark`.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/JoshRosen/spark SPARK-3772

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/2651.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2651
    
----
commit c4f57780ac03065ee8cecb71c3a950a68277f82f
Author: Josh Rosen <jo...@apache.org>
Date:   2014-10-03T22:43:16Z

    [SPARK-3772] Allow ipython to be used by Pyspark workers; IPython fixes:
    
    - Fix the remaining uses of the '-u' flag, which IPython doesn't support.
    - Change PYSPARK_PYTHON_OPTS to PYSPARK_DRIVER_PYTHON_OPTS, so that the old
      name is reserved in case we ever want to allow the worker Python options
      to be customized.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3772] Allow `ipython` to be used by Pys...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2651#issuecomment-58449460
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21490/Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3772] Allow `ipython` to be used by Pys...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2651#issuecomment-57893128
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21284/consoleFull) for   PR 2651 at commit [`c4f5778`](https://github.com/apache/spark/commit/c4f57780ac03065ee8cecb71c3a950a68277f82f).
     * This patch **passes** unit tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3772] Allow `ipython` to be used by Pys...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2651#issuecomment-57891695
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21284/consoleFull) for   PR 2651 at commit [`c4f5778`](https://github.com/apache/spark/commit/c4f57780ac03065ee8cecb71c3a950a68277f82f).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3772] Allow `ipython` to be used by Pys...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2651#issuecomment-58449456
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21490/consoleFull) for   PR 2651 at commit [`7b8eb86`](https://github.com/apache/spark/commit/7b8eb86c303a83d83f99e7cd18dab020a76e001b).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3772] Allow `ipython` to be used by Pys...

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/2651#issuecomment-57891560
  
    /cc @davies @cocotomo @robbles for reviews / feedback.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3772] Allow `ipython` to be used by Pys...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/2651


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3772] Allow `ipython` to be used by Pys...

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/2651#issuecomment-58591914
  
    Thanks for the review!  I'm going to merge this.
    
    Fun discovery: I tried running some simple examples using IPython (Python 2.7) on the driver with PyPy on the workers and it seemed to work (probably because we didn't use `marshal`):
    
    ```
    PYSPARK_PYTHON=pypy PYSPARK_DRIVER_PYTHON=ipython ./bin/pyspark
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3772] Allow `ipython` to be used by Pys...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/2651#issuecomment-58276170
  
    The current patch looks good to me.
    
    One question: should we figure out the version of IPython, then use the correct of python in worker? In the most common cases, user's have IPython (2.7), but have no Python 2.7 in workers, we need to tell user how to make it work.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3772] Allow `ipython` to be used by Pys...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2651#issuecomment-57893129
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21284/Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3772] Allow `ipython` to be used by Pys...

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/2651#issuecomment-58444611
  
    Updated based on offline discussion.  The key changes:
    
    - Introduce PYSPARK_DRIVER_PYTHON to allow the driver's Python executable to be different than the workers'.
    - Attempt to use `python2.7` as the default Python version and fall back to `python` if it's not available.
    - Refuse to launch IPython without `python2.7`, unless PYSPARK_PYTHON is set.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3772] Allow `ipython` to be used by Pys...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/2651#issuecomment-57927210
  
    Before 1.2 release, maybe it's time to rethink how to run pyspark shell or scripts, using bin/pyspark or spark-submit is not so friendly for user, maybe we could simplify it. 
    
    The most things that pyspark does are setup SPARK_HOME and PYTHONPATH, so I did these in .bashrc:
    ```
    export SPARK_HOME=xxxx
    export PYTHONPATH=${PYTHONPATH}:${SPARK_HOME}/python:${SPARK_HOME}/python/lib/py4j-0.8.2.1-src.zip
    ``` 
    Then I could run any python script to use pyspark (most of them are testing). I can easily choose the version of python to use, such as ipython:
    ```
    ipython python/pyspark/tests.py
    ```
    If the version of python I used for driver is not binary compatible with the default one, then I need to use PYSPARK_PYTHON
    ```
    PYSPARK_PYTHON=pypy pypy rdd.py
    ```
    I think we could find the correct version to use for worker, PYSPARK_PYTHON is not need for most cases. For example, if PYSPARK_PYTHON is not set, by default, we could use the path of python used in driver for it. We could create some special cases, such as ipython, we could use python2.7 for ipython which uses python2.7.
    
    Also, we could create SPARK_HOME and PYTHONPATH during install spark for user.
    
    bin/pyspark could be called sparkshell.py, then user could easily choose whatever version of python to use it.
    
    bin/spark-submit is still useful to submit the jobs into cluster or adding files. In the same time, may be we could introduce some default arguments for general pyspark scripts. such as:
    ```bash
    $ipython
    daviesliu@dm:~/work/spark$ ipython wc.py -h
    Usage: wc.py [options] [args]
    
    Options:
      -h, --help            show this help message and exit
      -q, --quiet
      -v, --verbose
    
      PySpark Options:
        -m MASTER, --master=MASTER
        -p PARALLEL, --parallel=PARALLEL number of processes
        -c CPUS, --cpus=CPUS cpus used per task
        -M MEM, --mem=MEM  memory used per task
        --conf=CONF         path for configuration file
        --profile                  do profiling
    ```



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3772] Allow `ipython` to be used by Pys...

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/2651#issuecomment-58429854
  
    We don't support IPython 1.0 anymore, so it seems reasonable to make Python 2.7 the default when using IPython (since IPython 2.0 requires at least Python 2.7).  It seems like there's a growing list of use-cases that we'd like to support:
    
    1. The same custom Python version (non IPython) used everywhere (e.g. I want to use PyPy).
    2. Stock IPython (`ipython`) with a different Python version on the workers.
    3. Custom IPython (#2167) with a different Python version on the workers.
    4. IPython (custom or stock) everywhere, including the workers.
    
    In Spark 1.1, we support 1 and 2 (via `IPYTHON=1`), but not 3 or 4.  In #2167, we tried to add support for 3 (Custom IPython) but ended up actually providing 4 (which was broken until #2554 and this PR).  
    
    For 3, we need a way to specify the driver's Python executable independently from the worker's executable.  Currently, `PYSPARK_PYTHON` affects all Python processes.  What if we added a `PYSPARK_DRIVER_PYTHON` option that only affected the driver Python (and which defaults to `PYSPARK_PYTHON` if unset)?  I think this would provide enough mechanism to support all four use-cases.
    
    As far as defaults are concerned, maybe we could try `which python2.7` to check whether Python 2.7 is installed, and fall back to `python` if it's not available (this would only be used if PYSPARK_PYTHON) wasn't set.  I've noticed that certain programs run _way_ faster under 2.7 (programs that use `json`, for example), so we should probably try to make `python2.7` the default if it's installed.
    
    Does this sound like a reasonable approach?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3772] Allow `ipython` to be used by Pys...

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/2651#issuecomment-58273093
  
    @davies Good points; we should definitely discuss this before 1.2.  I guess that the script also loads `spark-env.sh`, but this is less of an issue since we moved most of the configuration to SparkConf.
    
    Since we still need to support `spark-submit` and maintain backwards-compatibility with existing uses of the `pyspark` script, what do you think about this PR's approach?  At a minimum, we need to preserve the old behavior of the `IPYTHON=` setting.  Does the `PYSPARK_DRIVER_PYTHON_OPTS` seem like a reasonable thing to add to this script?
    
    I'd still like to discuss the rest of your proposal, but I'd like to try to get the fixes here merged first because the current master instructions are broken and we need to re-introduce backwards-compatibility.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3772] Allow `ipython` to be used by Pys...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/2651#issuecomment-58445558
  
    LGTM, thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3772] Allow `ipython` to be used by Pys...

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/2651#issuecomment-57891571
  
    /cc @davies @cocoatomo @robbles for reviews / feedback.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3772] Allow `ipython` to be used by Pys...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2651#issuecomment-58445007
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21490/consoleFull) for   PR 2651 at commit [`7b8eb86`](https://github.com/apache/spark/commit/7b8eb86c303a83d83f99e7cd18dab020a76e001b).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org