You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by Stibbons <gi...@git.apache.org> on 2016/07/13 13:28:56 UTC

[GitHub] spark pull request #14180: Wheelhouse and VirtualEnv support

GitHub user Stibbons opened a pull request:

    https://github.com/apache/spark/pull/14180

    Wheelhouse and VirtualEnv support

    ## What changes were proposed in this pull request?
    
    Support virtualenv and wheel in PySpark, based on SPARK-13587. 
    Full description in [SPARK-16367](https://issues.apache.org/jira/browse/SPARK-16367)
    
    
    ## How was this patch tested?
    
    Manually tested on Ubuntu, Spark Standalone

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/Stibbons/spark wheelhouse_support

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/14180.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #14180
    
----
commit b3b4aabb369959f9f668c4f163aeb940e6d979da
Author: Jeff Zhang <zj...@apache.org>
Date:   2016-01-29T03:43:24Z

    temp save

commit 61a5ae2f55bf103c3e7385f726d801cf90f5fc97
Author: Jeff Zhang <zj...@apache.org>
Date:   2016-02-01T07:54:54Z

    change it to java 7 stule

commit e1a204bfef44d8513cb7129e2ce150ce9c777d4a
Author: Jeff Zhang <zj...@apache.org>
Date:   2016-02-01T08:14:03Z

    minor fix

commit 2ba31d42c93d820d0120c0e6d8b7890763ba88cc
Author: Jeff Zhang <zj...@apache.org>
Date:   2016-02-02T01:42:53Z

    fix shebang line limitation

commit 50a0047cdc574560b3728a6b49780d819a75205d
Author: Jeff Zhang <zj...@apache.org>
Date:   2016-02-02T03:55:38Z

    minor refactoring

commit a2382121932891f57db08c6666368fb6101a506c
Author: Jeff Zhang <zj...@apache.org>
Date:   2016-02-03T07:22:26Z

    fix cache_dir issue

commit 915f442d1b0de793df8dd6099c295b5e811f808c
Author: Jeff Zhang <zj...@apache.org>
Date:   2016-06-10T11:39:18Z

    Revert "[SPARK-15803][PYSPARK] Support with statement syntax for SparkSession"
    
    This reverts commit 2ab64b41137374b935f939d919fec7cb2f56cd63.

commit 92816bb0ee2947bb6972f10e93767ad2259d198b
Author: Gaetan Semet <ga...@xeberon.net>
Date:   2016-07-01T14:31:53Z

    reorg python imports statements

commit 791d2bc284008b1cd25115a03990fa4d1bb9251e
Author: Gaetan Semet <ga...@xeberon.net>
Date:   2016-07-11T15:39:58Z

    [SPARK-16367][PYSPARK] Add wheelhouse support
    
    - Merge of #13599 ("virtualenv in pyspark", Bug SPARK-13587)
    - and #5408 (wheel package support for Pyspark", bug SPARK-6764)
    - Documentation updated
    
    Signed-off-by: Gaetan Semet <ga...@xeberon.net>

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14180: Wheelhouse and VirtualEnv support

Posted by Stibbons <gi...@git.apache.org>.
Github user Stibbons commented on the issue:

    https://github.com/apache/spark/pull/14180
  
    Opened #14567 with Pep8, import reorganisations and editconfig.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14180: [SPARK-16367][PYSPARK] Support for deploying Anaconda an...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14180
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14180: [SPARK-16367][PYSPARK] Wheelhouse and VirtualEnv support

Posted by Stibbons <gi...@git.apache.org>.
Github user Stibbons commented on the issue:

    https://github.com/apache/spark/pull/14180
  
    We are implementing Mesos here (may take a while). While not so many people use it, on the paper it looks great ;)
    
    Please mail me at gaetan[a t]xeberon.net if is easier for you (it is for me), this patch does not do the job completely for the moment :(


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14180: [SPARK-16367][PYSPARK] Support for deploying Anaconda an...

Posted by jiangxb1987 <gi...@git.apache.org>.
Github user jiangxb1987 commented on the issue:

    https://github.com/apache/spark/pull/14180
  
    gentle ping @ueshin 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #14180: Wheelhouse and VirtualEnv support

Posted by Stibbons <gi...@git.apache.org>.
Github user Stibbons commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14180#discussion_r73142048
  
    --- Diff: python/pyspark/worker.py ---
    @@ -19,18 +19,27 @@
     Worker that receives input from Piped RDD.
     """
     from __future__ import print_function
    +
     import os
    +import socket
     import sys
     import time
    -import socket
     import traceback
     
    +from pyspark import shuffle
     from pyspark.accumulators import _accumulatorRegistry
    -from pyspark.broadcast import Broadcast, _broadcastRegistry
    +from pyspark.broadcast import Broadcast
    +from pyspark.broadcast import _broadcastRegistry
     from pyspark.files import SparkFiles
    -from pyspark.serializers import write_with_length, write_int, read_long, \
    -    write_long, read_int, SpecialLengths, UTF8Deserializer, PickleSerializer, BatchedSerializer
    -from pyspark import shuffle
    +from pyspark.serializers import BatchedSerializer
    +from pyspark.serializers import PickleSerializer
    +from pyspark.serializers import SpecialLengths
    +from pyspark.serializers import UTF8Deserializer
    +from pyspark.serializers import read_int
    +from pyspark.serializers import read_long
    +from pyspark.serializers import write_int
    +from pyspark.serializers import write_long
    +from pyspark.serializers import write_with_length
     
    --- End diff --
    
    I have the habit to rearrange import statements. It is more readable and maintainable, and ease merges. I can move this to external pr.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #14180: Wheelhouse and VirtualEnv support

Posted by zjffdu <gi...@git.apache.org>.
Github user zjffdu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14180#discussion_r73109110
  
    --- Diff: .editorconfig ---
    @@ -0,0 +1,15 @@
    +root = true
    +
    +[*]
    +indent_style = space
    +indent_size = 4
    +end_of_line = lf
    +charset = utf-8
    +trim_trailing_whitespace = true
    +insert_final_newline = true
    +
    +[*.py]
    +indent_size = 4
    +
    +[*.scala]
    +indent_size = 2
    --- End diff --
    
    unnecessary file ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14180: [SPARK-16367][PYSPARK] Wheelhouse and VirtualEnv support

Posted by zjffdu <gi...@git.apache.org>.
Github user zjffdu commented on the issue:

    https://github.com/apache/spark/pull/14180
  
    I can help if you have any question regarding spark on yarn.  For mesos, since not so many people use it, we may put it another ticket. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #14180: Wheelhouse and VirtualEnv support

Posted by zjffdu <gi...@git.apache.org>.
Github user zjffdu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14180#discussion_r73109197
  
    --- Diff: core/src/main/scala/org/apache/spark/api/python/PythonWorkerFactory.scala ---
    @@ -68,6 +100,135 @@ private[spark] class PythonWorkerFactory(pythonExec: String, envVars: Map[String
         }
       }
     
    +
    +  def unzipWheelhouse(zipFile: String, outputFolder: String): Unit = {
    +
    +    val buffer = new Array[Byte](1024)
    +
    +    try {
    +
    +      // output directory
    +      val folder = new File(outputFolder);
    +      if (!folder.exists()) {
    +        folder.mkdir();
    +      }
    +
    +      // zip file content
    +      val zis: ZipInputStream = new ZipInputStream(new FileInputStream(zipFile));
    +      // get the zipped file list entry
    +      var ze: ZipEntry = zis.getNextEntry();
    +
    +      while (ze != null) {
    +        breakable {
    +
    +          if (ze.isDirectory()) {
    +            // continue
    +            break;
    +          }
    +
    +          val fileName = ze.getName();
    +          val newFile = new File(outputFolder + File.separator + fileName);
    +
    +          logDebug("file unzip : " + newFile.getAbsoluteFile());
    +
    +          // create folders
    +          new File(newFile.getParent()).mkdirs();
    +
    +          val fos = new FileOutputStream(newFile);
    +
    +          var len: Int = zis.read(buffer);
    +
    +          while (len > 0) {
    +
    +            fos.write(buffer, 0, len)
    +            len = zis.read(buffer)
    +          }
    +
    --- End diff --
    
    many unnecessary blank lines. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14180: [SPARK-16367][PYSPARK] Support for deploying Anaconda an...

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on the issue:

    https://github.com/apache/spark/pull/14180
  
    So I think we need some support for virtualenv/anaconda but this feels a little overly complicated as the first step (and mostly untested) -- maybe starting with a simple supporting running from a specific virtualenv/conda env that is already set up?
    
    What are your thoughts @gatorsmile / @davies ? If there is a consesus with the other people working on Python I'm happy to do some more reviews :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14180: [SPARK-16367][PYSPARK] Wheelhouse and VirtualEnv support

Posted by Stibbons <gi...@git.apache.org>.
Github user Stibbons commented on the issue:

    https://github.com/apache/spark/pull/14180
  
    Hello. Can someone help to review this PR? I find the current way Spark handle Python programs really problematic, with this proposal (based on top of #13599), jobs deployment becomes so much easier. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14180: [SPARK-16367][PYSPARK] Wheelhouse and VirtualEnv support

Posted by Stibbons <gi...@git.apache.org>.
Github user Stibbons commented on the issue:

    https://github.com/apache/spark/pull/14180
  
    Rebased.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14180: [SPARK-16367][PYSPARK] Support for deploying Anaconda an...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/14180
  
    Can we move the discussion into #13599 one place? BTW, I prefer a simplest approach first. Let's go ahead  unless you think we are unable to incrementally improve in the way of #13599. If that's the case, well formed doc with some arguments would be helpful to open up the discussion.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14180: [SPARK-16367][PYSPARK] Wheelhouse and VirtualEnv support

Posted by Stibbons <gi...@git.apache.org>.
Github user Stibbons commented on the issue:

    https://github.com/apache/spark/pull/14180
  
    Status for test 'standalone install, 'client' deployment":
    - virtualenv create and pip install Pypi repository: ok (1 min 30 exec)
    - wheelhouse (Pypi repositoy): ko, because 'cffi' refuses the built wheel. Not related to this patch, but require more effort on documentation.
    
    Test deploys a job that depends from many popular Pypi packages, among them Pandas, Theano, Scikit-learn,...):
    
    ```
    alabaster==0.7.9
    arrow==0.8.0
    astroid==1.4.8
    attrs==16.1.0
    autopep8==1.2.4
    Babel==2.3.4
    backports.functools-lru-cache==1.2.1
    BeautifulSoup==3.2.1
    bolt-python==0.7.1
    boto==2.42.0
    cffi==1.7.0
    click==6.6
    configparser==3.5.0
    cryptography==1.5
    cycler==0.10.0
    dask==0.11.0
    decorator==4.0.10
    docutils==0.12
    enum34==1.1.6
    findspark==1.1.0
    first==2.0.1
    flake8==3.0.4
    funcsigs==1.0.2
    futures==3.0.5
    hypothesis==3.4.2
    idna==2.1
    imagesize==0.7.1
    ipaddress==1.0.16
    isort==4.2.5
    Jinja2==2.8
    jira==1.0.3
    kerberos-sspi===0.1-intel
    lazy-object-proxy==1.2.2
    linecache2==1.0.0
    MarkupSafe==0.23
    matplotlib==1.5.2
    mccabe==0.5.2
    mock==2.0.0
    mpmath==0.19
    ndg-httpsclient==0.4.2
    networkx==1.11
    nltk==3.2.1
    nose==1.3.7
    numpy==1.11.1
    oauthlib==1.1.2
    panda==0.3.1
    pandas==0.18.1
    pathlib==1.0.1
    pbr==1.10.0
    pep8==1.7.0
    Pillow==3.3.1
    pip-tools==1.7.0
    py==1.4.31
    pyasn1==0.1.9
    pyasn1-modules==0.0.8
    PyBrain==0.3
    pycodestyle==2.0.0
    pycparser==2.14
    pycrypto==2.6.1
    pyflakes==1.2.3
    PyGithub==1.26.0
    Pygments==2.1.3
    pylint==1.6.4
    pyOpenSSL==16.1.0
    pyparsing==2.1.8
    pytest==3.0.1
    python-dateutil==2.5.3
    python-ntlm==1.1.0
    pytz==2016.6.1
    PyYAML==3.12
    -e git+ssh://gsemet@android.intel.com:29418/a/qsi/spark@1af5c148f8f2d55f6f26a067a822f722528e13b9#egg=qsi_jobs
    requests==2.11.1
    requests-aws4auth==0.9
    requests-kerberos===0.6.1-intel
    requests-oauthlib==0.6.2
    requests-toolbelt==0.7.0
    scikit-image==0.12.3
    scikit-learn==0.17.1
    scipy==0.18.0
    service-identity==16.0.0
    singledispatch==3.4.0.3
    six==1.10.0
    sklearn-pandas==1.1.0
    snowballstemmer==1.2.1
    spark-testing-base==0.0.7.post2
    Sphinx==1.4.6
    suds==0.4
    sympy==1.0
    Theano==0.8.2
    thunder-python==1.4.2
    tifffile==0.9.2
    tlslite==0.4.9
    toolz==0.8.0
    traceback2==1.4.0
    Unidecode==0.4.19
    unittest2==1.1.0
    urllib3==1.16
    wrapt==1.10.8
    -e git+ssh://internal.server.com/a/project/name@7f4a7623aa219743e9b96b228b4cd86fe9bc5595#egg=projectname
    yapf==0.11.1
    ```
    
    Execution of this installation takes 1 min on each executor, thanks to pip and wheels being downloaded from our internal pypi.python.org mirror:
    ```
    16/08/30 17:07:47 DEBUG PythonWorkerFactory: Running command: virtualenv_app-20160830170740-0000_0/bin/pip install -r requirements.txt --index-url https://internal.pypmirror/artifactory/api/pypi/pypi-prod/simple --trusted-host internal.pypmirror qsi-jobs-0.0.1.dev15.tar.gz
    16/08/30 17:08:58 DEBUG PythonWorkerFactory: Starting daemon with pythonExec: virtualenv_app-20160830170740-0000_0/bin/python
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14180: [SPARK-16367][PYSPARK] Wheelhouse and VirtualEnv support

Posted by zjffdu <gi...@git.apache.org>.
Github user zjffdu commented on the issue:

    https://github.com/apache/spark/pull/14180
  
    @Stibbons  After second review your PR, I have one concern that may be support wheelhouse and virtualenv in one PR is too big to review, it might be better to do it in 2 PRs. I will try to get some feedback on #13599 first, but will continue look at this PR.  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #14180: Wheelhouse and VirtualEnv support

Posted by Stibbons <gi...@git.apache.org>.
Github user Stibbons commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14180#discussion_r73142199
  
    --- Diff: .editorconfig ---
    @@ -0,0 +1,15 @@
    +root = true
    +
    +[*]
    +indent_style = space
    +indent_size = 4
    +end_of_line = lf
    +charset = utf-8
    +trim_trailing_whitespace = true
    +insert_final_newline = true
    +
    +[*.py]
    +indent_size = 4
    +
    +[*.scala]
    +indent_size = 2
    --- End diff --
    
    I saw there are differences in the indentation of Scala files and Python, this allow most editors (sublime, atom,...) to adapt automatically 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14180: [SPARK-16367][PYSPARK] Wheelhouse and VirtualEnv support

Posted by Stibbons <gi...@git.apache.org>.
Github user Stibbons commented on the issue:

    https://github.com/apache/spark/pull/14180
  
    It makes sense. If you manage to get this merged I can rebase with only my diff.
    
    Too bad we cannot stack pull requests on github :(


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14180: Wheelhouse and VirtualEnv support

Posted by Stibbons <gi...@git.apache.org>.
Github user Stibbons commented on the issue:

    https://github.com/apache/spark/pull/14180
  
    Yes, I'll be glad  ! It is not fully ready yet, I still need to figure out how the script is launched in each situation 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14180: [SPARK-16367][PYSPARK] Wheelhouse and VirtualEnv support

Posted by Stibbons <gi...@git.apache.org>.
Github user Stibbons commented on the issue:

    https://github.com/apache/spark/pull/14180
  
    I have written a blog post about this pull request to explain what we can do with it: http://www.great-a-blog.co/wheel-deployment-for-pyspark/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14180: Wheelhouse and VirtualEnv support

Posted by Stibbons <gi...@git.apache.org>.
Github user Stibbons commented on the issue:

    https://github.com/apache/spark/pull/14180
  
    Yes I am back from vacation! Can work on it now :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14180: [SPARK-16367][PYSPARK] Support for deploying Anaconda an...

Posted by Stibbons <gi...@git.apache.org>.
Github user Stibbons commented on the issue:

    https://github.com/apache/spark/pull/14180
  
    Hello. Been a long time, it probably needs a full rework. Maybe we need to take a step back and have a talk between several person interested in this feature to see what is the more suitable for the Spark project. I work a lot on Python packaging nowdays, so I have a pretty good idea on different distribution solutions we have for python (anaconda, pip/virtualenv, now Pipfile), and not only barely generating a python package and throwing it in the wild, I mean ensuring my package work in the targetted environment: pyexecutable is also a solution eventhough it is more complex, wheelhouse + some tricks might also do the job for Spark. Ultimately, the goal is to have something cool and easy to use for PySpark users willing to distribute any kind of work without having to ask the IT guys to install this numpy version on the cluster.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14180: [SPARK-16367][PYSPARK] Wheelhouse and VirtualEnv support

Posted by Stibbons <gi...@git.apache.org>.
Github user Stibbons commented on the issue:

    https://github.com/apache/spark/pull/14180
  
    This new version is meant to be rebased after #13599 is merged.
    
    Here is my current state:
    - only Standalone and YARN supported. Mesos not supported
    - only tested with virtualenv/pip. Conda not tested
    - wheelhouse deployment works (ie, all dependencies can be packaged into a single zip file and automatically and quickly installed on workers).
    - for example, deploying a package with numpy + pandas + scitkit-learn is fast once the installation has been done at least once on all workers, and if the wheelhouse provides all wheel on for all version, pip will install everything without internet connection and very fast)
    
    I'd like to have the same ability to specify the entry point in Python that we can do in Java/Scala with the `--class` argument of `spark-submit`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14180: Wheelhouse and VirtualEnv support

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14180
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14180: [SPARK-16367][PYSPARK] Support for deploying Anaconda an...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/14180
  
    cc @holdenk Any thought about this PR?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14180: [SPARK-16367][PYSPARK] Support for deploying Anaconda an...

Posted by ueshin <gi...@git.apache.org>.
Github user ueshin commented on the issue:

    https://github.com/apache/spark/pull/14180
  
    @gatorsmile @jiangxb1987 Maybe we should review and merge #13599 first because this pr is based on it.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14180: [SPARK-16367][PYSPARK] Wheelhouse and VirtualEnv support

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on the issue:

    https://github.com/apache/spark/pull/14180
  
    I was going through the old PySpark JIRAs and there is one about an unexpected pandas failure, which could be semi-related (e.g. good virtual env support with a reasonable requirements file could help avoid that) - but I still don't see the reviewer interest required to take this PR forward (I'm really sorry).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #14180: [SPARK-16367][PYSPARK] Wheelhouse and VirtualEnv ...

Posted by zjffdu <gi...@git.apache.org>.
Github user zjffdu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14180#discussion_r75823127
  
    --- Diff: core/src/main/scala/org/apache/spark/api/python/PythonWorkerFactory.scala ---
    @@ -68,6 +100,135 @@ private[spark] class PythonWorkerFactory(pythonExec: String, envVars: Map[String
         }
       }
     
    +
    +  def unzipWheelhouse(zipFile: String, outputFolder: String): Unit = {
    +
    +    val buffer = new Array[Byte](1024)
    +
    +    try {
    +
    +      // output directory
    +      val folder = new File(outputFolder);
    +      if (!folder.exists()) {
    +        folder.mkdir();
    +      }
    +
    +      // zip file content
    +      val zis: ZipInputStream = new ZipInputStream(new FileInputStream(zipFile));
    +      // get the zipped file list entry
    +      var ze: ZipEntry = zis.getNextEntry();
    +
    +      while (ze != null) {
    +        breakable {
    +
    +          if (ze.isDirectory()) {
    +            // continue
    +            break;
    +          }
    +
    +          val fileName = ze.getName();
    +          val newFile = new File(outputFolder + File.separator + fileName);
    +
    +          logDebug("file unzip : " + newFile.getAbsoluteFile());
    +
    +          // create folders
    +          new File(newFile.getParent()).mkdirs();
    +
    +          val fos = new FileOutputStream(newFile);
    +
    +          var len: Int = zis.read(buffer);
    +
    +          while (len > 0) {
    +
    +            fos.write(buffer, 0, len)
    +            len = zis.read(buffer)
    +          }
    +
    +          fos.close()
    +        }
    +        ze = zis.getNextEntry()
    +      }
    +
    +      zis.closeEntry()
    +      zis.close()
    +
    +    } catch {
    +      case e: IOException => logError("exception caught: " + e.getMessage)
    +    }
    +
    +  }
    +
    +  /**
    +   * Create virtualenv using native virtualenv or conda
    +   *
    +   * Native Virtualenv:
    +   *   -  Execute command: virtualenv -p pythonExec --no-site-packages virtualenvName
    +   *   -  if wheelhouse specified:
    +   *        - Execute command: python -m pip --cache-dir cache-dir install -r requirement_file.txt
    +   *      else:
    +   *        - Execute command: python -m pip --cache-dir cache-dir install --use-wheel \
    +   *                                  --no-index --find-links=wheelhouse -r requirement_file.txt
    +   *
    +   * Conda
    +   *   -  Execute command: conda create --name virtualenvName --file requirement_file.txt -y
    +   *
    +   */
    +  def setupVirtualEnv(): Unit = {
    +    logDebug("Start to setup virtualenv...")
    +    virtualEnvName = "virtualenv_" + conf.getAppId + "_" + WORKER_Id.getAndIncrement()
    +    // use the absolute path when it is local mode otherwise just use filename as it would be
    +    // fetched from FileServer
    +    val pyspark_requirements =
    +      if (Utils.isLocalMaster(conf)) {
    +        virtualRequirements
    +      } else {
    +        virtualRequirements.split("/").last
    +      }
    +
    +    val createEnvCommand =
    +      if (virtualEnvType == "native") {
    +        if (virtualEnvSystemSitePackages) {
    +          Arrays.asList(virtualEnvPath, "-p", pythonExec, "--system-site-packages", virtualEnvName)
    +        }
    +        else {
    +          Arrays.asList(virtualEnvPath, "-p", pythonExec, virtualEnvName)
    +        }
    +      } else {
    +        // Conda
    +        Arrays.asList(virtualEnvPath,
    +          "create", "--prefix", System.getProperty("user.dir") + "/" + virtualEnvName,
    +          "--file", pyspark_requirements, "-y")
    +      }
    +    execCommand(createEnvCommand)
    +    // virtualenv will be created in the working directory of Executor.
    +    virtualPythonExec = virtualEnvName + "/bin/python"
    +    if (virtualEnvType == "native") {
    +      var basePipArgs = mutable.ListBuffer[String]()
    +      basePipArgs += (virtualPythonExec, "-m", "pip", "install", "-r", pyspark_requirements)
    +      if (!virtualWheelhouse.isEmpty) {
    +        unzipWheelhouse("wheelhouse.zip", "wheelhouse")
    --- End diff --
    
    wheelhouse.zip is hard coded, it's better to specify it through configuration rather than hard code it and add it through --files. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14180: [SPARK-16367][PYSPARK] Support for deploying Anaconda an...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/14180
  
    ping @ueshin Should we continue this PR?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #14180: [SPARK-16367][PYSPARK] Wheelhouse and VirtualEnv ...

Posted by zjffdu <gi...@git.apache.org>.
Github user zjffdu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14180#discussion_r75821756
  
    --- Diff: python/pyspark/context.py ---
    @@ -797,21 +824,65 @@ def clearFiles(self):
     
         def addPyFile(self, path):
             """
    -        Add a .py or .zip dependency for all tasks to be executed on this
    +        Add a .py, .zip or .egg dependency for all tasks to be executed on this
             SparkContext in the future.  The C{path} passed can be either a local
             file, a file in HDFS (or other Hadoop-supported filesystems), or an
             HTTP, HTTPS or FTP URI.
    +        Note that .whl should not be handled by this method
             """
    +        if not path:
    +            return
             self.addFile(path)
    -        (dirname, filename) = os.path.split(path)  # dirname may be directory or HDFS/S3 prefix
    -        if filename[-4:].lower() in self.PACKAGE_EXTENSIONS:
    +
    +        (_dirname, filename) = os.path.split(path)  # dirname may be directory or HDFS/S3 prefix
    +        extname = os.path.splitext(path)[1].lower()
    +        if extname == '.whl':
    +            return
    +
    +        if extname in self.PACKAGE_EXTENSIONS:
                 self._python_includes.append(filename)
    -            # for tests in local mode
    -            sys.path.insert(1, os.path.join(SparkFiles.getRootDirectory(), filename))
    +            if extname != '.whl':
    +                # for tests in local mode
    +                # Prepend the python package (except for *.whl) to sys.path
    +                sys.path.insert(1, os.path.join(SparkFiles.getRootDirectory(), filename))
             if sys.version > '3':
                 import importlib
                 importlib.invalidate_caches()
     
    +    def _installWheelFiles(self, paths, quiet=True, upgrade=True, no_deps=True, no_index=True):
    +        """
    +        Install .whl files at once by pip install. We are garantee to have 'pip' module available
    +        since presence of whl in py-files, or in a wheelhouse, triggered the installation of a
    +        virtualenv
    +        """
    +        root_dir = SparkFiles.getRootDirectory()
    +        paths = {
    +            os.path.join(root_dir, os.path.basename(path))
    +            for path in paths
    +            if os.path.splitext(path)[1].lower() == '.whl'
    +        }
    +        if not paths:
    +            return
    +
    +        pip_args = [
    +            '--find-links', root_dir,
    +            '--target', os.path.join(root_dir, 'site-packages'),
    +        ]
    +        if quiet:
    +            pip_args.append('--quiet')
    +        if upgrade:
    +            pip_args.append('--upgrade')
    +        if no_deps:
    +            pip_args.append('--no-deps')
    +        if no_index:
    +            pip_args.append('--no-index')
    +        pip_args.extend(paths)
    +
    +        # We had this dependency here to avoid general script case, ie when not in a virtualenv,
    +        # where pip might not be installed
    +        from pip.commands.install import InstallCommand as pip_InstallCommand
    +        pip_InstallCommand().main(args=pip_args)
    +
    --- End diff --
    
    why install wheel Files here ? Shouldn't they been done in `PythonWorkerFactory.scala` ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14180: [SPARK-16367][PYSPARK] Wheelhouse and VirtualEnv support

Posted by Stibbons <gi...@git.apache.org>.
Github user Stibbons commented on the issue:

    https://github.com/apache/spark/pull/14180
  
    Rebased, without import reorg and editorconfig files. Still not fully validated.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14180: [SPARK-16367][PYSPARK] Support for deploying Anaconda an...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/14180
  
    cc @ueshin 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14180: [SPARK-16367][PYSPARK] Support for deploying Anaconda an...

Posted by zjffdu <gi...@git.apache.org>.
Github user zjffdu commented on the issue:

    https://github.com/apache/spark/pull/14180
  
    Here's my approach #13599 for virtualenv and conda support, welcome any comments and reviews
    
    
    https://docs.google.com/document/d/1EGNEf4vFmpGXSd2DPOLu_HL23Xhw9aWKeUrzzxsEbQs/edit?usp=sharing



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14180: [SPARK-16367][PYSPARK] Wheelhouse and VirtualEnv support

Posted by zjffdu <gi...@git.apache.org>.
Github user zjffdu commented on the issue:

    https://github.com/apache/spark/pull/14180
  
    @Stibbons Do you have time to continue this work ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #14180: [SPARK-16367][PYSPARK] Wheelhouse and VirtualEnv ...

Posted by Stibbons <gi...@git.apache.org>.
Github user Stibbons commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14180#discussion_r75841147
  
    --- Diff: python/pyspark/context.py ---
    @@ -797,21 +824,65 @@ def clearFiles(self):
     
         def addPyFile(self, path):
             """
    -        Add a .py or .zip dependency for all tasks to be executed on this
    +        Add a .py, .zip or .egg dependency for all tasks to be executed on this
             SparkContext in the future.  The C{path} passed can be either a local
             file, a file in HDFS (or other Hadoop-supported filesystems), or an
             HTTP, HTTPS or FTP URI.
    +        Note that .whl should not be handled by this method
             """
    +        if not path:
    +            return
             self.addFile(path)
    -        (dirname, filename) = os.path.split(path)  # dirname may be directory or HDFS/S3 prefix
    -        if filename[-4:].lower() in self.PACKAGE_EXTENSIONS:
    +
    +        (_dirname, filename) = os.path.split(path)  # dirname may be directory or HDFS/S3 prefix
    +        extname = os.path.splitext(path)[1].lower()
    +        if extname == '.whl':
    +            return
    +
    +        if extname in self.PACKAGE_EXTENSIONS:
                 self._python_includes.append(filename)
    -            # for tests in local mode
    -            sys.path.insert(1, os.path.join(SparkFiles.getRootDirectory(), filename))
    +            if extname != '.whl':
    +                # for tests in local mode
    +                # Prepend the python package (except for *.whl) to sys.path
    +                sys.path.insert(1, os.path.join(SparkFiles.getRootDirectory(), filename))
             if sys.version > '3':
                 import importlib
                 importlib.invalidate_caches()
     
    +    def _installWheelFiles(self, paths, quiet=True, upgrade=True, no_deps=True, no_index=True):
    +        """
    +        Install .whl files at once by pip install. We are garantee to have 'pip' module available
    +        since presence of whl in py-files, or in a wheelhouse, triggered the installation of a
    +        virtualenv
    +        """
    +        root_dir = SparkFiles.getRootDirectory()
    +        paths = {
    +            os.path.join(root_dir, os.path.basename(path))
    +            for path in paths
    +            if os.path.splitext(path)[1].lower() == '.whl'
    +        }
    +        if not paths:
    +            return
    +
    +        pip_args = [
    +            '--find-links', root_dir,
    +            '--target', os.path.join(root_dir, 'site-packages'),
    +        ]
    +        if quiet:
    +            pip_args.append('--quiet')
    +        if upgrade:
    +            pip_args.append('--upgrade')
    +        if no_deps:
    +            pip_args.append('--no-deps')
    +        if no_index:
    +            pip_args.append('--no-index')
    +        pip_args.extend(paths)
    +
    +        # We had this dependency here to avoid general script case, ie when not in a virtualenv,
    +        # where pip might not be installed
    +        from pip.commands.install import InstallCommand as pip_InstallCommand
    +        pip_InstallCommand().main(args=pip_args)
    +
    --- End diff --
    
    I need to dig a bit further, but there are some code that aren't executed in client mode (ie run the drive on the developers machine)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14180: Wheelhouse and VirtualEnv support

Posted by zjffdu <gi...@git.apache.org>.
Github user zjffdu commented on the issue:

    https://github.com/apache/spark/pull/14180
  
    ping @Stibbons  Any updates ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14180: [SPARK-16367][PYSPARK] Wheelhouse and VirtualEnv support

Posted by Stibbons <gi...@git.apache.org>.
Github user Stibbons commented on the issue:

    https://github.com/apache/spark/pull/14180
  
    Actually I was waiting for #14567 to be reviewed and merged :(
    
    I might have some questions on how Spark deploys Python script on YARN or Mesos if you know how it works


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14180: Wheelhouse and VirtualEnv support

Posted by zjffdu <gi...@git.apache.org>.
Github user zjffdu commented on the issue:

    https://github.com/apache/spark/pull/14180
  
    @Stibbons, sorry for late review. Let's work together on this PR in the next few days if it is OK for you. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #14180: [SPARK-16367][PYSPARK] Wheelhouse and VirtualEnv ...

Posted by Stibbons <gi...@git.apache.org>.
Github user Stibbons commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14180#discussion_r75841339
  
    --- Diff: core/src/main/scala/org/apache/spark/api/python/PythonWorkerFactory.scala ---
    @@ -68,6 +100,135 @@ private[spark] class PythonWorkerFactory(pythonExec: String, envVars: Map[String
         }
       }
     
    +
    +  def unzipWheelhouse(zipFile: String, outputFolder: String): Unit = {
    +
    +    val buffer = new Array[Byte](1024)
    +
    +    try {
    +
    +      // output directory
    +      val folder = new File(outputFolder);
    +      if (!folder.exists()) {
    +        folder.mkdir();
    +      }
    +
    +      // zip file content
    +      val zis: ZipInputStream = new ZipInputStream(new FileInputStream(zipFile));
    +      // get the zipped file list entry
    +      var ze: ZipEntry = zis.getNextEntry();
    +
    +      while (ze != null) {
    +        breakable {
    +
    +          if (ze.isDirectory()) {
    +            // continue
    +            break;
    +          }
    +
    +          val fileName = ze.getName();
    +          val newFile = new File(outputFolder + File.separator + fileName);
    +
    +          logDebug("file unzip : " + newFile.getAbsoluteFile());
    +
    +          // create folders
    +          new File(newFile.getParent()).mkdirs();
    +
    +          val fos = new FileOutputStream(newFile);
    +
    +          var len: Int = zis.read(buffer);
    +
    +          while (len > 0) {
    +
    +            fos.write(buffer, 0, len)
    +            len = zis.read(buffer)
    +          }
    +
    +          fos.close()
    +        }
    +        ze = zis.getNextEntry()
    +      }
    +
    +      zis.closeEntry()
    +      zis.close()
    +
    +    } catch {
    +      case e: IOException => logError("exception caught: " + e.getMessage)
    +    }
    +
    +  }
    +
    +  /**
    +   * Create virtualenv using native virtualenv or conda
    +   *
    +   * Native Virtualenv:
    +   *   -  Execute command: virtualenv -p pythonExec --no-site-packages virtualenvName
    +   *   -  if wheelhouse specified:
    +   *        - Execute command: python -m pip --cache-dir cache-dir install -r requirement_file.txt
    +   *      else:
    +   *        - Execute command: python -m pip --cache-dir cache-dir install --use-wheel \
    +   *                                  --no-index --find-links=wheelhouse -r requirement_file.txt
    +   *
    +   * Conda
    +   *   -  Execute command: conda create --name virtualenvName --file requirement_file.txt -y
    +   *
    +   */
    +  def setupVirtualEnv(): Unit = {
    +    logDebug("Start to setup virtualenv...")
    +    virtualEnvName = "virtualenv_" + conf.getAppId + "_" + WORKER_Id.getAndIncrement()
    +    // use the absolute path when it is local mode otherwise just use filename as it would be
    +    // fetched from FileServer
    +    val pyspark_requirements =
    +      if (Utils.isLocalMaster(conf)) {
    +        virtualRequirements
    +      } else {
    +        virtualRequirements.split("/").last
    +      }
    +
    +    val createEnvCommand =
    +      if (virtualEnvType == "native") {
    +        if (virtualEnvSystemSitePackages) {
    +          Arrays.asList(virtualEnvPath, "-p", pythonExec, "--system-site-packages", virtualEnvName)
    +        }
    +        else {
    +          Arrays.asList(virtualEnvPath, "-p", pythonExec, virtualEnvName)
    +        }
    +      } else {
    +        // Conda
    +        Arrays.asList(virtualEnvPath,
    +          "create", "--prefix", System.getProperty("user.dir") + "/" + virtualEnvName,
    +          "--file", pyspark_requirements, "-y")
    +      }
    +    execCommand(createEnvCommand)
    +    // virtualenv will be created in the working directory of Executor.
    +    virtualPythonExec = virtualEnvName + "/bin/python"
    +    if (virtualEnvType == "native") {
    +      var basePipArgs = mutable.ListBuffer[String]()
    +      basePipArgs += (virtualPythonExec, "-m", "pip", "install", "-r", pyspark_requirements)
    +      if (!virtualWheelhouse.isEmpty) {
    +        unzipWheelhouse("wheelhouse.zip", "wheelhouse")
    --- End diff --
    
    Indeed, this is not good. 
    There is a conf to specify the wheelhouse.zip filename


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14180: [SPARK-16367][PYSPARK] Support for deploying Anaconda an...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14180
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #14180: [SPARK-16367][PYSPARK] Wheelhouse and VirtualEnv ...

Posted by Stibbons <gi...@git.apache.org>.
Github user Stibbons commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14180#discussion_r76073814
  
    --- Diff: core/src/main/scala/org/apache/spark/api/python/PythonWorkerFactory.scala ---
    @@ -68,6 +100,135 @@ private[spark] class PythonWorkerFactory(pythonExec: String, envVars: Map[String
         }
       }
     
    +
    +  def unzipWheelhouse(zipFile: String, outputFolder: String): Unit = {
    +
    +    val buffer = new Array[Byte](1024)
    +
    +    try {
    +
    +      // output directory
    +      val folder = new File(outputFolder);
    +      if (!folder.exists()) {
    +        folder.mkdir();
    +      }
    +
    +      // zip file content
    +      val zis: ZipInputStream = new ZipInputStream(new FileInputStream(zipFile));
    +      // get the zipped file list entry
    +      var ze: ZipEntry = zis.getNextEntry();
    +
    +      while (ze != null) {
    +        breakable {
    +
    +          if (ze.isDirectory()) {
    +            // continue
    +            break;
    +          }
    +
    +          val fileName = ze.getName();
    +          val newFile = new File(outputFolder + File.separator + fileName);
    +
    +          logDebug("file unzip : " + newFile.getAbsoluteFile());
    +
    +          // create folders
    +          new File(newFile.getParent()).mkdirs();
    +
    +          val fos = new FileOutputStream(newFile);
    +
    +          var len: Int = zis.read(buffer);
    +
    +          while (len > 0) {
    +
    +            fos.write(buffer, 0, len)
    +            len = zis.read(buffer)
    +          }
    +
    --- End diff --
    
    Done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #14180: Wheelhouse and VirtualEnv support

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14180#discussion_r73429177
  
    --- Diff: .editorconfig ---
    @@ -0,0 +1,15 @@
    +root = true
    +
    +[*]
    +indent_style = space
    +indent_size = 4
    +end_of_line = lf
    +charset = utf-8
    +trim_trailing_whitespace = true
    +insert_final_newline = true
    +
    +[*.py]
    +indent_size = 4
    +
    +[*.scala]
    +indent_size = 2
    --- End diff --
    
    This might be better to do as a separate PR since I could forsee us having some issue with the tabs configuration and wanting to do a revert (or vice versa)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org