You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by ziky90 <gi...@git.apache.org> on 2014/10/17 11:14:02 UTC

[GitHub] spark pull request: Added possibility to directly install python p...

GitHub user ziky90 opened a pull request:

    https://github.com/apache/spark/pull/2836

    Added possibility to directly install python packages on EC2

    Goal of this PR is to simplify the way how to install Python packages for PySpark on EC2. It installs selected packages directly to all the nodes of the Spark cluster directly when the cluster is created via options --pip-install and --pip-upgrade.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/ziky90/spark master

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/2836.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2836
    
----
commit 365efade80cfb13aa70a33ff5cb21fef2d53a384
Author: Jan ZikesĖŒ <zi...@gmail.com>
Date:   2014-10-17T08:57:24Z

    Added possibility to directly install python packages on EC2 directly up on cluster start-up.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3989]Added possibility to directly inst...

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/2836#issuecomment-59580682
  
    > Could you please give me an example if you see some option where possibly might be the problem.
    
    If you look at the documentation of the `ssh`, function, it says "Run a command on a host through SSH, retrying up to five times _and then throwing an exception if ssh continues to fail_".  If a library can't be installed, I think that the current code will cause `spark-ec2` to exit rather than continuing in a best-effort manner.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3989]Added possibility to directly inst...

Posted by ziky90 <gi...@git.apache.org>.
Github user ziky90 commented on the pull request:

    https://github.com/apache/spark/pull/2836#issuecomment-59645599
  
    Ok, currently I'm using EMR instead of the spark-ec2 script, because it seems to me more convenient then connecting EC2 cluster from my own bash script, but you're right it's a possible way to go and it's not necessarily needed to have this functionality in spark-ec2.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3989]Added possibility to directly inst...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2836#issuecomment-59487823
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3989]Added possibility to directly inst...

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/2836#issuecomment-59552268
  
    I like the idea behind this, but I'm worried about adding even more stuff to the `spark-ec2` script, especially since I think this use case could be addressed by a more general "post startup" hook or script.
    
    I think that you can already install pip packages using `pssh`, e.g.
    
    ```bash
    pip install numpy  # Install on master
    pssh -h /root/spark-ec2/slaves pip install numpy  # Install on workers
    ```
    
    If we run these commands inside of `spark-ec2`, then what happens if one of them fails?  What if I want to pass more configuration options to pip other than what the `spark-ec2` wrapper supports?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3989]Added possibility to directly inst...

Posted by ziky90 <gi...@git.apache.org>.
Github user ziky90 commented on the pull request:

    https://github.com/apache/spark/pull/2836#issuecomment-59582401
  
    Ok thank you. Now I can see it.
    
    Based on this I also think that it'd need much more effort than I previously thought to do the bootstrap script execution in a robust way (it will probably need to implement another ssh method). 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3989]Added possibility to directly inst...

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/2836#issuecomment-59630692
  
    Can your use-case be addressed by logging into the master and running a command after it's launched?  I think you could even automate this in your own bash script that calls spark-ec2 then pipes a command to `ssh` from bash after your cluster has launched.  Is there a reason why this feature has to be in `spark-ec2`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3989]Added possibility to directly inst...

Posted by ziky90 <gi...@git.apache.org>.
Github user ziky90 commented on the pull request:

    https://github.com/apache/spark/pull/2836#issuecomment-59570517
  
    Ok, you are probably right.
    
    What about at least something like user defined bootstrap script execution automatically in the EMR like style + maybe automatic pip installation? 
    Do you something like this would be acceptable for spark-ec2 as a new feature?
    I'd be interested in implementing this since it'd help me (I hope also to others) in setting up spark on EC2. Currently I just don't know if there is some easy way how to perform scp of the script, but I can at least try it.
    
    You're right that when one of the commands fails, then one of these Python packages will not be installed or all the packages that depends on the package that failed. 
    And I think that you can pass as many options as you'd like to. Could you please give me an example if you see some where possibly might be the problem. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3989]Added possibility to directly inst...

Posted by ziky90 <gi...@git.apache.org>.
Github user ziky90 closed the pull request at:

    https://github.com/apache/spark/pull/2836


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org