You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by florianverhein <gi...@git.apache.org> on 2015/02/10 01:51:47 UTC

[GitHub] spark pull request: [SPARK-5641] Allow spark_ec2.py to copy a wide...

GitHub user florianverhein opened a pull request:

    https://github.com/apache/spark/pull/4487

    [SPARK-5641] Allow spark_ec2.py to copy a wider range of files to cluster via deploy.generic

    Currently, deploy_files fails on binary files. This change fixes this for certain file extensions, thus allowing a greater range of files to be deployed to the cluster.  
    
    This improvement is really useful if binary files need to be uploaded. E.g. I use this for rpm transfer to install extra stuff at cluster deployment time (I build or grab the rpms just prior and these rpms are not available externally - so this provides a convenient and secure way of delivering extra modules without code changes).
    
    Note that it could also be used to override what's on the image (e.g. a jar) or to deliver a tar ball (e.g. if it's not on S3 - say, you built spark against a different hadoop or tachyon version)
    
    To use this feature, a the user can just dump the files into ec2/deploy.generic/ before calling spark_ec2.py.
    
    Detecting binary files is non-trivial. So instead, the implementation has a list of file extensions that will trigger simple file copying.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/florianverhein/spark master

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/4487.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #4487
    
----
commit 90029be07c08b9428220705c9c45f455563000fb
Author: Florian Verhein <fl...@gmail.com>
Date:   2015-02-06T02:04:00Z

    [SPARK-5641] [EC2] Make deploy_files work for certain types of binary files too

commit 59c192a7b02feb9c950a6e9b69c16703690e20c6
Author: Florian Verhein <fl...@gmail.com>
Date:   2015-02-06T07:43:10Z

    [SPARK-5641] fix formating for lint

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5641] Allow spark_ec2.py to copy a wide...

Posted by florianverhein <gi...@git.apache.org>.
Github user florianverhein commented on the pull request:

    https://github.com/apache/spark/pull/4487#issuecomment-73829153
  
    Good point. I don't use any yet... However, here is a concrete case where I will soon:
    https://github.com/florianverhein/spark-ec2/tree/packer/extra
    
    Using this approach, I could supply an entire spark-ec2 type module via the deploy.generic mechanism (make it copy into /root/extra). Now, since the init.sh and setup.sh scripts in the modules often need to know variables like `masters` and `slaves`, we have an example requiring substitution. 
    
    Further, if the module required binaries (e.g. for installation), it would be cleaner to have this all together (examples: jars, packaged R libraries, etc...). Hence, this case favours handling config + binary together. 
    
    Actually, I just thought of another use for this... copying up spark application jars together with any config they might require. This would mean the spark apps could be launched automatically after the cluster gets deployed :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5641] Allow spark_ec2.py to copy a wide...

Posted by florianverhein <gi...@git.apache.org>.
Github user florianverhein commented on the pull request:

    https://github.com/apache/spark/pull/4487#issuecomment-73833605
  
    ... Actually that `extra` example should technically be able to be made to work without substitution as I source the `/root/extra/*.sh` scripts in `/root/spark-ec2/extra/*.sh`... so I think bash variables (e.g. `$MASTERS` rather than `{{master_list}}`) should therefore be fine in that case... ( ? )
    
    But that doesn't affect the point re: separation of binary and configuration - just whether template substitution is strictly necessary. On the latter, I'm sure there's a case for having it (or at least no strong argument against allowing it)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5641] Allow spark_ec2.py to copy a wide...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4487#issuecomment-73685198
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27197/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5641] Allow spark_ec2.py to copy a wide...

Posted by nchammas <gi...@git.apache.org>.
Github user nchammas commented on the pull request:

    https://github.com/apache/spark/pull/4487#issuecomment-73803678
  
    cc @shivaram 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5641] Allow spark_ec2.py to copy a wide...

Posted by florianverhein <gi...@git.apache.org>.
Github user florianverhein commented on the pull request:

    https://github.com/apache/spark/pull/4487#issuecomment-74175437
  
    No problems @shivaram . Thanks for taking the time to consider this! 
    
    Responses in order:
    * That's true, though *is* that a problem? I think it's unlikely, but if it does turn out to be, we could then implement it so that anything that does not get substitution is copied directly (more complex - e.g. rcp deploy.generic then overwrite any files requiring substitution via the temp route). It won't change the user interface. 
    * Consider using this for application deployment. ie. you have multiple long running spark applications, and you want to bring up a cluster for each of them (e.g. automatically at different scheduled times, etc). Without `extra` like functionality, you'd have to fork `spark-ec2` for every application (a module)... which is unreasonable from a ops and maintenance point of view. Also, the config for those apps should live closer to the apps, not in `spark-ec2`. Additionally, in my case I wouldn't be able to do this because it would expose information about proprietary software.
    * Correct, but directly editing `ec2_variables.sh` is problematic (again, maintenance of another fork if you have things that are not generic enough to be included in the spark repo).... unless there is a mechanism to just provide your own (which can live in your own source tree elsewhere). 
    
    Do you have any alternative suggestions? 
    
    I may be able to get by with a rcp mechanism that does not have the substitution capability (see earlier post about `extra`, and I could have another `ec2_vairiables.sh` type script that gets sourced by `spark-ec2/setup.sh`). But would we want to separate/split "copying to cluster" functionality?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5641] Allow spark_ec2.py to copy a wide...

Posted by florianverhein <gi...@git.apache.org>.
Github user florianverhein closed the pull request at:

    https://github.com/apache/spark/pull/4487


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5641] Allow spark_ec2.py to copy a wide...

Posted by florianverhein <gi...@git.apache.org>.
Github user florianverhein commented on the pull request:

    https://github.com/apache/spark/pull/4487#issuecomment-74201000
  
    Closed in favour of alternative solution. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5641] Allow spark_ec2.py to copy a wide...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4487#issuecomment-73685190
  
      [Test build #27197 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27197/consoleFull) for   PR 4487 at commit [`59c192a`](https://github.com/apache/spark/commit/59c192a7b02feb9c950a6e9b69c16703690e20c6).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5641] Allow spark_ec2.py to copy a wide...

Posted by florianverhein <gi...@git.apache.org>.
Github user florianverhein commented on the pull request:

    https://github.com/apache/spark/pull/4487#issuecomment-73981957
  
    The user-supplied `deploy.generic` option is also a clean way for users to provide their own version of `ec2_variables.sh` - for example to set any extra flags / variables for `spark-ec2` without having to alter `spark_ec2.py` at all.  
    Concrete example: I could use this approach to set `TEST_MODULES` in https://github.com/florianverhein/spark-ec2/blob/packer/setup.sh (which controls whether tests are run as part of the setup).
    
    What do you think @shivaram ? Would adding a `--deploy-generic` CLI option in this PR work for you?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5641] Allow spark_ec2.py to copy a wide...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4487#issuecomment-73813182
  
      [Test build #27238 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27238/consoleFull) for   PR 4487 at commit [`59c192a`](https://github.com/apache/spark/commit/59c192a7b02feb9c950a6e9b69c16703690e20c6).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5641] Allow spark_ec2.py to copy a wide...

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on the pull request:

    https://github.com/apache/spark/pull/4487#issuecomment-73802695
  
    retest this please 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5641] Allow spark_ec2.py to copy a wide...

Posted by shivaram <gi...@git.apache.org>.
Github user shivaram commented on the pull request:

    https://github.com/apache/spark/pull/4487#issuecomment-74191569
  
    Yep - that sounds good. Thanks @florianverhein 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5641] Allow spark_ec2.py to copy a wide...

Posted by shivaram <gi...@git.apache.org>.
Github user shivaram commented on the pull request:

    https://github.com/apache/spark/pull/4487#issuecomment-74181412
  
    Among the options we have, I think the option of adding a flag like `--deploy-dir` is pretty good. My main concerns were to make sure that (a) we didn't affect the most common use case of launching a simple EC2 cluster, and (b)  we didn't create workflows that would be hard to support in the future.  
    
    I think the additional flag doesn't affect both of those, so I am fine with that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5641] Allow spark_ec2.py to copy a wide...

Posted by florianverhein <gi...@git.apache.org>.
Github user florianverhein commented on the pull request:

    https://github.com/apache/spark/pull/4487#issuecomment-74187507
  
    Ok. So I'll close this PR and instead add a `--deploy-dir` option that rsyncs a given directory to master. 
    I'll make this step run after `deploy_files` (user should be given the "final say"), and call the new function `deploy_user_files`.
    *If* there's ever a strong need to support variable substitution for user supplied files, we can solve it then... and perhaps add a flag to turn the feature on. E.g. factor out the substitution code out and reuse it in both places.
    This approach cleanly separates `spark/ec2`  configuration from user supplied configuration + binaries. 
    I'm happy with that. 
    Does that sound good @shivaram ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5641] Allow spark_ec2.py to copy a wide...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4487#issuecomment-73803329
  
      [Test build #27238 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27238/consoleFull) for   PR 4487 at commit [`59c192a`](https://github.com/apache/spark/commit/59c192a7b02feb9c950a6e9b69c16703690e20c6).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5641] Allow spark_ec2.py to copy a wide...

Posted by florianverhein <gi...@git.apache.org>.
Github user florianverhein commented on the pull request:

    https://github.com/apache/spark/pull/4487#issuecomment-73817393
  
    Thanks for the comments @shivaram.
    
    The problem I see with `--copy-files` is that you don't get the variable substitution (and adding it there would probably duplicate things unnecessarily). Also I'm not keen on a list of files as this can be a long list and will be hard to generate. I would prefer a single directory (like deploy.generic), with the ability to copy a directory structure anywhere on the image. This directory structure can contain a mix of binary files and configuration files (or scripts like ec2-variables.sh or installation scripts) requiring substitution. I don't think it's a good idea to separate these groups arbitrarily. Consider copying up an entirely new spark-ec2 module complete with a jar it needs. 
    In the above scenario, ec2-variables.sh just becomes a special case (that happens to be provided already).  
    The patch achieves these things with only very small changes. I thought about adding something like `--copy-dir` but then realised it would be duplicating what's already there. 
    
    I agree copying into deploy.generic seems a bit clunky (though not so bad since copying rpms is always going to be necessary -- see below). **So how about we make the deploy.generic location a CLI option? That seems pretty clean to me.** 
    (aside: briefly thought about adding an "additional deploy.generic" option that gets deployed afterwards - though this feels ugly / prone to problems). 
    
    That way, an automated deployment script would look something like:
    1. clone spark for spark/ec2
    2. clone user's specific configuration repo (containing their "deploy.generic") - alternatively this might already be part of the repo containing this automated deployment script (that's what I do actually). 
    3. build (or copy from a yum repo) the rpms into the user's configuration (we can't avoid copying binaries at some point)
    4. run spark_ec2.py (pointing deploy.generic at the user's configuration) 
    
    I do something very similar at the moment, except that I just copy everything into deploy.generic once I've cloned spark.
    
    I hope that's explained my motivation a bit better.
    What do you think?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5641] Allow spark_ec2.py to copy a wide...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4487#issuecomment-73623534
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5641] Allow spark_ec2.py to copy a wide...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4487#issuecomment-73676477
  
      [Test build #27197 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27197/consoleFull) for   PR 4487 at commit [`59c192a`](https://github.com/apache/spark/commit/59c192a7b02feb9c950a6e9b69c16703690e20c6).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5641] Allow spark_ec2.py to copy a wide...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4487#issuecomment-73813186
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27238/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5641] Allow spark_ec2.py to copy a wide...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/4487#issuecomment-73676182
  
    ok to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5641] Allow spark_ec2.py to copy a wide...

Posted by shivaram <gi...@git.apache.org>.
Github user shivaram commented on the pull request:

    https://github.com/apache/spark/pull/4487#issuecomment-73825368
  
    One more clarification -- What substitutions from ec2-variables.sh do you find useful in other deployment scripts (i.e something other than `ec2-variables.sh`) ? I ask mostly to see if we can separate out binary data from configuration or if those need to be looked at together  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5641] Allow spark_ec2.py to copy a wide...

Posted by shivaram <gi...@git.apache.org>.
Github user shivaram commented on the pull request:

    https://github.com/apache/spark/pull/4487#issuecomment-73805058
  
    I am not sure this is the best way to implement this -- `deploy.generic` right now is used to copy out one shell script [1] and is part of the source tree. 
    
    It seems kludgy to ask users to copy binary files like rpms or debs into `deploy.generic/` and then have it copied over. Instead can't we just take a new argument `--copy-files` or something like that which takes in a comma separated list of paths and copies these files to some well known location (either `/root` or `/root/spark-ec2` ?)
    
    [1] https://github.com/florianverhein/spark/blob/master/ec2/deploy.generic/root/spark-ec2/ec2-variables.sh


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5641] Allow spark_ec2.py to copy a wide...

Posted by shivaram <gi...@git.apache.org>.
Github user shivaram commented on the pull request:

    https://github.com/apache/spark/pull/4487#issuecomment-74126044
  
    @florianverhein Sorry for the delay in getting back. At a high level adding a flag to specify a `deploy.generic` directory seems better than asking people to copy binaries. However there are a couple of things that I am still concerned about
    
    - To do variable substitution we make a copy of files from deploy.generic to a temp directory and then copy files after substitution from the temp directory to the EC2 machine. This might be a bad idea for large rpms / binaries as we will unnecessarily create copies.
    
    - I am not sure why one would want to add an extra module through deploy.generic. Isn't it easier / better to have a fork of spark-ec2 where the module is placed and then just use `--spark-ec2-git-repo` ? 
    
    - Finally I am not sure I completely get the case for adding new flags (like `TEST_MODULES`). If these flags need to be set based on EC2 instances, then it looks like one does need to edit `spark_ec2.py` -- if not, new flags can be directly added to `ec2-variables.sh` (i.e. by editing the file in the source tree)



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5641] Allow spark_ec2.py to copy a wide...

Posted by florianverhein <gi...@git.apache.org>.
Github user florianverhein commented on the pull request:

    https://github.com/apache/spark/pull/4487#issuecomment-73801881
  
    @srowen weird... is Jenkins flakey (overloaded?) Here's the failure:
    
    ```
    Error Message
    
    The code passed to failAfter did not complete within 60 seconds.
    ```
    
    **Aside**
    Even though this change doesn't affect existing tests, it would be nice to run them locally - but when I do this:
    * the spark unit tests pass except for some hive and yarn tests, which look to be intermittent problems ( ? ) (e.g. java.net.SocketException: Broken pipe). 
    * The python/run-tests (which test pyspark) failed with:
    *warning: could not import accumulators
    .  Your function may unexpectedly error due to this import failing; A version mismatch is likely.  Specific error was:
    Traceback (most recent call last):
      File "pyspark/cloudpickle.py", line 870, in _modules_to_main*
    .... But they seem fine on Jenkins. I notice the tests pick up python 2.6.6 rather than python 2.7 on my PATH.
    
    Do you have any tips / know of any documentation for fixing this? I already had to increase my ulimit -u to get this far.
    
    The only test that touches ec2/spark_ec2.py is lint-python, which passes. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org