You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by sarahgerweck <gi...@git.apache.org> on 2014/08/03 12:50:29 UTC

[GitHub] spark pull request: Optionally parallelize the Spark build.

GitHub user sarahgerweck opened a pull request:

    https://github.com/apache/spark/pull/1752

    Optionally parallelize the Spark build.

    This introduces a new environment variable, `SPARK_BUILD_THREADS`,
    that controls the level of Maven parallelization (if set). This uses
    Maven 3's syntax for the number of threads: either a simple integer or
    something like `1.5C` to indicate 1.5 times the number of cores on the
    build server.
    
    On my hardware, (Intel Xeon E5-1620 v2, roughly equivalent to a fast
    i7), a setting of `1.5C` speeds up the build by about 15%.
    
    This will trigger some warnings that the Scala plugins are not marked as
    threadsafe. Several repeated builds show no differences in the compiled
    distribution (save timestamps), but this isn't proof that it can never
    fail.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/AtScaleInc/spark parBuild

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/1752.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1752
    
----
commit 3cac93065df7c5620c1f4024193c1081a418cdbb
Author: Sarah Gerweck <sa...@gmail.com>
Date:   2014-08-03T09:40:59Z

    Optionally parallelize the Spark build.
    
    This introduces a new environment variable, `SPARK_BUILD_THREADS`,
    that controls the level of Maven parallelization (if set). This uses
    Maven 3's syntax for the number of threads: either a simple integer or
    something like `1.5C` to indicate 1.5 times the number of cores on the
    build server.
    
    On my hardware, (Intel Xeon E5-1620 v2, roughly equivalent to a fast
    i7), a setting of `1.5C` speeds up the build by about 15%.
    
    This will trigger some warnings that the Scala plugins are not marked as
    threadsafe. Several repeated builds show no differences in the compiled
    distribution (save timestamps), but this isn't proof that it can never
    fail.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: Optionally parallelize the Spark build.

Posted by sarahgerweck <gi...@git.apache.org>.
Github user sarahgerweck commented on the pull request:

    https://github.com/apache/spark/pull/1752#issuecomment-51015956
  
    @pwendell BTW, yes, it works just as well if you just put `-T 1.5C` at the end of your flags, and I agree that this is the right way (unless this were to become a standard flag in the build, which I wouldn't recommend without testing beyond my own). There's no reason to invent another syntax for enabling threading.
    If you're interested, I will take some time this week to write up a paragraph for the docs and send a PR in case others are interested. (To me, shaving ninety seconds off a ten-minute build is a worthwhile speedup, but I won't waste my time unless you think you'd want to add it to the docs.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: Optionally parallelize the Spark build.

Posted by sarahgerweck <gi...@git.apache.org>.
Github user sarahgerweck commented on the pull request:

    https://github.com/apache/spark/pull/1752#issuecomment-50997026
  
    @pwendell This is a good point. I didn't realize (but probably should've) that the script passes through all the Maven build options. I'll see if there's a good place to document this and open a different PR.
    @srowen I don't know whether using this option will trigger parallel-test problems or not.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: Optionally parallelize the Spark build.

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1752#discussion_r15733991
  
    --- Diff: make-distribution.sh ---
    @@ -154,7 +154,19 @@ cd "$FWDIR"
     
     export MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m"
     
    -BUILD_COMMAND="mvn clean package -DskipTests $@"
    --- End diff --
    
    This doesn't help developers running tests, just the distribution. Why not make this a setting in `pom.xml`? Should be an option to the surefire plugin too.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: Optionally parallelize the Spark build.

Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/1752#issuecomment-51015439
  
    @srowen @sarahgerweck so we recently started running several jenkins executors per machine anyways, so we already have to deal with parallel tests. Indeed, we still have a few test cases that don't handle this properly, but that shouldn't block us from doing this when we can in the test set-up, we should fix those anyways :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: Optionally parallelize the Spark build.

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/1752#issuecomment-50987445
  
    I suspect this will eventually cause tests to collide when they try to allocate the same file or port? (We already see a lot of Jenkins failures with "Address already in use"). Of course, ideally that doesn't happen. I have no idea whether every last instance can be fixed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: Optionally parallelize the Spark build.

Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/1752#issuecomment-51015513
  
    Anyways, given the fact that you can pass this directly to the build, I do agree closing this is probably the way to go.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: Optionally parallelize the Spark build.

Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/1752#issuecomment-50996784
  
    @sarahgerweck if that user can pass this option already, it might make sense to just document it somewhere rather than add a new flag. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: Optionally parallelize the Spark build.

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/1752#issuecomment-50987412
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: Optionally parallelize the Spark build.

Posted by sarahgerweck <gi...@git.apache.org>.
Github user sarahgerweck closed the pull request at:

    https://github.com/apache/spark/pull/1752


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: Optionally parallelize the Spark build.

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/1752#issuecomment-51030127
  
    @pwendell Ah, so that's what's causing it. Yes, fix forward by all means, but can this be disabled until that time? it looks like about half or more of all test runs are failing spuriously and that just means they have to be run 2-3 times. It's now slower to get to a passed test suite when they really pass. In a way, parallelizing single builds is less prone to this conflict.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: Optionally parallelize the Spark build.

Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/1752#issuecomment-50996284
  
    Hey @sarahgerweck thanks for sending this. I've looked into this in the past but only found something like a 10% speedup so didn't spend too much time.
    
    One qustion though - this script allows passing arbitrary build options to maven. So can't a user just do this?
    
    ```
    ./make-distribution.sh ...other opts... -T 1.5C
    ```
    
    We've generally tried to avoid replicating build options/flags that directly correspond to existing maven flags.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org