You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by JoshRosen <gi...@git.apache.org> on 2015/11/11 09:41:37 UTC

[GitHub] spark pull request: [SPARK-9866][SQL] Speed up VersionsSuite by us...

GitHub user JoshRosen opened a pull request:

    https://github.com/apache/spark/pull/9624

    [SPARK-9866][SQL] Speed up VersionsSuite by using standard Ivy cache

    This patch attempts to speed up VersionsSuite by storing fetched Hive JARs in the standard Ivy cache instead of copying them to a temporary directory. The only concern here is stability; in #7026, @vanzin mentioned that Ivy could become confused by existing caches. I'm curious to know whether this is still a problem; if not, this might be a cheap way to save a few minutes of build time.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/JoshRosen/spark SPARK-9866

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/9624.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #9624
    
----
commit 6172d7abc39da5cf9db2ebcc19cd219d776acf5a
Author: Josh Rosen <jo...@databricks.com>
Date:   2015-11-11T08:34:24Z

    [SPARK-9866][SQL] Speed up VersionsSuite by using standard Ivy cache.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9866][SQL] Speed up VersionsSuite by us...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9624#issuecomment-155705102
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9866][SQL] Speed up VersionsSuite by us...

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/9624#issuecomment-156893406
  
    Maybe @brkyvz has more insights on the Ivy corruption issues that you described?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9866][SQL] Speed up VersionsSuite by us...

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/9624#issuecomment-155796802
  
    Before this patch, VersionsSuite took ~5-7 minutes in Jenkins; it's 2-3 minutes after.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9866][SQL] Speed up VersionsSuite by us...

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/9624#issuecomment-156922650
  
    I did some more digging and it looks like this whole locking situation is a much bigger mess than I originally estimated:
    
    - SBT coordinates access to the Ivy cache using a lock file that is shared by all running SBT processes: https://github.com/sbt/sbt/blob/v0.13.8/ivy/src/main/scala/sbt/Ivy.scala#L65. [By default](https://github.com/sbt/sbt/blob/v0.13.8/ivy/src/main/scala/sbt/Ivy.scala#L119), this file is named ".sbt.ivy.lock". This will handle coordination among SBT processes but does not guard against other processes which might also use the Ivy cache.
    - Ivy's built-in locking support seems hard to configure and also seems prone to issues related to lock files not being cleaned up when processes crash: https://issues.apache.org/jira/browse/IVY-1388.
    - At http://jira.pentaho.com/browse/BISERVER-4809, someone points out that although Ivy's artifact cache can be guarded via locks, its resolution cache is not guardable by any of Ivy's locking strategies.
    
    @brkyvz, given all of this, I wonder whether Spark's default Ivy cache location should be different than `~/.ivy2`.
    
    To help clear this patch out of the queue, though, I'll just go ahead and implement the flag as originally discussed (off by default and using shared cache only in Jenkins, where it happens to be safe because each build workspace has its own isolated Maven and Ivy cache directories which are preserved across builds).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9866][SQL] Speed up VersionsSuite by us...

Posted by brkyvz <gi...@git.apache.org>.
Github user brkyvz commented on the pull request:

    https://github.com/apache/spark/pull/9624#issuecomment-156914641
  
    @JoshRosen There are so many things I couldn't figure about Ivy... The library is practically a black box. It unfortunately doesn't have a lot of documentation (at least for the library). There are two problems that I would love to fix:
    1) Ivy configurations. Basically, if you publish a library locally in SBT with publishLocal, Spark won't be able to access it due to configurations. publishM2 works fine though.
    2) This cache corruption issue.
    IMHO, I would suggest taking a flag, which is by default off, but on in Jenkins that would use the shared cache. If this causes flakiness in the SparkSubmitSuite tests, then we'll have to turn it off though :(


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9866][SQL] Speed up VersionsSuite by us...

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/9624#issuecomment-155870366
  
    Do you know what the origin of the problem is? Is it a concurrency or locking bug in Spark? There are many applications which use Ivy that don't run into these problems (e.g. running SBT and IntelliJ at the same time), so I'm wondering if Spark is doing something non-standard that breaks this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9866][SQL] Speed up VersionsSuite by us...

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on the pull request:

    https://github.com/apache/spark/pull/9624#issuecomment-155856310
  
    That is most definitely still a problem. I doubt it will affect jenkins machines, but it affects local builds when you're executing that test and you also use maven for other things. If you could make this configurable, so that based on some env variable set by jenkins it uses the shared cache, I'd be more comfortable.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9866][SQL] Speed up VersionsSuite by us...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9624#issuecomment-155737425
  
    **[Test build #45620 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45620/consoleFull)** for PR 9624 at commit [`6172d7a`](https://github.com/apache/spark/commit/6172d7abc39da5cf9db2ebcc19cd219d776acf5a).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9866][SQL] Speed up VersionsSuite by us...

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/9624#issuecomment-156914028
  
    @vanzin, I took a closer look and it appears that Ivy's default locking strategy is "no locking", which is not safe when using a shared cache. I'm going to submit a separate PR to fix this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9866][SQL] Speed up VersionsSuite by us...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9624#issuecomment-155737574
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/45620/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9866][SQL] Speed up VersionsSuite by us...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9624#issuecomment-155706724
  
    **[Test build #45620 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45620/consoleFull)** for PR 9624 at commit [`6172d7a`](https://github.com/apache/spark/commit/6172d7abc39da5cf9db2ebcc19cd219d776acf5a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9866][SQL] Speed up VersionsSuite by us...

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on the pull request:

    https://github.com/apache/spark/pull/9624#issuecomment-155874229
  
    I don't really know what's the source of the problem. My best guess is some incompatibility between the version of ivy embedded in sbt and the version of ivy used by Spark, or differences in how they handle dependencies from the local maven cache. I remember nuking the offending artifacts from the local maven cache would fix things (ivy would then download fresh copies into its own cache), but that wasn't a good long-term solution for someone who also uses maven.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9866][SQL] Speed up VersionsSuite by us...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9624#issuecomment-155705042
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9866][SQL] Speed up VersionsSuite by us...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9624#issuecomment-155737570
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-9866][SQL] Speed up VersionsSuite by us...

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/9624#issuecomment-156915513
  
    Isn't `spark-submit`'s default behavior to use a shared Ivy cache? If so, shouldn't we be testing with that as the default rather than the exception so that any bugs which are painful for users are equally painful for us, too? My fear is that if there is somehow a corruption issue then we'll be masking it by not using the shared cache in our own tests. Basically, I'm arguing that the the shared cache in tests should be an opt-out feature, not opt-in.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org