You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by JoshRosen <gi...@git.apache.org> on 2015/02/13 23:47:21 UTC

[GitHub] spark pull request: [SPARK-5227] [SPARK-5679] Disable FileSystem c...

GitHub user JoshRosen opened a pull request:

    https://github.com/apache/spark/pull/4599

    [SPARK-5227] [SPARK-5679] Disable FileSystem cache in WholeTextFileRecordReaderSuite

    This patch fixes two difficult-to-reproduce Jenkins test failures in InputOutputMetricsSuite (SPARK-5227 and SPARK-5679).  The problem was that WholeTextFileRecordReaderSuite modifies the `fs.local.block.size` Hadoop configuration and this change was affecting subsequent test suites due to Hadoop's caching of FileSystem instances (see HADOOP-8490 for more details).
    
    The fix implemented here is to disable FileSystem caching in WholeTextFileRecordReaderSuite.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/JoshRosen/spark inputoutputsuite-fix

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/4599.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #4599
    
----
commit 47dc4473cb817f5bf14a00f0c487c984aa2cc7c6
Author: Josh Rosen <jo...@databricks.com>
Date:   2015-02-13T22:33:05Z

    [SPARK-5227] [SPARK-5679] Disable FileSystem cache in WholeTextFileRecordReaderSuite

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5227] [SPARK-5679] Disable FileSystem c...

Posted by ksakellis <gi...@git.apache.org>.

Github user ksakellis commented on the pull request:

    https://github.com/apache/spark/pull/4599#issuecomment-74352451
  
    Ok, this LTGM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5227] [SPARK-5679] Disable FileSystem c...

Posted by JoshRosen <gi...@git.apache.org>.

Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/4599#issuecomment-74341718
  
    I SSH'ed into an AMPLab Jenkins box to reproduce the original failures and have confirmed that this patch fixes them.
    
    /cc @ksakellis


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5227] [SPARK-5679] Disable FileSystem c...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/4599


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5227] [SPARK-5679] Disable FileSystem c...

Posted by sryza <gi...@git.apache.org>.

Github user sryza commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4599#discussion_r24708608
  
    --- Diff: core/src/test/scala/org/apache/spark/input/WholeTextFileRecordReaderSuite.scala ---
    @@ -42,7 +42,15 @@ class WholeTextFileRecordReaderSuite extends FunSuite with BeforeAndAfterAll {
       private var factory: CompressionCodecFactory = _
     
       override def beforeAll() {
    -    sc = new SparkContext("local", "test")
    +    // Hadoop's FileSystem caching does not use the Configuration as part of its cache key, which
    +    // can cause Filesystem.get(Configuration) to return a cached instance created with a different
    +    // configuration than the one passed to get() (see HADOOP-8490 for more details). This caused
    +    // hard-to-reproduce test failures, since any suites that were run after this one would inherit
    +    // the new value of "fs.local.block.size" (see SPARK-5227 and SPARK-5679). To work around this,
    +    // we disable FileSystem caching in this suite.
    +    val conf = new SparkConf().set("spark.hadoop.fs.file.impl.disable.cache", "true")
    --- End diff --
    
    My opinion is that doing it for all tests would be a bit drastic.  It's not that modifications to `Configuration` objects arbitrarily affect other `Configuration` objects.  It's that this test specifically relies on some underlying properties set on the `FileSystem`, and the FS Cache allows these to leak to other FS instances.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5227] [SPARK-5679] Disable FileSystem c...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4599#issuecomment-74341495
  
      [Test build #27461 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27461/consoleFull) for   PR 4599 at commit [`47dc447`](https://github.com/apache/spark/commit/47dc4473cb817f5bf14a00f0c487c984aa2cc7c6).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5227] [SPARK-5679] Disable FileSystem c...

Posted by JoshRosen <gi...@git.apache.org>.

Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4599#discussion_r24707723
  
    --- Diff: core/src/test/scala/org/apache/spark/input/WholeTextFileRecordReaderSuite.scala ---
    @@ -42,7 +42,15 @@ class WholeTextFileRecordReaderSuite extends FunSuite with BeforeAndAfterAll {
       private var factory: CompressionCodecFactory = _
     
       override def beforeAll() {
    -    sc = new SparkContext("local", "test")
    +    // Hadoop's FileSystem caching does not use the Configuration as part of its cache key, which
    +    // can cause Filesystem.get(Configuration) to return a cached instance created with a different
    +    // configuration than the one passed to get() (see HADOOP-8490 for more details). This caused
    +    // hard-to-reproduce test failures, since any suites that were run after this one would inherit
    +    // the new value of "fs.local.block.size" (see SPARK-5227 and SPARK-5679). To work around this,
    +    // we disable FileSystem caching in this suite.
    +    val conf = new SparkConf().set("spark.hadoop.fs.file.impl.disable.cache", "true")
    --- End diff --
    
    Good question.  If we wanted to disable this in all tests, then I think the right place to do that would be in the Maven and SBT builds via system properties.
    
    I chose not to do that here because I wasn't sure whether doing so might mask bugs, since most users of Spark will run with FileSystem caching enabled (I think that disabling it across the board may harm performance, since it sounds like a lot of Hadoop code assumes that FileSystem.get is cheap, and, accordingly, calls it many times).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5227] [SPARK-5679] Disable FileSystem c...

Posted by JoshRosen <gi...@git.apache.org>.

Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/4599#issuecomment-74341896
  
    JIRA links, for the curious:
    
    - https://issues.apache.org/jira/browse/HADOOP-8490
    - https://issues.apache.org/jira/browse/SPARK-5679
    - https://issues.apache.org/jira/browse/SPARK-5227


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5227] [SPARK-5679] Disable FileSystem c...

Posted by pwendell <gi...@git.apache.org>.

Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/4599#issuecomment-74355615
  
    LGTM .


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5227] [SPARK-5679] Disable FileSystem c...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4599#issuecomment-74349968
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27461/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5227] [SPARK-5679] Disable FileSystem c...

Posted by JoshRosen <gi...@git.apache.org>.

Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/4599#issuecomment-74350596
  
    /cc @pwendell @andrewor14


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5227] [SPARK-5679] Disable FileSystem c...

Posted by ksakellis <gi...@git.apache.org>.

Github user ksakellis commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4599#discussion_r24707286
  
    --- Diff: core/src/test/scala/org/apache/spark/input/WholeTextFileRecordReaderSuite.scala ---
    @@ -42,7 +42,15 @@ class WholeTextFileRecordReaderSuite extends FunSuite with BeforeAndAfterAll {
       private var factory: CompressionCodecFactory = _
     
       override def beforeAll() {
    -    sc = new SparkContext("local", "test")
    +    // Hadoop's FileSystem caching does not use the Configuration as part of its cache key, which
    +    // can cause Filesystem.get(Configuration) to return a cached instance created with a different
    +    // configuration than the one passed to get() (see HADOOP-8490 for more details). This caused
    +    // hard-to-reproduce test failures, since any suites that were run after this one would inherit
    +    // the new value of "fs.local.block.size" (see SPARK-5227 and SPARK-5679). To work around this,
    +    // we disable FileSystem caching in this suite.
    +    val conf = new SparkConf().set("spark.hadoop.fs.file.impl.disable.cache", "true")
    --- End diff --
    
    So, do you think we should disable it across all tests? just in case there are other tests that also modify the hadoop configuration thinking that the config objects are local to them? It might bite someone else in the butt later if we don't globally do this. I don't think there is a global test class that every tests inherits, maybe we can add it in SparkSparkContext since a lot of the new tests written use that trait? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5227] [SPARK-5679] Disable FileSystem c...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4599#issuecomment-74349965
  
      [Test build #27461 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27461/consoleFull) for   PR 4599 at commit [`47dc447`](https://github.com/apache/spark/commit/47dc4473cb817f5bf14a00f0c487c984aa2cc7c6).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org