You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by cloud-fan <gi...@git.apache.org> on 2018/02/12 04:29:55 UTC

[GitHub] spark pull request #20584: [SPARK-23390][SQL] Flaky Test Suite: FileBasedDat...

GitHub user cloud-fan opened a pull request:

    https://github.com/apache/spark/pull/20584

    [SPARK-23390][SQL] Flaky Test Suite: FileBasedDataSourceSuite in Spark 2.3/hadoop 2.7

    ## What changes were proposed in this pull request?
    
    This test only fails with sbt on Hadoop 2.7, I can't reproduce it locally, but here is my speculation by looking at the code:
    1. FileSystem.delete doesn't delete the directory entirely, somehow we can still open the file as a 0-length empty file.(just speculation)
    2. ORC intentionally allow empty files, and the reader fails during reading without closing the file stream.
    
    This PR improves the test to make sure all files are deleted and can't be opened.
    
    ## How was this patch tested?
    
    N/A

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/cloud-fan/spark flaky-test

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/20584.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #20584
    
----
commit 51bb48a4189aeb0322dd4ccd0f02416a52e963c3
Author: Wenchen Fan <we...@...>
Date:   2018-02-12T04:24:35Z

    make sure all files are deleted when testing IGNORE_MISSING_FILES

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20584: [SPARK-23390][SQL] Flaky Test Suite: FileBasedDataSource...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/20584
  
    LGTM, I would merge this first and see whether this can help fix the flaky tests.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20584: [SPARK-23390][SQL] Flaky Test Suite: FileBasedDataSource...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/20584
  
    I won't get in the way but I am less sure on this. I thought this is also flaky in PR builder too anyway.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20584: [SPARK-23390][SQL] Flaky Test Suite: FileBasedDataSource...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20584
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/802/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20584: [SPARK-23390][SQL] Flaky Test Suite: FileBasedDataSource...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/20584
  
    You are right. I have run out of ideas. LGTM too for a try if it happens more frequently in spark-branch-2.3-test-sbt-hadoop-2.7.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20584: [SPARK-23390][SQL] Flaky Test Suite: FileBasedDataSource...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20584
  
    **[Test build #87321 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87321/testReport)** for PR 20584 at commit [`51bb48a`](https://github.com/apache/spark/commit/51bb48a4189aeb0322dd4ccd0f02416a52e963c3).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20584: [SPARK-23390][SQL] Flaky Test Suite: FileBasedDataSource...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/20584
  
    My bad. Thank you, guys. For the following, I'll investigate it.
    > According to the log, the leaked file stream was created when building the ORC columnar reader.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20584: [SPARK-23390][SQL] Flaky Test Suite: FileBasedDataSource...

Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on the issue:

    https://github.com/apache/spark/pull/20584
  
    I am also thinking about this. I agree with this.
    > According to the log, the leaked file stream was created when building the ORC columnar reader.
    
    I am suspicious about relationship between `afterEach()` and `addTaskCompletionListener` (call `close()`). But, not sure. Let us try this approach first.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20584: [SPARK-23390][SQL] Flaky Test Suite: FileBasedDataSource...

Posted by sameeragarwal <gi...@git.apache.org>.
Github user sameeragarwal commented on the issue:

    https://github.com/apache/spark/pull/20584
  
    merging this to master/2.3. Thanks!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20584: [SPARK-23390][SQL] Flaky Test Suite: FileBasedDat...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/20584


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20584: [SPARK-23390][SQL] Flaky Test Suite: FileBasedDataSource...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/20584
  
    This patch helps `sbt/hadoop2.7`. So, I'm seriously monitoring the latest consecutive failures at `sbt` and `hadoop-2.6` branch, too.
    
    - 4210 (Running)
    - 4209 Failed with **`FileBasedDataSourceSuite`** and `ParquetQuerySuite`
    - 4208 **This patch landed here** but failed with `StreamingOuterJoinSuite` and `ReceiverSuite`.
    - 4207 Failed with `ParquetQuerySuite`
    - 4206 Failed with `BufferHolderSparkSubmitSuite`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20584: [SPARK-23390][SQL] Flaky Test Suite: FileBasedDataSource...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/20584
  
    BTW, I would bet for case 2. in the PR description (just a rough wild guess).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20584: [SPARK-23390][SQL] Flaky Test Suite: FileBasedDataSource...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/20584
  
    I think I rushed to take a look at the first time. Thanks for fixing this.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20584: [SPARK-23390][SQL] Flaky Test Suite: FileBasedDataSource...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20584
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87321/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20584: [SPARK-23390][SQL] Flaky Test Suite: FileBasedDataSource...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/20584
  
    I created a PR, https://github.com/apache/spark/pull/20590 .


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20584: [SPARK-23390][SQL] Flaky Test Suite: FileBasedDat...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20584#discussion_r167469495
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala ---
    @@ -102,17 +104,27 @@ class FileBasedDataSourceSuite extends QueryTest with SharedSQLContext {
           def testIgnoreMissingFiles(): Unit = {
             withTempDir { dir =>
               val basePath = dir.getCanonicalPath
    +
               Seq("0").toDF("a").write.format(format).save(new Path(basePath, "first").toString)
               Seq("1").toDF("a").write.format(format).save(new Path(basePath, "second").toString)
    +
               val thirdPath = new Path(basePath, "third")
    +          val fs = thirdPath.getFileSystem(spark.sparkContext.hadoopConfiguration)
               Seq("2").toDF("a").write.format(format).save(thirdPath.toString)
    +          val files = fs.listStatus(thirdPath).filter(_.isFile).map(_.getPath)
    +
               val df = spark.read.format(format).load(
                 new Path(basePath, "first").toString,
                 new Path(basePath, "second").toString,
                 new Path(basePath, "third").toString)
     
    -          val fs = thirdPath.getFileSystem(spark.sparkContext.hadoopConfiguration)
    +          // Make sure all data files are deleted and can't be opened.
    +          files.foreach(f => fs.delete(f, false))
               assert(fs.delete(thirdPath, true))
    --- End diff --
    
    Hmmmm .. but it asserts true on the delete completion .. I would be surprised if it's something not guaranteed .. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20584: [SPARK-23390][SQL] Flaky Test Suite: FileBasedDataSource...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/20584
  
    Great! https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/ becomes green again!!!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20584: [SPARK-23390][SQL] Flaky Test Suite: FileBasedDataSource...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/20584
  
    cc @sameeragarwal @dongjoon-hyun @gatorsmile 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20584: [SPARK-23390][SQL] Flaky Test Suite: FileBasedDataSource...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20584
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20584: [SPARK-23390][SQL] Flaky Test Suite: FileBasedDataSource...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/20584
  
    For the following case, I'll make a PR for Spark ORC columnar reader very soon.
    > 2) the orc columnar reader's close method doesn't close the file stream.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20584: [SPARK-23390][SQL] Flaky Test Suite: FileBasedDataSource...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/20584
  
    BTW, my rough wild guess was that case 2. (reading it but not closing it) happens in schema inference path.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20584: [SPARK-23390][SQL] Flaky Test Suite: FileBasedDataSource...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20584
  
    **[Test build #87321 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87321/testReport)** for PR 20584 at commit [`51bb48a`](https://github.com/apache/spark/commit/51bb48a4189aeb0322dd4ccd0f02416a52e963c3).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20584: [SPARK-23390][SQL] Flaky Test Suite: FileBasedDataSource...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/20584
  
    > BTW, my rough wild guess was that case 2. (reading it but not closing it) happens in schema inference path.
    
    According to the log, the leaked file stream was created when building the ORC columnar reader.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20584: [SPARK-23390][SQL] Flaky Test Suite: FileBasedDataSource...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20584
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20584: [SPARK-23390][SQL] Flaky Test Suite: FileBasedDataSource...

Posted by sameeragarwal <gi...@git.apache.org>.
Github user sameeragarwal commented on the issue:

    https://github.com/apache/spark/pull/20584
  
    LGTM, seems plausible!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20584: [SPARK-23390][SQL] Flaky Test Suite: FileBasedDataSource...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/20584
  
    > I am suspicious about relationship between afterEach() and addTaskCompletionListener (call close()). But, not sure. Let us try this approach first.
    
    This is one of my speculations. There 2 possibilities I can think of: 1) the task completion listener is not called before `afterEach`. 2) the orc columnar reader's `close` method doesn't close the file stream.
    
    For 1), seems we've fixed it in https://github.com/apache/spark/commit/c5a31d160f47ba51bb9f8a4f3141851034640fc7 . For 2), I'm not sure and may need help from ORC folks.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org