You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by superbobry <gi...@git.apache.org> on 2017/10/09 16:56:42 UTC

[GitHub] spark pull request #19458: [SPARK-22227][CORE] DiskBlockManager.getAllBlocks...

GitHub user superbobry opened a pull request:

    https://github.com/apache/spark/pull/19458

    [SPARK-22227][CORE] DiskBlockManager.getAllBlocks now tolerates temp files

    ## What changes were proposed in this pull request?
    
    Prior to this commit getAllBlocks implicitly assumed that the directories
    managed by the DiskBlockManager contain only the files corresponding to
    valid block IDs. In reality, this assumption was violated during shuffle,
    which produces temporary files in the same directory as the resulting
    blocks. As a result, calls to getAllBlocks during shuffle were unreliable.
    
    The fix could be made more efficient, but this is probably good enough.
    
    ## How was this patch tested?
    
    `DiskBlockAggregateSuite`

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/criteo-forks/spark block-id-option

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19458.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19458
    
----
commit 9b9b86fed0e5949fd9e7abaefe08c3d9d986feb6
Author: Sergei Lebedev <s....@criteo.com>
Date:   2017-10-09T16:52:00Z

    [SPARK-22227][CORE] DiskBlockManager.getAllBlocks now tolerates temp files
    
    Prior to this commit getAllBlocks implicitly assumed that the directories
    managed by the DiskBlockManager contain only the files corresponding to
    valid block IDs. In reality this assumption was violated during shuffle,
    which produces temporary files in the same directory as the resulting
    blocks. As a result, calls to getAllBlocks during shuffle were unreliable.
    
    The fix could be made more efficient, but this is probably good enough.

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19458: [SPARK-22227][CORE] DiskBlockManager.getAllBlocks now to...

Posted by jerryshao <gi...@git.apache.org>.
Github user jerryshao commented on the issue:

    https://github.com/apache/spark/pull/19458
  
    There's a UT failure (https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83014/testReport/junit/org.apache.spark.storage/BlockIdSuite/test_bad_deserialization/). @superbobry please fix this failure.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19458: [SPARK-22227][CORE] DiskBlockManager.getAllBlocks now to...

Posted by jerryshao <gi...@git.apache.org>.
Github user jerryshao commented on the issue:

    https://github.com/apache/spark/pull/19458
  
    retest this please.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19458: [SPARK-22227][CORE] DiskBlockManager.getAllBlocks now to...

Posted by jerryshao <gi...@git.apache.org>.
Github user jerryshao commented on the issue:

    https://github.com/apache/spark/pull/19458
  
    Instead of filtering out temp blocks, why not adding parsing rule for `TempLocalBlockId` and `TempShuffleBlockId`? That could also solve the problem. Since `DiskBlockManager#getAllFiles` doesn't filter out temp shuffle/local files, is it better to keep the same behavior for `DiskBlockManager#getAllBlocks`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19458: [SPARK-22227][CORE] DiskBlockManager.getAllBlocks...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19458#discussion_r144758165
  
    --- Diff: core/src/main/scala/org/apache/spark/storage/DiskBlockManager.scala ---
    @@ -100,7 +100,16 @@ private[spark] class DiskBlockManager(conf: SparkConf, deleteFilesOnStop: Boolea
     
       /** List all the blocks currently stored on disk by the disk manager. */
       def getAllBlocks(): Seq[BlockId] = {
    -    getAllFiles().map(f => BlockId(f.getName))
    +    getAllFiles().flatMap { f =>
    +      val blockId = BlockId.guess(f.getName)
    --- End diff --
    
    I think we don't need to log here.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19458: [SPARK-22227][CORE] DiskBlockManager.getAllBlocks now to...

Posted by superbobry <gi...@git.apache.org>.
Github user superbobry commented on the issue:

    https://github.com/apache/spark/pull/19458
  
    > Instead of filtering out temp blocks, why not adding parsing rule for `TempLocalBlockId` and `TempShuffleBlockId`? 
    
    Because this fixes the issue out 2 out of 3 possible temp files. The unhandled case is produced by `Utils.tempFileWith`. 
    
    > Since `DiskBlockManager#getAllFiles` doesn't filter out temp shuffle/local files, is it better to keep the same behavior for `DiskBlockManager#getAllBlocks`?
    
    I agree that it makes sense to keep those in sync, therefore I prefer to introduce `Block.isValid` and use it in `getAllFiles`.
    
    > Also it would better for upstream code to decide whether to filter out temp files/blocks.
    
    Possibly, but in any case `getAllBlocks` should not throw, since temp blocks are an implementation detail.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19458: [SPARK-22227][CORE] DiskBlockManager.getAllBlocks now to...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19458
  
    **[Test build #83014 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83014/testReport)** for PR 19458 at commit [`febdd7e`](https://github.com/apache/spark/commit/febdd7ea14df896e7460991318daef85dc222dbd).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class UnrecognizedBlockId(name: String)`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19458: [SPARK-22227][CORE] DiskBlockManager.getAllBlocks now to...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19458
  
    **[Test build #83034 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83034/testReport)** for PR 19458 at commit [`ff9a6ae`](https://github.com/apache/spark/commit/ff9a6aed09074846e425eb614182cd6fc5ef74e7).
     * This patch **fails due to an unknown error code, -9**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19458: [SPARK-22227][CORE] DiskBlockManager.getAllBlocks now to...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19458
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19458: [SPARK-22227][CORE] DiskBlockManager.getAllBlocks now to...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19458
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19458: [SPARK-22227][CORE] DiskBlockManager.getAllBlocks now to...

Posted by jiangxb1987 <gi...@git.apache.org>.
Github user jiangxb1987 commented on the issue:

    https://github.com/apache/spark/pull/19458
  
    ping @superbobry


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19458: [SPARK-22227][CORE] DiskBlockManager.getAllBlocks now to...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19458
  
    **[Test build #83014 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83014/testReport)** for PR 19458 at commit [`febdd7e`](https://github.com/apache/spark/commit/febdd7ea14df896e7460991318daef85dc222dbd).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19458: [SPARK-22227][CORE] DiskBlockManager.getAllBlocks now to...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19458
  
    **[Test build #83048 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83048/testReport)** for PR 19458 at commit [`2b61347`](https://github.com/apache/spark/commit/2b6134708e1771546c3b2210b9471b51e20d4cd1).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19458: [SPARK-22227][CORE] DiskBlockManager.getAllBlocks now to...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19458
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19458: [SPARK-22227][CORE] DiskBlockManager.getAllBlocks now to...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19458
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83043/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19458: [SPARK-22227][CORE] DiskBlockManager.getAllBlocks now to...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19458
  
    **[Test build #83048 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83048/testReport)** for PR 19458 at commit [`2b61347`](https://github.com/apache/spark/commit/2b6134708e1771546c3b2210b9471b51e20d4cd1).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19458: [SPARK-22227][CORE] DiskBlockManager.getAllBlocks now to...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/19458
  
    retest this please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19458: [SPARK-22227][CORE] DiskBlockManager.getAllBlocks now to...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/19458
  
    retest this please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19458: [SPARK-22227][CORE] DiskBlockManager.getAllBlocks now to...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19458
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19458: [SPARK-22227][CORE] DiskBlockManager.getAllBlocks...

Posted by superbobry <gi...@git.apache.org>.
Github user superbobry commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19458#discussion_r143586763
  
    --- Diff: core/src/main/scala/org/apache/spark/storage/DiskBlockManager.scala ---
    @@ -100,7 +102,9 @@ private[spark] class DiskBlockManager(conf: SparkConf, deleteFilesOnStop: Boolea
     
       /** List all the blocks currently stored on disk by the disk manager. */
       def getAllBlocks(): Seq[BlockId] = {
    -    getAllFiles().map(f => BlockId(f.getName))
    +    // SPARK-22227: the Try guides against temporary files written
    +    // during shuffle which do not correspond to valid block IDs.
    +    getAllFiles().flatMap(f => Try(BlockId(f.getName)).toOption)
    --- End diff --
    
    > This has the effect of swallowing a number of possible exceptions. 
    It does. Right now the only possible exception coming from `BlockId.apply` is `IllegalStateException `, but I agree that using `Try` makes the code fragile. My intention was to make an easy to read proof of concept and then adjust the patch according to the feedback.
    
    What do you think about adding a "safe" parsing method to `BlockId`? The metho would return an `Option` instead of throwing and would only be used in `DiskBlockManager` for now? `BlockId.apply` can then be expressed `parse(s).getOrElse { throw ... }`.
    
    > But is there a more explicit way of excluding temp files? it seems like getAllFiles shouldn't return those either?
    
    I think this is tricky to do without leaking some abstractions, e.g. the ".UUID" naming scheme in `Utils. tempFileWith` and Temp*BlockId naming scheme.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19458: [SPARK-22227][CORE] DiskBlockManager.getAllBlocks...

Posted by jerryshao <gi...@git.apache.org>.
Github user jerryshao commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19458#discussion_r144450997
  
    --- Diff: core/src/main/scala/org/apache/spark/storage/DiskBlockManager.scala ---
    @@ -100,7 +100,16 @@ private[spark] class DiskBlockManager(conf: SparkConf, deleteFilesOnStop: Boolea
     
       /** List all the blocks currently stored on disk by the disk manager. */
       def getAllBlocks(): Seq[BlockId] = {
    -    getAllFiles().map(f => BlockId(f.getName))
    +    getAllFiles().flatMap { f =>
    +      val blockId = BlockId.guess(f.getName)
    --- End diff --
    
    It looks not so necessary to define a new method `guess` for the use here only. I think here we can still use `apply` and catch/log the exception. In another word, we can simply changes `apply()` and use it here, defining new `guess` method is not so necessary.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19458: [SPARK-22227][CORE] DiskBlockManager.getAllBlocks...

Posted by superbobry <gi...@git.apache.org>.
Github user superbobry commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19458#discussion_r146491977
  
    --- Diff: core/src/main/scala/org/apache/spark/storage/DiskBlockManager.scala ---
    @@ -100,7 +100,17 @@ private[spark] class DiskBlockManager(conf: SparkConf, deleteFilesOnStop: Boolea
     
       /** List all the blocks currently stored on disk by the disk manager. */
       def getAllBlocks(): Seq[BlockId] = {
    -    getAllFiles().map(f => BlockId(f.getName))
    +    getAllFiles().flatMap { f =>
    +      try {
    +        Some(BlockId(f.getName))
    +      } catch {
    +        case _: UnrecognizedBlockId =>
    +          // This does not handle a special-case of a temporary file
    +          // created by [[SortShuffleWriter]].
    +          log.warn(s"Encountered an unexpected file in a managed directory: $f")
    --- End diff --
    
    I am not convinced we should log either. I've added this line because it was suggested by @jerryshao. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19458: [SPARK-22227][CORE] DiskBlockManager.getAllBlocks now to...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19458
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83014/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19458: [SPARK-22227][CORE] DiskBlockManager.getAllBlocks...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19458#discussion_r146413565
  
    --- Diff: core/src/main/scala/org/apache/spark/storage/DiskBlockManager.scala ---
    @@ -100,7 +100,17 @@ private[spark] class DiskBlockManager(conf: SparkConf, deleteFilesOnStop: Boolea
     
       /** List all the blocks currently stored on disk by the disk manager. */
       def getAllBlocks(): Seq[BlockId] = {
    -    getAllFiles().map(f => BlockId(f.getName))
    +    getAllFiles().flatMap { f =>
    +      try {
    +        Some(BlockId(f.getName))
    +      } catch {
    +        case _: UnrecognizedBlockId =>
    +          // This does not handle a special-case of a temporary file
    +          // created by [[SortShuffleWriter]].
    +          log.warn(s"Encountered an unexpected file in a managed directory: $f")
    --- End diff --
    
    My question here: is it really something we need to warn users? What should they do about it? Here we are getting block ids from file names, so it's legal to meet a temporary file at this place.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19458: [SPARK-22227][CORE] DiskBlockManager.getAllBlocks now to...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19458
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19458: [SPARK-22227][CORE] DiskBlockManager.getAllBlocks now to...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19458
  
    **[Test build #83043 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83043/testReport)** for PR 19458 at commit [`2b61347`](https://github.com/apache/spark/commit/2b6134708e1771546c3b2210b9471b51e20d4cd1).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19458: [SPARK-22227][CORE] DiskBlockManager.getAllBlocks now to...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19458
  
    **[Test build #83034 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83034/testReport)** for PR 19458 at commit [`ff9a6ae`](https://github.com/apache/spark/commit/ff9a6aed09074846e425eb614182cd6fc5ef74e7).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19458: [SPARK-22227][CORE] DiskBlockManager.getAllBlocks...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19458#discussion_r146412838
  
    --- Diff: core/src/main/scala/org/apache/spark/storage/BlockId.scala ---
    @@ -100,6 +100,8 @@ private[spark] case class TestBlockId(id: String) extends BlockId {
       override def name: String = "test_" + id
     }
     
    +class UnrecognizedBlockId(name: String) extends Exception
    --- End diff --
    
    nit: extend `SparkException`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19458: [SPARK-22227][CORE] DiskBlockManager.getAllBlocks now to...

Posted by superbobry <gi...@git.apache.org>.
Github user superbobry commented on the issue:

    https://github.com/apache/spark/pull/19458
  
    On second thought, the documentation of `DiskBlockManager.getAllFiles` states that it returns all files in the directories managed by `DiskBlockManager`. Filtering anything would violate this contract, WDYT?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19458: [SPARK-22227][CORE] DiskBlockManager.getAllBlocks now to...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/19458
  
    retest this please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19458: [SPARK-22227][CORE] DiskBlockManager.getAllBlocks now to...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19458
  
    **[Test build #83043 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83043/testReport)** for PR 19458 at commit [`2b61347`](https://github.com/apache/spark/commit/2b6134708e1771546c3b2210b9471b51e20d4cd1).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19458: [SPARK-22227][CORE] DiskBlockManager.getAllBlocks...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19458#discussion_r144456908
  
    --- Diff: core/src/main/scala/org/apache/spark/storage/DiskBlockManager.scala ---
    @@ -100,7 +100,16 @@ private[spark] class DiskBlockManager(conf: SparkConf, deleteFilesOnStop: Boolea
     
       /** List all the blocks currently stored on disk by the disk manager. */
       def getAllBlocks(): Seq[BlockId] = {
    -    getAllFiles().map(f => BlockId(f.getName))
    +    getAllFiles().flatMap { f =>
    +      val blockId = BlockId.guess(f.getName)
    --- End diff --
    
    +1 for using try-catch here.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19458: [SPARK-22227][CORE] DiskBlockManager.getAllBlocks now to...

Posted by jerryshao <gi...@git.apache.org>.
Github user jerryshao commented on the issue:

    https://github.com/apache/spark/pull/19458
  
    Yes, I agree in any case it should not throw an exception. But in this PR you filtered out temp shuffle/local blocks, do you think this block is valid or not, are they blocks? 
    
    So I'd like not filtering out those blocks, instead adding two parsing rules for those blocks. And for any other illegal files (cannot be parsed) catch and log the exception.
    



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19458: [SPARK-22227][CORE] DiskBlockManager.getAllBlocks now to...

Posted by superbobry <gi...@git.apache.org>.
Github user superbobry commented on the issue:

    https://github.com/apache/spark/pull/19458
  
    @jerryshao the error looks a bit unrelated.
    
    ```
    sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: null
    ```
    
    The suite passes locally for me.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19458: [SPARK-22227][CORE] DiskBlockManager.getAllBlocks...

Posted by superbobry <gi...@git.apache.org>.
Github user superbobry commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19458#discussion_r146213702
  
    --- Diff: core/src/main/scala/org/apache/spark/storage/DiskBlockManager.scala ---
    @@ -100,7 +100,16 @@ private[spark] class DiskBlockManager(conf: SparkConf, deleteFilesOnStop: Boolea
     
       /** List all the blocks currently stored on disk by the disk manager. */
       def getAllBlocks(): Seq[BlockId] = {
    -    getAllFiles().map(f => BlockId(f.getName))
    +    getAllFiles().flatMap { f =>
    +      val blockId = BlockId.guess(f.getName)
    --- End diff --
    
    Do you mean we don't need to log at all?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19458: [SPARK-22227][CORE] DiskBlockManager.getAllBlocks now to...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19458
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83048/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19458: [SPARK-22227][CORE] DiskBlockManager.getAllBlocks...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/19458


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19458: [SPARK-22227][CORE] DiskBlockManager.getAllBlocks now to...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/19458
  
    thanks, merging to master/2.2


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19458: [SPARK-22227][CORE] DiskBlockManager.getAllBlocks...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19458#discussion_r143527010
  
    --- Diff: core/src/main/scala/org/apache/spark/storage/DiskBlockManager.scala ---
    @@ -100,7 +102,9 @@ private[spark] class DiskBlockManager(conf: SparkConf, deleteFilesOnStop: Boolea
     
       /** List all the blocks currently stored on disk by the disk manager. */
       def getAllBlocks(): Seq[BlockId] = {
    -    getAllFiles().map(f => BlockId(f.getName))
    +    // SPARK-22227: the Try guides against temporary files written
    +    // during shuffle which do not correspond to valid block IDs.
    +    getAllFiles().flatMap(f => Try(BlockId(f.getName)).toOption)
    --- End diff --
    
    This has the effect of swallowing a number of possible exceptions. I think at least you'd want to unpack this and log the error. But is there a more explicit way of excluding temp files? it seems like getAllFiles shouldn't return those either?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19458: [SPARK-22227][CORE] DiskBlockManager.getAllBlocks now to...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19458
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83034/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19458: [SPARK-22227][CORE] DiskBlockManager.getAllBlocks...

Posted by superbobry <gi...@git.apache.org>.
Github user superbobry commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19458#discussion_r144655974
  
    --- Diff: core/src/main/scala/org/apache/spark/storage/DiskBlockManager.scala ---
    @@ -100,7 +100,16 @@ private[spark] class DiskBlockManager(conf: SparkConf, deleteFilesOnStop: Boolea
     
       /** List all the blocks currently stored on disk by the disk manager. */
       def getAllBlocks(): Seq[BlockId] = {
    -    getAllFiles().map(f => BlockId(f.getName))
    +    getAllFiles().flatMap { f =>
    +      val blockId = BlockId.guess(f.getName)
    --- End diff --
    
    Will do. Should I log the exception even if the file has been produced by `Utils.tempFileWith`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org