You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/08/21 20:32:41 UTC

[GitHub] [spark] warrenzhu25 opened a new pull request, #37603: [SPARK-40168][CORE] Handle FileNotFoundException when shuffle file deleted in decommissioner

warrenzhu25 opened a new pull request, #37603:
URL: https://github.com/apache/spark/pull/37603

   ### What changes were proposed in this pull request?
   Explicitly handle FileNotFoundException wrapped inside SparkException, then mark this block as deleted, further avoid retry of this block and stop of current migration thread
   
   ### Why are the changes needed?
   FileNotFoundException wrapped inside SparkException is not handled correctly, causing unnecessary retry and stop of current migration thread
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   ### How was this patch tested?
   Added test in BlockManagerDecommissionUnitSuite
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #37603: [SPARK-40168][CORE] Handle FileNotFoundException when shuffle file deleted in decommissioner

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on PR #37603:
URL: https://github.com/apache/spark/pull/37603#issuecomment-1229902118

   Thank you for pinging me, @mridulm .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun closed pull request #37603: [SPARK-40168][CORE] Handle `SparkException` during shuffle block migration

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun closed pull request #37603: [SPARK-40168][CORE] Handle `SparkException` during shuffle block migration
URL: https://github.com/apache/spark/pull/37603


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] mridulm commented on pull request #37603: [SPARK-40168][CORE] Handle FileNotFoundException when shuffle file deleted in decommissioner

Posted by GitBox <gi...@apache.org>.
mridulm commented on PR #37603:
URL: https://github.com/apache/spark/pull/37603#issuecomment-1223288301

   If the existing file has not gone via scalafmt (a lot of files are not), we will end up with a lot of unnecessary diffs  unfortunately - you can avoid whole file format if it is not a new file being added.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] warrenzhu25 commented on pull request #37603: [SPARK-40168][CORE] Handle FileNotFoundException when shuffle file deleted in decommissioner

Posted by GitBox <gi...@apache.org>.
warrenzhu25 commented on PR #37603:
URL: https://github.com/apache/spark/pull/37603#issuecomment-1223017121

   > +CC @holdenk
   > 
   > Can you minimize the diff by avoiding reformating the files @warrenzhu25 ?
   
   @mridulm Thanks for looking at this. Do you know how to do auto format? I used  `./dev/scalafmt`, then it becomes current format.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] warrenzhu25 commented on pull request #37603: [SPARK-40168][CORE] Handle FileNotFoundException when shuffle file deleted in decommissioner

Posted by GitBox <gi...@apache.org>.
warrenzhu25 commented on PR #37603:
URL: https://github.com/apache/spark/pull/37603#issuecomment-1223398796

   > If the existing file has not gone via scalafmt (a lot of files are not), we will end up with a lot of unnecessary diffs unfortunately - you can avoid whole file format if it is not a new file being added.
   
   Thanks, updated.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] warrenzhu25 commented on pull request #37603: [SPARK-40168][CORE] Handle FileNotFoundException when shuffle file deleted in decommissioner

Posted by GitBox <gi...@apache.org>.
warrenzhu25 commented on PR #37603:
URL: https://github.com/apache/spark/pull/37603#issuecomment-1230879241

   > This PR seems to have two theme. Do you think you can spin off this counting part first, @warrenzhu25 ?
   > 
   > ```scala
   > + private[storage] val numDeletedShuffles = new AtomicInteger(0)
   > ```
   > 
   > ```scala
   > + numDeletedShuffles.incrementAndGet()
   > ```
   > 
   > ```scala
   > + numDeletedShuffles.foreach { s =>
   > +   assert(bmDecomManager.numDeletedShuffles.get() == s)
   > + }
   > ```
   > 
   > ```scala
   > +    validateDecommissionTimestampsOnManager(
   > +       bmDecomManager,
   > +       fail = false,
   > +       numDeletedShuffles = Some(1))
   > ```
   
   The reason why I add `numDeletedShuffles` here is for testing purpose. In current implementation, both deleted and failed shuffles are considered as migrated, no way to distinguish them in test case. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #37603: [SPARK-40168][CORE] Handle FileNotFoundException when shuffle file deleted in decommissioner

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on PR #37603:
URL: https://github.com/apache/spark/pull/37603#issuecomment-1222408361

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] mridulm commented on pull request #37603: [SPARK-40168][CORE] Handle FileNotFoundException when shuffle file deleted in decommissioner

Posted by GitBox <gi...@apache.org>.
mridulm commented on PR #37603:
URL: https://github.com/apache/spark/pull/37603#issuecomment-1229894961

   I dont have a lot of context about decomissioner.
   @holdenk or @dongjoon-hyun might be better at reviewing this ...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] warrenzhu25 commented on pull request #37603: [SPARK-40168][CORE] Handle FileNotFoundException when shuffle file deleted in decommissioner

Posted by GitBox <gi...@apache.org>.
warrenzhu25 commented on PR #37603:
URL: https://github.com/apache/spark/pull/37603#issuecomment-1229796821

   @mridulm @holdenk Could you help take a look?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #37603: [SPARK-40168][CORE] Handle FileNotFoundException when shuffle file deleted in decommissioner

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on code in PR #37603:
URL: https://github.com/apache/spark/pull/37603#discussion_r956997871


##########
core/src/main/scala/org/apache/spark/storage/BlockManagerDecommissioner.scala:
##########
@@ -302,7 +307,8 @@ private[storage] class BlockManagerDecommissioner(
       stoppedShuffle = true
     }
     // If we found any new shuffles to migrate or otherwise have not migrated everything.
-    newShufflesToMigrate.nonEmpty || migratingShuffles.size > numMigratedShuffles.get()
+    newShufflesToMigrate.nonEmpty ||
+    migratingShuffles.size > numMigratedShuffles.get() + numDeletedShuffles.get()

Review Comment:
   indentation?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Ngone51 commented on a diff in pull request #37603: [SPARK-40168][CORE] Handle FileNotFoundException when shuffle file deleted in decommissioner

Posted by GitBox <gi...@apache.org>.
Ngone51 commented on code in PR #37603:
URL: https://github.com/apache/spark/pull/37603#discussion_r963768575


##########
core/src/main/scala/org/apache/spark/storage/BlockManagerDecommissioner.scala:
##########
@@ -125,21 +126,25 @@ private[storage] class BlockManagerDecommissioner(
                   logDebug(s"Migrated sub-block $blockId")
                 }
               }
+              numMigratedShuffles.incrementAndGet()
               logInfo(s"Migrated $shuffleBlockInfo to $peer")
             } catch {
-              case e: IOException =>
+              case e @ ( _ : IOException | _ : SparkException) =>
                 // If a block got deleted before netty opened the file handle, then trying to
                 // load the blocks now will fail. This is most likely to occur if we start
                 // migrating blocks and then the shuffle TTL cleaner kicks in. However this
                 // could also happen with manually managed shuffles or a GC event on the
                 // driver a no longer referenced RDD with shuffle files.
                 if (bm.migratableResolver.getMigrationBlocks(shuffleBlockInfo).size < blocks.size) {
                   logWarning(s"Skipping block $shuffleBlockInfo, block deleted.")
+                  numDeletedShuffles.incrementAndGet()

Review Comment:
   Why not resue `numMigratedShuffles`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] warrenzhu25 commented on pull request #37603: [SPARK-40168][CORE] Handle FileNotFoundException when shuffle file deleted in decommissioner

Posted by GitBox <gi...@apache.org>.
warrenzhu25 commented on PR #37603:
URL: https://github.com/apache/spark/pull/37603#issuecomment-1240947001

   @Ngone51 Thanks for reviewing this. Will you merge this into master?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] mridulm commented on pull request #37603: [SPARK-40168][CORE] Handle FileNotFoundException when shuffle file deleted in decommissioner

Posted by GitBox <gi...@apache.org>.
mridulm commented on PR #37603:
URL: https://github.com/apache/spark/pull/37603#issuecomment-1222821987

   +CC @holdenk 
   
   Can you minimize the diff by avoiding reformating the files @warrenzhu25 ? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] warrenzhu25 commented on a diff in pull request #37603: [SPARK-40168][CORE] Handle FileNotFoundException when shuffle file deleted in decommissioner

Posted by GitBox <gi...@apache.org>.
warrenzhu25 commented on code in PR #37603:
URL: https://github.com/apache/spark/pull/37603#discussion_r964303436


##########
core/src/main/scala/org/apache/spark/storage/BlockManagerDecommissioner.scala:
##########
@@ -125,21 +126,25 @@ private[storage] class BlockManagerDecommissioner(
                   logDebug(s"Migrated sub-block $blockId")
                 }
               }
+              numMigratedShuffles.incrementAndGet()
               logInfo(s"Migrated $shuffleBlockInfo to $peer")
             } catch {
-              case e: IOException =>
+              case e @ ( _ : IOException | _ : SparkException) =>
                 // If a block got deleted before netty opened the file handle, then trying to
                 // load the blocks now will fail. This is most likely to occur if we start
                 // migrating blocks and then the shuffle TTL cleaner kicks in. However this
                 // could also happen with manually managed shuffles or a GC event on the
                 // driver a no longer referenced RDD with shuffle files.
                 if (bm.migratableResolver.getMigrationBlocks(shuffleBlockInfo).size < blocks.size) {
                   logWarning(s"Skipping block $shuffleBlockInfo, block deleted.")
+                  numDeletedShuffles.incrementAndGet()

Review Comment:
   Thanks for the suggestion. Updated



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Ngone51 commented on a diff in pull request #37603: [SPARK-40168][CORE] Handle FileNotFoundException when shuffle file deleted in decommissioner

Posted by GitBox <gi...@apache.org>.
Ngone51 commented on code in PR #37603:
URL: https://github.com/apache/spark/pull/37603#discussion_r963779224


##########
core/src/main/scala/org/apache/spark/storage/BlockManagerDecommissioner.scala:
##########
@@ -125,21 +126,25 @@ private[storage] class BlockManagerDecommissioner(
                   logDebug(s"Migrated sub-block $blockId")
                 }
               }
+              numMigratedShuffles.incrementAndGet()
               logInfo(s"Migrated $shuffleBlockInfo to $peer")
             } catch {
-              case e: IOException =>
+              case e @ ( _ : IOException | _ : SparkException) =>
                 // If a block got deleted before netty opened the file handle, then trying to
                 // load the blocks now will fail. This is most likely to occur if we start
                 // migrating blocks and then the shuffle TTL cleaner kicks in. However this
                 // could also happen with manually managed shuffles or a GC event on the
                 // driver a no longer referenced RDD with shuffle files.
                 if (bm.migratableResolver.getMigrationBlocks(shuffleBlockInfo).size < blocks.size) {
                   logWarning(s"Skipping block $shuffleBlockInfo, block deleted.")
+                  numDeletedShuffles.incrementAndGet()

Review Comment:
   It's still possible to reuse `numMigratedShuffles`? Just reusing the current framework but only adding one more case (i.e., `SparkException` ) seems enough to me.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Ngone51 commented on a diff in pull request #37603: [SPARK-40168][CORE] Handle FileNotFoundException when shuffle file deleted in decommissioner

Posted by GitBox <gi...@apache.org>.
Ngone51 commented on code in PR #37603:
URL: https://github.com/apache/spark/pull/37603#discussion_r963773165


##########
core/src/main/scala/org/apache/spark/storage/BlockManagerDecommissioner.scala:
##########
@@ -125,21 +126,25 @@ private[storage] class BlockManagerDecommissioner(
                   logDebug(s"Migrated sub-block $blockId")
                 }
               }
+              numMigratedShuffles.incrementAndGet()
               logInfo(s"Migrated $shuffleBlockInfo to $peer")
             } catch {
-              case e: IOException =>
+              case e @ ( _ : IOException | _ : SparkException) =>
                 // If a block got deleted before netty opened the file handle, then trying to
                 // load the blocks now will fail. This is most likely to occur if we start
                 // migrating blocks and then the shuffle TTL cleaner kicks in. However this
                 // could also happen with manually managed shuffles or a GC event on the
                 // driver a no longer referenced RDD with shuffle files.
                 if (bm.migratableResolver.getMigrationBlocks(shuffleBlockInfo).size < blocks.size) {
                   logWarning(s"Skipping block $shuffleBlockInfo, block deleted.")
+                  numDeletedShuffles.incrementAndGet()

Review Comment:
   Oh i see. The block should not be deleted before the retryable check.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org