You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Frank Yin (Jira)" <ji...@apache.org> on 2023/10/17 20:20:00 UTC

[jira] [Updated] (SPARK-45579) Executor hangs indefinitely due to decommissioner errors

     [ https://issues.apache.org/jira/browse/SPARK-45579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Frank Yin updated SPARK-45579:
------------------------------
    Description: 
During Spark executor decommission, the fallback storage uploads can fail due to some race conditions even though we check the actual file exists: 
```
java.io.FileNotFoundException: No file: /var/data/spark-ab14b716-630d-435e-a92a-1403f6206dd8/blockmgr-7f9ab4d7-1340-4b39-9558-fde994a82090/0b/shuffle_175_66754_0.index
	at org.apache.hadoop.fs.s3a.impl.CopyFromLocalOperation.checkSource(CopyFromLocalOperation.java:314)
	at org.apache.hadoop.fs.s3a.impl.CopyFromLocalOperation.execute(CopyFromLocalOperation.java:167)
	at org.apache.hadoop.fs.s3a.S3AFileSystem.lambda$copyFromLocalFile$26(S3AFileSystem.java:3854)
	at org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.invokeTrackingDuration(IOStatisticsBinding.java:547)
	at org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.lambda$trackDurationOfOperation$5(IOStatisticsBinding.java:528)
	at org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.trackDuration(IOStatisticsBinding.java:449)
	at org.apache.hadoop.fs.s3a.S3AFileSystem.trackDurationAndSpan(S3AFileSystem.java:2480)
	at org.apache.hadoop.fs.s3a.S3AFileSystem.trackDurationAndSpan(S3AFileSystem.java:2499)
	at org.apache.hadoop.fs.s3a.S3AFileSystem.copyFromLocalFile(S3AFileSystem.java:3847)
	at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:2558)
	at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:2520)
	at org.apache.spark.storage.FallbackStorage.copy(FallbackStorage.scala:67)
	at org.apache.spark.storage.BlockManagerDecommissioner$ShuffleMigrationRunnable.$anonfun$run$12(BlockManagerDecommissioner.scala:146)
	at org.apache.spark.storage.BlockManagerDecommissioner$ShuffleMigrationRunnable.$anonfun$run$12$adapted(BlockManagerDecommissioner.scala:146)
	at scala.Option.foreach(Option.scala:407)
	at org.apache.spark.storage.BlockManagerDecommissioner$ShuffleMigrationRunnable.run(BlockManagerDecommissioner.scala:146)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
```
This will block the executor from exiting properly because the decommissioner doesn't think shuffle migration is complete. 

  was:
During Spark executor decommission, the fallback storage uploads can fail due to some race conditions: 



> Executor hangs indefinitely due to decommissioner errors
> --------------------------------------------------------
>
>                 Key: SPARK-45579
>                 URL: https://issues.apache.org/jira/browse/SPARK-45579
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 3.5.0
>            Reporter: Frank Yin
>            Priority: Major
>
> During Spark executor decommission, the fallback storage uploads can fail due to some race conditions even though we check the actual file exists: 
> ```
> java.io.FileNotFoundException: No file: /var/data/spark-ab14b716-630d-435e-a92a-1403f6206dd8/blockmgr-7f9ab4d7-1340-4b39-9558-fde994a82090/0b/shuffle_175_66754_0.index
> 	at org.apache.hadoop.fs.s3a.impl.CopyFromLocalOperation.checkSource(CopyFromLocalOperation.java:314)
> 	at org.apache.hadoop.fs.s3a.impl.CopyFromLocalOperation.execute(CopyFromLocalOperation.java:167)
> 	at org.apache.hadoop.fs.s3a.S3AFileSystem.lambda$copyFromLocalFile$26(S3AFileSystem.java:3854)
> 	at org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.invokeTrackingDuration(IOStatisticsBinding.java:547)
> 	at org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.lambda$trackDurationOfOperation$5(IOStatisticsBinding.java:528)
> 	at org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.trackDuration(IOStatisticsBinding.java:449)
> 	at org.apache.hadoop.fs.s3a.S3AFileSystem.trackDurationAndSpan(S3AFileSystem.java:2480)
> 	at org.apache.hadoop.fs.s3a.S3AFileSystem.trackDurationAndSpan(S3AFileSystem.java:2499)
> 	at org.apache.hadoop.fs.s3a.S3AFileSystem.copyFromLocalFile(S3AFileSystem.java:3847)
> 	at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:2558)
> 	at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:2520)
> 	at org.apache.spark.storage.FallbackStorage.copy(FallbackStorage.scala:67)
> 	at org.apache.spark.storage.BlockManagerDecommissioner$ShuffleMigrationRunnable.$anonfun$run$12(BlockManagerDecommissioner.scala:146)
> 	at org.apache.spark.storage.BlockManagerDecommissioner$ShuffleMigrationRunnable.$anonfun$run$12$adapted(BlockManagerDecommissioner.scala:146)
> 	at scala.Option.foreach(Option.scala:407)
> 	at org.apache.spark.storage.BlockManagerDecommissioner$ShuffleMigrationRunnable.run(BlockManagerDecommissioner.scala:146)
> 	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
> 	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
> 	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> 	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> 	at java.base/java.lang.Thread.run(Thread.java:829)
> ```
> This will block the executor from exiting properly because the decommissioner doesn't think shuffle migration is complete. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org