You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Frank Yin (Jira)" <ji...@apache.org> on 2023/10/17 20:20:00 UTC
[jira] [Updated] (SPARK-45579) Executor hangs indefinitely due to decommissioner errors
[ https://issues.apache.org/jira/browse/SPARK-45579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Frank Yin updated SPARK-45579:
------------------------------
Description:
During Spark executor decommission, the fallback storage uploads can fail due to some race conditions even though we check the actual file exists:
```
java.io.FileNotFoundException: No file: /var/data/spark-ab14b716-630d-435e-a92a-1403f6206dd8/blockmgr-7f9ab4d7-1340-4b39-9558-fde994a82090/0b/shuffle_175_66754_0.index
at org.apache.hadoop.fs.s3a.impl.CopyFromLocalOperation.checkSource(CopyFromLocalOperation.java:314)
at org.apache.hadoop.fs.s3a.impl.CopyFromLocalOperation.execute(CopyFromLocalOperation.java:167)
at org.apache.hadoop.fs.s3a.S3AFileSystem.lambda$copyFromLocalFile$26(S3AFileSystem.java:3854)
at org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.invokeTrackingDuration(IOStatisticsBinding.java:547)
at org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.lambda$trackDurationOfOperation$5(IOStatisticsBinding.java:528)
at org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.trackDuration(IOStatisticsBinding.java:449)
at org.apache.hadoop.fs.s3a.S3AFileSystem.trackDurationAndSpan(S3AFileSystem.java:2480)
at org.apache.hadoop.fs.s3a.S3AFileSystem.trackDurationAndSpan(S3AFileSystem.java:2499)
at org.apache.hadoop.fs.s3a.S3AFileSystem.copyFromLocalFile(S3AFileSystem.java:3847)
at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:2558)
at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:2520)
at org.apache.spark.storage.FallbackStorage.copy(FallbackStorage.scala:67)
at org.apache.spark.storage.BlockManagerDecommissioner$ShuffleMigrationRunnable.$anonfun$run$12(BlockManagerDecommissioner.scala:146)
at org.apache.spark.storage.BlockManagerDecommissioner$ShuffleMigrationRunnable.$anonfun$run$12$adapted(BlockManagerDecommissioner.scala:146)
at scala.Option.foreach(Option.scala:407)
at org.apache.spark.storage.BlockManagerDecommissioner$ShuffleMigrationRunnable.run(BlockManagerDecommissioner.scala:146)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
```
This will block the executor from exiting properly because the decommissioner doesn't think shuffle migration is complete.
was:
During Spark executor decommission, the fallback storage uploads can fail due to some race conditions:
> Executor hangs indefinitely due to decommissioner errors
> --------------------------------------------------------
>
> Key: SPARK-45579
> URL: https://issues.apache.org/jira/browse/SPARK-45579
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Affects Versions: 3.5.0
> Reporter: Frank Yin
> Priority: Major
>
> During Spark executor decommission, the fallback storage uploads can fail due to some race conditions even though we check the actual file exists:
> ```
> java.io.FileNotFoundException: No file: /var/data/spark-ab14b716-630d-435e-a92a-1403f6206dd8/blockmgr-7f9ab4d7-1340-4b39-9558-fde994a82090/0b/shuffle_175_66754_0.index
> at org.apache.hadoop.fs.s3a.impl.CopyFromLocalOperation.checkSource(CopyFromLocalOperation.java:314)
> at org.apache.hadoop.fs.s3a.impl.CopyFromLocalOperation.execute(CopyFromLocalOperation.java:167)
> at org.apache.hadoop.fs.s3a.S3AFileSystem.lambda$copyFromLocalFile$26(S3AFileSystem.java:3854)
> at org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.invokeTrackingDuration(IOStatisticsBinding.java:547)
> at org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.lambda$trackDurationOfOperation$5(IOStatisticsBinding.java:528)
> at org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.trackDuration(IOStatisticsBinding.java:449)
> at org.apache.hadoop.fs.s3a.S3AFileSystem.trackDurationAndSpan(S3AFileSystem.java:2480)
> at org.apache.hadoop.fs.s3a.S3AFileSystem.trackDurationAndSpan(S3AFileSystem.java:2499)
> at org.apache.hadoop.fs.s3a.S3AFileSystem.copyFromLocalFile(S3AFileSystem.java:3847)
> at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:2558)
> at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:2520)
> at org.apache.spark.storage.FallbackStorage.copy(FallbackStorage.scala:67)
> at org.apache.spark.storage.BlockManagerDecommissioner$ShuffleMigrationRunnable.$anonfun$run$12(BlockManagerDecommissioner.scala:146)
> at org.apache.spark.storage.BlockManagerDecommissioner$ShuffleMigrationRunnable.$anonfun$run$12$adapted(BlockManagerDecommissioner.scala:146)
> at scala.Option.foreach(Option.scala:407)
> at org.apache.spark.storage.BlockManagerDecommissioner$ShuffleMigrationRunnable.run(BlockManagerDecommissioner.scala:146)
> at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
> at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
> at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> at java.base/java.lang.Thread.run(Thread.java:829)
> ```
> This will block the executor from exiting properly because the decommissioner doesn't think shuffle migration is complete.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org