You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Steve Loughran (Jira)" <ji...@apache.org> on 2022/01/10 12:56:00 UTC

[jira] [Commented] (HADOOP-17201) Spark job with s3acommitter stuck at the last stage

    [ https://issues.apache.org/jira/browse/HADOOP-17201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17471971#comment-17471971 ] 

Steve Loughran commented on HADOOP-17201:
-----------------------------------------


If you are confident that all versions of the S3a client in use are "directory marker aware" then you can turn off those needless DELETE calls.
https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/directory_markers.html

This saves on IO, throttling and avoid creating needless tombstone markers on versioned buckets. I would recommend this.

And I would also say: don't use the staging committer with s3a as the cluster fs, as it lacks the rename semantics that the algorithm depends on. It is probably less critical given that you are only renaming a single manifest file rather than a directory tree, but I would still worry. Use the magic committer.

> Spark job with s3acommitter stuck at the last stage
> ---------------------------------------------------
>
>                 Key: HADOOP-17201
>                 URL: https://issues.apache.org/jira/browse/HADOOP-17201
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: fs/s3
>    Affects Versions: 3.2.1
>         Environment: we are on spark 2.4.5/hadoop 3.2.1 with s3a committer.
> spark.hadoop.fs.s3a.committer.magic.enabled: 'true'
> spark.hadoop.fs.s3a.committer.name: magic
>            Reporter: Dyno
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: exec-120.log, exec-125.log, exec-25.log, exec-31.log, exec-36.log, exec-44.log, exec-5.log, exec-64.log, exec-7.log
>
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> usually our spark job took 1 hour or 2 to finish, occasionally it runs for more than 3 hour and then we know it's stuck and usually the executor has stack like this
> {{
> "Executor task launch worker for task 78620" #265 daemon prio=5 os_prio=0 tid=0x00007f73e0005000 nid=0x12d waiting on condition [0x00007f74cb291000]
>    java.lang.Thread.State: TIMED_WAITING (sleeping)
> 	at java.lang.Thread.sleep(Native Method)
> 	at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:349)
> 	at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:285)
> 	at org.apache.hadoop.fs.s3a.S3AFileSystem.deleteObjects(S3AFileSystem.java:1457)
> 	at org.apache.hadoop.fs.s3a.S3AFileSystem.removeKeys(S3AFileSystem.java:1717)
> 	at org.apache.hadoop.fs.s3a.S3AFileSystem.deleteUnnecessaryFakeDirectories(S3AFileSystem.java:2785)
> 	at org.apache.hadoop.fs.s3a.S3AFileSystem.finishedWrite(S3AFileSystem.java:2751)
> 	at org.apache.hadoop.fs.s3a.WriteOperationHelper.lambda$finalizeMultipartUpload$1(WriteOperationHelper.java:238)
> 	at org.apache.hadoop.fs.s3a.WriteOperationHelper$$Lambda$210/1059071691.execute(Unknown Source)
> 	at org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:109)
> 	at org.apache.hadoop.fs.s3a.Invoker.lambda$retry$3(Invoker.java:265)
> 	at org.apache.hadoop.fs.s3a.Invoker$$Lambda$23/586859139.execute(Unknown Source)
> 	at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:322)
> 	at org.apache.hadoop.fs.s3a.Invoker.retry(Invoker.java:261)
> 	at org.apache.hadoop.fs.s3a.WriteOperationHelper.finalizeMultipartUpload(WriteOperationHelper.java:226)
> 	at org.apache.hadoop.fs.s3a.WriteOperationHelper.completeMPUwithRetries(WriteOperationHelper.java:271)
> 	at org.apache.hadoop.fs.s3a.S3ABlockOutputStream$MultiPartUpload.complete(S3ABlockOutputStream.java:660)
> 	at org.apache.hadoop.fs.s3a.S3ABlockOutputStream$MultiPartUpload.access$200(S3ABlockOutputStream.java:521)
> 	at org.apache.hadoop.fs.s3a.S3ABlockOutputStream.close(S3ABlockOutputStream.java:385)
> 	at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)
> 	at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:101)
> 	at org.apache.parquet.hadoop.util.HadoopPositionOutputStream.close(HadoopPositionOutputStream.java:64)
> 	at org.apache.parquet.hadoop.ParquetFileWriter.end(ParquetFileWriter.java:685)
> 	at org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:122)
> 	at org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:165)
> 	at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetOutputWriter.scala:42)
> 	at org.apache.spark.sql.execution.datasources.FileFormatDataWriter.releaseResources(FileFormatDataWriter.scala:57)
> 	at org.apache.spark.sql.execution.datasources.FileFormatDataWriter.commit(FileFormatDataWriter.scala:74)
> 	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:247)
> 	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:242)
> 	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394)
> 	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:248)
> 	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170)
> 	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)
> 	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
> 	at org.apache.spark.scheduler.Task.run(Task.scala:123)
> 	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
> 	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
> 	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> 	at java.lang.Thread.run(Thread.java:748)
>    Locked ownable synchronizers:
> 	- <0x00000003a57332e0> (a java.util.concurrent.ThreadPoolExecutor$Worker)
> }}
> captured jstack on the stuck executors in case it's useful.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org