You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "James Yu (Jira)" <ji...@apache.org> on 2020/10/16 00:42:00 UTC

[jira] [Commented] (HADOOP-17201) Spark job with s3acommitter stuck at the last stage

    [ https://issues.apache.org/jira/browse/HADOOP-17201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17215091#comment-17215091 ] 

James Yu commented on HADOOP-17201:
-----------------------------------

We have this "never ending last task" issue a lot too, except that we used the default s3a committer, not the magic committer. When it happened, they mostly have stacktrace (see below) similar to the one reported in this ticket. Also we were sure there was no significant data skew when this happened.
{code:java}
s3a-transfer-shared-pool3-t5  (TIMED_WAITING)java.lang.Thread.sleep(Native Method)

org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:349)
org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:285)
org.apache.hadoop.fs.s3a.S3AFileSystem.deleteObjects(S3AFileSystem.java:1416)
org.apache.hadoop.fs.s3a.S3AFileSystem.removeKeys(S3AFileSystem.java:1676)
org.apache.hadoop.fs.s3a.S3AFileSystem.deleteUnnecessaryFakeDirectories(S3AFileSystem.java:2738)
org.apache.hadoop.fs.s3a.S3AFileSystem.finishedWrite(S3AFileSystem.java:2704)
org.apache.hadoop.fs.s3a.S3AFileSystem.putObjectDirect(S3AFileSystem.java:1548)
org.apache.hadoop.fs.s3a.WriteOperationHelper.lambda$putObject$5(WriteOperationHelper.java:430)
org.apache.hadoop.fs.s3a.WriteOperationHelper$$Lambda$1846/182096913.execute(Unknown Source)
org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:109)
org.apache.hadoop.fs.s3a.Invoker.lambda$retry$3(Invoker.java:265)
org.apache.hadoop.fs.s3a.Invoker$$Lambda$466/1761475153.execute(Unknown Source)
org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:322)
org.apache.hadoop.fs.s3a.Invoker.retry(Invoker.java:261)
org.apache.hadoop.fs.s3a.Invoker.retry(Invoker.java:236)
org.apache.hadoop.fs.s3a.WriteOperationHelper.retry(WriteOperationHelper.java:123)
org.apache.hadoop.fs.s3a.WriteOperationHelper.putObject(WriteOperationHelper.java:428)
org.apache.hadoop.fs.s3a.S3ABlockOutputStream.lambda$putObject$0(S3ABlockOutputStream.java:438)
org.apache.hadoop.fs.s3a.S3ABlockOutputStream$$Lambda$1845/1094082085.call(Unknown Source)
...{code}
We are not able to understand the true root cause of this issue. Looks like this issue is not specific to any flavor of s3a committers.

 

> Spark job with s3acommitter stuck at the last stage
> ---------------------------------------------------
>
>                 Key: HADOOP-17201
>                 URL: https://issues.apache.org/jira/browse/HADOOP-17201
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: fs/s3
>    Affects Versions: 3.2.1
>         Environment: we are on spark 2.4.5/hadoop 3.2.1 with s3a committer.
> spark.hadoop.fs.s3a.committer.magic.enabled: 'true'
> spark.hadoop.fs.s3a.committer.name: magic
>            Reporter: Dyno
>            Priority: Major
>         Attachments: exec-120.log, exec-125.log, exec-25.log, exec-31.log, exec-36.log, exec-44.log, exec-5.log, exec-64.log, exec-7.log
>
>
> usually our spark job took 1 hour or 2 to finish, occasionally it runs for more than 3 hour and then we know it's stuck and usually the executor has stack like this
> {{
> "Executor task launch worker for task 78620" #265 daemon prio=5 os_prio=0 tid=0x00007f73e0005000 nid=0x12d waiting on condition [0x00007f74cb291000]
>    java.lang.Thread.State: TIMED_WAITING (sleeping)
> 	at java.lang.Thread.sleep(Native Method)
> 	at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:349)
> 	at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:285)
> 	at org.apache.hadoop.fs.s3a.S3AFileSystem.deleteObjects(S3AFileSystem.java:1457)
> 	at org.apache.hadoop.fs.s3a.S3AFileSystem.removeKeys(S3AFileSystem.java:1717)
> 	at org.apache.hadoop.fs.s3a.S3AFileSystem.deleteUnnecessaryFakeDirectories(S3AFileSystem.java:2785)
> 	at org.apache.hadoop.fs.s3a.S3AFileSystem.finishedWrite(S3AFileSystem.java:2751)
> 	at org.apache.hadoop.fs.s3a.WriteOperationHelper.lambda$finalizeMultipartUpload$1(WriteOperationHelper.java:238)
> 	at org.apache.hadoop.fs.s3a.WriteOperationHelper$$Lambda$210/1059071691.execute(Unknown Source)
> 	at org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:109)
> 	at org.apache.hadoop.fs.s3a.Invoker.lambda$retry$3(Invoker.java:265)
> 	at org.apache.hadoop.fs.s3a.Invoker$$Lambda$23/586859139.execute(Unknown Source)
> 	at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:322)
> 	at org.apache.hadoop.fs.s3a.Invoker.retry(Invoker.java:261)
> 	at org.apache.hadoop.fs.s3a.WriteOperationHelper.finalizeMultipartUpload(WriteOperationHelper.java:226)
> 	at org.apache.hadoop.fs.s3a.WriteOperationHelper.completeMPUwithRetries(WriteOperationHelper.java:271)
> 	at org.apache.hadoop.fs.s3a.S3ABlockOutputStream$MultiPartUpload.complete(S3ABlockOutputStream.java:660)
> 	at org.apache.hadoop.fs.s3a.S3ABlockOutputStream$MultiPartUpload.access$200(S3ABlockOutputStream.java:521)
> 	at org.apache.hadoop.fs.s3a.S3ABlockOutputStream.close(S3ABlockOutputStream.java:385)
> 	at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)
> 	at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:101)
> 	at org.apache.parquet.hadoop.util.HadoopPositionOutputStream.close(HadoopPositionOutputStream.java:64)
> 	at org.apache.parquet.hadoop.ParquetFileWriter.end(ParquetFileWriter.java:685)
> 	at org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:122)
> 	at org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:165)
> 	at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetOutputWriter.scala:42)
> 	at org.apache.spark.sql.execution.datasources.FileFormatDataWriter.releaseResources(FileFormatDataWriter.scala:57)
> 	at org.apache.spark.sql.execution.datasources.FileFormatDataWriter.commit(FileFormatDataWriter.scala:74)
> 	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:247)
> 	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:242)
> 	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394)
> 	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:248)
> 	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170)
> 	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)
> 	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
> 	at org.apache.spark.scheduler.Task.run(Task.scala:123)
> 	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
> 	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
> 	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> 	at java.lang.Thread.run(Thread.java:748)
>    Locked ownable synchronizers:
> 	- <0x00000003a57332e0> (a java.util.concurrent.ThreadPoolExecutor$Worker)
> }}
> captured jstack on the stuck executors in case it's useful.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org