You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Dongjoon Hyun (Jira)" <ji...@apache.org> on 2019/09/14 04:10:00 UTC

[jira] [Resolved] (SPARK-29003) Spark history server startup hang due to deadlock

     [ https://issues.apache.org/jira/browse/SPARK-29003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dongjoon Hyun resolved SPARK-29003.
-----------------------------------
    Fix Version/s: 3.0.0
       Resolution: Fixed

Issue resolved by pull request 25705
[https://github.com/apache/spark/pull/25705]

> Spark history server startup hang due to deadlock
> -------------------------------------------------
>
>                 Key: SPARK-29003
>                 URL: https://issues.apache.org/jira/browse/SPARK-29003
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.3.4, 2.4.4
>            Reporter: shanyu zhao
>            Assignee: shanyu zhao
>            Priority: Major
>             Fix For: 3.0.0
>
>         Attachments: sparkhistory-jstack.log
>
>
> Occasionally when starting Spark History Server, the service process will hang before binding to the port so Spark History Server is not usable. One has to kill the process and start again. You can write a simple bash program to stop and start Spark History Server and you can reproduce this problem approximately 10% of time.
> The problem is due to java.nio.file.FileSystems.getDefault() cause deadlock. This is what I collected with jstack:
> {code:java}
> "log-replay-executor-0" #17 daemon prio=5 os_prio=0 tid=0x00007fca90028800 nid=0x6e8 in Object.wait() [0x00007fcaa9471000]
>     java.lang.Thread.State: RUNNABLE 
>     at java.nio.file.FileSystems.getDefault(FileSystems.java:176) 
>     ... 
>     at java.lang.Runtime.loadLibrary0(Runtime.java:870) - locked <0x00000000aaac1d40> (a java.lang.Runtime) 
>     ... 
>     at org.apache.spark.deploy.history.FsHistoryProvider.mergeApplicationListing(FsHistoryProvider.scala:698)
> "main" #1 prio=5 os_prio=0 tid=0x00007fcad8016800 nid=0x6d8 waiting for monitor entry [0x00007fcae146c000]
>     java.lang.Thread.State: BLOCKED (on object monitor) 
>     at java.lang.Runtime.loadLibrary0(Runtime.java:862) - waiting to lock <0x00000000aaac1d40> (a java.lang.Runtime) 
>     ... 
>     at java.nio.file.FileSystems.getDefault(FileSystems.java:176) 
>     at java.io.File.toPath(File.java:2234) - locked <0x000000008699bb68> (a java.io.File) 
>     ... 
>     at org.apache.spark.ui.JettyUtils$.startJettyServer(JettyUtils.scala:365){code}
> Basically "main" thread and "log-replay-executor-0" thread simultaneously calling java.nio,file.FileSystems.getDefault() and deadlocked. 
> This is similar to the reported JDK bug:
> [https://bugs.openjdk.java.net/browse/JDK-8037567]
> The problem is that during Spark History Server startup, there are two things happening simultaneously that call into java.nio.file.FileSystems.getDefault():
> 1) start jetty server
>  2) start ApplicationHistoryProvider (which reads files from HDFS)
> We should do this two things sequentially instead of in parallel.
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org