You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "shanyu zhao (Jira)" <ji...@apache.org> on 2019/09/06 00:29:00 UTC

[jira] [Created] (SPARK-29003) Spark history server startup hang due to deadlock

shanyu zhao created SPARK-29003:
-----------------------------------

             Summary: Spark history server startup hang due to deadlock
                 Key: SPARK-29003
                 URL: https://issues.apache.org/jira/browse/SPARK-29003
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 2.4.4
            Reporter: shanyu zhao


Occasionally when starting Spark History Server, the service process will hang before binding to the port so Spark History Server is not usable. One has to kill the process and start again. You can write a simple bash program to stop and start Spark History Server and you can reproduce this problem approximately 10% of time.

The problem is due to java.nio.file.FileSystems.getDefault() cause deadlock. This is what I collected with jstack:
{code:java}
"log-replay-executor-0" #17 daemon prio=5 os_prio=0 tid=0x00007fca90028800 nid=0x6e8 in Object.wait() [0x00007fcaa9471000]"log-replay-executor-0" #17 daemon prio=5 os_prio=0 tid=0x00007fca90028800 nid=0x6e8 in Object.wait() [0x00007fcaa9471000]   java.lang.Thread.State: RUNNABLE at java.nio.file.FileSystems.getDefault(FileSystems.java:176) ... at java.lang.Runtime.loadLibrary0(Runtime.java:870) - locked <0x00000000aaac1d40> (a java.lang.Runtime) ... at org.apache.spark.deploy.history.FsHistoryProvider.mergeApplicationListing(FsHistoryProvider.scala:698)
"main" #1 prio=5 os_prio=0 tid=0x00007fcad8016800 nid=0x6d8 waiting for monitor entry [0x00007fcae146c000]   java.lang.Thread.State: BLOCKED (on object monitor) at java.lang.Runtime.loadLibrary0(Runtime.java:862) - waiting to lock <0x00000000aaac1d40> (a java.lang.Runtime) ... at java.nio.file.FileSystems.getDefault(FileSystems.java:176) at java.io.File.toPath(File.java:2234) - locked <0x000000008699bb68> (a java.io.File) ...    at org.apache.spark.ui.JettyUtils$.startJettyServer(JettyUtils.scala:365){code}
Basically "main" thread and "log-replay-executor-0" thread simultaneously calling java.nio,file.FileSystems.getDefault() and deadlocked. 

This is similar to the reported JDK bug:

[https://bugs.openjdk.java.net/browse/JDK-8037567]

The problem is that during Spark History Server startup, there are two things happening simultaneously that call into java.nio.file.FileSystems.getDefault():

1) start jetty server
2) start ApplicationHistoryProvider (which reads files from HDFS)

We should do this two things sequentially instead of in parallel.

 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org