You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Dongjoon Hyun (JIRA)" <ji...@apache.org> on 2019/07/24 05:54:00 UTC

[jira] [Updated] (SPARK-26074) AsyncEventQueue.stop hangs when eventQueue is full

     [ https://issues.apache.org/jira/browse/SPARK-26074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dongjoon Hyun updated SPARK-26074:
----------------------------------
    Fix Version/s:     (was: 2.3.1)
                       (was: 2.4.0)

> AsyncEventQueue.stop hangs when eventQueue is full
> --------------------------------------------------
>
>                 Key: SPARK-26074
>                 URL: https://issues.apache.org/jira/browse/SPARK-26074
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.1.0
>            Reporter: liupengcheng
>            Priority: Major
>
> In our production environment, we found a case that the Driver hangs when the app finished and about to exit.
> Detail information:
>  The spark-listener-group-shared Thread may exited due to some plugin TaskFailedListener(here in our case is XGBoost),thus the eventQueue might easily enter Full state and stay unchanged. After that, if app finished and SparkContext.stop is called, the
> Driver will hangs at the following stack:
> {code:java}
> "Driver" #36 prio=5 os_prio=0 tid=0x00007fb08a948800 nid=0x6edc waiting on condition [0x00007faff5017000]
> java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for <0x00000006c718d7d8> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
> at java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:350)
> at org.apache.spark.scheduler.AsyncEventQueue.stop(AsyncEventQueue.scala:117)
> at org.apache.spark.scheduler.LiveListenerBus$$anonfun$stop$1.apply(LiveListenerBus.scala:202)
> at org.apache.spark.scheduler.LiveListenerBus$$anonfun$stop$1.apply(LiveListenerBus.scala:202)
> at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
> at org.apache.spark.scheduler.LiveListenerBus.stop(LiveListenerBus.scala:202)
> - locked <0x00000006c052f0a8> (a org.apache.spark.scheduler.LiveListenerBus)
> at org.apache.spark.SparkContext$$anonfun$stop$6.apply$mcV$sp(SparkContext.scala:1842)
> at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1294)
> at org.apache.spark.SparkContext.stop(SparkContext.scala:1841){code}
> Find out it's because spark put a POSION_PILL when AsyncEventQueue.stop, however, the queue is in full state, so the put action will be blocked forever.
> spark-listener-group-shared exit message:
> {code:java}
> 2018-11-15,14:44:04,782 INFO org.apache.spark.scheduler.AsyncEventQueue: Stopping listener queue shared.
> java.lang.InterruptedException: ExecutorLost during XGBoost Training: TaskKilled (killed intentionally)
> at org.apache.spark.TaskFailedListener.onTaskEnd(SparkParallelismTracker.scala:116)
> at org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:45)
> at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:35)
> at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:35)
> at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:63)
> at org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$super$postToAll(AsyncEventQueue.scala:83)
> at org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:83)
> at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
> at org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:79)
> at org.apache.spark.scheduler.AsyncEventQueue$$anon$1$$anonfun$run$1.apply$mcV$sp(AsyncEventQueue.scala:75)
> at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1256)
> at org.apache.spark.scheduler.AsyncEventQueue$$anon$1.run(AsyncEventQueue.scala:74)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org