You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Zongheng Yang (JIRA)" <ji...@apache.org> on 2014/08/05 20:28:12 UTC
[jira] [Created] (SPARK-2865) Potential deadlock: tasks could hang
forever waiting to fetch a remote block even though most tasks finish
Zongheng Yang created SPARK-2865:
------------------------------------
Summary: Potential deadlock: tasks could hang forever waiting to fetch a remote block even though most tasks finish
Key: SPARK-2865
URL: https://issues.apache.org/jira/browse/SPARK-2865
Project: Spark
Issue Type: Bug
Components: Shuffle, Spark Core
Affects Versions: 1.0.1, 1.1.0
Environment: 16-node EC2 r3.2xlarge cluster
Reporter: Zongheng Yang
Priority: Blocker
In the application I tested, most of the tasks out of 128 tasks could finish, but sometimes (pretty deterministically) either 1 or 3 tasks would just hang forever with the following stack trace. There were no apparent failures from the UI, also the nodes where the stuck tasks were running had no apparent memory/CPU/disk pressures.
{noformat}
"Executor task launch worker-0" daemon prio=10 tid=0x00007f32ec003800 nid=0xaac waiting on condition [0x00007f33f4428000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00007f3e0d7198e8> (a scala.concurrent.impl.Promise$CompletionLatch)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:994)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1303)
at scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:202)
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107)
at org.apache.spark.network.ConnectionManager.sendMessageReliablySync(ConnectionManager.scala:832)
at org.apache.spark.storage.BlockManagerWorker$.syncGetBlock(BlockManagerWorker.scala:122)
at org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:497)
at org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:495)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.storage.BlockManager.doGetRemote(BlockManager.scala:495)
at org.apache.spark.storage.BlockManager.getRemote(BlockManager.scala:481)
at org.apache.spark.storage.BlockManager.get(BlockManager.scala:524)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:44)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:227)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:54)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
{noformat}
This behavior does *not* appear on 1.0 (reusing the same cluster), but appears on the master branch as of Aug 4, 2014 *and* 1.0.1. Further, I tried out [this patch|https://github.com/apache/spark/pull/1758], and it didn't fix the behavior.
Further, when this behavior happened, the driver printed out the following line repeatedly:
{noformat}
14/08/04 23:32:42 WARN storage.BlockManagerMasterActor: Removing BlockManager BlockManagerId(7, ip-172-31-6-74.us-west-1.compute.internal, 59408, 0) with no recent heart beats: 67331ms exceeds 45000ms
{noformat}
--
This message was sent by Atlassian JIRA
(v6.2#6252)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org