You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by csmith <ch...@zocdoc.com> on 2019/09/10 15:12:30 UTC

Deadlock using Barrier Execution

I'm using barrier execution in my spark job but am occasionally seeing
deadlocks where the task scheduler is unable to place all the tasks. The
failure is logged but the job hangs indefinitely. I have 2 executors with 16
cores each, using standalone mode (I think? I'm using databricks). The
dataset has 31 partitions.


One thing I've noticed when this occurs is that the number of "Active Tasks"
exceeds the number of cores on one executor. How is the executor able to
start more tasks than it has cores? spark.executor.cores is not set.


Looking at the source I came across this issue
https://issues.apache.org/jira/browse/SPARK-24818 which looks related but
I'm not sure how to properly address it. 


<http://apache-spark-user-list.1001560.n3.nabble.com/file/t10345/Screen_Shot_2019-09-10_at_11.png> 


Example log

19/09/10 13:39:55 WARN FairSchedulableBuilder: A job was submitted with
scheduler pool 7987057535266794448, which has not been configured. This can
happen when the file that pools are read from isn't set, or when that file
doesn't contain 7987057535266794448. Created 7987057535266794448 with
default configuration (schedulingMode: FIFO, minShare: 0, weight: 1)
19/09/10 13:39:55 INFO FairSchedulableBuilder: Added task set TaskSet_20.0
tasks to pool 7987057535266794448
19/09/10 13:39:55 INFO TaskSetManager: Starting task 1.0 in stage 20.0 (TID
1705, 10.222.231.128, executor 0, partition 1, NODE_LOCAL, 5669 bytes)
19/09/10 13:39:55 INFO TaskSetManager: Starting task 0.0 in stage 20.0 (TID
1706, 10.222.234.66, executor 1, partition 0, NODE_LOCAL, 5669 bytes)
19/09/10 13:39:55 INFO TaskSetManager: Starting task 5.0 in stage 20.0 (TID
1707, 10.222.231.128, executor 0, partition 5, NODE_LOCAL, 5669 bytes)
19/09/10 13:39:55 INFO TaskSetManager: Starting task 2.0 in stage 20.0 (TID
1708, 10.222.234.66, executor 1, partition 2, NODE_LOCAL, 5669 bytes)
19/09/10 13:39:55 INFO TaskSetManager: Starting task 7.0 in stage 20.0 (TID
1709, 10.222.231.128, executor 0, partition 7, NODE_LOCAL, 5669 bytes)
19/09/10 13:39:55 INFO TaskSetManager: Starting task 3.0 in stage 20.0 (TID
1710, 10.222.234.66, executor 1, partition 3, NODE_LOCAL, 5669 bytes)
19/09/10 13:39:55 INFO TaskSetManager: Starting task 8.0 in stage 20.0 (TID
1711, 10.222.231.128, executor 0, partition 8, NODE_LOCAL, 5669 bytes)
19/09/10 13:39:55 INFO TaskSetManager: Starting task 4.0 in stage 20.0 (TID
1712, 10.222.234.66, executor 1, partition 4, NODE_LOCAL, 5669 bytes)
19/09/10 13:39:55 INFO TaskSetManager: Starting task 9.0 in stage 20.0 (TID
1713, 10.222.231.128, executor 0, partition 9, NODE_LOCAL, 5669 bytes)
19/09/10 13:39:55 INFO TaskSetManager: Starting task 6.0 in stage 20.0 (TID
1714, 10.222.234.66, executor 1, partition 6, NODE_LOCAL, 5669 bytes)
19/09/10 13:39:55 INFO TaskSetManager: Starting task 12.0 in stage 20.0 (TID
1715, 10.222.231.128, executor 0, partition 12, NODE_LOCAL, 5669 bytes)
19/09/10 13:39:55 INFO TaskSetManager: Starting task 10.0 in stage 20.0 (TID
1716, 10.222.234.66, executor 1, partition 10, NODE_LOCAL, 5669 bytes)
19/09/10 13:39:55 INFO TaskSetManager: Starting task 13.0 in stage 20.0 (TID
1717, 10.222.231.128, executor 0, partition 13, NODE_LOCAL, 5669 bytes)
19/09/10 13:39:55 INFO TaskSetManager: Starting task 11.0 in stage 20.0 (TID
1718, 10.222.234.66, executor 1, partition 11, NODE_LOCAL, 5669 bytes)
19/09/10 13:39:55 INFO TaskSetManager: Starting task 16.0 in stage 20.0 (TID
1719, 10.222.231.128, executor 0, partition 16, NODE_LOCAL, 5669 bytes)
19/09/10 13:39:55 INFO TaskSetManager: Starting task 14.0 in stage 20.0 (TID
1720, 10.222.234.66, executor 1, partition 14, NODE_LOCAL, 5669 bytes)
19/09/10 13:39:55 INFO TaskSetManager: Starting task 17.0 in stage 20.0 (TID
1721, 10.222.231.128, executor 0, partition 17, NODE_LOCAL, 5669 bytes)
19/09/10 13:39:55 INFO TaskSetManager: Starting task 15.0 in stage 20.0 (TID
1722, 10.222.234.66, executor 1, partition 15, NODE_LOCAL, 5669 bytes)
19/09/10 13:39:55 INFO TaskSetManager: Starting task 18.0 in stage 20.0 (TID
1723, 10.222.231.128, executor 0, partition 18, NODE_LOCAL, 5669 bytes)
19/09/10 13:39:55 INFO TaskSetManager: Starting task 21.0 in stage 20.0 (TID
1724, 10.222.234.66, executor 1, partition 21, NODE_LOCAL, 5669 bytes)
19/09/10 13:39:55 INFO TaskSetManager: Starting task 19.0 in stage 20.0 (TID
1725, 10.222.231.128, executor 0, partition 19, NODE_LOCAL, 5669 bytes)
19/09/10 13:39:55 INFO TaskSetManager: Starting task 22.0 in stage 20.0 (TID
1726, 10.222.234.66, executor 1, partition 22, NODE_LOCAL, 5669 bytes)
19/09/10 13:39:55 INFO TaskSetManager: Starting task 20.0 in stage 20.0 (TID
1727, 10.222.231.128, executor 0, partition 20, NODE_LOCAL, 5669 bytes)
19/09/10 13:39:55 INFO TaskSetManager: Starting task 24.0 in stage 20.0 (TID
1728, 10.222.234.66, executor 1, partition 24, NODE_LOCAL, 5669 bytes)
19/09/10 13:39:55 INFO TaskSetManager: Starting task 23.0 in stage 20.0 (TID
1729, 10.222.231.128, executor 0, partition 23, NODE_LOCAL, 5669 bytes)
19/09/10 13:39:55 INFO TaskSetManager: Starting task 25.0 in stage 20.0 (TID
1730, 10.222.234.66, executor 1, partition 25, NODE_LOCAL, 5669 bytes)
19/09/10 13:39:55 INFO TaskSetManager: Starting task 27.0 in stage 20.0 (TID
1731, 10.222.231.128, executor 0, partition 27, NODE_LOCAL, 5669 bytes)
19/09/10 13:39:55 INFO TaskSetManager: Starting task 26.0 in stage 20.0 (TID
1732, 10.222.234.66, executor 1, partition 26, NODE_LOCAL, 5669 bytes)
19/09/10 13:39:55 INFO TaskSetManager: Starting task 28.0 in stage 20.0 (TID
1733, 10.222.231.128, executor 0, partition 28, NODE_LOCAL, 5669 bytes)
19/09/10 13:39:55 INFO TaskSetManager: Starting task 29.0 in stage 20.0 (TID
1734, 10.222.231.128, executor 0, partition 29, NODE_LOCAL, 5669 bytes)
19/09/10 13:39:55 ERROR Inbox: Ignoring error
java.lang.IllegalArgumentException: requirement failed: Skip current round
of resource offers for barrier stage 20 because only 30 out of a total
number of 31 tasks got resource offers. The resource offers may have been
blacklisted or cannot fulfill task locality requirements.
	at scala.Predef$.require(Predef.scala:224)
	at
org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$4.apply(TaskSchedulerImpl.scala:489)
	at
org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$4.apply(TaskSchedulerImpl.scala:461)
	at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at
org.apache.spark.scheduler.TaskSchedulerImpl.resourceOffers(TaskSchedulerImpl.scala:461)
	at
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$2.apply(CoarseGrainedSchedulerBackend.scala:253)
	at
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$2.apply(CoarseGrainedSchedulerBackend.scala:245)
	at
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.org$apache$spark$scheduler$cluster$CoarseGrainedSchedulerBackend$$withLock(CoarseGrainedSchedulerBackend.scala:709)
	at
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint.org$apache$spark$scheduler$cluster$CoarseGrainedSchedulerBackend$DriverEndpoint$$makeOffers(CoarseGrainedSchedulerBackend.scala:245)
	at
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:140)
	at
org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:117)
	at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205)
	at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101)
	at
org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:226)
	at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
19/09/10 13:39:55 INFO TaskSetManager: Starting task 30.0 in stage 20.0 (TID
1735, 10.222.231.128, executor 0, partition 30, NODE_LOCAL, 5669 bytes)
19/09/10 13:39:55 ERROR Inbox: Ignoring error
java.lang.IllegalArgumentException: requirement failed: Skip current round
of resource offers for barrier stage 20 because only 1 out of a total number
of 31 tasks got resource offers. The resource offers may have been
blacklisted or cannot fulfill task locality requirements.
	at scala.Predef$.require(Predef.scala:224)
	at
org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$4.apply(TaskSchedulerImpl.scala:489)
	at
org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$4.apply(TaskSchedulerImpl.scala:461)
	at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at
org.apache.spark.scheduler.TaskSchedulerImpl.resourceOffers(TaskSchedulerImpl.scala:461)
	at
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$2.apply(CoarseGrainedSchedulerBackend.scala:253)
	at
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$2.apply(CoarseGrainedSchedulerBackend.scala:245)
	at
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.org$apache$spark$scheduler$cluster$CoarseGrainedSchedulerBackend$$withLock(CoarseGrainedSchedulerBackend.scala:709)
	at
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint.org$apache$spark$scheduler$cluster$CoarseGrainedSchedulerBackend$DriverEndpoint$$makeOffers(CoarseGrainedSchedulerBackend.scala:245)
	at
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:140)
	at
org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:117)
	at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205)
	at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101)
	at
org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:226)
	at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

 



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Deadlock using Barrier Execution

Posted by wuyi <yi...@databricks.com>.

Hi csmith,

Just be too late here. We just realize this bug recently.  Here's the fix
https://github.com/apache/spark/pull/28257. And I believe we're going to
backport it into 2.4.x.

Best,

Yi Wu



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org