You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Andrew Rothstein (JIRA)" <ji...@apache.org> on 2015/05/12 23:41:00 UTC
[jira] [Commented] (SPARK-7580) Driver out of memory

    [ https://issues.apache.org/jira/browse/SPARK-7580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14540817#comment-14540817 ] 

Andrew Rothstein commented on SPARK-7580:
-----------------------------------------

Application application_1431436758838_0155 failed 2 times due to AM Container for appattempt_1431436758838_0155_000002 exited with exitCode: 143 due to: Container [pid=16359,containerID=container_1431436758838_0155_02_000001] is running beyond physical memory limits. Current usage: 3.5 GB of 3.5 GB physical memory used; 5.1 GB of 7.3 GB virtual memory used. Killing container.

>From the log.

> Driver out of memory
> --------------------
>
>                 Key: SPARK-7580
>                 URL: https://issues.apache.org/jira/browse/SPARK-7580
>             Project: Spark
>          Issue Type: Bug
>    Affects Versions: 1.3.0
>         Environment: YARN, HDP 2.1, RedHat 6.4
> 200 x HP DL185
>            Reporter: Andrew Rothstein
>
> My 200 node cluster has a 8k executor capacity. When I submitted a job with 2k executors, 2g per executor, and 4g for the driver, the ApplicationMaster/driver quickly became unresponsive. It was making progress, then threw a couple of these exceptions:
> 2015-05-12 16:46:41,598 ERROR [Spark Context Cleaner] spark.ContextCleaner: Error cleaning broadcast 4 java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:107) at org.apache.spark.storage.BlockManagerMaster.removeBroadcast(BlockManagerMaster.scala:137) at org.apache.spark.broadcast.TorrentBroadcast$.unpersist(TorrentBroadcast.scala:227) at org.apache.spark.broadcast.TorrentBroadcastFactory.unbroadcast(TorrentBroadcastFactory.scala:45) at org.apache.spark.broadcast.BroadcastManager.unbroadcast(BroadcastManager.scala:66) at org.apache.spark.ContextCleaner.doCleanupBroadcast(ContextCleaner.scala:185) at org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$2.apply(ContextCleaner.scala:147) at org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$2.apply(ContextCleaner.scala:138) at scala.Option.foreach(Option.scala:236) at org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply$mcV$sp(ContextCleaner.scala:138) at org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply(ContextCleaner.scala:134) at org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply(ContextCleaner.scala:134) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1617) at org.apache.spark.ContextCleaner.org$apache$spark$ContextCleaner$$keepCleaning(ContextCleaner.scala:133) at org.apache.spark.ContextCleaner$$anon$3.run(ContextCleaner.scala:65)
> Then the job crashed with OOM.
> 2015-05-12 16:47:53,566 ERROR [sparkDriver-akka.actor.default-dispatcher-4] actor.ActorSystemImpl: Uncaught fatal error from thread [sparkDriver-akka.remote.default-remote-dispatcher-8] shutting down ActorSystem [sparkDriver] java.lang.OutOfMemoryError: Java heap space at org.spark_project.protobuf.ByteString.copyFrom(ByteString.java:216) at org.spark_project.protobuf.ByteString.copyFrom(ByteString.java:229) at akka.remote.transport.AkkaPduProtobufCodec$.constructPayload(AkkaPduCodec.scala:145) at akka.remote.transport.AkkaProtocolHandle.write(AkkaProtocolTransport.scala:182) at akka.remote.EndpointWriter.writeSend(Endpoint.scala:760) at akka.remote.EndpointWriter$$anonfun$2.applyOrElse(Endpoint.scala:722) at akka.actor.Actor$class.aroundReceive(Actor.scala:465) at akka.remote.EndpointActor.aroundReceive(Endpoint.scala:415) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) at akka.actor.ActorCell.invoke(ActorCell.scala:487) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238) at akka.dispatch.Mailbox.run(Mailbox.scala:220) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> When I reran the job with 3g of memory per executor and 1k executors it ran to completion more quickly than the 2k executor run took to crash. I didn't think I was pushing the envelope by using 2k executors and the stock driver heap size. Is this a scale limitation of the driver? Any suggestions beyond increasing the heap size of the driver and/or using less executors?
> Thanks, Andrew



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org