You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@zeppelin.apache.org by Mark Libucha <ml...@gmail.com> on 2016/10/06 16:06:29 UTC

No active SparkContext black hole

Hello again,

On "longer" running jobs (I'm using yarn-client mode), I sometimes get RPC
timeouts. Seems like Zeppelin is losing connectivity with the Spark
cluster. I can deal with that.

But my notebook has sections stuck in the "Cancel" state, and I can't get
them out. When I re-click on cancel, I see "No active SparkContext" in the
log. But I can't reload a new instance of the notebook, or kill the one
that's stuck, without restarting all of zeppelin.

Suggestions?

Thanks,

Mark

Re: No active SparkContext black hole

Posted by Mark Libucha <ml...@gmail.com>.
Hi Jeff,

Thanks for your response. This happens during long running yarn-client
Spark jobs, everything is going fine, lots of output in the interpreter
log, then we see a failed sent message.

 INFO [2016-10-05 17:31:49,586] ({spark-dynamic-executor-allocation}
Logging.scala[logInfo]:58) - Requesting to kill executor(s) 202
 INFO [2016-10-05 17:31:49,625] ({spark-dynamic-executor-allocation}
Logging.scala[logInfo]:58) - Removing executor 202 because it has been idle
for 60 seconds (new desired total will be 197)
 INFO [2016-10-05 17:31:49,626] ({spark-dynamic-executor-allocation}
Logging.scala[logInfo]:58) - Requesting to kill executor(s) 201
 WARN [2016-10-05 17:33:49,630] ({spark-dynamic-executor-allocation}
Logging.scala[logWarning]:91) - Error sending message [message =
RequestExecutors(196,69600,Map....

Then:

org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120
seconds]. This timeout is controlled by spark.rpc.askTimeout
        at org.apache.spark.rpc.RpcTimeout.org
$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
        at
org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
        at
org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
        at
scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
        at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:76)
        at
org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:101)
        at
org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:77)
        at
org.apache.spark.scheduler.cluster.YarnSchedulerBackend.doRequestTotalExecutors(YarnSchedulerBackend.scala:62)
        at
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.killExecutors(CoarseGrainedSchedulerBackend.scala:513)
        at
org.apache.spark.SparkContext.killExecutors(SparkContext.scala:1472)
        at
org.apache.spark.ExecutorAllocationClient$class.killExecutor(ExecutorAllocationClient.scala:61)
        at
org.apache.spark.SparkContext.killExecutor(SparkContext.scala:1491)
        at org.apache.spark.ExecutorAllocationManager.org
$apache$spark$ExecutorAllocationManager$$removeExecutor(ExecutorAllocationManager.scala:418)
        at
org.apache.spark.ExecutorAllocationManager$$anonfun$org$apache$spark$ExecutorAllocationManager$$schedule$1.apply(ExecutorAllocationManager.scala:284)
        at
org.apache.spark.ExecutorAllocationManager$$anonfun$org$apache$spark$ExecutorAllocationManager$$schedule$1.apply(ExecutorAllocationManager.scala:280)
        at
scala.collection.mutable.MapLike$$anonfun$retain$2.apply(MapLike.scala:213)
        at
scala.collection.mutable.MapLike$$anonfun$retain$2.apply(MapLike.scala:212)
        at
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
        at scala.collection.immutable.List.foreach(List.scala:318)
        at
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
        at scala.collection.mutable.MapLike$class.retain(MapLike.scala:212)
        at scala.collection.mutable.AbstractMap.retain(Map.scala:91)
        at org.apache.spark.ExecutorAllocationManager.org
$apache$spark$ExecutorAllocationManager$$schedule(ExecutorAllocationManager.scala:280)
        at
org.apache.spark.ExecutorAllocationManager$$anon$2.run(ExecutorAllocationManager.scala:224)
        at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
        at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
        at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.util.concurrent.TimeoutException: Futures timed out after
[120 seconds]
        at
scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
        at
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
        at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
        at
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
        at scala.concurrent.Await$.result(package.scala:107)
        at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
        ... 26 more

There is no recovery, even though we see the Spark Job still running on the
Hadoop cluster. Worse, sometimes the Zeppelin notebook can't be cancelled
and we have to restart Zeppelin to reuse the notebook.

Let me know if you'd like more info/logs.

Thanks,

Mark

On Fri, Oct 7, 2016 at 10:13 PM, Jianfeng (Jeff) Zhang <
jzhang@hortonworks.com> wrote:

>
> Could you paste the log ?
>
>
> Best Regard,
> Jeff Zhang
>
>
> From: Mark Libucha <ml...@gmail.com>
> Reply-To: "users@zeppelin.apache.org" <us...@zeppelin.apache.org>
> Date: Friday, October 7, 2016 at 12:11 AM
> To: "users@zeppelin.apache.org" <us...@zeppelin.apache.org>
> Subject: Re: No active SparkContext black hole
>
> Actually, it's stuck in the Running state. Trying to cancel it causes the
> No active SparkContext to appear in the log. Seems like a bug.
>
> On Thu, Oct 6, 2016 at 9:06 AM, Mark Libucha <ml...@gmail.com> wrote:
>
>> Hello again,
>>
>> On "longer" running jobs (I'm using yarn-client mode), I sometimes get
>> RPC timeouts. Seems like Zeppelin is losing connectivity with the Spark
>> cluster. I can deal with that.
>>
>> But my notebook has sections stuck in the "Cancel" state, and I can't get
>> them out. When I re-click on cancel, I see "No active SparkContext" in the
>> log. But I can't reload a new instance of the notebook, or kill the one
>> that's stuck, without restarting all of zeppelin.
>>
>> Suggestions?
>>
>> Thanks,
>>
>> Mark
>>
>
>

Re: No active SparkContext black hole

Posted by "Jianfeng (Jeff) Zhang" <jz...@hortonworks.com>.
Could you paste the log ?


Best Regard,
Jeff Zhang


From: Mark Libucha <ml...@gmail.com>>
Reply-To: "users@zeppelin.apache.org<ma...@zeppelin.apache.org>" <us...@zeppelin.apache.org>>
Date: Friday, October 7, 2016 at 12:11 AM
To: "users@zeppelin.apache.org<ma...@zeppelin.apache.org>" <us...@zeppelin.apache.org>>
Subject: Re: No active SparkContext black hole

Actually, it's stuck in the Running state. Trying to cancel it causes the No active SparkContext to appear in the log. Seems like a bug.

On Thu, Oct 6, 2016 at 9:06 AM, Mark Libucha <ml...@gmail.com>> wrote:
Hello again,

On "longer" running jobs (I'm using yarn-client mode), I sometimes get RPC timeouts. Seems like Zeppelin is losing connectivity with the Spark cluster. I can deal with that.

But my notebook has sections stuck in the "Cancel" state, and I can't get them out. When I re-click on cancel, I see "No active SparkContext" in the log. But I can't reload a new instance of the notebook, or kill the one that's stuck, without restarting all of zeppelin.

Suggestions?

Thanks,

Mark


Re: No active SparkContext black hole

Posted by Mark Libucha <ml...@gmail.com>.
Actually, it's stuck in the Running state. Trying to cancel it causes the
No active SparkContext to appear in the log. Seems like a bug.

On Thu, Oct 6, 2016 at 9:06 AM, Mark Libucha <ml...@gmail.com> wrote:

> Hello again,
>
> On "longer" running jobs (I'm using yarn-client mode), I sometimes get RPC
> timeouts. Seems like Zeppelin is losing connectivity with the Spark
> cluster. I can deal with that.
>
> But my notebook has sections stuck in the "Cancel" state, and I can't get
> them out. When I re-click on cancel, I see "No active SparkContext" in the
> log. But I can't reload a new instance of the notebook, or kill the one
> that's stuck, without restarting all of zeppelin.
>
> Suggestions?
>
> Thanks,
>
> Mark
>