You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Andres Perez <an...@tresata.com> on 2015/07/29 18:38:07 UTC

stopped SparkContext remaining active

Hi everyone. I'm running into an issue with SparkContexts when running on
Yarn. The issue is observable when I reproduce these steps in the
spark-shell (version 1.4.1):

scala> sc
res0: org.apache.spark.SparkContext = org.apache.spark.SparkContext@7b965dee

*Note the pointer address of sc.

(Then yarn application -kill <application-id> on the corresponding yarn
application)

scala> val rdd = sc.parallelize(List(1,2,3))
java.lang.IllegalStateException: Cannot call methods on a stopped
SparkContext
at
org.apache.spark.SparkContext.org$apache$spark$SparkContext$$assertNotStopped(SparkContext.scala:103)
at
org.apache.spark.SparkContext.defaultParallelism(SparkContext.scala:1914)
at
org.apache.spark.SparkContext.parallelize$default$2(SparkContext.scala:695)
... 49 elided

(Great, the SparkContext has been stopped by the killed yarn application, as
expected.)

alternatively:

scala> sc.stop()
15/07/29 12:10:14 INFO SparkContext: SparkContext already stopped.

(OK, so it's confirmed that it has been stopped.)

scala> org.apache.spark.SparkContext.getOrCreate
res3: org.apache.spark.SparkContext = org.apache.spark.SparkContext@7b965dee

(Hm, that's the same SparkContext, note the pointer address.)

The issue here is that the SparkContext.getOrCreate method returns either
the active SparkContext, if it exists, or creates a new one. Here it is
returning the original SparkContext, meaning the one we verified was stopped
above is still active. How can we recover from this? We can't use the
current one once it's been stopped (unless we allow for multiple contexts to
run using the spark.driver.allowMultipleContexts flag, but that's a band-aid
solution), and we can't seem to create a new one, because the old one is
still marked as active.

Digging a little deeper, in the body of the stop() method of SparkContext,
it seems like we never get to the clearActiveContext() call by the end,
which would have marked the context as inactive. Any future call to stop(),
however, will exit early since the stopped variable is true (hence the
"SparkContext already stopped." log message). So I don't see any other way
to mark the context as not active. Something about how the SparkContext was
stopped after killing the yarn application is preventing the SparkContext
from cleaning up properly.

Any ideas about this?

Thanks,

Andres

--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/stopped-SparkContext-remaining-active-tp24065.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: stopped SparkContext remaining active

Posted by Ted Yu <yu...@gmail.com>.

Thanks for the information, Andres.

Created the following PR:
https://github.com/apache/spark/pull/7756

BTW bq is used on JIRA to reference other people's comment.

Cheers

On Wed, Jul 29, 2015 at 12:46 PM, Andres Perez <an...@tresata.com> wrote:

> Hi Ted. Taking a look at the logs, I get the feeling like there may be an
> uncaught exception blowing up the SparkContext.stop method, causing it to
> not reach the line where it gets set as inactive. The line referenced below
> in SparkContext (SparkContext.scala:1644) is the call: _dagScheduler.stop()
>
> 15/07/29 15:17:09 INFO Client: Deleting staging directory
> .sparkStaging/application_1436825124867_0223
> 15/07/29 15:17:09 ERROR YarnClientSchedulerBackend: Yarn application has
> already exited with state KILLED!
> 15/07/29 15:17:09 INFO ContextHandler: stopped
> o.s.j.s.ServletContextHandler{/metrics/json,null}
> 15/07/29 15:17:09 INFO ContextHandler: stopped
> o.s.j.s.ServletContextHandler{/stages/stage/kill,null}
> 15/07/29 15:17:09 INFO ContextHandler: stopped
> o.s.j.s.ServletContextHandler{/api,null}
> 15/07/29 15:17:09 INFO ContextHandler: stopped
> o.s.j.s.ServletContextHandler{/,null}
> 15/07/29 15:17:09 INFO ContextHandler: stopped
> o.s.j.s.ServletContextHandler{/static,null}
> 15/07/29 15:17:09 INFO ContextHandler: stopped
> o.s.j.s.ServletContextHandler{/executors/threadDump/json,null}
> 15/07/29 15:17:09 INFO ContextHandler: stopped
> o.s.j.s.ServletContextHandler{/executors/threadDump,null}
> 15/07/29 15:17:09 INFO ContextHandler: stopped
> o.s.j.s.ServletContextHandler{/executors/json,null}
> 15/07/29 15:17:09 INFO ContextHandler: stopped
> o.s.j.s.ServletContextHandler{/executors,null}
> 15/07/29 15:17:09 INFO ContextHandler: stopped
> o.s.j.s.ServletContextHandler{/environment/json,null}
> 15/07/29 15:17:09 INFO ContextHandler: stopped
> o.s.j.s.ServletContextHandler{/environment,null}
> 15/07/29 15:17:09 INFO ContextHandler: stopped
> o.s.j.s.ServletContextHandler{/storage/rdd/json,null}
> 15/07/29 15:17:09 INFO ContextHandler: stopped
> o.s.j.s.ServletContextHandler{/storage/rdd,null}
> 15/07/29 15:17:09 INFO ContextHandler: stopped
> o.s.j.s.ServletContextHandler{/storage/json,null}
> 15/07/29 15:17:09 INFO ContextHandler: stopped
> o.s.j.s.ServletContextHandler{/storage,null}
> 15/07/29 15:17:09 INFO ContextHandler: stopped
> o.s.j.s.ServletContextHandler{/stages/pool/json,null}
> 15/07/29 15:17:09 INFO ContextHandler: stopped
> o.s.j.s.ServletContextHandler{/stages/pool,null}
> 15/07/29 15:17:09 INFO ContextHandler: stopped
> o.s.j.s.ServletContextHandler{/stages/stage/json,null}
> 15/07/29 15:17:09 INFO ContextHandler: stopped
> o.s.j.s.ServletContextHandler{/stages/stage,null}
> 15/07/29 15:17:09 INFO ContextHandler: stopped
> o.s.j.s.ServletContextHandler{/stages/json,null}
> 15/07/29 15:17:09 INFO ContextHandler: stopped
> o.s.j.s.ServletContextHandler{/stages,null}
> 15/07/29 15:17:09 INFO ContextHandler: stopped
> o.s.j.s.ServletContextHandler{/jobs/job/json,null}
> 15/07/29 15:17:09 INFO ContextHandler: stopped
> o.s.j.s.ServletContextHandler{/jobs/job,null}
> 15/07/29 15:17:09 INFO ContextHandler: stopped
> o.s.j.s.ServletContextHandler{/jobs/json,null}
> 15/07/29 15:17:09 INFO ContextHandler: stopped
> o.s.j.s.ServletContextHandler{/jobs,null}
> 15/07/29 15:17:09 INFO SparkUI: Stopped Spark web UI at http://<address
> removed>
> 15/07/29 15:17:09 WARN YarnSchedulerBackend$YarnSchedulerEndpoint:
> ApplicationMaster has disassociated: <address removed>
> 15/07/29 15:17:09 INFO DAGScheduler: Stopping DAGScheduler
> 15/07/29 15:17:09 WARN ReliableDeliverySupervisor: Association with remote
> system [akka.tcp://sparkYarnAM@<address removed>] has failed, address is
> now gated for [5000] ms. Reason is: [Disassociated].
> 15/07/29 15:17:09 WARN YarnSchedulerBackend$YarnSchedulerEndpoint:
> ApplicationMaster has disassociated: <address removed>
> 15/07/29 15:17:09 INFO YarnClientSchedulerBackend: Shutting down all
> executors
> Exception in thread "Yarn application state monitor"
> org.apache.spark.SparkException: Error asking standalone scheduler to shut
> down executors
>         at
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stopExecutors(CoarseGrainedSchedulerBackend.scala:261)
>         at
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stop(CoarseGrainedSchedulerBackend.scala:266)
>         at
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.stop(YarnClientSchedulerBackend.scala:158)
>         at
> org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:416)
>         at
> org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1411)
>         at org.apache.spark.SparkContext.stop(SparkContext.scala:1644)
>         at
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend$$anon$1.run(YarnClientSchedulerBackend.scala:139)
> Caused by: java.lang.InterruptedException
>         at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1325)
>         at
> scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208)
>         at
> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
>         at
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
>         at
> scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)
>         at
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
>         at scala.concurrent.Await$.result(package.scala:190)15/07/29
> 15:17:09 INFO YarnClientSchedulerBackend: Asking each executor to shut down
>
>         at
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
>         at
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:78)
>         at
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stopExecutors(CoarseGrainedSchedulerBackend.scala:257)
>         ... 6 more
> 15/07/29 15:17:09 ERROR YarnScheduler: Lost executor 2 on
> node09.tresata.com: remote Rpc client disassociated
> 15/07/29 15:17:09 ERROR YarnScheduler: Lost executor 1 on
> node06.tresata.com: remote Rpc client disassociated
> 5/07/29 15:19:09 WARN HeartbeatReceiver: Removing executor 2 with no
> recent heartbeats: 128182 ms exceeds timeout 120000 ms
> 15/07/29 15:19:09 ERROR YarnScheduler: Lost an executor 2 (already
> removed): Executor heartbeat timed out after 128182 ms
> 15/07/29 15:19:09 WARN HeartbeatReceiver: Removing executor 1 with no
> recent heartbeats: 126799 ms exceeds timeout 120000 ms
> 15/07/29 15:19:09 ERROR YarnScheduler: Lost an executor 1 (already
> removed): Executor heartbeat timed out after 126799 ms
> 15/07/29 15:19:09 INFO YarnClientSchedulerBackend: Requesting to kill
> executor(s) 2
> 15/07/29 15:19:09 WARN YarnClientSchedulerBackend: Executor to kill 2 does
> not exist!
> 15/07/29 15:19:09 WARN Remoting: Tried to associate with unreachable
> remote address [akka.tcp://sparkYarnAM@<address removed>]. Address is now
> gated for 5000 ms, all messages to this address will be delivered to dead
> letters. Reason: Connection refused: /<address removed>
> 15/07/29 15:19:09 WARN YarnSchedulerBackend$YarnSchedulerEndpoint:
> ApplicationMaster has disassociated: <address removed>
>
> It seems like the call,
>
> driverEndpoint.askWithRetry[Boolean](StopExecutors)
>
> inside the stopExecutors() method of CoarseGrainedSchedulerBackend is
> throwing an exception, which then bubbles up as a SparkException ("Error
> asking standalone scheduler to shut down executors"). This is then not
> caught by the SparkContext.stop() method (and thus stop() never reaches
> the clearActiveContext() call). Does this sound right?
>
> Also, does bq == be quiet???
>
> Thanks for the reply!
>
> -Andres
>
> On Wed, Jul 29, 2015 at 1:10 PM, Ted Yu <yu...@gmail.com> wrote:
>
>> bq. it seems like we never get to the clearActiveContext() call by the
>> end
>>
>> Looking at stop() method, there is only one early return
>> after stopped.compareAndSet() call.
>> Is there any clue from driver log ?
>>
>> Cheers
>>
>> On Wed, Jul 29, 2015 at 9:38 AM, Andres Perez <an...@tresata.com> wrote:
>>
>>> Hi everyone. I'm running into an issue with SparkContexts when running on
>>> Yarn. The issue is observable when I reproduce these steps in the
>>> spark-shell (version 1.4.1):
>>>
>>> scala> sc
>>> res0: org.apache.spark.SparkContext =
>>> org.apache.spark.SparkContext@7b965dee
>>>
>>> *Note the pointer address of sc.
>>>
>>> (Then yarn application -kill <application-id> on the corresponding yarn
>>> application)
>>>
>>> scala> val rdd = sc.parallelize(List(1,2,3))
>>> java.lang.IllegalStateException: Cannot call methods on a stopped
>>> SparkContext
>>>   at
>>> org.apache.spark.SparkContext.org
>>> $apache$spark$SparkContext$$assertNotStopped(SparkContext.scala:103)
>>>
>>>   at
>>> org.apache.spark.SparkContext.defaultParallelism(SparkContext.scala:1914)
>>>   at
>>>
>>> org.apache.spark.SparkContext.parallelize$default$2(SparkContext.scala:695)
>>>   ... 49 elided
>>>
>>> (Great, the SparkContext has been stopped by the killed yarn
>>> application, as
>>> expected.)
>>>
>>> alternatively:
>>>
>>> scala> sc.stop()
>>> 15/07/29 12:10:14 INFO SparkContext: SparkContext already stopped.
>>>
>>> (OK, so it's confirmed that it has been stopped.)
>>>
>>> scala> org.apache.spark.SparkContext.getOrCreate
>>> res3: org.apache.spark.SparkContext =
>>> org.apache.spark.SparkContext@7b965dee
>>>
>>> (Hm, that's the same SparkContext, note the pointer address.)
>>>
>>> The issue here is that the SparkContext.getOrCreate method returns either
>>> the active SparkContext, if it exists, or creates a new one. Here it is
>>> returning the original SparkContext, meaning the one we verified was
>>> stopped
>>> above is still active. How can we recover from this? We can't use the
>>> current one once it's been stopped (unless we allow for multiple
>>> contexts to
>>> run using the spark.driver.allowMultipleContexts flag, but that's a
>>> band-aid
>>> solution), and we can't seem to create a new one, because the old one is
>>> still marked as active.
>>>
>>> Digging a little deeper, in the body of the stop() method of
>>> SparkContext,
>>> it seems like we never get to the clearActiveContext() call by the end,
>>> which would have marked the context as inactive. Any future call to
>>> stop(),
>>> however, will exit early since the stopped variable is true (hence the
>>> "SparkContext already stopped." log message). So I don't see any other
>>> way
>>> to mark the context as not active. Something about how the SparkContext
>>> was
>>> stopped after killing the yarn application is preventing the SparkContext
>>> from cleaning up properly.
>>>
>>> Any ideas about this?
>>>
>>> Thanks,
>>>
>>> Andres
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/stopped-SparkContext-remaining-active-tp24065.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>> For additional commands, e-mail: user-help@spark.apache.org
>>>
>>>
>>
>

Re: stopped SparkContext remaining active

Posted by Andres Perez <an...@tresata.com>.

Hi Ted. Taking a look at the logs, I get the feeling like there may be an
uncaught exception blowing up the SparkContext.stop method, causing it to
not reach the line where it gets set as inactive. The line referenced below
in SparkContext (SparkContext.scala:1644) is the call: _dagScheduler.stop()

15/07/29 15:17:09 INFO Client: Deleting staging directory
.sparkStaging/application_1436825124867_0223
15/07/29 15:17:09 ERROR YarnClientSchedulerBackend: Yarn application has
already exited with state KILLED!
15/07/29 15:17:09 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/metrics/json,null}
15/07/29 15:17:09 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/stages/stage/kill,null}
15/07/29 15:17:09 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/api,null}
15/07/29 15:17:09 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/,null}
15/07/29 15:17:09 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/static,null}
15/07/29 15:17:09 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/executors/threadDump/json,null}
15/07/29 15:17:09 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/executors/threadDump,null}
15/07/29 15:17:09 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/executors/json,null}
15/07/29 15:17:09 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/executors,null}
15/07/29 15:17:09 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/environment/json,null}
15/07/29 15:17:09 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/environment,null}
15/07/29 15:17:09 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/storage/rdd/json,null}
15/07/29 15:17:09 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/storage/rdd,null}
15/07/29 15:17:09 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/storage/json,null}
15/07/29 15:17:09 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/storage,null}
15/07/29 15:17:09 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/stages/pool/json,null}
15/07/29 15:17:09 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/stages/pool,null}
15/07/29 15:17:09 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/stages/stage/json,null}
15/07/29 15:17:09 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/stages/stage,null}
15/07/29 15:17:09 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/stages/json,null}
15/07/29 15:17:09 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/stages,null}
15/07/29 15:17:09 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/jobs/job/json,null}
15/07/29 15:17:09 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/jobs/job,null}
15/07/29 15:17:09 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/jobs/json,null}
15/07/29 15:17:09 INFO ContextHandler: stopped
o.s.j.s.ServletContextHandler{/jobs,null}
15/07/29 15:17:09 INFO SparkUI: Stopped Spark web UI at http://<address
removed>
15/07/29 15:17:09 WARN YarnSchedulerBackend$YarnSchedulerEndpoint:
ApplicationMaster has disassociated: <address removed>
15/07/29 15:17:09 INFO DAGScheduler: Stopping DAGScheduler
15/07/29 15:17:09 WARN ReliableDeliverySupervisor: Association with remote
system [akka.tcp://sparkYarnAM@<address removed>] has failed, address is
now gated for [5000] ms. Reason is: [Disassociated].
15/07/29 15:17:09 WARN YarnSchedulerBackend$YarnSchedulerEndpoint:
ApplicationMaster has disassociated: <address removed>
15/07/29 15:17:09 INFO YarnClientSchedulerBackend: Shutting down all
executors
Exception in thread "Yarn application state monitor"
org.apache.spark.SparkException: Error asking standalone scheduler to shut
down executors
        at
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stopExecutors(CoarseGrainedSchedulerBackend.scala:261)
        at
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stop(CoarseGrainedSchedulerBackend.scala:266)
        at
org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.stop(YarnClientSchedulerBackend.scala:158)
        at
org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:416)
        at
org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1411)
        at org.apache.spark.SparkContext.stop(SparkContext.scala:1644)
        at
org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend$$anon$1.run(YarnClientSchedulerBackend.scala:139)
Caused by: java.lang.InterruptedException
        at
java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1325)
        at
scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208)
        at
scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
        at
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
        at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)
        at
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
        at scala.concurrent.Await$.result(package.scala:190)15/07/29
15:17:09 INFO YarnClientSchedulerBackend: Asking each executor to shut down

        at
org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
        at
org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:78)
        at
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stopExecutors(CoarseGrainedSchedulerBackend.scala:257)
        ... 6 more
15/07/29 15:17:09 ERROR YarnScheduler: Lost executor 2 on node09.tresata.com:
remote Rpc client disassociated
15/07/29 15:17:09 ERROR YarnScheduler: Lost executor 1 on node06.tresata.com:
remote Rpc client disassociated
5/07/29 15:19:09 WARN HeartbeatReceiver: Removing executor 2 with no recent
heartbeats: 128182 ms exceeds timeout 120000 ms
15/07/29 15:19:09 ERROR YarnScheduler: Lost an executor 2 (already
removed): Executor heartbeat timed out after 128182 ms
15/07/29 15:19:09 WARN HeartbeatReceiver: Removing executor 1 with no
recent heartbeats: 126799 ms exceeds timeout 120000 ms
15/07/29 15:19:09 ERROR YarnScheduler: Lost an executor 1 (already
removed): Executor heartbeat timed out after 126799 ms
15/07/29 15:19:09 INFO YarnClientSchedulerBackend: Requesting to kill
executor(s) 2
15/07/29 15:19:09 WARN YarnClientSchedulerBackend: Executor to kill 2 does
not exist!
15/07/29 15:19:09 WARN Remoting: Tried to associate with unreachable remote
address [akka.tcp://sparkYarnAM@<address removed>]. Address is now gated
for 5000 ms, all messages to this address will be delivered to dead
letters. Reason: Connection refused: /<address removed>
15/07/29 15:19:09 WARN YarnSchedulerBackend$YarnSchedulerEndpoint:
ApplicationMaster has disassociated: <address removed>

It seems like the call,

driverEndpoint.askWithRetry[Boolean](StopExecutors)

inside the stopExecutors() method of CoarseGrainedSchedulerBackend is
throwing an exception, which then bubbles up as a SparkException ("Error
asking standalone scheduler to shut down executors"). This is then not
caught by the SparkContext.stop() method (and thus stop() never reaches
the clearActiveContext() call). Does this sound right?

Also, does bq == be quiet???

Thanks for the reply!

-Andres

On Wed, Jul 29, 2015 at 1:10 PM, Ted Yu <yu...@gmail.com> wrote:

> bq. it seems like we never get to the clearActiveContext() call by the end
>
> Looking at stop() method, there is only one early return
> after stopped.compareAndSet() call.
> Is there any clue from driver log ?
>
> Cheers
>
> On Wed, Jul 29, 2015 at 9:38 AM, Andres Perez <an...@tresata.com> wrote:
>
>> Hi everyone. I'm running into an issue with SparkContexts when running on
>> Yarn. The issue is observable when I reproduce these steps in the
>> spark-shell (version 1.4.1):
>>
>> scala> sc
>> res0: org.apache.spark.SparkContext =
>> org.apache.spark.SparkContext@7b965dee
>>
>> *Note the pointer address of sc.
>>
>> (Then yarn application -kill <application-id> on the corresponding yarn
>> application)
>>
>> scala> val rdd = sc.parallelize(List(1,2,3))
>> java.lang.IllegalStateException: Cannot call methods on a stopped
>> SparkContext
>>   at
>> org.apache.spark.SparkContext.org
>> $apache$spark$SparkContext$$assertNotStopped(SparkContext.scala:103)
>>
>>   at
>> org.apache.spark.SparkContext.defaultParallelism(SparkContext.scala:1914)
>>   at
>>
>> org.apache.spark.SparkContext.parallelize$default$2(SparkContext.scala:695)
>>   ... 49 elided
>>
>> (Great, the SparkContext has been stopped by the killed yarn application,
>> as
>> expected.)
>>
>> alternatively:
>>
>> scala> sc.stop()
>> 15/07/29 12:10:14 INFO SparkContext: SparkContext already stopped.
>>
>> (OK, so it's confirmed that it has been stopped.)
>>
>> scala> org.apache.spark.SparkContext.getOrCreate
>> res3: org.apache.spark.SparkContext =
>> org.apache.spark.SparkContext@7b965dee
>>
>> (Hm, that's the same SparkContext, note the pointer address.)
>>
>> The issue here is that the SparkContext.getOrCreate method returns either
>> the active SparkContext, if it exists, or creates a new one. Here it is
>> returning the original SparkContext, meaning the one we verified was
>> stopped
>> above is still active. How can we recover from this? We can't use the
>> current one once it's been stopped (unless we allow for multiple contexts
>> to
>> run using the spark.driver.allowMultipleContexts flag, but that's a
>> band-aid
>> solution), and we can't seem to create a new one, because the old one is
>> still marked as active.
>>
>> Digging a little deeper, in the body of the stop() method of SparkContext,
>> it seems like we never get to the clearActiveContext() call by the end,
>> which would have marked the context as inactive. Any future call to
>> stop(),
>> however, will exit early since the stopped variable is true (hence the
>> "SparkContext already stopped." log message). So I don't see any other way
>> to mark the context as not active. Something about how the SparkContext
>> was
>> stopped after killing the yarn application is preventing the SparkContext
>> from cleaning up properly.
>>
>> Any ideas about this?
>>
>> Thanks,
>>
>> Andres
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/stopped-SparkContext-remaining-active-tp24065.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>>
>

Re: stopped SparkContext remaining active

Posted by Ted Yu <yu...@gmail.com>.

bq. it seems like we never get to the clearActiveContext() call by the end

Looking at stop() method, there is only one early return
after stopped.compareAndSet() call.
Is there any clue from driver log ?

Cheers

On Wed, Jul 29, 2015 at 9:38 AM, Andres Perez <an...@tresata.com> wrote:

> Hi everyone. I'm running into an issue with SparkContexts when running on
> Yarn. The issue is observable when I reproduce these steps in the
> spark-shell (version 1.4.1):
>
> scala> sc
> res0: org.apache.spark.SparkContext =
> org.apache.spark.SparkContext@7b965dee
>
> *Note the pointer address of sc.
>
> (Then yarn application -kill <application-id> on the corresponding yarn
> application)
>
> scala> val rdd = sc.parallelize(List(1,2,3))
> java.lang.IllegalStateException: Cannot call methods on a stopped
> SparkContext
>   at
> org.apache.spark.SparkContext.org
> $apache$spark$SparkContext$$assertNotStopped(SparkContext.scala:103)
>   at
> org.apache.spark.SparkContext.defaultParallelism(SparkContext.scala:1914)
>   at
> org.apache.spark.SparkContext.parallelize$default$2(SparkContext.scala:695)
>   ... 49 elided
>
> (Great, the SparkContext has been stopped by the killed yarn application,
> as
> expected.)
>
> alternatively:
>
> scala> sc.stop()
> 15/07/29 12:10:14 INFO SparkContext: SparkContext already stopped.
>
> (OK, so it's confirmed that it has been stopped.)
>
> scala> org.apache.spark.SparkContext.getOrCreate
> res3: org.apache.spark.SparkContext =
> org.apache.spark.SparkContext@7b965dee
>
> (Hm, that's the same SparkContext, note the pointer address.)
>
> The issue here is that the SparkContext.getOrCreate method returns either
> the active SparkContext, if it exists, or creates a new one. Here it is
> returning the original SparkContext, meaning the one we verified was
> stopped
> above is still active. How can we recover from this? We can't use the
> current one once it's been stopped (unless we allow for multiple contexts
> to
> run using the spark.driver.allowMultipleContexts flag, but that's a
> band-aid
> solution), and we can't seem to create a new one, because the old one is
> still marked as active.
>
> Digging a little deeper, in the body of the stop() method of SparkContext,
> it seems like we never get to the clearActiveContext() call by the end,
> which would have marked the context as inactive. Any future call to stop(),
> however, will exit early since the stopped variable is true (hence the
> "SparkContext already stopped." log message). So I don't see any other way
> to mark the context as not active. Something about how the SparkContext was
> stopped after killing the yarn application is preventing the SparkContext
> from cleaning up properly.
>
> Any ideas about this?
>
> Thanks,
>
> Andres
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/stopped-SparkContext-remaining-active-tp24065.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>