You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Tobias Pfeiffer <tg...@preferred.jp> on 2015/03/11 11:43:37 UTC

"Timed out while stopping the job generator" plus subsequent failures

Hi,

it seems like I am unable to shut down my StreamingContext properly, both
in local[n] and yarn-cluster mode. In addition, (only) in yarn-cluster
mode, subsequent use of a new StreamingContext will raise
an InvalidActorNameException.

I use

  logger.info("stoppingStreamingContext")
  staticStreamingContext.stop(stopSparkContext=false,
    stopGracefully=true)
  logger.debug("done")

and have in my output logs

  19:16:47.708 [ForkJoinPool-2-worker-11] INFO  stopping StreamingContext
    [... output from other threads ...]
  19:17:07.729 [ForkJoinPool-2-worker-11] WARN  scheduler.JobGenerator -
Timed out while stopping the job generator (timeout = 20000)
  19:17:07.739 [ForkJoinPool-2-worker-11] DEBUG done

The processing itself is complete, i.e., the batch currently processed at
the time of stop() is finished and no further batches are processed.
However, something keeps the streaming context from stopping properly. In
local[n] mode, this is not actually a problem (other than I have to wait 20
seconds for shutdown), but in yarn-cluster mode, I get an error

  akka.actor.InvalidActorNameException: actor name [JobGenerator] is not
unique!

when I start a (newly created) StreamingContext, and I was wondering what
* is the issue with stop()
* is the difference between local[n] and yarn-cluster mode.

Some possible reasons:
* On my executors, I use a networking library that depends on netty and
doesn't properly shut down the event loop. (That has not been a problem in
the past, though.)
* I have a non-empty state (from using updateStateByKey()) that is
checkpointed to /tmp/spark (in local mode) and hdfs:///tmp/spark (in
yarn-cluster) mode, could that be an issue? (In fact, I have not seen this
error in any non-stateful stream applications before.)

Any help much appreciated!

Thanks
Tobias

Re: "Timed out while stopping the job generator" plus subsequent failures

Posted by Sean Owen <so...@cloudera.com>.

I don't think that's the same issue I was seeing, but you can have a
look at https://issues.apache.org/jira/browse/SPARK-4545 for more
detail on my issue.

On Thu, Mar 12, 2015 at 12:51 AM, Tobias Pfeiffer <tg...@preferred.jp> wrote:
> Sean,
>
> On Wed, Mar 11, 2015 at 7:43 PM, Tobias Pfeiffer <tg...@preferred.jp> wrote:
>>
>> it seems like I am unable to shut down my StreamingContext properly, both
>> in local[n] and yarn-cluster mode. In addition, (only) in yarn-cluster mode,
>> subsequent use of a new StreamingContext will raise an
>> InvalidActorNameException.
>
>
> I was wondering if this is related to your question on spark-dev
>   http://tinyurl.com/q5cd5px
> Did you get any additional insight into this issue?
>
> In my case the processing of the first batch completes, but I don't know if
> there is anything wrong with the checkpoints? When I look to the
> corresponding checkpoint directory in HDFS, it doesn't seem like all state
> RDDs are persisted there, just a subset. Any ideas?
>
> Thanks
> Tobias
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: "Timed out while stopping the job generator" plus subsequent failures

Posted by Tobias Pfeiffer <tg...@preferred.jp>.

Sean,

On Wed, Mar 11, 2015 at 7:43 PM, Tobias Pfeiffer <tg...@preferred.jp> wrote:
>
> it seems like I am unable to shut down my StreamingContext properly, both
> in local[n] and yarn-cluster mode. In addition, (only) in yarn-cluster
> mode, subsequent use of a new StreamingContext will raise
> an InvalidActorNameException.
>

I was wondering if this is related to your question on spark-dev
  http://tinyurl.com/q5cd5px
Did you get any additional insight into this issue?

In my case the processing of the first batch completes, but I don't know if
there is anything wrong with the checkpoints? When I look to the
corresponding checkpoint directory in HDFS, it doesn't seem like all state
RDDs are persisted there, just a subset. Any ideas?

Thanks
Tobias

Re: "Timed out while stopping the job generator" plus subsequent failures

Posted by Tobias Pfeiffer <tg...@preferred.jp>.

Hi,

I discovered what caused my issue when running on YARN and was able to work
around it.

On Wed, Mar 11, 2015 at 7:43 PM, Tobias Pfeiffer <tg...@preferred.jp> wrote:

> The processing itself is complete, i.e., the batch currently processed at
> the time of stop() is finished and no further batches are processed.
> However, something keeps the streaming context from stopping properly. In
> local[n] mode, this is not actually a problem (other than I have to wait 20
> seconds for shutdown), but in yarn-cluster mode, I get an error
>
>   akka.actor.InvalidActorNameException: actor name [JobGenerator] is not
> unique!
>

It seems that not all checkpointed RDDs are cleaned (metadata cleared,
checkpoint directories deleted etc.?) at the time when the streamingContext
is stopped, but only afterwards. In particular, when I add
`Thread.sleep(5000)` after my streamingContext.stop() call, then it works
and I can start a different streamingContext afterwards.

This is pretty ugly, so does anyone know a method to poll whether it's safe
to continue or whether there are still some RDDs waiting to be cleaned up?

Thanks
Tobias