You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Behroz Sikander <ra...@yahoo.com> on 2018/03/05 12:15:52 UTC

Properly stop applications or jobs within the application

Hello,
We are using spark-jobserver to spawn jobs in Spark cluster. We have
recently faced issues with Zombie jobs in Spark cluster. This normally
happens when the job is accessing some external resources like Kafka/C* and
something goes wrong while consuming them. For example, if suddenly a topic
which was being consumed is deleted in Kafka or connection breaks to the
whole Kafka cluster.

Within spark-jobserver, we have the option to delete the context/jobs in
such scenarios.
When we delete the job
<https://github.com/spark-jobserver/spark-jobserver/blob/master/job-server/src/main/scala/spark/jobserver/JobManagerActor.scala#L228>,
internally context.cancelJobGroup(<jobId>) is used.
When we delete the context
<https://github.com/spark-jobserver/spark-jobserver/blob/master/job-server/src/main/scala/spark/jobserver/JobManagerActor.scala#L148>,
internally context.stop(true,true) is executed.

In both cases, even if we delete the job/context, the application on the
Spark cluster is still running (sometimes) and some jobs are still being
executed within Spark.

Here are the logs of one such scenario. The job context was stopped but it
kept on running and became a zombie.

2018-02-28 15:36:50,931 INFO ForkJoinPool-3-worker-13
org.apache.kafka.common.utils.AppInfoParser []: Kafka version :
0.11.0.1-SNAPSHOT
2018-02-28 15:36:50,931 INFO ForkJoinPool-3-worker-13
org.apache.kafka.common.utils.AppInfoParser []: Kafka commitId :
de8225b66d494cd
2018-02-28 15:36:51,144 INFO dispatcher-event-loop-5
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint
[]: Registered executor NettyRpcEndpointRef(null) (10.10.10.15:46224)
with ID 1
2018-02-28 15:38:58,254 WARN ForkJoinPool-3-worker-13
org.apache.kafka.clients.NetworkClient []: Connection to node -3 could
not be established. Broker may not be available.
2018-02-28 15:41:05,485 WARN ForkJoinPool-3-worker-13
org.apache.kafka.clients.NetworkClient []: Connection to node -2 could
not be established. Broker may not be available.
2018-02-28 15:42:07,074 WARN JobServer-akka.actor.default-dispatcher-3
akka.cluster.ClusterCoreDaemon []: Cluster Node
[akka.tcp://JobServer@127.0.0.1:43319] - Marking node(s) as
UNREACHABLE [Member(address = akka.tcp://JobServer@127.0.0.1:37343,
status = Up)]. Node roles [manager]

Later at some point, we see the following logs. It seems that from Spark
job, none of the Kafka nodes were accessible. The job kept on trying and
became a zombie.

2018-02-28 15:43:12,717 WARN ForkJoinPool-3-worker-13
org.apache.kafka.clients.NetworkClient []: Connection to node -3 could
not be established. Broker may not be available.
2018-02-28 15:45:19,949 WARN ForkJoinPool-3-worker-13
org.apache.kafka.clients.NetworkClient []: Connection to node -1 could
not be established. Broker may not be available.
2018-02-28 15:47:27,180 WARN ForkJoinPool-3-worker-13
org.apache.kafka.clients.NetworkClient []: Connection to node -2 could
not be established. Broker may not be available.
2018-02-28 15:49:34,412 WARN ForkJoinPool-3-worker-13
org.apache.kafka.clients.NetworkClient []: Connection to node -3 could
not be established. Broker may not be available.
2018-02-28 15:51:41,644 WARN ForkJoinPool-3-worker-13
org.apache.kafka.clients.NetworkClient []: Connection to node -1 could
not be established. Broker may not be available.
2018-02-28 15:53:48,877 WARN ForkJoinPool-3-worker-13
org.apache.kafka.clients.NetworkClient []: Connection to node -2 could
not be established. Broker may not be available.
2018-02-28 15:55:56,109 WARN ForkJoinPool-3-worker-13
org.apache.kafka.clients.NetworkClient []: Connection to node -1 could
not be established. Broker may not be available.
2018-02-28 15:58:03,340 WARN ForkJoinPool-3-worker-13
org.apache.kafka.clients.NetworkClient []: Connection to node -2 could
not be established. Broker may not be available.
2018-02-28 16:00:10,572 WARN ForkJoinPool-3-worker-13
org.apache.kafka.clients.NetworkClient []: Connection to node -3 could
not be established. Broker may not be available.
2018-02-28 16:02:17,804 WARN ForkJoinPool-3-worker-13
org.apache.kafka.clients.NetworkClient []: Connection to node -1 could
not be established. Broker may not be available.

Similarly to this, we have another scenario for zombie contexts. The logs
are in the gist below.
https://gist.github.com/bsikander/697d85e2352a650437a922752328a90f

In the gist, you can see that the topic is not created and the job tried to
use it. Then when we try to delete the job but it bacame a zombie and kept
on showing.
"Block rdd_13011_0 already exists on this machine; not re-adding it"

So, my question would be, what is the right way to kill the jobs running
within
the context or the context/application itself without having these zombies?

Regards,
Behroz

Re: Properly stop applications or jobs within the application

Posted by Dhaval Modi <dh...@gmail.com>.

@sagar - YARN kill is not a reliable process for spark streaming.



Regards,
Dhaval Modi
dhavalmodi24@gmail.com

On 8 March 2018 at 17:18, bsikander <be...@gmail.com> wrote:

> I am running in Spark standalone mode. No YARN.
>
> anyways, yarn application -kill is a manual process. I donot want that. I
> was to properly kill the driver/application programatically.
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>

Re: Properly stop applications or jobs within the application

Posted by bsikander <be...@gmail.com>.

I am running in Spark standalone mode. No YARN.

anyways, yarn application -kill is a manual process. I donot want that. I
was to properly kill the driver/application programatically.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Properly stop applications or jobs within the application

Posted by sagar grover <sa...@gmail.com>.

I am assuming you are running in yarn cluster mode. Have you tried yarn
application -kill application_id ?

With regards,
Sagar Grover
Phone - 7022175584

On Thu, Mar 8, 2018 at 4:03 PM, bsikander <be...@gmail.com> wrote:

> I have scenarios for both.
> So, I want to kill both batch and streaming midway, if required.
>
> Usecase:
> Normally, if everything is okay we don't kill the application but sometimes
> while accessing external resources (like Kafka) something can go wrong. In
> that case, the application can become useless because it is not doing
> anything useful, so we want to kill it (midway). In such a case, when we
> kill it, sometimes the application becomes a zombie and doesn't get killed
> programmatically (atleast, this is what we found). A kill through Master UI
> or manual using kill -9 is required to clean up the zombies.
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>

Re: Properly stop applications or jobs within the application

Posted by bsikander <be...@gmail.com>.

I have scenarios for both.
So, I want to kill both batch and streaming midway, if required.

Usecase:
Normally, if everything is okay we don't kill the application but sometimes
while accessing external resources (like Kafka) something can go wrong. In
that case, the application can become useless because it is not doing
anything useful, so we want to kill it (midway). In such a case, when we
kill it, sometimes the application becomes a zombie and doesn't get killed
programmatically (atleast, this is what we found). A kill through Master UI
or manual using kill -9 is required to clean up the zombies.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Properly stop applications or jobs within the application

Posted by sagar grover <sa...@gmail.com>.

What do you mean by stopping applications?
Do you want to kill a batch application mid way or are you running
streaming jobs that you want to kill?

With regards,
Sagar Grover

On Thu, Mar 8, 2018 at 1:45 PM, bsikander <be...@gmail.com> wrote:

> Any help would be much appreciated. This seems to be a common problem.
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>

Re: Properly stop applications or jobs within the application

Posted by bsikander <be...@gmail.com>.

Any help would be much appreciated. This seems to be a common problem.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Properly stop applications or jobs within the application

Posted by bsikander <be...@gmail.com>.

It seems to be related to this issue from Kafka
https://issues.apache.org/jira/browse/KAFKA-1894



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org