You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Richard Siebeling <rs...@gmail.com> on 2016/10/05 20:55:03 UTC

How to stop a running job

Hi,

how can I stop a long running job?

We're having Spark running in Mesos Coarse-grained mode. Suppose the user
start a long running job, makes a mistake, changes a transformation and
runs the job again. In this case I'd like to cancel the first job and after
that start the second job. It would be a waste of resources to finish the
first job (which could possibly take several hours...)

How can this be accomplished?
thanks in advance,
Richard

Re: How to stop a running job

Posted by Richard Siebeling <rs...@gmail.com>.

I think I mean the job that Mark is talking about but that's also the thing
that's being stopped by the dcos command and (hopefully) the thing that's
being stopped by the dispatcher, isn't it?

It would be really good if the issue (SPARK-17064) would be resolved, but
for now I'll do with cancelling the planned tasks in the current job
(that's already a lot better than completing the whole job).

Thanks anyway for the answers, you helped me a lot,
kind regards,
Richard

On Wed, Oct 5, 2016 at 11:38 PM, Michael Gummelt <mg...@mesosphere.io>
wrote:

> You're using the proper Spark definition of "job", but I believe Richard
> means "driver".
>
> On Wed, Oct 5, 2016 at 2:17 PM, Mark Hamstra <ma...@clearstorydata.com>
> wrote:
>
>> Yes and no.  Something that you need to be aware of is that a Job as such
>> exists in the DAGScheduler as part of the Application running on the
>> Driver.  When talking about stopping or killing a Job, however, what people
>> often mean is not just stopping the DAGScheduler from telling the Executors
>> to run more Tasks associated with the Job, but also to stop any associated
>> Tasks that are already running on Executors.  That is something that Spark
>> doesn't try to do by default, and changing that behavior has been an open
>> issue for a long time -- cf. SPARK-17064
>>
>> On Wed, Oct 5, 2016 at 2:07 PM, Michael Gummelt <mg...@mesosphere.io>
>> wrote:
>>
>>> If running in client mode, just kill the job.  If running in cluster
>>> mode, the Spark Dispatcher exposes an HTTP API for killing jobs.  I don't
>>> think this is externally documented, so you might have to check the code to
>>> find this endpoint.  If you run in dcos, you can just run "dcos spark kill
>>> <id>".
>>>
>>> You can also find which node is running the driver, ssh in, and kill the
>>> process.
>>>
>>> On Wed, Oct 5, 2016 at 1:55 PM, Richard Siebeling <rs...@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> how can I stop a long running job?
>>>>
>>>> We're having Spark running in Mesos Coarse-grained mode. Suppose the
>>>> user start a long running job, makes a mistake, changes a transformation
>>>> and runs the job again. In this case I'd like to cancel the first job and
>>>> after that start the second job. It would be a waste of resources to finish
>>>> the first job (which could possibly take several hours...)
>>>>
>>>> How can this be accomplished?
>>>> thanks in advance,
>>>> Richard
>>>>
>>>>
>>>
>>>
>>> --
>>> Michael Gummelt
>>> Software Engineer
>>> Mesosphere
>>>
>>
>>
>
>
> --
> Michael Gummelt
> Software Engineer
> Mesosphere
>

Re: How to stop a running job

Posted by Michael Gummelt <mg...@mesosphere.io>.

You're using the proper Spark definition of "job", but I believe Richard
means "driver".

On Wed, Oct 5, 2016 at 2:17 PM, Mark Hamstra <ma...@clearstorydata.com>
wrote:

> Yes and no.  Something that you need to be aware of is that a Job as such
> exists in the DAGScheduler as part of the Application running on the
> Driver.  When talking about stopping or killing a Job, however, what people
> often mean is not just stopping the DAGScheduler from telling the Executors
> to run more Tasks associated with the Job, but also to stop any associated
> Tasks that are already running on Executors.  That is something that Spark
> doesn't try to do by default, and changing that behavior has been an open
> issue for a long time -- cf. SPARK-17064
>
> On Wed, Oct 5, 2016 at 2:07 PM, Michael Gummelt <mg...@mesosphere.io>
> wrote:
>
>> If running in client mode, just kill the job.  If running in cluster
>> mode, the Spark Dispatcher exposes an HTTP API for killing jobs.  I don't
>> think this is externally documented, so you might have to check the code to
>> find this endpoint.  If you run in dcos, you can just run "dcos spark kill
>> <id>".
>>
>> You can also find which node is running the driver, ssh in, and kill the
>> process.
>>
>> On Wed, Oct 5, 2016 at 1:55 PM, Richard Siebeling <rs...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> how can I stop a long running job?
>>>
>>> We're having Spark running in Mesos Coarse-grained mode. Suppose the
>>> user start a long running job, makes a mistake, changes a transformation
>>> and runs the job again. In this case I'd like to cancel the first job and
>>> after that start the second job. It would be a waste of resources to finish
>>> the first job (which could possibly take several hours...)
>>>
>>> How can this be accomplished?
>>> thanks in advance,
>>> Richard
>>>
>>>
>>
>>
>> --
>> Michael Gummelt
>> Software Engineer
>> Mesosphere
>>
>
>


-- 
Michael Gummelt
Software Engineer
Mesosphere

Re: How to stop a running job

Posted by Mark Hamstra <ma...@clearstorydata.com>.

Yes and no.  Something that you need to be aware of is that a Job as such
exists in the DAGScheduler as part of the Application running on the
Driver.  When talking about stopping or killing a Job, however, what people
often mean is not just stopping the DAGScheduler from telling the Executors
to run more Tasks associated with the Job, but also to stop any associated
Tasks that are already running on Executors.  That is something that Spark
doesn't try to do by default, and changing that behavior has been an open
issue for a long time -- cf. SPARK-17064

On Wed, Oct 5, 2016 at 2:07 PM, Michael Gummelt <mg...@mesosphere.io>
wrote:

> If running in client mode, just kill the job.  If running in cluster mode,
> the Spark Dispatcher exposes an HTTP API for killing jobs.  I don't think
> this is externally documented, so you might have to check the code to find
> this endpoint.  If you run in dcos, you can just run "dcos spark kill <id>".
>
> You can also find which node is running the driver, ssh in, and kill the
> process.
>
> On Wed, Oct 5, 2016 at 1:55 PM, Richard Siebeling <rs...@gmail.com>
> wrote:
>
>> Hi,
>>
>> how can I stop a long running job?
>>
>> We're having Spark running in Mesos Coarse-grained mode. Suppose the user
>> start a long running job, makes a mistake, changes a transformation and
>> runs the job again. In this case I'd like to cancel the first job and after
>> that start the second job. It would be a waste of resources to finish the
>> first job (which could possibly take several hours...)
>>
>> How can this be accomplished?
>> thanks in advance,
>> Richard
>>
>>
>
>
> --
> Michael Gummelt
> Software Engineer
> Mesosphere
>

Re: How to stop a running job

Posted by Michael Gummelt <mg...@mesosphere.io>.

If running in client mode, just kill the job.  If running in cluster mode,
the Spark Dispatcher exposes an HTTP API for killing jobs.  I don't think
this is externally documented, so you might have to check the code to find
this endpoint.  If you run in dcos, you can just run "dcos spark kill <id>".

You can also find which node is running the driver, ssh in, and kill the
process.

On Wed, Oct 5, 2016 at 1:55 PM, Richard Siebeling <rs...@gmail.com>
wrote:

> Hi,
>
> how can I stop a long running job?
>
> We're having Spark running in Mesos Coarse-grained mode. Suppose the user
> start a long running job, makes a mistake, changes a transformation and
> runs the job again. In this case I'd like to cancel the first job and after
> that start the second job. It would be a waste of resources to finish the
> first job (which could possibly take several hours...)
>
> How can this be accomplished?
> thanks in advance,
> Richard
>
>

-- 
Michael Gummelt
Software Engineer
Mesosphere