You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Koert Kuipers <ko...@tresata.com> on 2014/03/04 23:02:50 UTC

trying to understand job cancellation

i have a running job that i cancel while keeping the spark context alive.

at the time of cancellation the active stage is 14.

i see in logs:
2014/03/04 16:43:19 INFO scheduler.DAGScheduler: Asked to cancel job group
3a25db23-2e39-4497-b7ab-b26b2a976f9c
2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage 10
2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage 14
2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Stage 14 was cancelled
2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Remove TaskSet 14.0
from pool x
2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage 13
2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage 12
2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage 11
2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage 15

so far it all looks good. then i get a lot of messages like this:
2014/03/04 16:43:20 INFO scheduler.TaskSchedulerImpl: Ignoring update with
state FINISHED from TID 883 because its task set is gone
2014/03/04 16:43:24 INFO scheduler.TaskSchedulerImpl: Ignoring update with
state KILLED from TID 888 because its task set is gone

after this stage 14 hangs around in active stages, without any sign of
progress or cancellation. it just sits there forever, stuck. looking at the
logs of the executors confirms this. they task seem to be still running,
but nothing is happening. for example (by the time i look at this its 4:58
so this tasks hasnt done anything in 15 mins):

14/03/04 16:43:16 INFO Executor: Serialized size of result for 943 is 1007
14/03/04 16:43:16 INFO Executor: Sending result for 943 directly to driver
14/03/04 16:43:16 INFO Executor: Finished task ID 943
14/03/04 16:43:16 INFO Executor: Serialized size of result for 945 is 1007
14/03/04 16:43:16 INFO Executor: Sending result for 945 directly to driver
14/03/04 16:43:16 INFO Executor: Finished task ID 945
14/03/04 16:43:19 INFO BlockManager: Removing RDD 66

not sure what to make of this. any suggestions? best, koert

Re: trying to understand job cancellation

Posted by Koert Kuipers <ko...@tresata.com>.

on spark 1.0.0 SNAPSHOT this seems to work. at least so far i have seen no
issues yet.


On Thu, Mar 6, 2014 at 8:44 AM, Koert Kuipers <ko...@tresata.com> wrote:

> its 0.9 snapshot from january running in standalone mode.
>
> have these fixed been merged into 0.9?
>
>
> On Thu, Mar 6, 2014 at 12:45 AM, Matei Zaharia <ma...@gmail.com>wrote:
>
>> Which version of Spark is this in, Koert? There might have been some
>> fixes more recently for it.
>>
>> Matei
>>
>> On Mar 5, 2014, at 5:26 PM, Koert Kuipers <ko...@tresata.com> wrote:
>>
>> Sorry I meant to say: seems the issue is shared RDDs between a job that
>> got cancelled and a later job.
>>
>> However even disregarding that I have the other issue that the active
>> task of the cancelled job hangs around forever, not doing anything....
>> On Mar 5, 2014 7:29 PM, "Koert Kuipers" <ko...@tresata.com> wrote:
>>
>>> yes jobs on RDDs that were not part of the cancelled job work fine.
>>>
>>> so it seems the issue is the cached RDDs that are ahred between the
>>> cancelled job and the jobs after that.
>>>
>>>
>>> On Wed, Mar 5, 2014 at 7:15 PM, Koert Kuipers <ko...@tresata.com> wrote:
>>>
>>>> well, the new jobs use existing RDDs that were also used in the jon
>>>> that got killed.
>>>>
>>>> let me confirm that new jobs that use completely different RDDs do not
>>>> get killed.
>>>>
>>>>
>>>>
>>>> On Wed, Mar 5, 2014 at 7:00 PM, Mayur Rustagi <ma...@gmail.com>wrote:
>>>>
>>>>> Quite unlikely as jobid are given in an incremental fashion, so your
>>>>> future jobid are not likely to be killed if your groupid is not repeated.I
>>>>> guess the issue is something else.
>>>>>
>>>>> Mayur Rustagi
>>>>> Ph: +1 (760) 203 3257
>>>>> http://www.sigmoidanalytics.com
>>>>> @mayur_rustagi <https://twitter.com/mayur_rustagi>
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Mar 5, 2014 at 3:50 PM, Koert Kuipers <ko...@tresata.com>wrote:
>>>>>
>>>>>> i did that. my next job gets a random new group job id (a uuid).
>>>>>> however that doesnt seem to stop the job from getting sucked into the
>>>>>> cancellation it seems
>>>>>>
>>>>>>
>>>>>> On Wed, Mar 5, 2014 at 6:47 PM, Mayur Rustagi <
>>>>>> mayur.rustagi@gmail.com> wrote:
>>>>>>
>>>>>>> You can randomize job groups as well. to secure yourself against
>>>>>>> termination.
>>>>>>>
>>>>>>> Mayur Rustagi
>>>>>>> Ph: +1 (760) 203 3257
>>>>>>> http://www.sigmoidanalytics.com
>>>>>>> @mayur_rustagi <https://twitter.com/mayur_rustagi>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Mar 5, 2014 at 3:42 PM, Koert Kuipers <ko...@tresata.com>wrote:
>>>>>>>
>>>>>>>> got it. seems like i better stay away from this feature for now..
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Mar 5, 2014 at 5:55 PM, Mayur Rustagi <
>>>>>>>> mayur.rustagi@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> One issue is that job cancellation is posted on eventloop. So its
>>>>>>>>> possible that subsequent jobs submitted to job queue may beat the job
>>>>>>>>> cancellation event & hence the job cancellation event may end up closing
>>>>>>>>> them too.
>>>>>>>>> So there's definitely a race condition you are risking even if not
>>>>>>>>> running into.
>>>>>>>>>
>>>>>>>>> Mayur Rustagi
>>>>>>>>> Ph: +1 (760) 203 3257
>>>>>>>>> http://www.sigmoidanalytics.com
>>>>>>>>> @mayur_rustagi <https://twitter.com/mayur_rustagi>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Mar 5, 2014 at 2:40 PM, Koert Kuipers <ko...@tresata.com>wrote:
>>>>>>>>>
>>>>>>>>>> SparkContext.cancelJobGroup
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Wed, Mar 5, 2014 at 5:32 PM, Mayur Rustagi <
>>>>>>>>>> mayur.rustagi@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> How do you cancel the job. Which API do you use?
>>>>>>>>>>>
>>>>>>>>>>> Mayur Rustagi
>>>>>>>>>>> Ph: +1 (760) 203 3257
>>>>>>>>>>> http://www.sigmoidanalytics.com
>>>>>>>>>>>  @mayur_rustagi <https://twitter.com/mayur_rustagi>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Mar 5, 2014 at 2:29 PM, Koert Kuipers <koert@tresata.com
>>>>>>>>>>> > wrote:
>>>>>>>>>>>
>>>>>>>>>>>> i also noticed that jobs (with a new JobGroupId) which i run
>>>>>>>>>>>> after this use which use the same RDDs get very confused. i see lots of
>>>>>>>>>>>> cancelled stages and retries that go on forever.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Mar 4, 2014 at 5:02 PM, Koert Kuipers <
>>>>>>>>>>>> koert@tresata.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> i have a running job that i cancel while keeping the spark
>>>>>>>>>>>>> context alive.
>>>>>>>>>>>>>
>>>>>>>>>>>>> at the time of cancellation the active stage is 14.
>>>>>>>>>>>>>
>>>>>>>>>>>>> i see in logs:
>>>>>>>>>>>>> 2014/03/04 16:43:19 INFO scheduler.DAGScheduler: Asked to
>>>>>>>>>>>>> cancel job group 3a25db23-2e39-4497-b7ab-b26b2a976f9c
>>>>>>>>>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl:
>>>>>>>>>>>>> Cancelling stage 10
>>>>>>>>>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl:
>>>>>>>>>>>>> Cancelling stage 14
>>>>>>>>>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Stage 14
>>>>>>>>>>>>> was cancelled
>>>>>>>>>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Remove
>>>>>>>>>>>>> TaskSet 14.0 from pool x
>>>>>>>>>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl:
>>>>>>>>>>>>> Cancelling stage 13
>>>>>>>>>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl:
>>>>>>>>>>>>> Cancelling stage 12
>>>>>>>>>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl:
>>>>>>>>>>>>> Cancelling stage 11
>>>>>>>>>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl:
>>>>>>>>>>>>> Cancelling stage 15
>>>>>>>>>>>>>
>>>>>>>>>>>>> so far it all looks good. then i get a lot of messages like
>>>>>>>>>>>>> this:
>>>>>>>>>>>>> 2014/03/04 16:43:20 INFO scheduler.TaskSchedulerImpl: Ignoring
>>>>>>>>>>>>> update with state FINISHED from TID 883 because its task set is gone
>>>>>>>>>>>>> 2014/03/04 16:43:24 INFO scheduler.TaskSchedulerImpl: Ignoring
>>>>>>>>>>>>> update with state KILLED from TID 888 because its task set is gone
>>>>>>>>>>>>>
>>>>>>>>>>>>> after this stage 14 hangs around in active stages, without any
>>>>>>>>>>>>> sign of progress or cancellation. it just sits there forever, stuck.
>>>>>>>>>>>>> looking at the logs of the executors confirms this. they task seem to be
>>>>>>>>>>>>> still running, but nothing is happening. for example (by the time i look at
>>>>>>>>>>>>> this its 4:58 so this tasks hasnt done anything in 15 mins):
>>>>>>>>>>>>>
>>>>>>>>>>>>> 14/03/04 16:43:16 INFO Executor: Serialized size of result for
>>>>>>>>>>>>> 943 is 1007
>>>>>>>>>>>>> 14/03/04 16:43:16 INFO Executor: Sending result for 943
>>>>>>>>>>>>> directly to driver
>>>>>>>>>>>>> 14/03/04 16:43:16 INFO Executor: Finished task ID 943
>>>>>>>>>>>>> 14/03/04 16:43:16 INFO Executor: Serialized size of result for
>>>>>>>>>>>>> 945 is 1007
>>>>>>>>>>>>> 14/03/04 16:43:16 INFO Executor: Sending result for 945
>>>>>>>>>>>>> directly to driver
>>>>>>>>>>>>> 14/03/04 16:43:16 INFO Executor: Finished task ID 945
>>>>>>>>>>>>> 14/03/04 16:43:19 INFO BlockManager: Removing RDD 66
>>>>>>>>>>>>>
>>>>>>>>>>>>> not sure what to make of this. any suggestions? best, koert
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: trying to understand job cancellation

Posted by Koert Kuipers <ko...@tresata.com>.

its 0.9 snapshot from january running in standalone mode.

have these fixed been merged into 0.9?


On Thu, Mar 6, 2014 at 12:45 AM, Matei Zaharia <ma...@gmail.com>wrote:

> Which version of Spark is this in, Koert? There might have been some fixes
> more recently for it.
>
> Matei
>
> On Mar 5, 2014, at 5:26 PM, Koert Kuipers <ko...@tresata.com> wrote:
>
> Sorry I meant to say: seems the issue is shared RDDs between a job that
> got cancelled and a later job.
>
> However even disregarding that I have the other issue that the active task
> of the cancelled job hangs around forever, not doing anything....
> On Mar 5, 2014 7:29 PM, "Koert Kuipers" <ko...@tresata.com> wrote:
>
>> yes jobs on RDDs that were not part of the cancelled job work fine.
>>
>> so it seems the issue is the cached RDDs that are ahred between the
>> cancelled job and the jobs after that.
>>
>>
>> On Wed, Mar 5, 2014 at 7:15 PM, Koert Kuipers <ko...@tresata.com> wrote:
>>
>>> well, the new jobs use existing RDDs that were also used in the jon that
>>> got killed.
>>>
>>> let me confirm that new jobs that use completely different RDDs do not
>>> get killed.
>>>
>>>
>>>
>>> On Wed, Mar 5, 2014 at 7:00 PM, Mayur Rustagi <ma...@gmail.com>wrote:
>>>
>>>> Quite unlikely as jobid are given in an incremental fashion, so your
>>>> future jobid are not likely to be killed if your groupid is not repeated.I
>>>> guess the issue is something else.
>>>>
>>>> Mayur Rustagi
>>>> Ph: +1 (760) 203 3257
>>>> http://www.sigmoidanalytics.com
>>>> @mayur_rustagi <https://twitter.com/mayur_rustagi>
>>>>
>>>>
>>>>
>>>> On Wed, Mar 5, 2014 at 3:50 PM, Koert Kuipers <ko...@tresata.com>wrote:
>>>>
>>>>> i did that. my next job gets a random new group job id (a uuid).
>>>>> however that doesnt seem to stop the job from getting sucked into the
>>>>> cancellation it seems
>>>>>
>>>>>
>>>>> On Wed, Mar 5, 2014 at 6:47 PM, Mayur Rustagi <mayur.rustagi@gmail.com
>>>>> > wrote:
>>>>>
>>>>>> You can randomize job groups as well. to secure yourself against
>>>>>> termination.
>>>>>>
>>>>>> Mayur Rustagi
>>>>>> Ph: +1 (760) 203 3257
>>>>>> http://www.sigmoidanalytics.com
>>>>>> @mayur_rustagi <https://twitter.com/mayur_rustagi>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Mar 5, 2014 at 3:42 PM, Koert Kuipers <ko...@tresata.com>wrote:
>>>>>>
>>>>>>> got it. seems like i better stay away from this feature for now..
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Mar 5, 2014 at 5:55 PM, Mayur Rustagi <
>>>>>>> mayur.rustagi@gmail.com> wrote:
>>>>>>>
>>>>>>>> One issue is that job cancellation is posted on eventloop. So its
>>>>>>>> possible that subsequent jobs submitted to job queue may beat the job
>>>>>>>> cancellation event & hence the job cancellation event may end up closing
>>>>>>>> them too.
>>>>>>>> So there's definitely a race condition you are risking even if not
>>>>>>>> running into.
>>>>>>>>
>>>>>>>> Mayur Rustagi
>>>>>>>> Ph: +1 (760) 203 3257
>>>>>>>> http://www.sigmoidanalytics.com
>>>>>>>> @mayur_rustagi <https://twitter.com/mayur_rustagi>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Mar 5, 2014 at 2:40 PM, Koert Kuipers <ko...@tresata.com>wrote:
>>>>>>>>
>>>>>>>>> SparkContext.cancelJobGroup
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Mar 5, 2014 at 5:32 PM, Mayur Rustagi <
>>>>>>>>> mayur.rustagi@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> How do you cancel the job. Which API do you use?
>>>>>>>>>>
>>>>>>>>>> Mayur Rustagi
>>>>>>>>>> Ph: +1 (760) 203 3257
>>>>>>>>>> http://www.sigmoidanalytics.com
>>>>>>>>>>  @mayur_rustagi <https://twitter.com/mayur_rustagi>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Wed, Mar 5, 2014 at 2:29 PM, Koert Kuipers <ko...@tresata.com>wrote:
>>>>>>>>>>
>>>>>>>>>>> i also noticed that jobs (with a new JobGroupId) which i run
>>>>>>>>>>> after this use which use the same RDDs get very confused. i see lots of
>>>>>>>>>>> cancelled stages and retries that go on forever.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Mar 4, 2014 at 5:02 PM, Koert Kuipers <koert@tresata.com
>>>>>>>>>>> > wrote:
>>>>>>>>>>>
>>>>>>>>>>>> i have a running job that i cancel while keeping the spark
>>>>>>>>>>>> context alive.
>>>>>>>>>>>>
>>>>>>>>>>>> at the time of cancellation the active stage is 14.
>>>>>>>>>>>>
>>>>>>>>>>>> i see in logs:
>>>>>>>>>>>> 2014/03/04 16:43:19 INFO scheduler.DAGScheduler: Asked to
>>>>>>>>>>>> cancel job group 3a25db23-2e39-4497-b7ab-b26b2a976f9c
>>>>>>>>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl:
>>>>>>>>>>>> Cancelling stage 10
>>>>>>>>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl:
>>>>>>>>>>>> Cancelling stage 14
>>>>>>>>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Stage 14
>>>>>>>>>>>> was cancelled
>>>>>>>>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Remove
>>>>>>>>>>>> TaskSet 14.0 from pool x
>>>>>>>>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl:
>>>>>>>>>>>> Cancelling stage 13
>>>>>>>>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl:
>>>>>>>>>>>> Cancelling stage 12
>>>>>>>>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl:
>>>>>>>>>>>> Cancelling stage 11
>>>>>>>>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl:
>>>>>>>>>>>> Cancelling stage 15
>>>>>>>>>>>>
>>>>>>>>>>>> so far it all looks good. then i get a lot of messages like
>>>>>>>>>>>> this:
>>>>>>>>>>>> 2014/03/04 16:43:20 INFO scheduler.TaskSchedulerImpl: Ignoring
>>>>>>>>>>>> update with state FINISHED from TID 883 because its task set is gone
>>>>>>>>>>>> 2014/03/04 16:43:24 INFO scheduler.TaskSchedulerImpl: Ignoring
>>>>>>>>>>>> update with state KILLED from TID 888 because its task set is gone
>>>>>>>>>>>>
>>>>>>>>>>>> after this stage 14 hangs around in active stages, without any
>>>>>>>>>>>> sign of progress or cancellation. it just sits there forever, stuck.
>>>>>>>>>>>> looking at the logs of the executors confirms this. they task seem to be
>>>>>>>>>>>> still running, but nothing is happening. for example (by the time i look at
>>>>>>>>>>>> this its 4:58 so this tasks hasnt done anything in 15 mins):
>>>>>>>>>>>>
>>>>>>>>>>>> 14/03/04 16:43:16 INFO Executor: Serialized size of result for
>>>>>>>>>>>> 943 is 1007
>>>>>>>>>>>> 14/03/04 16:43:16 INFO Executor: Sending result for 943
>>>>>>>>>>>> directly to driver
>>>>>>>>>>>> 14/03/04 16:43:16 INFO Executor: Finished task ID 943
>>>>>>>>>>>> 14/03/04 16:43:16 INFO Executor: Serialized size of result for
>>>>>>>>>>>> 945 is 1007
>>>>>>>>>>>> 14/03/04 16:43:16 INFO Executor: Sending result for 945
>>>>>>>>>>>> directly to driver
>>>>>>>>>>>> 14/03/04 16:43:16 INFO Executor: Finished task ID 945
>>>>>>>>>>>> 14/03/04 16:43:19 INFO BlockManager: Removing RDD 66
>>>>>>>>>>>>
>>>>>>>>>>>> not sure what to make of this. any suggestions? best, koert
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: trying to understand job cancellation

Posted by Matei Zaharia <ma...@gmail.com>.

Which version of Spark is this in, Koert? There might have been some fixes more recently for it.

Matei

On Mar 5, 2014, at 5:26 PM, Koert Kuipers <ko...@tresata.com> wrote:

> Sorry I meant to say: seems the issue is shared RDDs between a job that got cancelled and a later job.
> 
> However even disregarding that I have the other issue that the active task of the cancelled job hangs around forever, not doing anything....
> 
> On Mar 5, 2014 7:29 PM, "Koert Kuipers" <ko...@tresata.com> wrote:
> yes jobs on RDDs that were not part of the cancelled job work fine.
> 
> so it seems the issue is the cached RDDs that are ahred between the cancelled job and the jobs after that.
> 
> 
> On Wed, Mar 5, 2014 at 7:15 PM, Koert Kuipers <ko...@tresata.com> wrote:
> well, the new jobs use existing RDDs that were also used in the jon that got killed. 
> 
> let me confirm that new jobs that use completely different RDDs do not get killed.
> 
> 
> 
> On Wed, Mar 5, 2014 at 7:00 PM, Mayur Rustagi <ma...@gmail.com> wrote:
> Quite unlikely as jobid are given in an incremental fashion, so your future jobid are not likely to be killed if your groupid is not repeated.I guess the issue is something else. 
> 
> Mayur Rustagi
> Ph: +1 (760) 203 3257
> http://www.sigmoidanalytics.com
> @mayur_rustagi
> 
> 
> 
> On Wed, Mar 5, 2014 at 3:50 PM, Koert Kuipers <ko...@tresata.com> wrote:
> i did that. my next job gets a random new group job id (a uuid). however that doesnt seem to stop the job from getting sucked into the cancellation it seems
> 
> 
> On Wed, Mar 5, 2014 at 6:47 PM, Mayur Rustagi <ma...@gmail.com> wrote:
> You can randomize job groups as well. to secure yourself against termination. 
> 
> Mayur Rustagi
> Ph: +1 (760) 203 3257
> http://www.sigmoidanalytics.com
> @mayur_rustagi
> 
> 
> 
> On Wed, Mar 5, 2014 at 3:42 PM, Koert Kuipers <ko...@tresata.com> wrote:
> got it. seems like i better stay away from this feature for now..
> 
> 
> On Wed, Mar 5, 2014 at 5:55 PM, Mayur Rustagi <ma...@gmail.com> wrote:
> One issue is that job cancellation is posted on eventloop. So its possible that subsequent jobs submitted to job queue may beat the job cancellation event & hence the job cancellation event may end up closing them too.
> So there's definitely a race condition you are risking even if not running into. 
> 
> Mayur Rustagi
> Ph: +1 (760) 203 3257
> http://www.sigmoidanalytics.com
> @mayur_rustagi
> 
> 
> 
> On Wed, Mar 5, 2014 at 2:40 PM, Koert Kuipers <ko...@tresata.com> wrote:
> SparkContext.cancelJobGroup
> 
> 
> On Wed, Mar 5, 2014 at 5:32 PM, Mayur Rustagi <ma...@gmail.com> wrote:
> How do you cancel the job. Which API do you use?
> 
> Mayur Rustagi
> Ph: +1 (760) 203 3257
> http://www.sigmoidanalytics.com
> @mayur_rustagi
> 
> 
> 
> On Wed, Mar 5, 2014 at 2:29 PM, Koert Kuipers <ko...@tresata.com> wrote:
> i also noticed that jobs (with a new JobGroupId) which i run after this use which use the same RDDs get very confused. i see lots of cancelled stages and retries that go on forever.
> 
> 
> On Tue, Mar 4, 2014 at 5:02 PM, Koert Kuipers <ko...@tresata.com> wrote:
> i have a running job that i cancel while keeping the spark context alive.
> 
> at the time of cancellation the active stage is 14.
> 
> i see in logs:
> 2014/03/04 16:43:19 INFO scheduler.DAGScheduler: Asked to cancel job group 3a25db23-2e39-4497-b7ab-b26b2a976f9c
> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage 10
> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage 14
> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Stage 14 was cancelled
> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Remove TaskSet 14.0 from pool x
> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage 13
> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage 12
> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage 11
> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage 15
> 
> so far it all looks good. then i get a lot of messages like this:
> 2014/03/04 16:43:20 INFO scheduler.TaskSchedulerImpl: Ignoring update with state FINISHED from TID 883 because its task set is gone
> 2014/03/04 16:43:24 INFO scheduler.TaskSchedulerImpl: Ignoring update with state KILLED from TID 888 because its task set is gone
> 
> after this stage 14 hangs around in active stages, without any sign of progress or cancellation. it just sits there forever, stuck. looking at the logs of the executors confirms this. they task seem to be still running, but nothing is happening. for example (by the time i look at this its 4:58 so this tasks hasnt done anything in 15 mins):
> 
> 14/03/04 16:43:16 INFO Executor: Serialized size of result for 943 is 1007
> 14/03/04 16:43:16 INFO Executor: Sending result for 943 directly to driver
> 14/03/04 16:43:16 INFO Executor: Finished task ID 943
> 14/03/04 16:43:16 INFO Executor: Serialized size of result for 945 is 1007
> 14/03/04 16:43:16 INFO Executor: Sending result for 945 directly to driver
> 14/03/04 16:43:16 INFO Executor: Finished task ID 945
> 14/03/04 16:43:19 INFO BlockManager: Removing RDD 66
> 
> not sure what to make of this. any suggestions? best, koert
> 
> 
> 
> 
> 
> 
> 
> 
> 
>

Re: trying to understand job cancellation

Posted by Koert Kuipers <ko...@tresata.com>.

Sorry I meant to say: seems the issue is shared RDDs between a job that got
cancelled and a later job.

However even disregarding that I have the other issue that the active task
of the cancelled job hangs around forever, not doing anything....
On Mar 5, 2014 7:29 PM, "Koert Kuipers" <ko...@tresata.com> wrote:

> yes jobs on RDDs that were not part of the cancelled job work fine.
>
> so it seems the issue is the cached RDDs that are ahred between the
> cancelled job and the jobs after that.
>
>
> On Wed, Mar 5, 2014 at 7:15 PM, Koert Kuipers <ko...@tresata.com> wrote:
>
>> well, the new jobs use existing RDDs that were also used in the jon that
>> got killed.
>>
>> let me confirm that new jobs that use completely different RDDs do not
>> get killed.
>>
>>
>>
>> On Wed, Mar 5, 2014 at 7:00 PM, Mayur Rustagi <ma...@gmail.com>wrote:
>>
>>> Quite unlikely as jobid are given in an incremental fashion, so your
>>> future jobid are not likely to be killed if your groupid is not repeated.I
>>> guess the issue is something else.
>>>
>>> Mayur Rustagi
>>> Ph: +1 (760) 203 3257
>>> http://www.sigmoidanalytics.com
>>> @mayur_rustagi <https://twitter.com/mayur_rustagi>
>>>
>>>
>>>
>>> On Wed, Mar 5, 2014 at 3:50 PM, Koert Kuipers <ko...@tresata.com> wrote:
>>>
>>>> i did that. my next job gets a random new group job id (a uuid).
>>>> however that doesnt seem to stop the job from getting sucked into the
>>>> cancellation it seems
>>>>
>>>>
>>>> On Wed, Mar 5, 2014 at 6:47 PM, Mayur Rustagi <ma...@gmail.com>wrote:
>>>>
>>>>> You can randomize job groups as well. to secure yourself against
>>>>> termination.
>>>>>
>>>>> Mayur Rustagi
>>>>> Ph: +1 (760) 203 3257
>>>>> http://www.sigmoidanalytics.com
>>>>> @mayur_rustagi <https://twitter.com/mayur_rustagi>
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Mar 5, 2014 at 3:42 PM, Koert Kuipers <ko...@tresata.com>wrote:
>>>>>
>>>>>> got it. seems like i better stay away from this feature for now..
>>>>>>
>>>>>>
>>>>>> On Wed, Mar 5, 2014 at 5:55 PM, Mayur Rustagi <
>>>>>> mayur.rustagi@gmail.com> wrote:
>>>>>>
>>>>>>> One issue is that job cancellation is posted on eventloop. So its
>>>>>>> possible that subsequent jobs submitted to job queue may beat the job
>>>>>>> cancellation event & hence the job cancellation event may end up closing
>>>>>>> them too.
>>>>>>> So there's definitely a race condition you are risking even if not
>>>>>>> running into.
>>>>>>>
>>>>>>> Mayur Rustagi
>>>>>>> Ph: +1 (760) 203 3257
>>>>>>> http://www.sigmoidanalytics.com
>>>>>>> @mayur_rustagi <https://twitter.com/mayur_rustagi>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Mar 5, 2014 at 2:40 PM, Koert Kuipers <ko...@tresata.com>wrote:
>>>>>>>
>>>>>>>> SparkContext.cancelJobGroup
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Mar 5, 2014 at 5:32 PM, Mayur Rustagi <
>>>>>>>> mayur.rustagi@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> How do you cancel the job. Which API do you use?
>>>>>>>>>
>>>>>>>>> Mayur Rustagi
>>>>>>>>> Ph: +1 (760) 203 3257
>>>>>>>>> http://www.sigmoidanalytics.com
>>>>>>>>>  @mayur_rustagi <https://twitter.com/mayur_rustagi>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Mar 5, 2014 at 2:29 PM, Koert Kuipers <ko...@tresata.com>wrote:
>>>>>>>>>
>>>>>>>>>> i also noticed that jobs (with a new JobGroupId) which i run
>>>>>>>>>> after this use which use the same RDDs get very confused. i see lots of
>>>>>>>>>> cancelled stages and retries that go on forever.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, Mar 4, 2014 at 5:02 PM, Koert Kuipers <ko...@tresata.com>wrote:
>>>>>>>>>>
>>>>>>>>>>> i have a running job that i cancel while keeping the spark
>>>>>>>>>>> context alive.
>>>>>>>>>>>
>>>>>>>>>>> at the time of cancellation the active stage is 14.
>>>>>>>>>>>
>>>>>>>>>>> i see in logs:
>>>>>>>>>>> 2014/03/04 16:43:19 INFO scheduler.DAGScheduler: Asked to cancel
>>>>>>>>>>> job group 3a25db23-2e39-4497-b7ab-b26b2a976f9c
>>>>>>>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling
>>>>>>>>>>> stage 10
>>>>>>>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling
>>>>>>>>>>> stage 14
>>>>>>>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Stage 14
>>>>>>>>>>> was cancelled
>>>>>>>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Remove
>>>>>>>>>>> TaskSet 14.0 from pool x
>>>>>>>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling
>>>>>>>>>>> stage 13
>>>>>>>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling
>>>>>>>>>>> stage 12
>>>>>>>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling
>>>>>>>>>>> stage 11
>>>>>>>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling
>>>>>>>>>>> stage 15
>>>>>>>>>>>
>>>>>>>>>>> so far it all looks good. then i get a lot of messages like this:
>>>>>>>>>>> 2014/03/04 16:43:20 INFO scheduler.TaskSchedulerImpl: Ignoring
>>>>>>>>>>> update with state FINISHED from TID 883 because its task set is gone
>>>>>>>>>>> 2014/03/04 16:43:24 INFO scheduler.TaskSchedulerImpl: Ignoring
>>>>>>>>>>> update with state KILLED from TID 888 because its task set is gone
>>>>>>>>>>>
>>>>>>>>>>> after this stage 14 hangs around in active stages, without any
>>>>>>>>>>> sign of progress or cancellation. it just sits there forever, stuck.
>>>>>>>>>>> looking at the logs of the executors confirms this. they task seem to be
>>>>>>>>>>> still running, but nothing is happening. for example (by the time i look at
>>>>>>>>>>> this its 4:58 so this tasks hasnt done anything in 15 mins):
>>>>>>>>>>>
>>>>>>>>>>> 14/03/04 16:43:16 INFO Executor: Serialized size of result for
>>>>>>>>>>> 943 is 1007
>>>>>>>>>>> 14/03/04 16:43:16 INFO Executor: Sending result for 943 directly
>>>>>>>>>>> to driver
>>>>>>>>>>> 14/03/04 16:43:16 INFO Executor: Finished task ID 943
>>>>>>>>>>> 14/03/04 16:43:16 INFO Executor: Serialized size of result for
>>>>>>>>>>> 945 is 1007
>>>>>>>>>>> 14/03/04 16:43:16 INFO Executor: Sending result for 945 directly
>>>>>>>>>>> to driver
>>>>>>>>>>> 14/03/04 16:43:16 INFO Executor: Finished task ID 945
>>>>>>>>>>> 14/03/04 16:43:19 INFO BlockManager: Removing RDD 66
>>>>>>>>>>>
>>>>>>>>>>> not sure what to make of this. any suggestions? best, koert
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: trying to understand job cancellation

Posted by Koert Kuipers <ko...@tresata.com>.

yes jobs on RDDs that were not part of the cancelled job work fine.

so it seems the issue is the cached RDDs that are ahred between the
cancelled job and the jobs after that.


On Wed, Mar 5, 2014 at 7:15 PM, Koert Kuipers <ko...@tresata.com> wrote:

> well, the new jobs use existing RDDs that were also used in the jon that
> got killed.
>
> let me confirm that new jobs that use completely different RDDs do not get
> killed.
>
>
>
> On Wed, Mar 5, 2014 at 7:00 PM, Mayur Rustagi <ma...@gmail.com>wrote:
>
>> Quite unlikely as jobid are given in an incremental fashion, so your
>> future jobid are not likely to be killed if your groupid is not repeated.I
>> guess the issue is something else.
>>
>> Mayur Rustagi
>> Ph: +1 (760) 203 3257
>> http://www.sigmoidanalytics.com
>> @mayur_rustagi <https://twitter.com/mayur_rustagi>
>>
>>
>>
>> On Wed, Mar 5, 2014 at 3:50 PM, Koert Kuipers <ko...@tresata.com> wrote:
>>
>>> i did that. my next job gets a random new group job id (a uuid). however
>>> that doesnt seem to stop the job from getting sucked into the cancellation
>>> it seems
>>>
>>>
>>> On Wed, Mar 5, 2014 at 6:47 PM, Mayur Rustagi <ma...@gmail.com>wrote:
>>>
>>>> You can randomize job groups as well. to secure yourself against
>>>> termination.
>>>>
>>>> Mayur Rustagi
>>>> Ph: +1 (760) 203 3257
>>>> http://www.sigmoidanalytics.com
>>>> @mayur_rustagi <https://twitter.com/mayur_rustagi>
>>>>
>>>>
>>>>
>>>> On Wed, Mar 5, 2014 at 3:42 PM, Koert Kuipers <ko...@tresata.com>wrote:
>>>>
>>>>> got it. seems like i better stay away from this feature for now..
>>>>>
>>>>>
>>>>> On Wed, Mar 5, 2014 at 5:55 PM, Mayur Rustagi <mayur.rustagi@gmail.com
>>>>> > wrote:
>>>>>
>>>>>> One issue is that job cancellation is posted on eventloop. So its
>>>>>> possible that subsequent jobs submitted to job queue may beat the job
>>>>>> cancellation event & hence the job cancellation event may end up closing
>>>>>> them too.
>>>>>> So there's definitely a race condition you are risking even if not
>>>>>> running into.
>>>>>>
>>>>>> Mayur Rustagi
>>>>>> Ph: +1 (760) 203 3257
>>>>>> http://www.sigmoidanalytics.com
>>>>>> @mayur_rustagi <https://twitter.com/mayur_rustagi>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Mar 5, 2014 at 2:40 PM, Koert Kuipers <ko...@tresata.com>wrote:
>>>>>>
>>>>>>> SparkContext.cancelJobGroup
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Mar 5, 2014 at 5:32 PM, Mayur Rustagi <
>>>>>>> mayur.rustagi@gmail.com> wrote:
>>>>>>>
>>>>>>>> How do you cancel the job. Which API do you use?
>>>>>>>>
>>>>>>>> Mayur Rustagi
>>>>>>>> Ph: +1 (760) 203 3257
>>>>>>>> http://www.sigmoidanalytics.com
>>>>>>>>  @mayur_rustagi <https://twitter.com/mayur_rustagi>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Mar 5, 2014 at 2:29 PM, Koert Kuipers <ko...@tresata.com>wrote:
>>>>>>>>
>>>>>>>>> i also noticed that jobs (with a new JobGroupId) which i run after
>>>>>>>>> this use which use the same RDDs get very confused. i see lots of cancelled
>>>>>>>>> stages and retries that go on forever.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Mar 4, 2014 at 5:02 PM, Koert Kuipers <ko...@tresata.com>wrote:
>>>>>>>>>
>>>>>>>>>> i have a running job that i cancel while keeping the spark
>>>>>>>>>> context alive.
>>>>>>>>>>
>>>>>>>>>> at the time of cancellation the active stage is 14.
>>>>>>>>>>
>>>>>>>>>> i see in logs:
>>>>>>>>>> 2014/03/04 16:43:19 INFO scheduler.DAGScheduler: Asked to cancel
>>>>>>>>>> job group 3a25db23-2e39-4497-b7ab-b26b2a976f9c
>>>>>>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling
>>>>>>>>>> stage 10
>>>>>>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling
>>>>>>>>>> stage 14
>>>>>>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Stage 14
>>>>>>>>>> was cancelled
>>>>>>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Remove
>>>>>>>>>> TaskSet 14.0 from pool x
>>>>>>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling
>>>>>>>>>> stage 13
>>>>>>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling
>>>>>>>>>> stage 12
>>>>>>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling
>>>>>>>>>> stage 11
>>>>>>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling
>>>>>>>>>> stage 15
>>>>>>>>>>
>>>>>>>>>> so far it all looks good. then i get a lot of messages like this:
>>>>>>>>>> 2014/03/04 16:43:20 INFO scheduler.TaskSchedulerImpl: Ignoring
>>>>>>>>>> update with state FINISHED from TID 883 because its task set is gone
>>>>>>>>>> 2014/03/04 16:43:24 INFO scheduler.TaskSchedulerImpl: Ignoring
>>>>>>>>>> update with state KILLED from TID 888 because its task set is gone
>>>>>>>>>>
>>>>>>>>>> after this stage 14 hangs around in active stages, without any
>>>>>>>>>> sign of progress or cancellation. it just sits there forever, stuck.
>>>>>>>>>> looking at the logs of the executors confirms this. they task seem to be
>>>>>>>>>> still running, but nothing is happening. for example (by the time i look at
>>>>>>>>>> this its 4:58 so this tasks hasnt done anything in 15 mins):
>>>>>>>>>>
>>>>>>>>>> 14/03/04 16:43:16 INFO Executor: Serialized size of result for
>>>>>>>>>> 943 is 1007
>>>>>>>>>> 14/03/04 16:43:16 INFO Executor: Sending result for 943 directly
>>>>>>>>>> to driver
>>>>>>>>>> 14/03/04 16:43:16 INFO Executor: Finished task ID 943
>>>>>>>>>> 14/03/04 16:43:16 INFO Executor: Serialized size of result for
>>>>>>>>>> 945 is 1007
>>>>>>>>>> 14/03/04 16:43:16 INFO Executor: Sending result for 945 directly
>>>>>>>>>> to driver
>>>>>>>>>> 14/03/04 16:43:16 INFO Executor: Finished task ID 945
>>>>>>>>>> 14/03/04 16:43:19 INFO BlockManager: Removing RDD 66
>>>>>>>>>>
>>>>>>>>>> not sure what to make of this. any suggestions? best, koert
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: trying to understand job cancellation

Posted by Koert Kuipers <ko...@tresata.com>.

well, the new jobs use existing RDDs that were also used in the jon that
got killed.

let me confirm that new jobs that use completely different RDDs do not get
killed.



On Wed, Mar 5, 2014 at 7:00 PM, Mayur Rustagi <ma...@gmail.com>wrote:

> Quite unlikely as jobid are given in an incremental fashion, so your
> future jobid are not likely to be killed if your groupid is not repeated.I
> guess the issue is something else.
>
> Mayur Rustagi
> Ph: +1 (760) 203 3257
> http://www.sigmoidanalytics.com
> @mayur_rustagi <https://twitter.com/mayur_rustagi>
>
>
>
> On Wed, Mar 5, 2014 at 3:50 PM, Koert Kuipers <ko...@tresata.com> wrote:
>
>> i did that. my next job gets a random new group job id (a uuid). however
>> that doesnt seem to stop the job from getting sucked into the cancellation
>> it seems
>>
>>
>> On Wed, Mar 5, 2014 at 6:47 PM, Mayur Rustagi <ma...@gmail.com>wrote:
>>
>>> You can randomize job groups as well. to secure yourself against
>>> termination.
>>>
>>> Mayur Rustagi
>>> Ph: +1 (760) 203 3257
>>> http://www.sigmoidanalytics.com
>>> @mayur_rustagi <https://twitter.com/mayur_rustagi>
>>>
>>>
>>>
>>> On Wed, Mar 5, 2014 at 3:42 PM, Koert Kuipers <ko...@tresata.com> wrote:
>>>
>>>> got it. seems like i better stay away from this feature for now..
>>>>
>>>>
>>>> On Wed, Mar 5, 2014 at 5:55 PM, Mayur Rustagi <ma...@gmail.com>wrote:
>>>>
>>>>> One issue is that job cancellation is posted on eventloop. So its
>>>>> possible that subsequent jobs submitted to job queue may beat the job
>>>>> cancellation event & hence the job cancellation event may end up closing
>>>>> them too.
>>>>> So there's definitely a race condition you are risking even if not
>>>>> running into.
>>>>>
>>>>> Mayur Rustagi
>>>>> Ph: +1 (760) 203 3257
>>>>> http://www.sigmoidanalytics.com
>>>>> @mayur_rustagi <https://twitter.com/mayur_rustagi>
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Mar 5, 2014 at 2:40 PM, Koert Kuipers <ko...@tresata.com>wrote:
>>>>>
>>>>>> SparkContext.cancelJobGroup
>>>>>>
>>>>>>
>>>>>> On Wed, Mar 5, 2014 at 5:32 PM, Mayur Rustagi <
>>>>>> mayur.rustagi@gmail.com> wrote:
>>>>>>
>>>>>>> How do you cancel the job. Which API do you use?
>>>>>>>
>>>>>>> Mayur Rustagi
>>>>>>> Ph: +1 (760) 203 3257
>>>>>>> http://www.sigmoidanalytics.com
>>>>>>>  @mayur_rustagi <https://twitter.com/mayur_rustagi>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Mar 5, 2014 at 2:29 PM, Koert Kuipers <ko...@tresata.com>wrote:
>>>>>>>
>>>>>>>> i also noticed that jobs (with a new JobGroupId) which i run after
>>>>>>>> this use which use the same RDDs get very confused. i see lots of cancelled
>>>>>>>> stages and retries that go on forever.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Mar 4, 2014 at 5:02 PM, Koert Kuipers <ko...@tresata.com>wrote:
>>>>>>>>
>>>>>>>>> i have a running job that i cancel while keeping the spark context
>>>>>>>>> alive.
>>>>>>>>>
>>>>>>>>> at the time of cancellation the active stage is 14.
>>>>>>>>>
>>>>>>>>> i see in logs:
>>>>>>>>> 2014/03/04 16:43:19 INFO scheduler.DAGScheduler: Asked to cancel
>>>>>>>>> job group 3a25db23-2e39-4497-b7ab-b26b2a976f9c
>>>>>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling
>>>>>>>>> stage 10
>>>>>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling
>>>>>>>>> stage 14
>>>>>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Stage 14 was
>>>>>>>>> cancelled
>>>>>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Remove
>>>>>>>>> TaskSet 14.0 from pool x
>>>>>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling
>>>>>>>>> stage 13
>>>>>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling
>>>>>>>>> stage 12
>>>>>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling
>>>>>>>>> stage 11
>>>>>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling
>>>>>>>>> stage 15
>>>>>>>>>
>>>>>>>>> so far it all looks good. then i get a lot of messages like this:
>>>>>>>>> 2014/03/04 16:43:20 INFO scheduler.TaskSchedulerImpl: Ignoring
>>>>>>>>> update with state FINISHED from TID 883 because its task set is gone
>>>>>>>>> 2014/03/04 16:43:24 INFO scheduler.TaskSchedulerImpl: Ignoring
>>>>>>>>> update with state KILLED from TID 888 because its task set is gone
>>>>>>>>>
>>>>>>>>> after this stage 14 hangs around in active stages, without any
>>>>>>>>> sign of progress or cancellation. it just sits there forever, stuck.
>>>>>>>>> looking at the logs of the executors confirms this. they task seem to be
>>>>>>>>> still running, but nothing is happening. for example (by the time i look at
>>>>>>>>> this its 4:58 so this tasks hasnt done anything in 15 mins):
>>>>>>>>>
>>>>>>>>> 14/03/04 16:43:16 INFO Executor: Serialized size of result for 943
>>>>>>>>> is 1007
>>>>>>>>> 14/03/04 16:43:16 INFO Executor: Sending result for 943 directly
>>>>>>>>> to driver
>>>>>>>>> 14/03/04 16:43:16 INFO Executor: Finished task ID 943
>>>>>>>>> 14/03/04 16:43:16 INFO Executor: Serialized size of result for 945
>>>>>>>>> is 1007
>>>>>>>>> 14/03/04 16:43:16 INFO Executor: Sending result for 945 directly
>>>>>>>>> to driver
>>>>>>>>> 14/03/04 16:43:16 INFO Executor: Finished task ID 945
>>>>>>>>> 14/03/04 16:43:19 INFO BlockManager: Removing RDD 66
>>>>>>>>>
>>>>>>>>> not sure what to make of this. any suggestions? best, koert
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: trying to understand job cancellation

Posted by Mayur Rustagi <ma...@gmail.com>.

Quite unlikely as jobid are given in an incremental fashion, so your future
jobid are not likely to be killed if your groupid is not repeated.I guess
the issue is something else.

Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi <https://twitter.com/mayur_rustagi>



On Wed, Mar 5, 2014 at 3:50 PM, Koert Kuipers <ko...@tresata.com> wrote:

> i did that. my next job gets a random new group job id (a uuid). however
> that doesnt seem to stop the job from getting sucked into the cancellation
> it seems
>
>
> On Wed, Mar 5, 2014 at 6:47 PM, Mayur Rustagi <ma...@gmail.com>wrote:
>
>> You can randomize job groups as well. to secure yourself against
>> termination.
>>
>> Mayur Rustagi
>> Ph: +1 (760) 203 3257
>> http://www.sigmoidanalytics.com
>> @mayur_rustagi <https://twitter.com/mayur_rustagi>
>>
>>
>>
>> On Wed, Mar 5, 2014 at 3:42 PM, Koert Kuipers <ko...@tresata.com> wrote:
>>
>>> got it. seems like i better stay away from this feature for now..
>>>
>>>
>>> On Wed, Mar 5, 2014 at 5:55 PM, Mayur Rustagi <ma...@gmail.com>wrote:
>>>
>>>> One issue is that job cancellation is posted on eventloop. So its
>>>> possible that subsequent jobs submitted to job queue may beat the job
>>>> cancellation event & hence the job cancellation event may end up closing
>>>> them too.
>>>> So there's definitely a race condition you are risking even if not
>>>> running into.
>>>>
>>>> Mayur Rustagi
>>>> Ph: +1 (760) 203 3257
>>>> http://www.sigmoidanalytics.com
>>>> @mayur_rustagi <https://twitter.com/mayur_rustagi>
>>>>
>>>>
>>>>
>>>> On Wed, Mar 5, 2014 at 2:40 PM, Koert Kuipers <ko...@tresata.com>wrote:
>>>>
>>>>> SparkContext.cancelJobGroup
>>>>>
>>>>>
>>>>> On Wed, Mar 5, 2014 at 5:32 PM, Mayur Rustagi <mayur.rustagi@gmail.com
>>>>> > wrote:
>>>>>
>>>>>> How do you cancel the job. Which API do you use?
>>>>>>
>>>>>> Mayur Rustagi
>>>>>> Ph: +1 (760) 203 3257
>>>>>> http://www.sigmoidanalytics.com
>>>>>>  @mayur_rustagi <https://twitter.com/mayur_rustagi>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Mar 5, 2014 at 2:29 PM, Koert Kuipers <ko...@tresata.com>wrote:
>>>>>>
>>>>>>> i also noticed that jobs (with a new JobGroupId) which i run after
>>>>>>> this use which use the same RDDs get very confused. i see lots of cancelled
>>>>>>> stages and retries that go on forever.
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Mar 4, 2014 at 5:02 PM, Koert Kuipers <ko...@tresata.com>wrote:
>>>>>>>
>>>>>>>> i have a running job that i cancel while keeping the spark context
>>>>>>>> alive.
>>>>>>>>
>>>>>>>> at the time of cancellation the active stage is 14.
>>>>>>>>
>>>>>>>> i see in logs:
>>>>>>>> 2014/03/04 16:43:19 INFO scheduler.DAGScheduler: Asked to cancel
>>>>>>>> job group 3a25db23-2e39-4497-b7ab-b26b2a976f9c
>>>>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling
>>>>>>>> stage 10
>>>>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling
>>>>>>>> stage 14
>>>>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Stage 14 was
>>>>>>>> cancelled
>>>>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Remove
>>>>>>>> TaskSet 14.0 from pool x
>>>>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling
>>>>>>>> stage 13
>>>>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling
>>>>>>>> stage 12
>>>>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling
>>>>>>>> stage 11
>>>>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling
>>>>>>>> stage 15
>>>>>>>>
>>>>>>>> so far it all looks good. then i get a lot of messages like this:
>>>>>>>> 2014/03/04 16:43:20 INFO scheduler.TaskSchedulerImpl: Ignoring
>>>>>>>> update with state FINISHED from TID 883 because its task set is gone
>>>>>>>> 2014/03/04 16:43:24 INFO scheduler.TaskSchedulerImpl: Ignoring
>>>>>>>> update with state KILLED from TID 888 because its task set is gone
>>>>>>>>
>>>>>>>> after this stage 14 hangs around in active stages, without any sign
>>>>>>>> of progress or cancellation. it just sits there forever, stuck. looking at
>>>>>>>> the logs of the executors confirms this. they task seem to be still
>>>>>>>> running, but nothing is happening. for example (by the time i look at this
>>>>>>>> its 4:58 so this tasks hasnt done anything in 15 mins):
>>>>>>>>
>>>>>>>> 14/03/04 16:43:16 INFO Executor: Serialized size of result for 943
>>>>>>>> is 1007
>>>>>>>> 14/03/04 16:43:16 INFO Executor: Sending result for 943 directly to
>>>>>>>> driver
>>>>>>>> 14/03/04 16:43:16 INFO Executor: Finished task ID 943
>>>>>>>> 14/03/04 16:43:16 INFO Executor: Serialized size of result for 945
>>>>>>>> is 1007
>>>>>>>> 14/03/04 16:43:16 INFO Executor: Sending result for 945 directly to
>>>>>>>> driver
>>>>>>>> 14/03/04 16:43:16 INFO Executor: Finished task ID 945
>>>>>>>> 14/03/04 16:43:19 INFO BlockManager: Removing RDD 66
>>>>>>>>
>>>>>>>> not sure what to make of this. any suggestions? best, koert
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: trying to understand job cancellation

Posted by Koert Kuipers <ko...@tresata.com>.

i did that. my next job gets a random new group job id (a uuid). however
that doesnt seem to stop the job from getting sucked into the cancellation
it seems


On Wed, Mar 5, 2014 at 6:47 PM, Mayur Rustagi <ma...@gmail.com>wrote:

> You can randomize job groups as well. to secure yourself against
> termination.
>
> Mayur Rustagi
> Ph: +1 (760) 203 3257
> http://www.sigmoidanalytics.com
> @mayur_rustagi <https://twitter.com/mayur_rustagi>
>
>
>
> On Wed, Mar 5, 2014 at 3:42 PM, Koert Kuipers <ko...@tresata.com> wrote:
>
>> got it. seems like i better stay away from this feature for now..
>>
>>
>> On Wed, Mar 5, 2014 at 5:55 PM, Mayur Rustagi <ma...@gmail.com>wrote:
>>
>>> One issue is that job cancellation is posted on eventloop. So its
>>> possible that subsequent jobs submitted to job queue may beat the job
>>> cancellation event & hence the job cancellation event may end up closing
>>> them too.
>>> So there's definitely a race condition you are risking even if not
>>> running into.
>>>
>>> Mayur Rustagi
>>> Ph: +1 (760) 203 3257
>>> http://www.sigmoidanalytics.com
>>> @mayur_rustagi <https://twitter.com/mayur_rustagi>
>>>
>>>
>>>
>>> On Wed, Mar 5, 2014 at 2:40 PM, Koert Kuipers <ko...@tresata.com> wrote:
>>>
>>>> SparkContext.cancelJobGroup
>>>>
>>>>
>>>> On Wed, Mar 5, 2014 at 5:32 PM, Mayur Rustagi <ma...@gmail.com>wrote:
>>>>
>>>>> How do you cancel the job. Which API do you use?
>>>>>
>>>>> Mayur Rustagi
>>>>> Ph: +1 (760) 203 3257
>>>>> http://www.sigmoidanalytics.com
>>>>>  @mayur_rustagi <https://twitter.com/mayur_rustagi>
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Mar 5, 2014 at 2:29 PM, Koert Kuipers <ko...@tresata.com>wrote:
>>>>>
>>>>>> i also noticed that jobs (with a new JobGroupId) which i run after
>>>>>> this use which use the same RDDs get very confused. i see lots of cancelled
>>>>>> stages and retries that go on forever.
>>>>>>
>>>>>>
>>>>>> On Tue, Mar 4, 2014 at 5:02 PM, Koert Kuipers <ko...@tresata.com>wrote:
>>>>>>
>>>>>>> i have a running job that i cancel while keeping the spark context
>>>>>>> alive.
>>>>>>>
>>>>>>> at the time of cancellation the active stage is 14.
>>>>>>>
>>>>>>> i see in logs:
>>>>>>> 2014/03/04 16:43:19 INFO scheduler.DAGScheduler: Asked to cancel job
>>>>>>> group 3a25db23-2e39-4497-b7ab-b26b2a976f9c
>>>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling
>>>>>>> stage 10
>>>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling
>>>>>>> stage 14
>>>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Stage 14 was
>>>>>>> cancelled
>>>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Remove TaskSet
>>>>>>> 14.0 from pool x
>>>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling
>>>>>>> stage 13
>>>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling
>>>>>>> stage 12
>>>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling
>>>>>>> stage 11
>>>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling
>>>>>>> stage 15
>>>>>>>
>>>>>>> so far it all looks good. then i get a lot of messages like this:
>>>>>>> 2014/03/04 16:43:20 INFO scheduler.TaskSchedulerImpl: Ignoring
>>>>>>> update with state FINISHED from TID 883 because its task set is gone
>>>>>>> 2014/03/04 16:43:24 INFO scheduler.TaskSchedulerImpl: Ignoring
>>>>>>> update with state KILLED from TID 888 because its task set is gone
>>>>>>>
>>>>>>> after this stage 14 hangs around in active stages, without any sign
>>>>>>> of progress or cancellation. it just sits there forever, stuck. looking at
>>>>>>> the logs of the executors confirms this. they task seem to be still
>>>>>>> running, but nothing is happening. for example (by the time i look at this
>>>>>>> its 4:58 so this tasks hasnt done anything in 15 mins):
>>>>>>>
>>>>>>> 14/03/04 16:43:16 INFO Executor: Serialized size of result for 943
>>>>>>> is 1007
>>>>>>> 14/03/04 16:43:16 INFO Executor: Sending result for 943 directly to
>>>>>>> driver
>>>>>>> 14/03/04 16:43:16 INFO Executor: Finished task ID 943
>>>>>>> 14/03/04 16:43:16 INFO Executor: Serialized size of result for 945
>>>>>>> is 1007
>>>>>>> 14/03/04 16:43:16 INFO Executor: Sending result for 945 directly to
>>>>>>> driver
>>>>>>> 14/03/04 16:43:16 INFO Executor: Finished task ID 945
>>>>>>> 14/03/04 16:43:19 INFO BlockManager: Removing RDD 66
>>>>>>>
>>>>>>> not sure what to make of this. any suggestions? best, koert
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: trying to understand job cancellation

Posted by Mayur Rustagi <ma...@gmail.com>.

You can randomize job groups as well. to secure yourself against
termination.

Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi <https://twitter.com/mayur_rustagi>



On Wed, Mar 5, 2014 at 3:42 PM, Koert Kuipers <ko...@tresata.com> wrote:

> got it. seems like i better stay away from this feature for now..
>
>
> On Wed, Mar 5, 2014 at 5:55 PM, Mayur Rustagi <ma...@gmail.com>wrote:
>
>> One issue is that job cancellation is posted on eventloop. So its
>> possible that subsequent jobs submitted to job queue may beat the job
>> cancellation event & hence the job cancellation event may end up closing
>> them too.
>> So there's definitely a race condition you are risking even if not
>> running into.
>>
>> Mayur Rustagi
>> Ph: +1 (760) 203 3257
>> http://www.sigmoidanalytics.com
>> @mayur_rustagi <https://twitter.com/mayur_rustagi>
>>
>>
>>
>> On Wed, Mar 5, 2014 at 2:40 PM, Koert Kuipers <ko...@tresata.com> wrote:
>>
>>> SparkContext.cancelJobGroup
>>>
>>>
>>> On Wed, Mar 5, 2014 at 5:32 PM, Mayur Rustagi <ma...@gmail.com>wrote:
>>>
>>>> How do you cancel the job. Which API do you use?
>>>>
>>>> Mayur Rustagi
>>>> Ph: +1 (760) 203 3257
>>>> http://www.sigmoidanalytics.com
>>>>  @mayur_rustagi <https://twitter.com/mayur_rustagi>
>>>>
>>>>
>>>>
>>>> On Wed, Mar 5, 2014 at 2:29 PM, Koert Kuipers <ko...@tresata.com>wrote:
>>>>
>>>>> i also noticed that jobs (with a new JobGroupId) which i run after
>>>>> this use which use the same RDDs get very confused. i see lots of cancelled
>>>>> stages and retries that go on forever.
>>>>>
>>>>>
>>>>> On Tue, Mar 4, 2014 at 5:02 PM, Koert Kuipers <ko...@tresata.com>wrote:
>>>>>
>>>>>> i have a running job that i cancel while keeping the spark context
>>>>>> alive.
>>>>>>
>>>>>> at the time of cancellation the active stage is 14.
>>>>>>
>>>>>> i see in logs:
>>>>>> 2014/03/04 16:43:19 INFO scheduler.DAGScheduler: Asked to cancel job
>>>>>> group 3a25db23-2e39-4497-b7ab-b26b2a976f9c
>>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling
>>>>>> stage 10
>>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling
>>>>>> stage 14
>>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Stage 14 was
>>>>>> cancelled
>>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Remove TaskSet
>>>>>> 14.0 from pool x
>>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling
>>>>>> stage 13
>>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling
>>>>>> stage 12
>>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling
>>>>>> stage 11
>>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling
>>>>>> stage 15
>>>>>>
>>>>>> so far it all looks good. then i get a lot of messages like this:
>>>>>> 2014/03/04 16:43:20 INFO scheduler.TaskSchedulerImpl: Ignoring update
>>>>>> with state FINISHED from TID 883 because its task set is gone
>>>>>> 2014/03/04 16:43:24 INFO scheduler.TaskSchedulerImpl: Ignoring update
>>>>>> with state KILLED from TID 888 because its task set is gone
>>>>>>
>>>>>> after this stage 14 hangs around in active stages, without any sign
>>>>>> of progress or cancellation. it just sits there forever, stuck. looking at
>>>>>> the logs of the executors confirms this. they task seem to be still
>>>>>> running, but nothing is happening. for example (by the time i look at this
>>>>>> its 4:58 so this tasks hasnt done anything in 15 mins):
>>>>>>
>>>>>> 14/03/04 16:43:16 INFO Executor: Serialized size of result for 943 is
>>>>>> 1007
>>>>>> 14/03/04 16:43:16 INFO Executor: Sending result for 943 directly to
>>>>>> driver
>>>>>> 14/03/04 16:43:16 INFO Executor: Finished task ID 943
>>>>>> 14/03/04 16:43:16 INFO Executor: Serialized size of result for 945 is
>>>>>> 1007
>>>>>> 14/03/04 16:43:16 INFO Executor: Sending result for 945 directly to
>>>>>> driver
>>>>>> 14/03/04 16:43:16 INFO Executor: Finished task ID 945
>>>>>> 14/03/04 16:43:19 INFO BlockManager: Removing RDD 66
>>>>>>
>>>>>> not sure what to make of this. any suggestions? best, koert
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: trying to understand job cancellation

Posted by Koert Kuipers <ko...@tresata.com>.

got it. seems like i better stay away from this feature for now..


On Wed, Mar 5, 2014 at 5:55 PM, Mayur Rustagi <ma...@gmail.com>wrote:

> One issue is that job cancellation is posted on eventloop. So its possible
> that subsequent jobs submitted to job queue may beat the job cancellation
> event & hence the job cancellation event may end up closing them too.
> So there's definitely a race condition you are risking even if not running
> into.
>
> Mayur Rustagi
> Ph: +1 (760) 203 3257
> http://www.sigmoidanalytics.com
> @mayur_rustagi <https://twitter.com/mayur_rustagi>
>
>
>
> On Wed, Mar 5, 2014 at 2:40 PM, Koert Kuipers <ko...@tresata.com> wrote:
>
>> SparkContext.cancelJobGroup
>>
>>
>> On Wed, Mar 5, 2014 at 5:32 PM, Mayur Rustagi <ma...@gmail.com>wrote:
>>
>>> How do you cancel the job. Which API do you use?
>>>
>>> Mayur Rustagi
>>> Ph: +1 (760) 203 3257
>>> http://www.sigmoidanalytics.com
>>>  @mayur_rustagi <https://twitter.com/mayur_rustagi>
>>>
>>>
>>>
>>> On Wed, Mar 5, 2014 at 2:29 PM, Koert Kuipers <ko...@tresata.com> wrote:
>>>
>>>> i also noticed that jobs (with a new JobGroupId) which i run after this
>>>> use which use the same RDDs get very confused. i see lots of cancelled
>>>> stages and retries that go on forever.
>>>>
>>>>
>>>> On Tue, Mar 4, 2014 at 5:02 PM, Koert Kuipers <ko...@tresata.com>wrote:
>>>>
>>>>> i have a running job that i cancel while keeping the spark context
>>>>> alive.
>>>>>
>>>>> at the time of cancellation the active stage is 14.
>>>>>
>>>>> i see in logs:
>>>>> 2014/03/04 16:43:19 INFO scheduler.DAGScheduler: Asked to cancel job
>>>>> group 3a25db23-2e39-4497-b7ab-b26b2a976f9c
>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage
>>>>> 10
>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage
>>>>> 14
>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Stage 14 was
>>>>> cancelled
>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Remove TaskSet
>>>>> 14.0 from pool x
>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage
>>>>> 13
>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage
>>>>> 12
>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage
>>>>> 11
>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage
>>>>> 15
>>>>>
>>>>> so far it all looks good. then i get a lot of messages like this:
>>>>> 2014/03/04 16:43:20 INFO scheduler.TaskSchedulerImpl: Ignoring update
>>>>> with state FINISHED from TID 883 because its task set is gone
>>>>> 2014/03/04 16:43:24 INFO scheduler.TaskSchedulerImpl: Ignoring update
>>>>> with state KILLED from TID 888 because its task set is gone
>>>>>
>>>>> after this stage 14 hangs around in active stages, without any sign of
>>>>> progress or cancellation. it just sits there forever, stuck. looking at the
>>>>> logs of the executors confirms this. they task seem to be still running,
>>>>> but nothing is happening. for example (by the time i look at this its 4:58
>>>>> so this tasks hasnt done anything in 15 mins):
>>>>>
>>>>> 14/03/04 16:43:16 INFO Executor: Serialized size of result for 943 is
>>>>> 1007
>>>>> 14/03/04 16:43:16 INFO Executor: Sending result for 943 directly to
>>>>> driver
>>>>> 14/03/04 16:43:16 INFO Executor: Finished task ID 943
>>>>> 14/03/04 16:43:16 INFO Executor: Serialized size of result for 945 is
>>>>> 1007
>>>>> 14/03/04 16:43:16 INFO Executor: Sending result for 945 directly to
>>>>> driver
>>>>> 14/03/04 16:43:16 INFO Executor: Finished task ID 945
>>>>> 14/03/04 16:43:19 INFO BlockManager: Removing RDD 66
>>>>>
>>>>> not sure what to make of this. any suggestions? best, koert
>>>>>
>>>>
>>>>
>>>
>>
>

Re: trying to understand job cancellation

Posted by Mayur Rustagi <ma...@gmail.com>.

One issue is that job cancellation is posted on eventloop. So its possible
that subsequent jobs submitted to job queue may beat the job cancellation
event & hence the job cancellation event may end up closing them too.
So there's definitely a race condition you are risking even if not running
into.

Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi <https://twitter.com/mayur_rustagi>



On Wed, Mar 5, 2014 at 2:40 PM, Koert Kuipers <ko...@tresata.com> wrote:

> SparkContext.cancelJobGroup
>
>
> On Wed, Mar 5, 2014 at 5:32 PM, Mayur Rustagi <ma...@gmail.com>wrote:
>
>> How do you cancel the job. Which API do you use?
>>
>> Mayur Rustagi
>> Ph: +1 (760) 203 3257
>> http://www.sigmoidanalytics.com
>>  @mayur_rustagi <https://twitter.com/mayur_rustagi>
>>
>>
>>
>> On Wed, Mar 5, 2014 at 2:29 PM, Koert Kuipers <ko...@tresata.com> wrote:
>>
>>> i also noticed that jobs (with a new JobGroupId) which i run after this
>>> use which use the same RDDs get very confused. i see lots of cancelled
>>> stages and retries that go on forever.
>>>
>>>
>>> On Tue, Mar 4, 2014 at 5:02 PM, Koert Kuipers <ko...@tresata.com> wrote:
>>>
>>>> i have a running job that i cancel while keeping the spark context
>>>> alive.
>>>>
>>>> at the time of cancellation the active stage is 14.
>>>>
>>>> i see in logs:
>>>> 2014/03/04 16:43:19 INFO scheduler.DAGScheduler: Asked to cancel job
>>>> group 3a25db23-2e39-4497-b7ab-b26b2a976f9c
>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage
>>>> 10
>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage
>>>> 14
>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Stage 14 was
>>>> cancelled
>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Remove TaskSet
>>>> 14.0 from pool x
>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage
>>>> 13
>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage
>>>> 12
>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage
>>>> 11
>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage
>>>> 15
>>>>
>>>> so far it all looks good. then i get a lot of messages like this:
>>>> 2014/03/04 16:43:20 INFO scheduler.TaskSchedulerImpl: Ignoring update
>>>> with state FINISHED from TID 883 because its task set is gone
>>>> 2014/03/04 16:43:24 INFO scheduler.TaskSchedulerImpl: Ignoring update
>>>> with state KILLED from TID 888 because its task set is gone
>>>>
>>>> after this stage 14 hangs around in active stages, without any sign of
>>>> progress or cancellation. it just sits there forever, stuck. looking at the
>>>> logs of the executors confirms this. they task seem to be still running,
>>>> but nothing is happening. for example (by the time i look at this its 4:58
>>>> so this tasks hasnt done anything in 15 mins):
>>>>
>>>> 14/03/04 16:43:16 INFO Executor: Serialized size of result for 943 is
>>>> 1007
>>>> 14/03/04 16:43:16 INFO Executor: Sending result for 943 directly to
>>>> driver
>>>> 14/03/04 16:43:16 INFO Executor: Finished task ID 943
>>>> 14/03/04 16:43:16 INFO Executor: Serialized size of result for 945 is
>>>> 1007
>>>> 14/03/04 16:43:16 INFO Executor: Sending result for 945 directly to
>>>> driver
>>>> 14/03/04 16:43:16 INFO Executor: Finished task ID 945
>>>> 14/03/04 16:43:19 INFO BlockManager: Removing RDD 66
>>>>
>>>> not sure what to make of this. any suggestions? best, koert
>>>>
>>>
>>>
>>
>

Re: trying to understand job cancellation

Posted by Koert Kuipers <ko...@tresata.com>.

SparkContext.cancelJobGroup


On Wed, Mar 5, 2014 at 5:32 PM, Mayur Rustagi <ma...@gmail.com>wrote:

> How do you cancel the job. Which API do you use?
>
> Mayur Rustagi
> Ph: +1 (760) 203 3257
> http://www.sigmoidanalytics.com
> @mayur_rustagi <https://twitter.com/mayur_rustagi>
>
>
>
> On Wed, Mar 5, 2014 at 2:29 PM, Koert Kuipers <ko...@tresata.com> wrote:
>
>> i also noticed that jobs (with a new JobGroupId) which i run after this
>> use which use the same RDDs get very confused. i see lots of cancelled
>> stages and retries that go on forever.
>>
>>
>> On Tue, Mar 4, 2014 at 5:02 PM, Koert Kuipers <ko...@tresata.com> wrote:
>>
>>> i have a running job that i cancel while keeping the spark context alive.
>>>
>>> at the time of cancellation the active stage is 14.
>>>
>>> i see in logs:
>>> 2014/03/04 16:43:19 INFO scheduler.DAGScheduler: Asked to cancel job
>>> group 3a25db23-2e39-4497-b7ab-b26b2a976f9c
>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage 10
>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage 14
>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Stage 14 was
>>> cancelled
>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Remove TaskSet
>>> 14.0 from pool x
>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage 13
>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage 12
>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage 11
>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage 15
>>>
>>> so far it all looks good. then i get a lot of messages like this:
>>> 2014/03/04 16:43:20 INFO scheduler.TaskSchedulerImpl: Ignoring update
>>> with state FINISHED from TID 883 because its task set is gone
>>> 2014/03/04 16:43:24 INFO scheduler.TaskSchedulerImpl: Ignoring update
>>> with state KILLED from TID 888 because its task set is gone
>>>
>>> after this stage 14 hangs around in active stages, without any sign of
>>> progress or cancellation. it just sits there forever, stuck. looking at the
>>> logs of the executors confirms this. they task seem to be still running,
>>> but nothing is happening. for example (by the time i look at this its 4:58
>>> so this tasks hasnt done anything in 15 mins):
>>>
>>> 14/03/04 16:43:16 INFO Executor: Serialized size of result for 943 is
>>> 1007
>>> 14/03/04 16:43:16 INFO Executor: Sending result for 943 directly to
>>> driver
>>> 14/03/04 16:43:16 INFO Executor: Finished task ID 943
>>> 14/03/04 16:43:16 INFO Executor: Serialized size of result for 945 is
>>> 1007
>>> 14/03/04 16:43:16 INFO Executor: Sending result for 945 directly to
>>> driver
>>> 14/03/04 16:43:16 INFO Executor: Finished task ID 945
>>> 14/03/04 16:43:19 INFO BlockManager: Removing RDD 66
>>>
>>> not sure what to make of this. any suggestions? best, koert
>>>
>>
>>
>

Re: trying to understand job cancellation

Posted by Mayur Rustagi <ma...@gmail.com>.

How do you cancel the job. Which API do you use?

Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi <https://twitter.com/mayur_rustagi>



On Wed, Mar 5, 2014 at 2:29 PM, Koert Kuipers <ko...@tresata.com> wrote:

> i also noticed that jobs (with a new JobGroupId) which i run after this
> use which use the same RDDs get very confused. i see lots of cancelled
> stages and retries that go on forever.
>
>
> On Tue, Mar 4, 2014 at 5:02 PM, Koert Kuipers <ko...@tresata.com> wrote:
>
>> i have a running job that i cancel while keeping the spark context alive.
>>
>> at the time of cancellation the active stage is 14.
>>
>> i see in logs:
>> 2014/03/04 16:43:19 INFO scheduler.DAGScheduler: Asked to cancel job
>> group 3a25db23-2e39-4497-b7ab-b26b2a976f9c
>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage 10
>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage 14
>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Stage 14 was
>> cancelled
>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Remove TaskSet 14.0
>> from pool x
>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage 13
>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage 12
>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage 11
>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage 15
>>
>> so far it all looks good. then i get a lot of messages like this:
>> 2014/03/04 16:43:20 INFO scheduler.TaskSchedulerImpl: Ignoring update
>> with state FINISHED from TID 883 because its task set is gone
>> 2014/03/04 16:43:24 INFO scheduler.TaskSchedulerImpl: Ignoring update
>> with state KILLED from TID 888 because its task set is gone
>>
>> after this stage 14 hangs around in active stages, without any sign of
>> progress or cancellation. it just sits there forever, stuck. looking at the
>> logs of the executors confirms this. they task seem to be still running,
>> but nothing is happening. for example (by the time i look at this its 4:58
>> so this tasks hasnt done anything in 15 mins):
>>
>> 14/03/04 16:43:16 INFO Executor: Serialized size of result for 943 is 1007
>> 14/03/04 16:43:16 INFO Executor: Sending result for 943 directly to driver
>> 14/03/04 16:43:16 INFO Executor: Finished task ID 943
>> 14/03/04 16:43:16 INFO Executor: Serialized size of result for 945 is 1007
>> 14/03/04 16:43:16 INFO Executor: Sending result for 945 directly to driver
>> 14/03/04 16:43:16 INFO Executor: Finished task ID 945
>> 14/03/04 16:43:19 INFO BlockManager: Removing RDD 66
>>
>> not sure what to make of this. any suggestions? best, koert
>>
>
>

Re: trying to understand job cancellation

Posted by Koert Kuipers <ko...@tresata.com>.

i also noticed that jobs (with a new JobGroupId) which i run after this use
which use the same RDDs get very confused. i see lots of cancelled stages
and retries that go on forever.


On Tue, Mar 4, 2014 at 5:02 PM, Koert Kuipers <ko...@tresata.com> wrote:

> i have a running job that i cancel while keeping the spark context alive.
>
> at the time of cancellation the active stage is 14.
>
> i see in logs:
> 2014/03/04 16:43:19 INFO scheduler.DAGScheduler: Asked to cancel job group
> 3a25db23-2e39-4497-b7ab-b26b2a976f9c
> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage 10
> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage 14
> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Stage 14 was
> cancelled
> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Remove TaskSet 14.0
> from pool x
> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage 13
> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage 12
> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage 11
> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage 15
>
> so far it all looks good. then i get a lot of messages like this:
> 2014/03/04 16:43:20 INFO scheduler.TaskSchedulerImpl: Ignoring update with
> state FINISHED from TID 883 because its task set is gone
> 2014/03/04 16:43:24 INFO scheduler.TaskSchedulerImpl: Ignoring update with
> state KILLED from TID 888 because its task set is gone
>
> after this stage 14 hangs around in active stages, without any sign of
> progress or cancellation. it just sits there forever, stuck. looking at the
> logs of the executors confirms this. they task seem to be still running,
> but nothing is happening. for example (by the time i look at this its 4:58
> so this tasks hasnt done anything in 15 mins):
>
> 14/03/04 16:43:16 INFO Executor: Serialized size of result for 943 is 1007
> 14/03/04 16:43:16 INFO Executor: Sending result for 943 directly to driver
> 14/03/04 16:43:16 INFO Executor: Finished task ID 943
> 14/03/04 16:43:16 INFO Executor: Serialized size of result for 945 is 1007
> 14/03/04 16:43:16 INFO Executor: Sending result for 945 directly to driver
> 14/03/04 16:43:16 INFO Executor: Finished task ID 945
> 14/03/04 16:43:19 INFO BlockManager: Removing RDD 66
>
> not sure what to make of this. any suggestions? best, koert
>