You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Weichen Xu (JIRA)" <ji...@apache.org> on 2019/07/23 13:44:00 UTC

[jira] [Updated] (SPARK-28483) Canceling a spark job using barrier mode but tasks still being printing messages

     [ https://issues.apache.org/jira/browse/SPARK-28483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Weichen Xu updated SPARK-28483:
-------------------------------
    Description: 
Reproduce code:
{code:java}
import time
from pyspark import BarrierTaskContext

n = 4

def  task(x):
  context = BarrierTaskContext.get()
  this = next(x)
  if (this % 2 == 0):
    time.sleep(10000)
  context.barrier()
  return []

sc.setLogLevel("INFO")
sc.parallelize(list(range(n)), n).barrier().mapPartitions(task).collect(){code}


Run above code in pyspark shell and then print Ctrl + C to exit the job.

Get logging like:

{code}
19/02/05 01:07:42 INFO BarrierTaskContext: Task 3 from Stage 0(Attempt 0) has entered the global sync, current barrier epoch is 0.
19/02/05 01:07:42 INFO BarrierTaskContext: Task 1 from Stage 0(Attempt 0) has entered the global sync, current barrier epoch is 0.
19/02/05 01:07:47 INFO Executor: Executor is trying to kill task 2.0 in stage 0.0 (TID 2), reason: Stage cancelled
19/02/05 01:07:47 INFO Executor: Executor is trying to kill task 0.0 in stage 0.0 (TID 0), reason: Stage cancelled
19/02/05 01:07:47 INFO Executor: Executor is trying to kill task 1.0 in stage 0.0 (TID 1), reason: Stage cancelled
19/02/05 01:07:47 INFO Executor: Executor is trying to kill task 3.0 in stage 0.0 (TID 3), reason: Stage cancelled
19/02/05 01:07:50 WARN PythonRunner: Incomplete task 0.0 in stage 0 (TID 0) interrupted: Attempting to kill Python Worker
19/02/05 01:07:50 INFO Executor: Executor killed task 0.0 in stage 0.0 (TID 0), reason: Stage cancelled
19/02/05 01:07:50 WARN PythonRunner: Incomplete task 3.3 in stage 0 (TID 3) interrupted: Attempting to kill Python Worker
19/02/05 01:07:50 INFO Executor: Executor killed task 3.0 in stage 0.0 (TID 3), reason: Stage cancelled
19/02/05 01:07:50 WARN PythonRunner: Incomplete task 2.2 in stage 0 (TID 2) interrupted: Attempting to kill Python Worker
19/02/05 01:07:50 INFO Executor: Executor killed task 2.0 in stage 0.0 (TID 2), reason: Stage cancelled
19/02/05 01:07:50 WARN PythonRunner: Incomplete task 1.1 in stage 0 (TID 1) interrupted: Attempting to kill Python Worker
19/02/05 01:07:50 INFO Executor: Executor killed task 1.0 in stage 0.0 (TID 1), reason: Stage cancelled
19/02/05 01:08:42 INFO BarrierTaskContext: Task 3 from Stage 0(Attempt 0) waiting under the global sync since 1549328862443, has been waiting for 60 seconds, current barrier epoch is 0.
19/02/05 01:08:42 INFO BarrierTaskContext: Task 1 from Stage 0(Attempt 0) waiting under the global sync since 1549328862522, has been waiting for 60 seconds, current barrier epoch is 0.
19/02/05 01:09:42 INFO BarrierTaskContext: Task 3 from Stage 0(Attempt 0) waiting under the global sync since 1549328862443, has been waiting for 120 seconds, current barrier epoch is 0.
19/02/05 01:09:42 INFO BarrierTaskContext: Task 1 from Stage 0(Attempt 0) waiting under the global sync since 1549328862522, has been waiting for 120 seconds, current barrier epoch is 0.
19/02/05 01:10:42 INFO BarrierTaskContext: Task 3 from Stage 0(Attempt 0) waiting under the global sync since 1549328862443, has been waiting for 180 seconds, current barrier epoch is 0.
19/02/05 01:10:42 INFO BarrierTaskContext: Task 1 from Stage 0(Attempt 0) waiting under the global sync since 1549328862522, has been waiting for 180 seconds, current barrier epoch is 0.
19/02/05 01:11:42 INFO BarrierTaskContext: Task 3 from Stage 0(Attempt 0) waiting under the global sync since 1549328862443, has been waiting for 240 seconds, current barrier epoch is 0.
19/02/05 01:11:42 INFO BarrierTaskContext: Task 1 from Stage 0(Attempt 0) waiting under the global sync since 1549328862522, has been waiting for 240 seconds, current barrier epoch is 0.
{code}


  was:
Reproduce code:
{code:java}
import time
from pyspark import BarrierTaskContext

n = 4

def  task(x):
  context = BarrierTaskContext.get()
  this = next(x)
  if (this % 2 == 0):
    time.sleep(10000)
  context.barrier()
  return []

sc.setLogLevel("INFO")
sc.parallelize(list(range(n)), n).barrier().mapPartitions(task).collect(){code}

Get logging like:

```
19/02/05 01:07:42 INFO BarrierTaskContext: Task 3 from Stage 0(Attempt 0) has entered the global sync, current barrier epoch is 0.
19/02/05 01:07:42 INFO BarrierTaskContext: Task 1 from Stage 0(Attempt 0) has entered the global sync, current barrier epoch is 0.
19/02/05 01:07:47 INFO Executor: Executor is trying to kill task 2.0 in stage 0.0 (TID 2), reason: Stage cancelled
19/02/05 01:07:47 INFO Executor: Executor is trying to kill task 0.0 in stage 0.0 (TID 0), reason: Stage cancelled
19/02/05 01:07:47 INFO Executor: Executor is trying to kill task 1.0 in stage 0.0 (TID 1), reason: Stage cancelled
19/02/05 01:07:47 INFO Executor: Executor is trying to kill task 3.0 in stage 0.0 (TID 3), reason: Stage cancelled
19/02/05 01:07:50 WARN PythonRunner: Incomplete task 0.0 in stage 0 (TID 0) interrupted: Attempting to kill Python Worker
19/02/05 01:07:50 INFO Executor: Executor killed task 0.0 in stage 0.0 (TID 0), reason: Stage cancelled
19/02/05 01:07:50 WARN PythonRunner: Incomplete task 3.3 in stage 0 (TID 3) interrupted: Attempting to kill Python Worker
19/02/05 01:07:50 INFO Executor: Executor killed task 3.0 in stage 0.0 (TID 3), reason: Stage cancelled
19/02/05 01:07:50 WARN PythonRunner: Incomplete task 2.2 in stage 0 (TID 2) interrupted: Attempting to kill Python Worker
19/02/05 01:07:50 INFO Executor: Executor killed task 2.0 in stage 0.0 (TID 2), reason: Stage cancelled
19/02/05 01:07:50 WARN PythonRunner: Incomplete task 1.1 in stage 0 (TID 1) interrupted: Attempting to kill Python Worker
19/02/05 01:07:50 INFO Executor: Executor killed task 1.0 in stage 0.0 (TID 1), reason: Stage cancelled
19/02/05 01:08:42 INFO BarrierTaskContext: Task 3 from Stage 0(Attempt 0) waiting under the global sync since 1549328862443, has been waiting for 60 seconds, current barrier epoch is 0.
19/02/05 01:08:42 INFO BarrierTaskContext: Task 1 from Stage 0(Attempt 0) waiting under the global sync since 1549328862522, has been waiting for 60 seconds, current barrier epoch is 0.
19/02/05 01:09:42 INFO BarrierTaskContext: Task 3 from Stage 0(Attempt 0) waiting under the global sync since 1549328862443, has been waiting for 120 seconds, current barrier epoch is 0.
19/02/05 01:09:42 INFO BarrierTaskContext: Task 1 from Stage 0(Attempt 0) waiting under the global sync since 1549328862522, has been waiting for 120 seconds, current barrier epoch is 0.
19/02/05 01:10:42 INFO BarrierTaskContext: Task 3 from Stage 0(Attempt 0) waiting under the global sync since 1549328862443, has been waiting for 180 seconds, current barrier epoch is 0.
19/02/05 01:10:42 INFO BarrierTaskContext: Task 1 from Stage 0(Attempt 0) waiting under the global sync since 1549328862522, has been waiting for 180 seconds, current barrier epoch is 0.
19/02/05 01:11:42 INFO BarrierTaskContext: Task 3 from Stage 0(Attempt 0) waiting under the global sync since 1549328862443, has been waiting for 240 seconds, current barrier epoch is 0.
19/02/05 01:11:42 INFO BarrierTaskContext: Task 1 from Stage 0(Attempt 0) waiting under the global sync since 1549328862522, has been waiting for 240 seconds, current barrier epoch is 0.
```


> Canceling a spark job using barrier mode but tasks still being printing messages
> --------------------------------------------------------------------------------
>
>                 Key: SPARK-28483
>                 URL: https://issues.apache.org/jira/browse/SPARK-28483
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.4.3
>            Reporter: Weichen Xu
>            Priority: Major
>
> Reproduce code:
> {code:java}
> import time
> from pyspark import BarrierTaskContext
> n = 4
> def  task(x):
>   context = BarrierTaskContext.get()
>   this = next(x)
>   if (this % 2 == 0):
>     time.sleep(10000)
>   context.barrier()
>   return []
> sc.setLogLevel("INFO")
> sc.parallelize(list(range(n)), n).barrier().mapPartitions(task).collect(){code}
> Run above code in pyspark shell and then print Ctrl + C to exit the job.
> Get logging like:
> {code}
> 19/02/05 01:07:42 INFO BarrierTaskContext: Task 3 from Stage 0(Attempt 0) has entered the global sync, current barrier epoch is 0.
> 19/02/05 01:07:42 INFO BarrierTaskContext: Task 1 from Stage 0(Attempt 0) has entered the global sync, current barrier epoch is 0.
> 19/02/05 01:07:47 INFO Executor: Executor is trying to kill task 2.0 in stage 0.0 (TID 2), reason: Stage cancelled
> 19/02/05 01:07:47 INFO Executor: Executor is trying to kill task 0.0 in stage 0.0 (TID 0), reason: Stage cancelled
> 19/02/05 01:07:47 INFO Executor: Executor is trying to kill task 1.0 in stage 0.0 (TID 1), reason: Stage cancelled
> 19/02/05 01:07:47 INFO Executor: Executor is trying to kill task 3.0 in stage 0.0 (TID 3), reason: Stage cancelled
> 19/02/05 01:07:50 WARN PythonRunner: Incomplete task 0.0 in stage 0 (TID 0) interrupted: Attempting to kill Python Worker
> 19/02/05 01:07:50 INFO Executor: Executor killed task 0.0 in stage 0.0 (TID 0), reason: Stage cancelled
> 19/02/05 01:07:50 WARN PythonRunner: Incomplete task 3.3 in stage 0 (TID 3) interrupted: Attempting to kill Python Worker
> 19/02/05 01:07:50 INFO Executor: Executor killed task 3.0 in stage 0.0 (TID 3), reason: Stage cancelled
> 19/02/05 01:07:50 WARN PythonRunner: Incomplete task 2.2 in stage 0 (TID 2) interrupted: Attempting to kill Python Worker
> 19/02/05 01:07:50 INFO Executor: Executor killed task 2.0 in stage 0.0 (TID 2), reason: Stage cancelled
> 19/02/05 01:07:50 WARN PythonRunner: Incomplete task 1.1 in stage 0 (TID 1) interrupted: Attempting to kill Python Worker
> 19/02/05 01:07:50 INFO Executor: Executor killed task 1.0 in stage 0.0 (TID 1), reason: Stage cancelled
> 19/02/05 01:08:42 INFO BarrierTaskContext: Task 3 from Stage 0(Attempt 0) waiting under the global sync since 1549328862443, has been waiting for 60 seconds, current barrier epoch is 0.
> 19/02/05 01:08:42 INFO BarrierTaskContext: Task 1 from Stage 0(Attempt 0) waiting under the global sync since 1549328862522, has been waiting for 60 seconds, current barrier epoch is 0.
> 19/02/05 01:09:42 INFO BarrierTaskContext: Task 3 from Stage 0(Attempt 0) waiting under the global sync since 1549328862443, has been waiting for 120 seconds, current barrier epoch is 0.
> 19/02/05 01:09:42 INFO BarrierTaskContext: Task 1 from Stage 0(Attempt 0) waiting under the global sync since 1549328862522, has been waiting for 120 seconds, current barrier epoch is 0.
> 19/02/05 01:10:42 INFO BarrierTaskContext: Task 3 from Stage 0(Attempt 0) waiting under the global sync since 1549328862443, has been waiting for 180 seconds, current barrier epoch is 0.
> 19/02/05 01:10:42 INFO BarrierTaskContext: Task 1 from Stage 0(Attempt 0) waiting under the global sync since 1549328862522, has been waiting for 180 seconds, current barrier epoch is 0.
> 19/02/05 01:11:42 INFO BarrierTaskContext: Task 3 from Stage 0(Attempt 0) waiting under the global sync since 1549328862443, has been waiting for 240 seconds, current barrier epoch is 0.
> 19/02/05 01:11:42 INFO BarrierTaskContext: Task 1 from Stage 0(Attempt 0) waiting under the global sync since 1549328862522, has been waiting for 240 seconds, current barrier epoch is 0.
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org