You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (JIRA)" <ji...@apache.org> on 2019/05/21 04:20:12 UTC

[jira] [Updated] (SPARK-16244) Failed job/stage couldn't stop JobGenerator immediately.

     [ https://issues.apache.org/jira/browse/SPARK-16244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hyukjin Kwon updated SPARK-16244:
---------------------------------
    Labels: bulk-closed  (was: )

> Failed job/stage couldn't stop JobGenerator immediately.
> --------------------------------------------------------
>
>                 Key: SPARK-16244
>                 URL: https://issues.apache.org/jira/browse/SPARK-16244
>             Project: Spark
>          Issue Type: Bug
>          Components: DStreams
>    Affects Versions: 1.5.2
>            Reporter: Liyin Tang
>            Priority: Major
>              Labels: bulk-closed
>
> This streaming job has a very simple DAG. Each batch have only 1 job, and each job has only 1 stage.
> Based on the following logs, we observed a potential race condition. Stage 1 failed due to some tasks failure, and it tigers failJobAndIndependentStages.
> In the meanwhile, the next stage (job), 2, is submitted and was able to successfully run a few tasks before stopping JobGenerator via shutdown hook.
> Since the next job was able to run through a few tasks successfully, it just messed up all the checkpoints / offset management.
> Here is the log from my job:
> {color:red}
> Stage 227 started:
> {color}
> [INFO] 2016-06-25 18:59:00,171 org.apache.spark.scheduler.DAGScheduler logInfo - Submitting 1495 missing tasks from ResultStage 227 (MapPartitionsRDD[455] at foreachRDD at DBExportStreaming.java:55)
> [INFO] 2016-06-25 18:59:00,160 org.apache.spark.scheduler.DAGScheduler logInfo - Final stage: ResultStage 227(foreachRDD at DBExportStreaming.java:55)
> [INFO] 2016-06-25 18:59:00,160 org.apache.spark.scheduler.DAGScheduler logInfo - Submitting ResultStage 227 (MapPartitionsRDD[455] at foreachRDD at DBExportStreaming.java:55), which has no missing parents
> [INFO] 2016-06-25 18:59:00,171 org.apache.spark.scheduler.DAGScheduler logInfo - Submitting 1495 missing tasks from ResultStage 227 (MapPartitionsRDD[455] at foreachRDD at DBExportStreaming.java:55)
> {color:red}
> Stage 227 failed:
> {color}
> [ERROR] 2016-06-25 19:01:34,083 org.apache.spark.scheduler.TaskSetManager logError - Task 26 in stage 227.0 failed 4 times; aborting job
> [INFO] 2016-06-25 19:01:34,086 org.apache.spark.scheduler.cluster.YarnScheduler logInfo - Cancelling stage 227
> [INFO] 2016-06-25 19:01:34,088 org.apache.spark.scheduler.cluster.YarnScheduler logInfo - Stage 227 was cancelled
> [INFO] 2016-06-25 19:01:34,089 org.apache.spark.scheduler.DAGScheduler logInfo - ResultStage 227 (foreachRDD at DBExportStreaming.java:55) failed in 153.914 s
> [INFO] 2016-06-25 19:01:34,090 org.apache.spark.scheduler.DAGScheduler logInfo - Job 227 failed: foreachRDD at DBExportStreaming.java:55, took 153.930462 s
> [INFO] 2016-06-25 19:01:34,091 org.apache.spark.streaming.scheduler.JobScheduler logInfo - Finished job streaming job 1466881140000 ms.0 from job set of time 14
> 66881140000 ms
> [INFO] 2016-06-25 19:01:34,091 org.apache.spark.streaming.scheduler.JobScheduler logInfo - Total delay: 154.091 s for time 1466881140000 ms (execution: 153.935
> s)
> {color:red}
> Stage 228 started:
> {color}
> [INFO] 2016-06-25 19:01:34,094 org.apache.spark.SparkContext logInfo - Starting job: foreachRDD at DBExportStreaming.java:55
> [INFO] 2016-06-25 19:01:34,095 org.apache.spark.scheduler.DAGScheduler logInfo - Got job 228 (foreachRDD at DBExportStreaming.java:55) with 1495 output partitions
> [INFO] 2016-06-25 19:01:34,095 org.apache.spark.scheduler.DAGScheduler logInfo - Final stage: ResultStage 228(foreachRDD at DBExportStreaming.java:55)
> Exception in thread "main" [INFO] 2016-06-25 19:01:34,095 org.apache.spark.scheduler.DAGScheduler logInfo - Parents of final stage: List()
> {color:red}
> Shutdown hook was called after stage 228 started:
> {color}
> [INFO] 2016-06-25 19:01:34,099 org.apache.spark.streaming.StreamingContext logInfo - Invoking stop(stopGracefully=false) from shutdown hook
> [INFO] 2016-06-25 19:01:34,101 org.apache.spark.streaming.scheduler.JobGenerator logInfo - Stopping JobGenerator immediately
> [INFO] 2016-06-25 19:01:34,102 org.apache.spark.streaming.util.RecurringTimer logInfo - Stopped timer for JobGenerator after time 1466881260000
> [INFO] 2016-06-25 19:01:34,103 org.apache.spark.streaming.scheduler.JobGenerator logInfo - Stopped JobGenerator
> [INFO] 2016-06-25 19:01:34,106 org.apache.spark.storage.MemoryStore logInfo - ensureFreeSpace(133720) called with curMem=344903, maxMem=1159641169
> [INFO] 2016-06-25 19:01:34,106 org.apache.spark.storage.MemoryStore logInfo - Block broadcast_229 stored as values in memory (estimated size 130.6 KB, free 1105.5 MB)
> [INFO] 2016-06-25 19:01:34,107 org.apache.spark.storage.MemoryStore logInfo - ensureFreeSpace(51478) called with curMem=478623, maxMem=1159641169
> [INFO] 2016-06-25 19:01:34,107 org.apache.spark.storage.MemoryStore logInfo - Block broadcast_229_piece0 stored as bytes in memory (estimated size 50.3 KB, free 1105.4 MB)
> [INFO] 2016-06-25 19:01:34,108 org.apache.spark.storage.BlockManagerInfo logInfo - Added broadcast_229_piece0 in memory on 10.123.209.8:42154 (size: 50.3 KB, free: 1105.8 MB)
> [INFO] 2016-06-25 19:01:34,109 org.apache.spark.SparkContext logInfo - Created broadcast 229 from broadcast at DAGScheduler.scala:861
> [INFO] 2016-06-25 19:01:34,110 org.apache.spark.scheduler.DAGScheduler logInfo - Submitting 1495 missing tasks from ResultStage 228 (MapPartitionsRDD[458] at foreachRDD at DBExportStreaming.java:55)
> [INFO] 2016-06-25 19:01:34,111 org.apache.spark.scheduler.cluster.YarnScheduler logInfo - Adding task set 228.0 with 1495 tasks
> [INFO] 2016-06-25 19:01:34,329 org.apache.spark.scheduler.TaskSetManager logInfo - Starting task 0.0 in stage 228.0 (TID 340194, i-87adb702.inst.aws.airbnb.com, RACK_LOCAL, 2894 bytes)
> {color:red}
> Stage 228 has completed a few tasks even though the job is supposed to be killed:
> {color}
> [INFO] 2016-06-25 19:01:34,591 org.apache.spark.scheduler.TaskSetManager logInfo - Starting task 2.0 in stage 228.0 (TID 340196, i-fbc19629.inst.aws.airbnb.com, RACK_LOCAL, 2872 bytes)
> [INFO] 2016-06-25 19:01:34,591 org.apache.spark.scheduler.TaskSetManager logInfo - Finished task 1.0 in stage 228.0 (TID 340195) in 100 ms on i-fbc19629.inst.aws.airbnb.com (1/1495)
> [INFO] 2016-06-25 19:01:34,610 org.apache.spark.scheduler.TaskSetManager logInfo - Starting task 3.0 in stage 228.0 (TID 340197, i-fbc19629.inst.aws.airbnb.com, RACK_LOCAL, 2872 bytes)
> [INFO] 2016-06-25 19:01:34,610 org.apache.spark.scheduler.TaskSetManager logInfo - Finished task 2.0 in stage 228.0 (TID 340196) in 19 ms on i-fbc19629.inst.aws.airbnb.com (2/1495)
> [INFO] 2016-06-25 19:01:34,654 org.apache.spark.scheduler.TaskSetManager logInfo - Starting task 4.0 in stage 228.0 (TID 340198, i-fbc19629.inst.aws.airbnb.com, RACK_LOCAL, 2871 bytes)
> [INFO] 2016-06-25 19:01:34,654 org.apache.spark.scheduler.TaskSetManager logInfo - Finished task 3.0 in stage 228.0 (TID 340197) in 44 ms on i-fbc19629.inst.aws.airbnb.com (3/1495)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org