You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "liupengcheng (JIRA)" <ji...@apache.org> on 2018/07/24 07:20:00 UTC

[jira] [Commented] (SPARK-24897) DAGScheduler should not unregisterMapOutput and increaseEpoch repeatedly for stage fetchFailed

    [ https://issues.apache.org/jira/browse/SPARK-24897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16553907#comment-16553907 ] 

liupengcheng commented on SPARK-24897:
--------------------------------------

Can some one verify the issue?

I checked the 2.3.1 branch, this problem seems not fixed.

> DAGScheduler should not unregisterMapOutput and increaseEpoch repeatedly for stage fetchFailed
> ----------------------------------------------------------------------------------------------
>
>                 Key: SPARK-24897
>                 URL: https://issues.apache.org/jira/browse/SPARK-24897
>             Project: Spark
>          Issue Type: Bug
>          Components: Scheduler, Spark Core
>    Affects Versions: 2.1.0, 2.3.1
>            Reporter: liupengcheng
>            Priority: Major
>
> In Spark2.1, when a stage fetchfailed DAGScheduler will retry both this stage and it's parent stage, however, when the parent stage is resubmitted and start running, the mapstatuses can 
> still be invalidate by the stage's outstanding task due to fetchfailed.
> The stage's outstanding task might unregister the mapstatuses with new epoch, thus causing 
> the parent stage repeated MetadataFetchFailed and finally failling the Job.
>  
>  
> {code:java}
> 2018-07-23,01:52:33,012 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 174.0 in stage 71.0 (TID 154127, <host>, executor 96): FetchFailed(BlockManagerId(4945, <host>, 22409), shuffleId=24, mapId=667, reduceId=174, message= org.apache.spark.shuffle.FetchFailedException: Failed to connect to <host>/<ip>:22409
> 2018-07-23,01:52:33,013 INFO org.apache.spark.scheduler.DAGScheduler: Marking ShuffleMapStage 71 (map at DeviceLocateMain.scala:238) as failed due to a fetch failure from ShuffleMapStage 69 ($plus$plus at DeviceLocateMain.scala:236) 
> 2018-07-23,01:52:33,014 INFO org.apache.spark.scheduler.DAGScheduler: ShuffleMapStage 71 (map at DeviceLocateMain.scala:238) failed in 246.856 s 2018-07-23,01:52:33,014 INFO org.apache.spark.scheduler.DAGScheduler: Resubmitting ShuffleMapStage 69 ($plus$plus at DeviceLocateMain.scala:236) and ShuffleMapStage 71 (map at DeviceLocateMain.scala:238) due to fetch failure 
> 2018-07-23,01:52:36,004 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 120.0 in stage 71.0 (TID 154073, <host>, executor 286): FetchFailed(BlockManagerId(4208, <host>, 22409), shuffleId=24, mapId=241, reduceId=120, message= org.apache.spark.shuffle.FetchFailedException: Failed to connect to <host>/<ip>:22409 
> 2018-07-23,01:52:36,005 INFO org.apache.spark.scheduler.DAGScheduler: Resubmitting ShuffleMapStage 69 ($plus$plus at DeviceLocateMain.scala:236) and ShuffleMapStage 71 (map at DeviceLocateMain.scala:238) due to fetch failure 
> 2018-07-23,01:52:36,017 INFO org.apache.spark.storage.MemoryStore: Block broadcast_63_piece0 stored as bytes in memory (estimated size 4.0 MB, free 26.7 MB) 
> 2018-07-23,01:52:36,025 INFO org.apache.spark.storage.BlockManagerInfo: Removed broadcast_59_piece1 on <host>:52349 in memory (size: 4.0 MB, free: 3.0 GB) 
> 2018-07-23,01:52:36,029 INFO org.apache.spark.storage.BlockManagerInfo: Removed broadcast_61_piece6 on <host>:52349 in memory (size: 4.0 MB, free: 3.0 GB) 
> 2018-07-23,01:52:36,079 INFO org.apache.spark.deploy.yarn.YarnAllocator: Canceling requests for 0 executor containers 
> 2018-07-23,01:52:36,079 WARN org.apache.spark.deploy.yarn.YarnAllocator: Expected to find pending requests, but found none.
>  2018-07-23,01:52:36,094 INFO org.apache.spark.storage.BlockManagerInfo: Added broadcast_63_piece0 in memory on <host>:56780 (size: 4.0 MB, free: 3.7 GB) 
> 2018-07-23,01:52:36,095 INFO org.apache.spark.storage.MemoryStore: Block broadcast_63_piece1 stored as bytes in memory (estimated size 4.0 MB, free 30.7 MB) 
> 2018-07-23,01:52:36,107 INFO org.apache.spark.storage.BlockManagerInfo: Added broadcast_63_piece1 in memory on <host>:56780 (size: 4.0 MB, free: 3.7 GB) 
> 2018-07-23,01:52:36,108 INFO org.apache.spark.storage.MemoryStore: Block broadcast_63_piece2 stored as bytes in memory (estimated size 4.0 MB, free 34.7 MB) 
> 2018-07-23,01:52:36,108 INFO org.apache.spark.storage.BlockManagerInfo: Added broadcast_63_piece2 in memory on <host>:56780 (size: 4.0 MB, free: 3.7 GB) 
> 2018-07-23,01:52:36,108 INFO org.apache.spark.storage.MemoryStore: Block broadcast_63_piece3 stored as bytes in memory (estimated size 4.0 MB, free 38.7 MB) 
> 2018-07-23,01:52:36,132 INFO org.apache.spark.storage.BlockManagerInfo: Added broadcast_63_piece3 in memory on <host>:56780 (size: 4.0 MB, free: 3.7 GB) 
> 2018-07-23,01:52:36,132 INFO org.apache.spark.storage.MemoryStore: Block broadcast_63_piece4 stored as bytes in memory (estimated size 3.8 MB, free 42.5 MB) 
> 2018-07-23,01:52:36,132 INFO org.apache.spark.storage.BlockManagerInfo: Added broadcast_63_piece4 in memory on <host>:56780 (size: 3.8 MB, free: 3.7 GB) 
> 2018-07-23,01:52:36,132 INFO org.apache.spark.MapOutputTracker: Broadcast mapstatuses size = 384, actual size = 20784475 
> 2018-07-23,01:52:36,132 INFO org.apache.spark.MapOutputTrackerMaster: Size of output statuses for shuffle 17 is 384 bytes 
> 2018-07-23,01:52:36,133 INFO org.apache.spark.MapOutputTrackerMaster: Epoch changed, not caching! 
> 2018-07-23,01:52:36,185 INFO org.apache.spark.storage.BlockManagerInfo: Removed broadcast_61_piece3 on <host>:52349 in memory (size: 4.0 MB, free: 3.0 GB) 
> 2018-07-23,01:52:36,185 INFO org.apache.spark.storage.BlockManagerInfo: Removed broadcast_63_piece4 on <host>:56780 in memory (size: 3.8 MB, free: 3.7 GB) 
> 2018-07-23,01:52:36,186 INFO org.apache.spark.storage.BlockManagerInfo: Removed broadcast_63_piece2 on <host>:56780 in memory (size: 4.0 MB, free: 3.7 GB) 
> 2018-07-23,01:52:36,186 INFO org.apache.spark.storage.BlockManagerInfo: Removed broadcast_63_piece1 on <host>:56780 in memory (size: 4.0 MB, free: 3.7 GB) 
> 2018-07-23,01:52:36,186 INFO org.apache.spark.storage.BlockManagerInfo: Removed broadcast_63_piece0 on <host>:56780 in memory (size: 4.0 MB, free: 3.7 GB) 
> 2018-07-23,01:52:36,186 INFO org.apache.spark.storage.BlockManagerInfo: Removed broadcast_63_piece3 on <host>:56780 in memory (size: 4.0 MB, free: 3.7 GB) 
> 2018-07-23,01:52:36,192 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 1.0 in stage 69.1 (TID 154955, <host>, executor 267): FetchFailed(null, shuffleId=17, mapId=-1, reduceId=-1, message= org.apache.spark.shuffle.MetadataFetchFailedException: java.io.IOException: org.apache.spark.SparkException: Failed to get broadcast_63_piece4 of broadcast_63
> {code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org