You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Apache Spark (JIRA)" <ji...@apache.org> on 2019/07/08 16:06:00 UTC

[jira] [Assigned] (SPARK-28305) When the AM cannot obtain the container loss reason, ask GetExecutorLossReason times out

     [ https://issues.apache.org/jira/browse/SPARK-28305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Apache Spark reassigned SPARK-28305:
------------------------------------

    Assignee:     (was: Apache Spark)

> When the AM cannot obtain the container loss reason, ask GetExecutorLossReason times out
> ----------------------------------------------------------------------------------------
>
>                 Key: SPARK-28305
>                 URL: https://issues.apache.org/jira/browse/SPARK-28305
>             Project: Spark
>          Issue Type: Bug
>          Components: YARN
>    Affects Versions: 2.4.0
>            Reporter: dzcxzl
>            Priority: Trivial
>
> In some cases, such as NM machine crashes or shuts down,driver ask GetExecutorLossReason,
> AM getCompletedContainersStatuses can't get the failure information of container.
> Because the yarn NM detection timeout is 10 minutes, it is controlled by the parameter yarn.resourcemanager.rm.container-allocation.expiry-interval-ms.
> So AM has to wait for 10 minutes to get the cause of the container failure.
> Although the driver's ask fails, it will call recover.
> However, due to the 2-minute timeout (spark.network.timeout) configured by IdleStateHandler, the connection between driver and am is closed, AM exits, app finish, driver exits, causing the job to fail.
> AM LOG:
> 19/07/08 16:56:48 [dispatcher-event-loop-0] INFO YarnAllocator: add executor 951 to pendingLossReasonRequests for get the loss reason
> 19/07/08 16:58:48 [dispatcher-event-loop-26] INFO ApplicationMaster$AMEndpoint: Driver terminated or disconnected! Shutting down.
> 19/07/08 16:58:48 [dispatcher-event-loop-26] INFO ApplicationMaster: Final app status: SUCCEEDED, exitCode: 0
> Driver LOG:
> 19/07/08 16:58:48,476 [rpc-server-3-3] ERROR TransportChannelHandler: Connection to /xx.xx.xx.xx:19398 has been quiet for 120000 ms while there are outstanding requests. Assuming connection is dead; please adjust spark.network.timeout if this is wrong.
> 19/07/08 16:58:48,476 [rpc-server-3-3] ERROR TransportResponseHandler: Still have 1 requests outstanding when connection from /xx.xx.xx.xx:19398 is closed
> 19/07/08 16:58:48,510 [rpc-server-3-3] WARN NettyRpcEnv: Ignored failure: java.io.IOException: Connection from /xx.xx.xx.xx:19398 closed
> 19/07/08 16:58:48,516 [netty-rpc-env-timeout] WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Attempted to get executor loss reason for executor id 951 at RPC address xx.xx.xx.xx:49175, but got no response. Marking as slave lost.
> org.apache.spark.rpc.RpcTimeoutException: Cannot receive any reply from null in 120 seconds. This timeout is controlled by spark.rpc.askTimeout



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org