You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Josh Rosen (Jira)" <ji...@apache.org> on 2019/10/14 22:17:00 UTC

[jira] [Updated] (SPARK-29471) "TaskResultLost (result lost from block manager)" error message is misleading in case result fetch is caused by client-side network connectivity issues

     [ https://issues.apache.org/jira/browse/SPARK-29471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Josh Rosen updated SPARK-29471:
-------------------------------
    Summary: "TaskResultLost (result lost from block manager)" error message is misleading in case result fetch is caused by client-side network connectivity issues  (was: "TaskResultLost (result lost from block manager)" error message is misleading in case result fetch is caused by client-side issues)

> "TaskResultLost (result lost from block manager)" error message is misleading in case result fetch is caused by client-side network connectivity issues
> -------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-29471
>                 URL: https://issues.apache.org/jira/browse/SPARK-29471
>             Project: Spark
>          Issue Type: Bug
>          Components: Block Manager
>    Affects Versions: 3.0.0
>            Reporter: Josh Rosen
>            Priority: Minor
>
> I recently encountered a problem where jobs non-deterministically failed with
> {code:java}
> TaskResultLost (result lost from block manager) {code}
> exceptions.
> It turned out that this was due to some sort of networking issue where the Spark driver was unable to initiate outgoing connections to executors' block managers in order to fetch indirect task results.
> In this situation, the error message was slightly misleading: the "result lost from block manager" makes it sound like we received an error / block-not-found response from the remote host, whereas in my case the problem was actually a network connectivity issue where we weren't even able to connect in the first place.
> If it's easy to do so, it might be nice to refine the error-handling / logging code so that we distinguish between the receipt of an error response vs. a lower-level networking / connectivity issue. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org