You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "feiwang (Jira)" <ji...@apache.org> on 2020/03/18 02:39:00 UTC

[jira] [Updated] (SPARK-31179) Fast fail the connection while last shuffle connection failed in the last retry IO wait

     [ https://issues.apache.org/jira/browse/SPARK-31179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

feiwang updated SPARK-31179:
----------------------------
    Description: 
When reading shuffle data, maybe several fetch request sent to a same shuffle server.
There is a client pool, and these request may share the same client.
When the shuffle server is busy, it may cause the request connection timeout.
For example: there are two request connection, rc1 and rc2.
Especially, the io.numConnectionsPerPeer is 1 and connection timeout is 2 minutes.

1: rc1 hold the client lock, it timeout after 2 minutes.
2: rc2 hold the client lock, it timeout after 2 minutes.
3: rc1 start the second retry, hold lock and timeout after 2 minutes.
4: rc2 start the second retry, hold lock and timeout after 2 minutes.
5: rc1 start the third retry, hold lock and timeout after 2 minutes.
6: rc2 start the third retry, hold lock and timeout after 2 minutes.
It wastes lots of time.

> Fast fail the connection while last shuffle connection failed in the last retry IO wait 
> ----------------------------------------------------------------------------------------
>
>                 Key: SPARK-31179
>                 URL: https://issues.apache.org/jira/browse/SPARK-31179
>             Project: Spark
>          Issue Type: Improvement
>          Components: Shuffle
>    Affects Versions: 3.1.0
>            Reporter: feiwang
>            Priority: Major
>
> When reading shuffle data, maybe several fetch request sent to a same shuffle server.
> There is a client pool, and these request may share the same client.
> When the shuffle server is busy, it may cause the request connection timeout.
> For example: there are two request connection, rc1 and rc2.
> Especially, the io.numConnectionsPerPeer is 1 and connection timeout is 2 minutes.
> 1: rc1 hold the client lock, it timeout after 2 minutes.
> 2: rc2 hold the client lock, it timeout after 2 minutes.
> 3: rc1 start the second retry, hold lock and timeout after 2 minutes.
> 4: rc2 start the second retry, hold lock and timeout after 2 minutes.
> 5: rc1 start the third retry, hold lock and timeout after 2 minutes.
> 6: rc2 start the third retry, hold lock and timeout after 2 minutes.
> It wastes lots of time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org