You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tez.apache.org by "Ming Ma (JIRA)" <ji...@apache.org> on 2016/05/20 22:04:12 UTC

[jira] [Updated] (TEZ-3263) Improved shuffle error handling across NM restarts

     [ https://issues.apache.org/jira/browse/TEZ-3263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ming Ma updated TEZ-3263:
-------------------------
    Description: 
Maybe the fix could be something similar to MAPREDUCE-5891. Here is one exception found during NM rolling restart. At least for the unordered case, it seems the fetcher is able to resubmit the request back to the queue and eventually succeed; but the fetcher still sends InputReadErrorEvent to AM for each retry, which could cause AM to mark the source task bad if enough destination tasks complaints.

{noformat}
java.net.ConnectException: Connection refused
	at java.net.PlainSocketImpl.socketConnect(Native Method)
	at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
	at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
	at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
	at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
	at java.net.Socket.connect(Socket.java:579)
	at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
	at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
	at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)
	at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:653)
	at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1325)
	at org.apache.tez.http.HttpConnection.getInputStream(HttpConnection.java:247)
	at org.apache.tez.runtime.library.common.shuffle.Fetcher.setupConnection(Fetcher.java:464)
{noformat}

  was:
Maybe the fix could be something similar to MAPREDUCE-5891. Here is one exception found during NM rolling restart.

{noformat}
java.net.ConnectException: Connection refused
	at java.net.PlainSocketImpl.socketConnect(Native Method)
	at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
	at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
	at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
	at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
	at java.net.Socket.connect(Socket.java:579)
	at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
	at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
	at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)
	at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:653)
	at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1325)
	at org.apache.tez.http.HttpConnection.getInputStream(HttpConnection.java:247)
	at org.apache.tez.runtime.library.common.shuffle.Fetcher.setupConnection(Fetcher.java:464)
{noformat}


> Improved shuffle error handling across NM restarts
> --------------------------------------------------
>
>                 Key: TEZ-3263
>                 URL: https://issues.apache.org/jira/browse/TEZ-3263
>             Project: Apache Tez
>          Issue Type: Improvement
>            Reporter: Ming Ma
>
> Maybe the fix could be something similar to MAPREDUCE-5891. Here is one exception found during NM rolling restart. At least for the unordered case, it seems the fetcher is able to resubmit the request back to the queue and eventually succeed; but the fetcher still sends InputReadErrorEvent to AM for each retry, which could cause AM to mark the source task bad if enough destination tasks complaints.
> {noformat}
> java.net.ConnectException: Connection refused
> 	at java.net.PlainSocketImpl.socketConnect(Native Method)
> 	at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
> 	at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
> 	at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
> 	at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
> 	at java.net.Socket.connect(Socket.java:579)
> 	at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
> 	at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
> 	at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)
> 	at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:653)
> 	at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1325)
> 	at org.apache.tez.http.HttpConnection.getInputStream(HttpConnection.java:247)
> 	at org.apache.tez.runtime.library.common.shuffle.Fetcher.setupConnection(Fetcher.java:464)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)