You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Surendranauth Hiraman <su...@velos.io> on 2014/06/11 18:17:51 UTC

Re: Error During ReceivingConnection

It looks like this was due to another executor on a different node closing
the connection on its side. I found the entries below in the remote side's
logs.

Can anyone comment on why one ConnectionManager would close its connection
to another node and what could be tuned to avoid this? It did not have any
errors on its side.


This is from the ConnectionManager on the side shutting down the
connection, not the ConnectionManager that had the "Connection Reset By
Peer".

14/06/10 18:51:14 INFO network.ConnectionManager: Removing
ReceivingConnection to ConnectionManagerId(172.16.25.125,45610)

14/06/10 18:51:14 INFO network.ConnectionManager: Removing
SendingConnection to ConnectionManagerId(172.16.25.125,45610)




On Wed, Jun 11, 2014 at 8:38 AM, Surendranauth Hiraman <
suren.hiraman@velos.io> wrote:

> I have a somewhat large job (10 GB input data but generates about 500 GB
> of data after many stages).
>
> Most tasks completed but a few stragglers on the same node/executor are
> still active (but doing nothing) after about 16 hours.
>
> At about 3 to 4 hours in, the tasks that are hanging have the following in
> the work logs.
>
> Any idea what config to tweak for this?
>
>
> 14/06/10 18:51:10 WARN network.ReceivingConnection: Error reading from
> connection to ConnectionManagerId(172.16.25.108,37693)
> java.io.IOException: Connection reset by peer
> at sun.nio.ch.FileDispatcher.read0(Native Method)
>  at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
> at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:251)
>  at sun.nio.ch.IOUtil.read(IOUtil.java:224)
> at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:254)
>  at
> org.apache.spark.network.ReceivingConnection.read(Connection.scala:534)
> at
> org.apache.spark.network.ConnectionManager$$anon$6.run(ConnectionManager.scala:175)
>  at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>  at java.lang.Thread.run(Thread.java:679)
> 14/06/10 18:51:10 INFO network.ConnectionManager: Handling connection
> error on connection to ConnectionManagerId(172.16.25.108,37693)
> 14/06/10 18:51:10 INFO network.ConnectionManager: Removing
> ReceivingConnection to ConnectionManagerId(172.16.25.108,37693)
> 14/06/10 18:51:10 INFO network.ConnectionManager: Removing
> SendingConnection to ConnectionManagerId(172.16.25.108,37693)
> 14/06/10 18:51:10 INFO network.ConnectionManager: Removing
> ReceivingConnection to ConnectionManagerId(172.16.25.108,37693)
> 14/06/10 18:51:10 ERROR network.ConnectionManager: Corresponding
> SendingConnectionManagerId not found
> 14/06/10 18:51:10 INFO network.ConnectionManager: Removing
> ReceivingConnection to ConnectionManagerId(172.16.25.108,37693)
> 14/06/10 18:51:10 ERROR network.ConnectionManager: Corresponding
> SendingConnectionManagerId not found
> 14/06/10 18:51:14 WARN network.ReceivingConnection: Error reading from
> connection to ConnectionManagerId(172.16.25.97,54918)
> java.io.IOException: Connection reset by peer
>  at sun.nio.ch.FileDispatcher.read0(Native Method)
> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>  at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:251)
> at sun.nio.ch.IOUtil.read(IOUtil.java:224)
>  at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:254)
> at org.apache.spark.network.ReceivingConnection.read(Connection.scala:534)
>  at
> org.apache.spark.network.ConnectionManager$$anon$6.run(ConnectionManager.scala:175)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>  at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
> at java.lang.Thread.run(Thread.java:679)
> 14/06/10 18:51:14 INFO network.ConnectionManager: Handling connection
> error on connection to ConnectionManagerId(172.16.25.97,54918)
> 14/06/10 18:51:14 INFO network.ConnectionManager: Removing
> ReceivingConnection to ConnectionManagerId(172.16.25.97,54918)
> 14/06/10 18:51:14 INFO network.ConnectionManager: Removing
> SendingConnection to ConnectionManagerId(172.16.25.97,54918)
> 14/06/10 18:51:14 INFO network.ConnectionManager: Removing
> ReceivingConnection to ConnectionManagerId(172.16.25.97,54918)
> 14/06/10 18:51:14 ERROR network.ConnectionManager: Corresponding
> SendingConnectionManagerId not found
> 14/06/10 18:51:14 INFO network.ConnectionManager: Removing
> ReceivingConnection to ConnectionManagerId(172.16.25.97,54918)
> 14/06/10 18:51:14 ERROR network.ConnectionManager: Corresponding
> SendingConnectionManagerId not found
>
> --
>
> SUREN HIRAMAN, VP TECHNOLOGY
> Velos
> Accelerating Machine Learning
>
> 440 NINTH AVENUE, 11TH FLOOR
> NEW YORK, NY 10001
> O: (917) 525-2466 ext. 105
> F: 646.349.4063
> E: suren.hiraman@v <su...@sociocast.com>elos.io
> W: www.velos.io
>
>


-- 

SUREN HIRAMAN, VP TECHNOLOGY
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR
NEW YORK, NY 10001
O: (917) 525-2466 ext. 105
F: 646.349.4063
E: suren.hiraman@v <su...@sociocast.com>elos.io
W: www.velos.io