You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Jeff Jones <jj...@adaptivebiotech.com> on 2016/01/06 22:57:39 UTC

Timeout connecting between workers after upgrade to 1.6

I upgraded our Spark standalone cluster from 1.4.1 to 1.6.0 yesterday. We are now seeing regular timeouts between two of the workers when making connections. These workers and the same driver code worked fine running on 1.4.1 and finished in under a second. Any thoughts on what might have changed?

16/01/06 19:17:58 ERROR RetryingBlockFetcher: Exception while beginning fetch of 1 outstanding blocks (after 3 retries)
java.io.IOException: Connecting to /10.248.0.218:52104 timed out (120000 ms)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:214)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:167)
at org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:90)
at org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
at org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43)
at org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170)
at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
at java.util.concurrent.FutureTask.run(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
16/01/06 19:17:58 WARN BlockManager: Failed to fetch remote block rdd_74_3 from BlockManagerId(1, 10.248.0.218, 52104) (failed attempt 1)
java.io.IOException: Connecting to /10.248.0.218:52104 timed out (120000 ms)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:214)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:167)
at org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:90)
at org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
at org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43)
at org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170)
at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
at java.util.concurrent.FutureTask.run(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)



Thanks,
Jeff





This message (and any attachments) is intended only for the designated recipient(s). It
may contain confidential or proprietary information, or have other limitations on use as
indicated by the sender. If you are not a designated recipient, you may not review, use,
copy or distribute this message. If you received this in error, please notify the sender by
reply e-mail and delete this message.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Re: Timeout connecting between workers after upgrade to 1.6

Posted by Jeff Jones <jj...@adaptivebiotech.com>.
Did you get a chance to look at the logs? Any explanation?

From: Jeff Jones <jj...@adaptivebiotech.com>>
Date: Wednesday, January 6, 2016 at 2:07 PM
To: Michael Armbrust <mi...@databricks.com>>
Cc: "user@spark.apache.org<ma...@spark.apache.org>" <us...@spark.apache.org>>
Subject: Re: Timeout connecting between workers after upgrade to 1.6

Attached. Filled with lots of what you see below.

From: Michael Armbrust <mi...@databricks.com>>
Date: Wednesday, January 6, 2016 at 2:01 PM
To: Jeff Jones <jj...@adaptivebiotech.com>>
Cc: "user@spark.apache.org<ma...@spark.apache.org>" <us...@spark.apache.org>>
Subject: Re: Timeout connecting between workers after upgrade to 1.6

Logs from the workers?

On Wed, Jan 6, 2016 at 1:57 PM, Jeff Jones <jj...@adaptivebiotech.com>> wrote:
I upgraded our Spark standalone cluster from 1.4.1 to 1.6.0 yesterday. We are now seeing regular timeouts between two of the workers when making connections. These workers and the same driver code worked fine running on 1.4.1 and finished in under a second. Any thoughts on what might have changed?

16/01/06 19:17:58 ERROR RetryingBlockFetcher: Exception while beginning fetch of 1 outstanding blocks (after 3 retries)
java.io.IOException: Connecting to /10.248.0.218:52104<http://10.248.0.218:52104> timed out (120000 ms)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:214)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:167)
at org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:90)
at org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
at org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43)
at org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170)
at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
at java.util.concurrent.FutureTask.run(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
16/01/06 19:17:58 WARN BlockManager: Failed to fetch remote block rdd_74_3 from BlockManagerId(1, 10.248.0.218, 52104) (failed attempt 1)
java.io.IOException: Connecting to /10.248.0.218:52104<http://10.248.0.218:52104> timed out (120000 ms)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:214)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:167)
at org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:90)
at org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
at org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43)
at org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170)
at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
at java.util.concurrent.FutureTask.run(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)



Thanks,
Jeff





This message (and any attachments) is intended only for the designated recipient(s). It
may contain confidential or proprietary information, or have other limitations on use as
indicated by the sender. If you are not a designated recipient, you may not review, use,
copy or distribute this message. If you received this in error, please notify the sender by
reply e-mail and delete this message.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org<ma...@spark.apache.org>
For additional commands, e-mail: user-help@spark.apache.org<ma...@spark.apache.org>




This message (and any attachments) is intended only for the designated recipient(s). It
may contain confidential or proprietary information, or have other limitations on use as
indicated by the sender. If you are not a designated recipient, you may not review, use,
copy or distribute this message. If you received this in error, please notify the sender by
reply e-mail and delete this message.


This message (and any attachments) is intended only for the designated recipient(s). It
may contain confidential or proprietary information, or have other limitations on use as
indicated by the sender. If you are not a designated recipient, you may not review, use,
copy or distribute this message. If you received this in error, please notify the sender by
reply e-mail and delete this message.

Re: Timeout connecting between workers after upgrade to 1.6

Posted by Jeff Jones <jj...@adaptivebiotech.com>.
Attached. Filled with lots of what you see below.

From: Michael Armbrust <mi...@databricks.com>>
Date: Wednesday, January 6, 2016 at 2:01 PM
To: Jeff Jones <jj...@adaptivebiotech.com>>
Cc: "user@spark.apache.org<ma...@spark.apache.org>" <us...@spark.apache.org>>
Subject: Re: Timeout connecting between workers after upgrade to 1.6

Logs from the workers?

On Wed, Jan 6, 2016 at 1:57 PM, Jeff Jones <jj...@adaptivebiotech.com>> wrote:
I upgraded our Spark standalone cluster from 1.4.1 to 1.6.0 yesterday. We are now seeing regular timeouts between two of the workers when making connections. These workers and the same driver code worked fine running on 1.4.1 and finished in under a second. Any thoughts on what might have changed?

16/01/06 19:17:58 ERROR RetryingBlockFetcher: Exception while beginning fetch of 1 outstanding blocks (after 3 retries)
java.io.IOException: Connecting to /10.248.0.218:52104<http://10.248.0.218:52104> timed out (120000 ms)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:214)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:167)
at org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:90)
at org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
at org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43)
at org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170)
at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
at java.util.concurrent.FutureTask.run(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
16/01/06 19:17:58 WARN BlockManager: Failed to fetch remote block rdd_74_3 from BlockManagerId(1, 10.248.0.218, 52104) (failed attempt 1)
java.io.IOException: Connecting to /10.248.0.218:52104<http://10.248.0.218:52104> timed out (120000 ms)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:214)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:167)
at org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:90)
at org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
at org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43)
at org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170)
at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
at java.util.concurrent.FutureTask.run(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)



Thanks,
Jeff





This message (and any attachments) is intended only for the designated recipient(s). It
may contain confidential or proprietary information, or have other limitations on use as
indicated by the sender. If you are not a designated recipient, you may not review, use,
copy or distribute this message. If you received this in error, please notify the sender by
reply e-mail and delete this message.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org<ma...@spark.apache.org>
For additional commands, e-mail: user-help@spark.apache.org<ma...@spark.apache.org>




This message (and any attachments) is intended only for the designated recipient(s). It
may contain confidential or proprietary information, or have other limitations on use as
indicated by the sender. If you are not a designated recipient, you may not review, use,
copy or distribute this message. If you received this in error, please notify the sender by
reply e-mail and delete this message.

Re: Timeout connecting between workers after upgrade to 1.6

Posted by Michael Armbrust <mi...@databricks.com>.
Logs from the workers?

On Wed, Jan 6, 2016 at 1:57 PM, Jeff Jones <jj...@adaptivebiotech.com>
wrote:

> I upgraded our Spark standalone cluster from 1.4.1 to 1.6.0 yesterday. We
> are now seeing regular timeouts between two of the workers when making
> connections. These workers and the same driver code worked fine running on
> 1.4.1 and finished in under a second. Any thoughts on what might have
> changed?
>
> 16/01/06 19:17:58 ERROR RetryingBlockFetcher: Exception while beginning
> fetch of 1 outstanding blocks (after 3 retries)
> java.io.IOException: Connecting to /10.248.0.218:52104 timed out (120000
> ms)
> at
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:214)
> at
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:167)
> at
> org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:90)
> at
> org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
> at
> org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43)
> at
> org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170)
> at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
> at java.util.concurrent.FutureTask.run(Unknown Source)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
> at java.lang.Thread.run(Unknown Source)
> 16/01/06 19:17:58 WARN BlockManager: Failed to fetch remote block rdd_74_3
> from BlockManagerId(1, 10.248.0.218, 52104) (failed attempt 1)
> java.io.IOException: Connecting to /10.248.0.218:52104 timed out (120000
> ms)
> at
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:214)
> at
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:167)
> at
> org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:90)
> at
> org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
> at
> org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43)
> at
> org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170)
> at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
> at java.util.concurrent.FutureTask.run(Unknown Source)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
> at java.lang.Thread.run(Unknown Source)
>
>
>
> Thanks,
> Jeff
>
>
>
>
>
> This message (and any attachments) is intended only for the designated
> recipient(s). It
> may contain confidential or proprietary information, or have other
> limitations on use as
> indicated by the sender. If you are not a designated recipient, you may
> not review, use,
> copy or distribute this message. If you received this in error, please
> notify the sender by
> reply e-mail and delete this message.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>