You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Luc Bourlier (JIRA)" <ji...@apache.org> on 2016/04/22 15:56:12 UTC
[jira] [Commented] (SPARK-14849) shuffle broken when accessing standalone cluster through NAT

    [ https://issues.apache.org/jira/browse/SPARK-14849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15253968#comment-15253968 ] 

Luc Bourlier commented on SPARK-14849:
--------------------------------------

I have dug at the problem. It is created during the registration  of the executor with the driver (spark-shell).

Because it is running in standalone (I assume) the executor doesn't have its own Netty instance, but use the one of the worker. So in [NettyRpcEnv|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rpc/netty/NettyRpcEnv.scala#L125] the address is set to {{null}}.
This information (or lack of) is sent to the driver in the registration message. The driver (in [CoarseGrainedSchedulerBackend|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L158]), seeing that the information is missing, attempt to guess the IP address of the worker by looking at the TCP/IP connection which is up, and in this case pick the external IP address of the machine acting as router.
This bad information is then sent back to the worker in the registration confirmation message, and used by the worker as his 'external' IP address.

Later down the execution, the worker needs to share information about the blocks it holds, and use the bad IP address in the BlockManagerIds. These BlockManagerIds are then unusable by the rest of the system.

I'll push a PR with a fix shortly. The executor should always send its 'public' address, and the driver should not try to find an address just by looking at the other side of a TCP connection. It easily can be wrong.

> shuffle broken when accessing standalone cluster through NAT
> ------------------------------------------------------------
>
>                 Key: SPARK-14849
>                 URL: https://issues.apache.org/jira/browse/SPARK-14849
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.6.1
>            Reporter: Luc Bourlier
>              Labels: nat, network
>
> I have the following network configuration:
> {code}
>              +--------------------+
>              |                    |
>              |  spark-shell       |
>              |                    |
>              +- ip: 10.110.101.2 -+
>                        |
>                        |
>              +- ip: 10.110.101.1 -+
>              |                    | NAT + routing
>              |  spark-master      | configured
>              |                    |
>              +- ip: 10.110.100.1 -+
>                        |
>           +------------------------+
>           |                        |
> +- ip: 10.110.101.2 -+    +- ip: 10.110.101.3 -+
> |                    |    |                    |
> |  spark-worker 1    |    |  spark-worker 2    |
> |                    |    |                    |
> +--------------------+    +--------------------+
> {code}
> I have NAT, DNS and routing correctly configure such as each machine can communicate with each other.
> Launch spark-shell against the cluster works well. Simple map operations work too:
> {code}
> scala> sc.makeRDD(1 to 5).map(_ * 5).collect
> res0: Array[Int] = Array(5, 10, 15, 20, 25)
> {code}
> But operations requiring shuffling fail:
> {code}
> scala> sc.makeRDD(1 to 5).map(i => (i,1)).reduceByKey(_ + _).collect
> 16/04/22 15:33:17 WARN TaskSetManager: Lost task 4.0 in stage 2.0 (TID 19, 10.110.101.1): FetchFailed(BlockManagerId(0, 10.110.101.1, 42842), shuffleId=0, mapId=6, reduceId=4, message=
> org.apache.spark.shuffle.FetchFailedException: Failed to connect to /10.110.101.1:42842
> 	at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:323)
> [ ... ]
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> 	at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Failed to connect to /10.110.101.1:42842
> 	at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216)
> [ ... ]
> 	at org.apache.spark.network.shuffle.RetryingBlockFetcher.access
> [ ... ]
> {code}
> It makes sense that a connection to 10.110.101.1:42842 would fail, no part of the system should have a direct knowledge of the IP address 10.110.101.1.
> So a part of the system is wrongly discovering this IP address.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org