You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Andrew Ash (JIRA)" <ji...@apache.org> on 2014/11/14 10:43:35 UTC

[jira] [Commented] (SPARK-625) Client hangs when connecting to standalone cluster using wrong address

    [ https://issues.apache.org/jira/browse/SPARK-625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14212062#comment-14212062 ] 

Andrew Ash commented on SPARK-625:
----------------------------------

Spark is very sensitive to hostnames in Spark URLs, and that comes from Akka being very sensitive.  I've personally been bitten by hostnames vs FQDNs vs external IP address vs loopback IP address, and it's really a pain.

On current master branch (1.2) with the Spark standalone master listening on {{spark://aash-mbp.local:7077}} as confirmed by the master web UI, and the spark shell attempting to connect to {{spark://127.0.01:7077}} with the {{--master}} parameter, the driver tries 3 attempts and then fails with this message:

{noformat}
14/11/14 01:37:56 INFO AppClient$ClientActor: Connecting to master spark://127.0.0.1:7077...
14/11/14 01:37:56 WARN AppClient$ClientActor: Could not connect to akka.tcp://sparkMaster@127.0.0.1:7077: akka.remote.InvalidAssociation: Invalid address: akka.tcp://sparkMaster@127.0.0.1:7077
14/11/14 01:37:56 WARN Remoting: Tried to associate with unreachable remote address [akka.tcp://sparkMaster@127.0.0.1:7077]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: Connection refused: /127.0.0.1:7077
14/11/14 01:38:16 INFO AppClient$ClientActor: Connecting to master spark://127.0.0.1:7077...
14/11/14 01:38:16 WARN Remoting: Tried to associate with unreachable remote address [akka.tcp://sparkMaster@127.0.0.1:7077]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: Connection refused: /127.0.0.1:7077
14/11/14 01:38:16 WARN AppClient$ClientActor: Could not connect to akka.tcp://sparkMaster@127.0.0.1:7077: akka.remote.InvalidAssociation: Invalid address: akka.tcp://sparkMaster@127.0.0.1:7077
14/11/14 01:38:36 INFO AppClient$ClientActor: Connecting to master spark://127.0.0.1:7077...
14/11/14 01:38:36 WARN Remoting: Tried to associate with unreachable remote address [akka.tcp://sparkMaster@127.0.0.1:7077]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: Connection refused: /127.0.0.1:7077
14/11/14 01:38:36 WARN AppClient$ClientActor: Could not connect to akka.tcp://sparkMaster@127.0.0.1:7077: akka.remote.InvalidAssociation: Invalid address: akka.tcp://sparkMaster@127.0.0.1:7077
14/11/14 01:38:56 ERROR SparkDeploySchedulerBackend: Application has been killed. Reason: All masters are unresponsive! Giving up.
14/11/14 01:38:56 WARN SparkDeploySchedulerBackend: Application ID is not initialized yet.
14/11/14 01:38:56 ERROR TaskSchedulerImpl: Exiting due to error from cluster scheduler: All masters are unresponsive! Giving up.
{noformat}

So the hang seems to be gone and replaced with a reasonable 3x attempts and fail.

[~joshrosen], short of changing Akka ourselves to make it less strict on exact URL matches, is there anything else we can do for this ticket?  I think we can reasonably close as fixed.

> Client hangs when connecting to standalone cluster using wrong address
> ----------------------------------------------------------------------
>
>                 Key: SPARK-625
>                 URL: https://issues.apache.org/jira/browse/SPARK-625
>             Project: Spark
>          Issue Type: Bug
>    Affects Versions: 0.7.0, 0.7.1, 0.8.0
>            Reporter: Josh Rosen
>            Priority: Minor
>
> I launched a standalone cluster on my laptop, connecting the workers to the master using my machine's public IP address (128.32.*.*:7077).  If I try to connect spark-shell to the master using "spark://0.0.0.0:7077", it successfully brings up a Scala prompt but hangs when I try to run a job.
> From the standalone master's log, it looks like the client's messages are being dropped without the client discovering that the connection has failed:
> {code}
> 12/11/27 14:00:52 ERROR NettyRemoteTransport(null): dropping message RegisterJob(JobDescription(Spark shell)) for non-local recipient akka://spark@0.0.0.0:7077/user/Master at akka://spark@128.32.*.*:7077 local is akka://spark@128.32.*.*:7077
> 12/11/27 14:00:52 ERROR NettyRemoteTransport(null): dropping message DaemonMsgWatch(Actor[akka://spark@128.32.*.*:57518/user/$a],Actor[akka://spark@0.0.0.0:7077/user/Master]) for non-local recipient akka://spark@0.0.0.0:7077/remote at akka://spark@128.32.*.*:7077 local is akka://spark@128.32.*.*:7077
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org