You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Aaron Davidson (JIRA)" <ji...@apache.org> on 2014/10/13 09:44:34 UTC

[jira] [Commented] (SPARK-3923) All Standalone Mode services time out with each other

    [ https://issues.apache.org/jira/browse/SPARK-3923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14169044#comment-14169044 ] 

Aaron Davidson commented on SPARK-3923:
---------------------------------------

I did a little digging hoping to find some post about this, no particular luck. I did find [this post|https://groups.google.com/forum/#!topic/akka-user/X3xzpTCbEFs] which recommends using an interval time < pause, which we are not doing. This doesn't seem to explain the services all timing out after the heartbeat interval time (which is currently 1000 seconds), but may be good to know in the future.

> All Standalone Mode services time out with each other
> -----------------------------------------------------
>
>                 Key: SPARK-3923
>                 URL: https://issues.apache.org/jira/browse/SPARK-3923
>             Project: Spark
>          Issue Type: Bug
>          Components: Deploy
>    Affects Versions: 1.2.0
>            Reporter: Aaron Davidson
>            Priority: Blocker
>
> I'm seeing an issue where it seems that components in Standalone Mode (Worker, Master, Driver, and Executor) all seem to time out with each other after around 1000 seconds. Here is an example log:
> {code}
> 14/10/13 06:43:55 INFO Master: Registering worker ip-10-0-147-189.us-west-2.compute.internal:38922 with 4 cores, 29.0 GB RAM
> 14/10/13 06:43:55 INFO Master: Registering worker ip-10-0-175-214.us-west-2.compute.internal:42918 with 4 cores, 59.0 GB RAM
> 14/10/13 06:43:56 INFO Master: Registering app Databricks Shell
> 14/10/13 06:43:56 INFO Master: Registered app Databricks Shell with ID app-20141013064356-0000
> ... precisely 1000 seconds later ...
> 14/10/13 07:00:35 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkWorker@ip-10-0-147-189.us-west-2.compute.internal:38922] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
> 14/10/13 07:00:35 INFO Master: akka.tcp://sparkWorker@ip-10-0-147-189.us-west-2.compute.internal:38922 got disassociated, removing it.
> 14/10/13 07:00:35 INFO LocalActorRef: Message [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from Actor[akka://sparkMaster/deadLetters] to Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.0.147.189%3A54956-1#1529980245] was not delivered. [2] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
> 14/10/13 07:00:35 INFO Master: akka.tcp://sparkWorker@ip-10-0-175-214.us-west-2.compute.internal:42918 got disassociated, removing it.
> 14/10/13 07:00:35 INFO Master: Removing worker worker-20141013064354-ip-10-0-175-214.us-west-2.compute.internal-42918 on ip-10-0-175-214.us-west-2.compute.internal:42918
> 14/10/13 07:00:35 INFO Master: Telling app of lost executor: 1
> 14/10/13 07:00:35 INFO Master: akka.tcp://sparkWorker@ip-10-0-175-214.us-west-2.compute.internal:42918 got disassociated, removing it.
> 14/10/13 07:00:35 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkWorker@ip-10-0-175-214.us-west-2.compute.internal:42918] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
> 14/10/13 07:00:35 INFO LocalActorRef: Message [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from Actor[akka://sparkMaster/deadLetters] to Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.0.175.214%3A35958-2#314633324] was not delivered. [3] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
> 14/10/13 07:00:35 INFO LocalActorRef: Message [akka.remote.transport.AssociationHandle$Disassociated] from Actor[akka://sparkMaster/deadLetters] to Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.0.175.214%3A35958-2#314633324] was not delivered. [4] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
> 14/10/13 07:00:36 INFO ProtocolStateActor: No response from remote. Handshake timed out or transport failure detector triggered.
> 14/10/13 07:00:36 INFO Master: akka.tcp://sparkDriver@ip-10-0-175-215.us-west-2.compute.internal:58259 got disassociated, removing it.
> 14/10/13 07:00:36 INFO LocalActorRef: Message [akka.remote.transport.AssociationHandle$InboundPayload] from Actor[akka://sparkMaster/deadLetters] to Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.0.175.215%3A41987-3#1944377249] was not delivered. [5] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
> 14/10/13 07:00:36 INFO Master: Removing app app-20141013064356-0000
> 14/10/13 07:00:36 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkDriver@ip-10-0-175-215.us-west-2.compute.internal:58259] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
> 14/10/13 07:00:36 INFO LocalActorRef: Message [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from Actor[akka://sparkMaster/deadLetters] to Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.0.175.215%3A41987-3#1944377249] was not delivered. [6] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
> 14/10/13 07:00:36 INFO LocalActorRef: Message [akka.remote.transport.AssociationHandle$Disassociated] from Actor[akka://sparkMaster/deadLetters] to Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.0.175.215%3A41987-3#1944377249] was not delivered. [7] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
> 14/10/13 07:00:36 INFO Master: akka.tcp://sparkDriver@ip-10-0-175-215.us-west-2.compute.internal:58259 got disassociated, removing it.
> {code}
> Note that the driver and master are living on the same machine, and there is no load to speak of at the time (so no GC). Also everything disconnecting exactly 1000 seconds after initial connection is pretty suspicious.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org