You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Andrew Or (JIRA)" <ji...@apache.org> on 2014/10/17 03:59:33 UTC

[jira] [Closed] (SPARK-3923) All Standalone Mode services time out with each other

     [ https://issues.apache.org/jira/browse/SPARK-3923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrew Or closed SPARK-3923.
----------------------------
          Resolution: Fixed
       Fix Version/s: 1.2.0
            Assignee: Aaron Davidson
    Target Version/s: 1.2.0

> All Standalone Mode services time out with each other
> -----------------------------------------------------
>
>                 Key: SPARK-3923
>                 URL: https://issues.apache.org/jira/browse/SPARK-3923
>             Project: Spark
>          Issue Type: Bug
>          Components: Deploy
>    Affects Versions: 1.2.0
>            Reporter: Aaron Davidson
>            Assignee: Aaron Davidson
>            Priority: Blocker
>             Fix For: 1.2.0
>
>
> I'm seeing an issue where it seems that components in Standalone Mode (Worker, Master, Driver, and Executor) all seem to time out with each other after around 1000 seconds. Here is an example log:
> {code}
> 14/10/13 06:43:55 INFO Master: Registering worker ip-10-0-147-189.us-west-2.compute.internal:38922 with 4 cores, 29.0 GB RAM
> 14/10/13 06:43:55 INFO Master: Registering worker ip-10-0-175-214.us-west-2.compute.internal:42918 with 4 cores, 59.0 GB RAM
> 14/10/13 06:43:56 INFO Master: Registering app Databricks Shell
> 14/10/13 06:43:56 INFO Master: Registered app Databricks Shell with ID app-20141013064356-0000
> ... precisely 1000 seconds later ...
> 14/10/13 07:00:35 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkWorker@ip-10-0-147-189.us-west-2.compute.internal:38922] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
> 14/10/13 07:00:35 INFO Master: akka.tcp://sparkWorker@ip-10-0-147-189.us-west-2.compute.internal:38922 got disassociated, removing it.
> 14/10/13 07:00:35 INFO LocalActorRef: Message [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from Actor[akka://sparkMaster/deadLetters] to Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.0.147.189%3A54956-1#1529980245] was not delivered. [2] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
> 14/10/13 07:00:35 INFO Master: akka.tcp://sparkWorker@ip-10-0-175-214.us-west-2.compute.internal:42918 got disassociated, removing it.
> 14/10/13 07:00:35 INFO Master: Removing worker worker-20141013064354-ip-10-0-175-214.us-west-2.compute.internal-42918 on ip-10-0-175-214.us-west-2.compute.internal:42918
> 14/10/13 07:00:35 INFO Master: Telling app of lost executor: 1
> 14/10/13 07:00:35 INFO Master: akka.tcp://sparkWorker@ip-10-0-175-214.us-west-2.compute.internal:42918 got disassociated, removing it.
> 14/10/13 07:00:35 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkWorker@ip-10-0-175-214.us-west-2.compute.internal:42918] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
> 14/10/13 07:00:35 INFO LocalActorRef: Message [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from Actor[akka://sparkMaster/deadLetters] to Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.0.175.214%3A35958-2#314633324] was not delivered. [3] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
> 14/10/13 07:00:35 INFO LocalActorRef: Message [akka.remote.transport.AssociationHandle$Disassociated] from Actor[akka://sparkMaster/deadLetters] to Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.0.175.214%3A35958-2#314633324] was not delivered. [4] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
> 14/10/13 07:00:36 INFO ProtocolStateActor: No response from remote. Handshake timed out or transport failure detector triggered.
> 14/10/13 07:00:36 INFO Master: akka.tcp://sparkDriver@ip-10-0-175-215.us-west-2.compute.internal:58259 got disassociated, removing it.
> 14/10/13 07:00:36 INFO LocalActorRef: Message [akka.remote.transport.AssociationHandle$InboundPayload] from Actor[akka://sparkMaster/deadLetters] to Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.0.175.215%3A41987-3#1944377249] was not delivered. [5] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
> 14/10/13 07:00:36 INFO Master: Removing app app-20141013064356-0000
> 14/10/13 07:00:36 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkDriver@ip-10-0-175-215.us-west-2.compute.internal:58259] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
> 14/10/13 07:00:36 INFO LocalActorRef: Message [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from Actor[akka://sparkMaster/deadLetters] to Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.0.175.215%3A41987-3#1944377249] was not delivered. [6] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
> 14/10/13 07:00:36 INFO LocalActorRef: Message [akka.remote.transport.AssociationHandle$Disassociated] from Actor[akka://sparkMaster/deadLetters] to Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.0.175.215%3A41987-3#1944377249] was not delivered. [7] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
> 14/10/13 07:00:36 INFO Master: akka.tcp://sparkDriver@ip-10-0-175-215.us-west-2.compute.internal:58259 got disassociated, removing it.
> {code}
> Note that the driver and master are living on the same machine, and there is no load to speak of at the time (so no GC). Also everything disconnecting exactly 1000 seconds after initial connection is pretty suspicious.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org