You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Aaron Davidson (JIRA)" <ji...@apache.org> on 2014/10/13 09:30:34 UTC

[jira] [Created] (SPARK-3923) All Standalone Mode services time out with each other

Aaron Davidson created SPARK-3923:
-------------------------------------

             Summary: All Standalone Mode services time out with each other
                 Key: SPARK-3923
                 URL: https://issues.apache.org/jira/browse/SPARK-3923
             Project: Spark
          Issue Type: Bug
          Components: Deploy
    Affects Versions: 1.2.0
            Reporter: Aaron Davidson
            Priority: Blocker


I'm seeing an issue where it seems that components in Standalone Mode (Worker, Master, Driver, and Executor) all seem to time out with each other after around 1000 seconds. Here is an example log:

{code}
14/10/13 06:43:55 INFO Master: Registering worker ip-10-0-147-189.us-west-2.compute.internal:38922 with 4 cores, 29.0 GB RAM
14/10/13 06:43:55 INFO Master: Registering worker ip-10-0-175-214.us-west-2.compute.internal:42918 with 4 cores, 59.0 GB RAM
14/10/13 06:43:56 INFO Master: Registering app Databricks Shell
14/10/13 06:43:56 INFO Master: Registered app Databricks Shell with ID app-20141013064356-0000

... precisely 1000 seconds later ...

14/10/13 07:00:35 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkWorker@ip-10-0-147-189.us-west-2.compute.internal:38922] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
14/10/13 07:00:35 INFO Master: akka.tcp://sparkWorker@ip-10-0-147-189.us-west-2.compute.internal:38922 got disassociated, removing it.
14/10/13 07:00:35 INFO LocalActorRef: Message [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from Actor[akka://sparkMaster/deadLetters] to Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.0.147.189%3A54956-1#1529980245] was not delivered. [2] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
14/10/13 07:00:35 INFO Master: akka.tcp://sparkWorker@ip-10-0-175-214.us-west-2.compute.internal:42918 got disassociated, removing it.
14/10/13 07:00:35 INFO Master: Removing worker worker-20141013064354-ip-10-0-175-214.us-west-2.compute.internal-42918 on ip-10-0-175-214.us-west-2.compute.internal:42918
14/10/13 07:00:35 INFO Master: Telling app of lost executor: 1
14/10/13 07:00:35 INFO Master: akka.tcp://sparkWorker@ip-10-0-175-214.us-west-2.compute.internal:42918 got disassociated, removing it.
14/10/13 07:00:35 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkWorker@ip-10-0-175-214.us-west-2.compute.internal:42918] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
14/10/13 07:00:35 INFO LocalActorRef: Message [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from Actor[akka://sparkMaster/deadLetters] to Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.0.175.214%3A35958-2#314633324] was not delivered. [3] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
14/10/13 07:00:35 INFO LocalActorRef: Message [akka.remote.transport.AssociationHandle$Disassociated] from Actor[akka://sparkMaster/deadLetters] to Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.0.175.214%3A35958-2#314633324] was not delivered. [4] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
14/10/13 07:00:36 INFO ProtocolStateActor: No response from remote. Handshake timed out or transport failure detector triggered.
14/10/13 07:00:36 INFO Master: akka.tcp://sparkDriver@ip-10-0-175-215.us-west-2.compute.internal:58259 got disassociated, removing it.
14/10/13 07:00:36 INFO LocalActorRef: Message [akka.remote.transport.AssociationHandle$InboundPayload] from Actor[akka://sparkMaster/deadLetters] to Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.0.175.215%3A41987-3#1944377249] was not delivered. [5] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
14/10/13 07:00:36 INFO Master: Removing app app-20141013064356-0000
14/10/13 07:00:36 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkDriver@ip-10-0-175-215.us-west-2.compute.internal:58259] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
14/10/13 07:00:36 INFO LocalActorRef: Message [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from Actor[akka://sparkMaster/deadLetters] to Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.0.175.215%3A41987-3#1944377249] was not delivered. [6] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
14/10/13 07:00:36 INFO LocalActorRef: Message [akka.remote.transport.AssociationHandle$Disassociated] from Actor[akka://sparkMaster/deadLetters] to Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.0.175.215%3A41987-3#1944377249] was not delivered. [7] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
14/10/13 07:00:36 INFO Master: akka.tcp://sparkDriver@ip-10-0-175-215.us-west-2.compute.internal:58259 got disassociated, removing it.
{code}

Note that the driver and master are living on the same machine, and there is no load to speak of at the time (so no GC). Also everything disconnecting exactly 1000 seconds after initial connection is pretty suspicious.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org