You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by twinkle sachdeva <tw...@gmail.com> on 2015/03/03 11:08:25 UTC

Re: One of the executor not getting StopExecutor message

Hi,

Operations are not very extensive, as this scenario is not always
reproducible.
One of the executor start behaving in this manner. For this particular
application, we are using 8 cores in one executors, and practically, 4
executors are launched on one machine.

This machine has good config with respect to number of cores.

Somehow, to me it seems to be some akka communication issue. If i try to
take thread dump of the executor, once it appears to be in trouble, then
time out happens.

Can it be something related to* spark.akka.threads?*



On Fri, Feb 27, 2015 at 3:55 PM, Akhil Das <ak...@sigmoidanalytics.com>
wrote:

> Mostly, that particular executor is stuck on GC Pause, what operation are
> you performing? You can try increasing the parallelism if you see only 1
> executor is doing the task.
>
> Thanks
> Best Regards
>
> On Fri, Feb 27, 2015 at 11:39 AM, twinkle sachdeva <
> twinkle.sachdeva@gmail.com> wrote:
>
>> Hi,
>>
>> I am running a spark application on Yarn in cluster mode.
>> One of my executor appears to be in hang state, for  a long time, and
>> gets finally killed by the driver.
>>
>> As compared to other executors, It have not received StopExecutor message
>> from the driver.
>>
>> Here are the logs at the end of this container (C_1):
>>
>> --------------------------------------------------------------------------------
>> 15/02/26 18:17:07 DEBUG storage.BlockManagerSlaveActor: Done removing
>> broadcast 36, response is 2
>> 15/02/26 18:17:07 DEBUG storage.BlockManagerSlaveActor: Sent response: 2
>> to Actor[akka.tcp://sparkDriver@TMO-DN73:37906/temp/$aB]
>> 15/02/26 18:17:09 DEBUG ipc.Client: IPC Client (1206963429) connection to
>> TMO-GCR70/192.168.162.70:9000 from admin: closed
>> 15/02/26 18:17:09 DEBUG ipc.Client: IPC Client (1206963429) connection to
>> TMO-GCR70/192.168.162.70:9000 from admin: stopped, remaining connections
>> 0
>> 15/02/26 18:17:32 DEBUG hdfs.LeaseRenewer: Lease renewer daemon for []
>> with renew id 1 executed
>> 15/02/26 18:18:00 DEBUG hdfs.LeaseRenewer: Lease renewer daemon for []
>> with renew id 1 expired
>> 15/02/26 18:18:00 DEBUG hdfs.LeaseRenewer: Lease renewer daemon for []
>> with renew id 1 exited
>> 15/02/26 20:33:13 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED
>> SIGNAL 15: SIGTERM
>>
>> NOTE that it has no logs for more than 2hrs.
>>
>> Here are the logs at the end of normal container ( C_2):
>>
>> ------------------------------------------------------------------------------------
>> 15/02/26 20:33:09 DEBUG storage.BlockManagerSlaveActor: Sent response: 2
>> to Actor[akka.tcp://sparkDriver@TMO-DN73:37906/temp/$D+b]
>> 15/02/26 20:33:10 DEBUG executor.CoarseGrainedExecutorBackend: [actor]
>> received message StopExecutor from Actor[akka.tcp://sparkDriver@TMO-DN73
>> :37906/user/CoarseGrainedScheduler#160899257]
>> 15/02/26 20:33:10 INFO executor.CoarseGrainedExecutorBackend: Driver
>> commanded a shutdown
>> 15/02/26 20:33:10 INFO storage.MemoryStore: MemoryStore cleared
>> 15/02/26 20:33:10 INFO storage.BlockManager: BlockManager stopped
>> 15/02/26 20:33:10 DEBUG executor.CoarseGrainedExecutorBackend: [actor] *handled
>> message (181.499835 ms) StopExecutor* from
>> Actor[akka.tcp://sparkDriver@TMO-DN73
>> :37906/user/CoarseGrainedScheduler#160899257]
>> 15/02/26 20:33:10 INFO remote.RemoteActorRefProvider$RemotingTerminator:
>> Shutting down remote daemon.
>> 15/02/26 20:33:10 INFO remote.RemoteActorRefProvider$RemotingTerminator:
>> Remote daemon shut down; proceeding with flushing remote transports.
>> 15/02/26 20:33:10 INFO remote.RemoteActorRefProvider$RemotingTerminator:
>> Remoting shut down.
>> 15/02/26 20:33:10 DEBUG ipc.Client: stopping client from cache:
>> org.apache.hadoop.ipc.Client@76a68bd4
>> 15/02/26 20:33:10 DEBUG ipc.Client: stopping client from cache:
>> org.apache.hadoop.ipc.Client@76a68bd4
>> 15/02/26 20:33:10 DEBUG ipc.Client: removing client from cache:
>> org.apache.hadoop.ipc.Client@76a68bd4
>> 15/02/26 20:33:10 DEBUG ipc.Client: stopping actual client because no
>> more references remain: org.apache.hadoop.ipc.Client@76a68bd4
>> 15/02/26 20:33:10 DEBUG ipc.Client: Stopping client
>> 15/02/26 20:33:10 DEBUG storage.DiskBlockManager: Shutdown hook called
>> 15/02/26 20:33:10 DEBUG util.Utils: Shutdown hook called
>>
>> At the driver side, i can see the logs related to heartbeat messages from
>> C_1 till 20:05:00
>>
>> ------------------------------------------------------------------------------------------
>> 15/02/26 20:05:00 DEBUG spark.HeartbeatReceiver: [actor] received message
>> Heartbeat(7,[Lscala.Tuple2;@151e5ce6,BlockManagerId(7, TMO-DN73, 34106))
>> from Actor[akka.tcp://sparkExecutor@TMO-DN73:43671/temp/$fn]
>>
>> After this, it continues to receive the heartbeat from other executors
>> except this one, and here follows the message responsible for it's SIGTERM:
>>
>>
>> ------------------------------------------------------------------------------------------------------------
>>
>> 15/02/26 20:06:20 WARN storage.BlockManagerMasterActor: Removing
>> BlockManager BlockManagerId(7, TMO-DN73, 34106) with no recent heart beats:
>> 80515ms exceeds 45000ms
>>
>>
>> I am using spark 1.2.1.
>>
>> Any pointer(s) ?
>>
>>
>> Thanks,
>>
>> Twinkle
>>
>
>

Re: One of the executor not getting StopExecutor message

Posted by Akhil Das <ak...@sigmoidanalytics.com>.

Not quite sure, but you can try increasing the spark.akka.threads, most
likely it can be a yarn related issue.

Thanks
Best Regards

On Tue, Mar 3, 2015 at 3:38 PM, twinkle sachdeva <twinkle.sachdeva@gmail.com
> wrote:

> Hi,
>
> Operations are not very extensive, as this scenario is not always
> reproducible.
> One of the executor start behaving in this manner. For this particular
> application, we are using 8 cores in one executors, and practically, 4
> executors are launched on one machine.
>
> This machine has good config with respect to number of cores.
>
> Somehow, to me it seems to be some akka communication issue. If i try to
> take thread dump of the executor, once it appears to be in trouble, then
> time out happens.
>
> Can it be something related to* spark.akka.threads?*
>
>
>
> On Fri, Feb 27, 2015 at 3:55 PM, Akhil Das <ak...@sigmoidanalytics.com>
> wrote:
>
>> Mostly, that particular executor is stuck on GC Pause, what operation are
>> you performing? You can try increasing the parallelism if you see only 1
>> executor is doing the task.
>>
>> Thanks
>> Best Regards
>>
>> On Fri, Feb 27, 2015 at 11:39 AM, twinkle sachdeva <
>> twinkle.sachdeva@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I am running a spark application on Yarn in cluster mode.
>>> One of my executor appears to be in hang state, for  a long time, and
>>> gets finally killed by the driver.
>>>
>>> As compared to other executors, It have not received StopExecutor
>>> message from the driver.
>>>
>>> Here are the logs at the end of this container (C_1):
>>>
>>> --------------------------------------------------------------------------------
>>> 15/02/26 18:17:07 DEBUG storage.BlockManagerSlaveActor: Done removing
>>> broadcast 36, response is 2
>>> 15/02/26 18:17:07 DEBUG storage.BlockManagerSlaveActor: Sent response: 2
>>> to Actor[akka.tcp://sparkDriver@TMO-DN73:37906/temp/$aB]
>>> 15/02/26 18:17:09 DEBUG ipc.Client: IPC Client (1206963429) connection
>>> to TMO-GCR70/192.168.162.70:9000 from admin: closed
>>> 15/02/26 18:17:09 DEBUG ipc.Client: IPC Client (1206963429) connection
>>> to TMO-GCR70/192.168.162.70:9000 from admin: stopped, remaining
>>> connections 0
>>> 15/02/26 18:17:32 DEBUG hdfs.LeaseRenewer: Lease renewer daemon for []
>>> with renew id 1 executed
>>> 15/02/26 18:18:00 DEBUG hdfs.LeaseRenewer: Lease renewer daemon for []
>>> with renew id 1 expired
>>> 15/02/26 18:18:00 DEBUG hdfs.LeaseRenewer: Lease renewer daemon for []
>>> with renew id 1 exited
>>> 15/02/26 20:33:13 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED
>>> SIGNAL 15: SIGTERM
>>>
>>> NOTE that it has no logs for more than 2hrs.
>>>
>>> Here are the logs at the end of normal container ( C_2):
>>>
>>> ------------------------------------------------------------------------------------
>>> 15/02/26 20:33:09 DEBUG storage.BlockManagerSlaveActor: Sent response: 2
>>> to Actor[akka.tcp://sparkDriver@TMO-DN73:37906/temp/$D+b]
>>> 15/02/26 20:33:10 DEBUG executor.CoarseGrainedExecutorBackend: [actor]
>>> received message StopExecutor from Actor[akka.tcp://sparkDriver@TMO-DN73
>>> :37906/user/CoarseGrainedScheduler#160899257]
>>> 15/02/26 20:33:10 INFO executor.CoarseGrainedExecutorBackend: Driver
>>> commanded a shutdown
>>> 15/02/26 20:33:10 INFO storage.MemoryStore: MemoryStore cleared
>>> 15/02/26 20:33:10 INFO storage.BlockManager: BlockManager stopped
>>> 15/02/26 20:33:10 DEBUG executor.CoarseGrainedExecutorBackend: [actor] *handled
>>> message (181.499835 ms) StopExecutor* from
>>> Actor[akka.tcp://sparkDriver@TMO-DN73
>>> :37906/user/CoarseGrainedScheduler#160899257]
>>> 15/02/26 20:33:10 INFO remote.RemoteActorRefProvider$RemotingTerminator:
>>> Shutting down remote daemon.
>>> 15/02/26 20:33:10 INFO remote.RemoteActorRefProvider$RemotingTerminator:
>>> Remote daemon shut down; proceeding with flushing remote transports.
>>> 15/02/26 20:33:10 INFO remote.RemoteActorRefProvider$RemotingTerminator:
>>> Remoting shut down.
>>> 15/02/26 20:33:10 DEBUG ipc.Client: stopping client from cache:
>>> org.apache.hadoop.ipc.Client@76a68bd4
>>> 15/02/26 20:33:10 DEBUG ipc.Client: stopping client from cache:
>>> org.apache.hadoop.ipc.Client@76a68bd4
>>> 15/02/26 20:33:10 DEBUG ipc.Client: removing client from cache:
>>> org.apache.hadoop.ipc.Client@76a68bd4
>>> 15/02/26 20:33:10 DEBUG ipc.Client: stopping actual client because no
>>> more references remain: org.apache.hadoop.ipc.Client@76a68bd4
>>> 15/02/26 20:33:10 DEBUG ipc.Client: Stopping client
>>> 15/02/26 20:33:10 DEBUG storage.DiskBlockManager: Shutdown hook called
>>> 15/02/26 20:33:10 DEBUG util.Utils: Shutdown hook called
>>>
>>> At the driver side, i can see the logs related to heartbeat messages
>>> from C_1 till 20:05:00
>>>
>>> ------------------------------------------------------------------------------------------
>>> 15/02/26 20:05:00 DEBUG spark.HeartbeatReceiver: [actor] received
>>> message Heartbeat(7,[Lscala.Tuple2;@151e5ce6,BlockManagerId(7,
>>> TMO-DN73, 34106)) from Actor[akka.tcp://sparkExecutor@TMO-DN73
>>> :43671/temp/$fn]
>>>
>>> After this, it continues to receive the heartbeat from other executors
>>> except this one, and here follows the message responsible for it's SIGTERM:
>>>
>>>
>>> ------------------------------------------------------------------------------------------------------------
>>>
>>> 15/02/26 20:06:20 WARN storage.BlockManagerMasterActor: Removing
>>> BlockManager BlockManagerId(7, TMO-DN73, 34106) with no recent heart beats:
>>> 80515ms exceeds 45000ms
>>>
>>>
>>> I am using spark 1.2.1.
>>>
>>> Any pointer(s) ?
>>>
>>>
>>> Thanks,
>>>
>>> Twinkle
>>>
>>
>>
>