You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Thomas Gerber <th...@radius.com> on 2015/03/04 22:39:53 UTC

Driver disassociated

Hello,

sometimes, in the *middle* of a job, the job stops (status is then seen as
FINISHED in the master).

There isn't anything wrong in the shell/submit output.

When looking at the executor logs, I see logs like this:

15/03/04 21:24:51 INFO MapOutputTrackerWorker: Doing the fetch; tracker
actor = Actor[akka.tcp://sparkDriver@ip-10-0-10-17.ec2.internal
:40019/user/MapOutputTracker#893807065]
15/03/04 21:24:51 INFO MapOutputTrackerWorker: Don't have map outputs for
shuffle 38, fetching them
15/03/04 21:24:55 ERROR CoarseGrainedExecutorBackend: Driver Disassociated
[akka.tcp://sparkExecutor@ip-10-0-11-9.ec2.internal:54766] ->
[akka.tcp://sparkDriver@ip-10-0-10-17.ec2.internal:40019] disassociated!
Shutting down.
15/03/04 21:24:55 WARN ReliableDeliverySupervisor: Association with remote
system [akka.tcp://sparkDriver@ip-10-0-10-17.ec2.internal:40019] has
failed, address is now gated for [5000] ms. Reason is: [Disassociated].

How can I investigate further?
Thanks

Re: Driver disassociated

Posted by Thomas Gerber <th...@radius.com>.

Thanks.
I was already setting those (and I checked they were in use through the
environment tab in the UI).

They were set at 10 times their default value: 60000 and 10000 respectively.

I'll start poking at spark.shuffle.io.retryWait.
Thanks!

On Wed, Mar 4, 2015 at 7:02 PM, Ted Yu <yu...@gmail.com> wrote:

> See this thread:
> https://groups.google.com/forum/#!topic/akka-user/X3xzpTCbEFs
>
> Here're the relevant config parameters in Spark:
>     val akkaHeartBeatPauses = conf.getInt("spark.akka.heartbeat.pauses",
> 6000)
>     val akkaHeartBeatInterval =
> conf.getInt("spark.akka.heartbeat.interval", 1000)
>
> Cheers
>
> On Wed, Mar 4, 2015 at 4:09 PM, Thomas Gerber <th...@radius.com>
> wrote:
>
>> Also,
>>
>> I was experiencing another problem which might be related:
>> "Error communicating with MapOutputTracker" (see email in the ML today).
>>
>> I just thought I would mention it in case it is relevant.
>>
>> On Wed, Mar 4, 2015 at 4:07 PM, Thomas Gerber <th...@radius.com>
>> wrote:
>>
>>> 1.2.1
>>>
>>> Also, I was using the following parameters, which are 10 times the
>>> default ones:
>>> spark.akka.timeout 1000
>>> spark.akka.heartbeat.pauses 60000
>>> spark.akka.failure-detector.threshold 3000.0
>>> spark.akka.heartbeat.interval 10000
>>>
>>> which should have helped *avoid* the problem if I understand correctly.
>>>
>>> Thanks,
>>> Thomas
>>>
>>> On Wed, Mar 4, 2015 at 3:21 PM, Ted Yu <yu...@gmail.com> wrote:
>>>
>>>> What release are you using ?
>>>>
>>>> SPARK-3923 went into 1.2.0 release.
>>>>
>>>> Cheers
>>>>
>>>> On Wed, Mar 4, 2015 at 1:39 PM, Thomas Gerber <thomas.gerber@radius.com
>>>> > wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> sometimes, in the *middle* of a job, the job stops (status is then
>>>>> seen as FINISHED in the master).
>>>>>
>>>>> There isn't anything wrong in the shell/submit output.
>>>>>
>>>>> When looking at the executor logs, I see logs like this:
>>>>>
>>>>> 15/03/04 21:24:51 INFO MapOutputTrackerWorker: Doing the fetch;
>>>>> tracker actor = Actor[akka.tcp://sparkDriver@ip-10-0-10-17.ec2.internal
>>>>> :40019/user/MapOutputTracker#893807065]
>>>>> 15/03/04 21:24:51 INFO MapOutputTrackerWorker: Don't have map outputs
>>>>> for shuffle 38, fetching them
>>>>> 15/03/04 21:24:55 ERROR CoarseGrainedExecutorBackend: Driver
>>>>> Disassociated [akka.tcp://sparkExecutor@ip-10-0-11-9.ec2.internal:54766]
>>>>> -> [akka.tcp://sparkDriver@ip-10-0-10-17.ec2.internal:40019]
>>>>> disassociated! Shutting down.
>>>>> 15/03/04 21:24:55 WARN ReliableDeliverySupervisor: Association with
>>>>> remote system [akka.tcp://sparkDriver@ip-10-0-10-17.ec2.internal:40019]
>>>>> has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
>>>>>
>>>>> How can I investigate further?
>>>>> Thanks
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Driver disassociated

Posted by Ted Yu <yu...@gmail.com>.

See this thread:
https://groups.google.com/forum/#!topic/akka-user/X3xzpTCbEFs

Here're the relevant config parameters in Spark:
    val akkaHeartBeatPauses = conf.getInt("spark.akka.heartbeat.pauses",
6000)
    val akkaHeartBeatInterval =
conf.getInt("spark.akka.heartbeat.interval", 1000)

Cheers

On Wed, Mar 4, 2015 at 4:09 PM, Thomas Gerber <th...@radius.com>
wrote:

> Also,
>
> I was experiencing another problem which might be related:
> "Error communicating with MapOutputTracker" (see email in the ML today).
>
> I just thought I would mention it in case it is relevant.
>
> On Wed, Mar 4, 2015 at 4:07 PM, Thomas Gerber <th...@radius.com>
> wrote:
>
>> 1.2.1
>>
>> Also, I was using the following parameters, which are 10 times the
>> default ones:
>> spark.akka.timeout 1000
>> spark.akka.heartbeat.pauses 60000
>> spark.akka.failure-detector.threshold 3000.0
>> spark.akka.heartbeat.interval 10000
>>
>> which should have helped *avoid* the problem if I understand correctly.
>>
>> Thanks,
>> Thomas
>>
>> On Wed, Mar 4, 2015 at 3:21 PM, Ted Yu <yu...@gmail.com> wrote:
>>
>>> What release are you using ?
>>>
>>> SPARK-3923 went into 1.2.0 release.
>>>
>>> Cheers
>>>
>>> On Wed, Mar 4, 2015 at 1:39 PM, Thomas Gerber <th...@radius.com>
>>> wrote:
>>>
>>>> Hello,
>>>>
>>>> sometimes, in the *middle* of a job, the job stops (status is then
>>>> seen as FINISHED in the master).
>>>>
>>>> There isn't anything wrong in the shell/submit output.
>>>>
>>>> When looking at the executor logs, I see logs like this:
>>>>
>>>> 15/03/04 21:24:51 INFO MapOutputTrackerWorker: Doing the fetch; tracker
>>>> actor = Actor[akka.tcp://sparkDriver@ip-10-0-10-17.ec2.internal
>>>> :40019/user/MapOutputTracker#893807065]
>>>> 15/03/04 21:24:51 INFO MapOutputTrackerWorker: Don't have map outputs
>>>> for shuffle 38, fetching them
>>>> 15/03/04 21:24:55 ERROR CoarseGrainedExecutorBackend: Driver
>>>> Disassociated [akka.tcp://sparkExecutor@ip-10-0-11-9.ec2.internal:54766]
>>>> -> [akka.tcp://sparkDriver@ip-10-0-10-17.ec2.internal:40019]
>>>> disassociated! Shutting down.
>>>> 15/03/04 21:24:55 WARN ReliableDeliverySupervisor: Association with
>>>> remote system [akka.tcp://sparkDriver@ip-10-0-10-17.ec2.internal:40019]
>>>> has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
>>>>
>>>> How can I investigate further?
>>>> Thanks
>>>>
>>>
>>>
>>
>

Re: Driver disassociated

Posted by Thomas Gerber <th...@radius.com>.

Also,

I was experiencing another problem which might be related:
"Error communicating with MapOutputTracker" (see email in the ML today).

I just thought I would mention it in case it is relevant.

On Wed, Mar 4, 2015 at 4:07 PM, Thomas Gerber <th...@radius.com>
wrote:

> 1.2.1
>
> Also, I was using the following parameters, which are 10 times the default
> ones:
> spark.akka.timeout 1000
> spark.akka.heartbeat.pauses 60000
> spark.akka.failure-detector.threshold 3000.0
> spark.akka.heartbeat.interval 10000
>
> which should have helped *avoid* the problem if I understand correctly.
>
> Thanks,
> Thomas
>
> On Wed, Mar 4, 2015 at 3:21 PM, Ted Yu <yu...@gmail.com> wrote:
>
>> What release are you using ?
>>
>> SPARK-3923 went into 1.2.0 release.
>>
>> Cheers
>>
>> On Wed, Mar 4, 2015 at 1:39 PM, Thomas Gerber <th...@radius.com>
>> wrote:
>>
>>> Hello,
>>>
>>> sometimes, in the *middle* of a job, the job stops (status is then seen
>>> as FINISHED in the master).
>>>
>>> There isn't anything wrong in the shell/submit output.
>>>
>>> When looking at the executor logs, I see logs like this:
>>>
>>> 15/03/04 21:24:51 INFO MapOutputTrackerWorker: Doing the fetch; tracker
>>> actor = Actor[akka.tcp://sparkDriver@ip-10-0-10-17.ec2.internal
>>> :40019/user/MapOutputTracker#893807065]
>>> 15/03/04 21:24:51 INFO MapOutputTrackerWorker: Don't have map outputs
>>> for shuffle 38, fetching them
>>> 15/03/04 21:24:55 ERROR CoarseGrainedExecutorBackend: Driver
>>> Disassociated [akka.tcp://sparkExecutor@ip-10-0-11-9.ec2.internal:54766]
>>> -> [akka.tcp://sparkDriver@ip-10-0-10-17.ec2.internal:40019]
>>> disassociated! Shutting down.
>>> 15/03/04 21:24:55 WARN ReliableDeliverySupervisor: Association with
>>> remote system [akka.tcp://sparkDriver@ip-10-0-10-17.ec2.internal:40019]
>>> has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
>>>
>>> How can I investigate further?
>>> Thanks
>>>
>>
>>
>

Re: Driver disassociated

Posted by Thomas Gerber <th...@radius.com>.

1.2.1

Also, I was using the following parameters, which are 10 times the default
ones:
spark.akka.timeout 1000
spark.akka.heartbeat.pauses 60000
spark.akka.failure-detector.threshold 3000.0
spark.akka.heartbeat.interval 10000

which should have helped *avoid* the problem if I understand correctly.

Thanks,
Thomas

On Wed, Mar 4, 2015 at 3:21 PM, Ted Yu <yu...@gmail.com> wrote:

> What release are you using ?
>
> SPARK-3923 went into 1.2.0 release.
>
> Cheers
>
> On Wed, Mar 4, 2015 at 1:39 PM, Thomas Gerber <th...@radius.com>
> wrote:
>
>> Hello,
>>
>> sometimes, in the *middle* of a job, the job stops (status is then seen
>> as FINISHED in the master).
>>
>> There isn't anything wrong in the shell/submit output.
>>
>> When looking at the executor logs, I see logs like this:
>>
>> 15/03/04 21:24:51 INFO MapOutputTrackerWorker: Doing the fetch; tracker
>> actor = Actor[akka.tcp://sparkDriver@ip-10-0-10-17.ec2.internal
>> :40019/user/MapOutputTracker#893807065]
>> 15/03/04 21:24:51 INFO MapOutputTrackerWorker: Don't have map outputs for
>> shuffle 38, fetching them
>> 15/03/04 21:24:55 ERROR CoarseGrainedExecutorBackend: Driver
>> Disassociated [akka.tcp://sparkExecutor@ip-10-0-11-9.ec2.internal:54766]
>> -> [akka.tcp://sparkDriver@ip-10-0-10-17.ec2.internal:40019]
>> disassociated! Shutting down.
>> 15/03/04 21:24:55 WARN ReliableDeliverySupervisor: Association with
>> remote system [akka.tcp://sparkDriver@ip-10-0-10-17.ec2.internal:40019]
>> has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
>>
>> How can I investigate further?
>> Thanks
>>
>
>

Re: Driver disassociated

Posted by Ted Yu <yu...@gmail.com>.

What release are you using ?

SPARK-3923 went into 1.2.0 release.

Cheers

On Wed, Mar 4, 2015 at 1:39 PM, Thomas Gerber <th...@radius.com>
wrote:

> Hello,
>
> sometimes, in the *middle* of a job, the job stops (status is then seen
> as FINISHED in the master).
>
> There isn't anything wrong in the shell/submit output.
>
> When looking at the executor logs, I see logs like this:
>
> 15/03/04 21:24:51 INFO MapOutputTrackerWorker: Doing the fetch; tracker
> actor = Actor[akka.tcp://sparkDriver@ip-10-0-10-17.ec2.internal
> :40019/user/MapOutputTracker#893807065]
> 15/03/04 21:24:51 INFO MapOutputTrackerWorker: Don't have map outputs for
> shuffle 38, fetching them
> 15/03/04 21:24:55 ERROR CoarseGrainedExecutorBackend: Driver Disassociated
> [akka.tcp://sparkExecutor@ip-10-0-11-9.ec2.internal:54766] ->
> [akka.tcp://sparkDriver@ip-10-0-10-17.ec2.internal:40019] disassociated!
> Shutting down.
> 15/03/04 21:24:55 WARN ReliableDeliverySupervisor: Association with remote
> system [akka.tcp://sparkDriver@ip-10-0-10-17.ec2.internal:40019] has
> failed, address is now gated for [5000] ms. Reason is: [Disassociated].
>
> How can I investigate further?
> Thanks
>