You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Debasish Das <de...@gmail.com> on 2014/08/20 05:51:39 UTC

Lost executor on YARN ALS iterations

Hi,

During the 4th ALS iteration, I am noticing that one of the executor gets
disconnected:

14/08/19 23:40:00 ERROR network.ConnectionManager: Corresponding
SendingConnectionManagerId not found

14/08/19 23:40:00 INFO cluster.YarnClientSchedulerBackend: Executor 5
disconnected, so removing it

14/08/19 23:40:00 ERROR cluster.YarnClientClusterScheduler: Lost executor 5
on tblpmidn42adv-hdp.tdc.vzwcorp.com: remote Akka client disassociated

14/08/19 23:40:00 INFO scheduler.DAGScheduler: Executor lost: 5 (epoch 12)
Any idea if this is a bug related to akka on YARN ?

I am using master

Thanks.
Deb

Re: Lost executor on YARN ALS iterations

Posted by Nishkam Ravi <nr...@cloudera.com>.

Can someone from Databricks test and commit this PR? This is not a complete
solution, but would provide some relief.
https://github.com/apache/spark/pull/1391

Thanks,
Nishkam


On Wed, Aug 20, 2014 at 12:39 AM, Sandy Ryza <sa...@cloudera.com>
wrote:

> Hi Debasish,
>
> The fix is to raise spark.yarn.executor.memoryOverhead until this goes
> away.  This controls the buffer between the JVM heap size and the amount of
> memory requested from YARN (JVMs can take up memory beyond their heap
> size). You should also make sure that, in the YARN NodeManager
> configuration, yarn.nodemanager.vmem-check-enabled is set to false.
>
> -Sandy
>
>
> On Wed, Aug 20, 2014 at 12:27 AM, Debasish Das <de...@gmail.com>
> wrote:
>
> > I could reproduce the issue in both 1.0 and 1.1 using YARN...so this is
> > definitely a YARN related problem...
> >
> > At least for me right now only deployment option possible is
> standalone...
> >
> >
> >
> > On Tue, Aug 19, 2014 at 11:29 PM, Xiangrui Meng <me...@gmail.com>
> wrote:
> >
> >> Hi Deb,
> >>
> >> I think this may be the same issue as described in
> >> https://issues.apache.org/jira/browse/SPARK-2121 . We know that the
> >> container got killed by YARN because it used much more memory that it
> >> requested. But we haven't figured out the root cause yet.
> >>
> >> +Sandy
> >>
> >> Best,
> >> Xiangrui
> >>
> >> On Tue, Aug 19, 2014 at 8:51 PM, Debasish Das <debasish.das83@gmail.com
> >
> >> wrote:
> >> > Hi,
> >> >
> >> > During the 4th ALS iteration, I am noticing that one of the executor
> >> gets
> >> > disconnected:
> >> >
> >> > 14/08/19 23:40:00 ERROR network.ConnectionManager: Corresponding
> >> > SendingConnectionManagerId not found
> >> >
> >> > 14/08/19 23:40:00 INFO cluster.YarnClientSchedulerBackend: Executor 5
> >> > disconnected, so removing it
> >> >
> >> > 14/08/19 23:40:00 ERROR cluster.YarnClientClusterScheduler: Lost
> >> executor 5
> >> > on tblpmidn42adv-hdp.tdc.vzwcorp.com: remote Akka client
> disassociated
> >> >
> >> > 14/08/19 23:40:00 INFO scheduler.DAGScheduler: Executor lost: 5 (epoch
> >> 12)
> >> > Any idea if this is a bug related to akka on YARN ?
> >> >
> >> > I am using master
> >> >
> >> > Thanks.
> >> > Deb
> >>
> >
> >
>

Re: Lost executor on YARN ALS iterations

Posted by Debasish Das <de...@gmail.com>.

Sandy,

I put spark.yarn.executor.memoryOverhead 1024 on spark-defaults.conf but I
don't see environment variable on spark properties on the webui->environment

Does it need to go in spark-env.sh ?

Thanks.
Deb


On Wed, Aug 20, 2014 at 12:39 AM, Sandy Ryza <sa...@cloudera.com>
wrote:

> Hi Debasish,
>
> The fix is to raise spark.yarn.executor.memoryOverhead until this goes
> away.  This controls the buffer between the JVM heap size and the amount of
> memory requested from YARN (JVMs can take up memory beyond their heap
> size). You should also make sure that, in the YARN NodeManager
> configuration, yarn.nodemanager.vmem-check-enabled is set to false.
>
> -Sandy
>
>
> On Wed, Aug 20, 2014 at 12:27 AM, Debasish Das <de...@gmail.com>
> wrote:
>
>> I could reproduce the issue in both 1.0 and 1.1 using YARN...so this is
>> definitely a YARN related problem...
>>
>> At least for me right now only deployment option possible is standalone...
>>
>>
>>
>> On Tue, Aug 19, 2014 at 11:29 PM, Xiangrui Meng <me...@gmail.com> wrote:
>>
>>> Hi Deb,
>>>
>>> I think this may be the same issue as described in
>>> https://issues.apache.org/jira/browse/SPARK-2121 . We know that the
>>> container got killed by YARN because it used much more memory that it
>>> requested. But we haven't figured out the root cause yet.
>>>
>>> +Sandy
>>>
>>> Best,
>>> Xiangrui
>>>
>>> On Tue, Aug 19, 2014 at 8:51 PM, Debasish Das <de...@gmail.com>
>>> wrote:
>>> > Hi,
>>> >
>>> > During the 4th ALS iteration, I am noticing that one of the executor
>>> gets
>>> > disconnected:
>>> >
>>> > 14/08/19 23:40:00 ERROR network.ConnectionManager: Corresponding
>>> > SendingConnectionManagerId not found
>>> >
>>> > 14/08/19 23:40:00 INFO cluster.YarnClientSchedulerBackend: Executor 5
>>> > disconnected, so removing it
>>> >
>>> > 14/08/19 23:40:00 ERROR cluster.YarnClientClusterScheduler: Lost
>>> executor 5
>>> > on tblpmidn42adv-hdp.tdc.vzwcorp.com: remote Akka client disassociated
>>> >
>>> > 14/08/19 23:40:00 INFO scheduler.DAGScheduler: Executor lost: 5 (epoch
>>> 12)
>>> > Any idea if this is a bug related to akka on YARN ?
>>> >
>>> > I am using master
>>> >
>>> > Thanks.
>>> > Deb
>>>
>>
>>
>

Re: Lost executor on YARN ALS iterations

Posted by Sandy Ryza <sa...@cloudera.com>.

Hi Debasish,

The fix is to raise spark.yarn.executor.memoryOverhead until this goes
away.  This controls the buffer between the JVM heap size and the amount of
memory requested from YARN (JVMs can take up memory beyond their heap
size). You should also make sure that, in the YARN NodeManager
configuration, yarn.nodemanager.vmem-check-enabled is set to false.

-Sandy


On Wed, Aug 20, 2014 at 12:27 AM, Debasish Das <de...@gmail.com>
wrote:

> I could reproduce the issue in both 1.0 and 1.1 using YARN...so this is
> definitely a YARN related problem...
>
> At least for me right now only deployment option possible is standalone...
>
>
>
> On Tue, Aug 19, 2014 at 11:29 PM, Xiangrui Meng <me...@gmail.com> wrote:
>
>> Hi Deb,
>>
>> I think this may be the same issue as described in
>> https://issues.apache.org/jira/browse/SPARK-2121 . We know that the
>> container got killed by YARN because it used much more memory that it
>> requested. But we haven't figured out the root cause yet.
>>
>> +Sandy
>>
>> Best,
>> Xiangrui
>>
>> On Tue, Aug 19, 2014 at 8:51 PM, Debasish Das <de...@gmail.com>
>> wrote:
>> > Hi,
>> >
>> > During the 4th ALS iteration, I am noticing that one of the executor
>> gets
>> > disconnected:
>> >
>> > 14/08/19 23:40:00 ERROR network.ConnectionManager: Corresponding
>> > SendingConnectionManagerId not found
>> >
>> > 14/08/19 23:40:00 INFO cluster.YarnClientSchedulerBackend: Executor 5
>> > disconnected, so removing it
>> >
>> > 14/08/19 23:40:00 ERROR cluster.YarnClientClusterScheduler: Lost
>> executor 5
>> > on tblpmidn42adv-hdp.tdc.vzwcorp.com: remote Akka client disassociated
>> >
>> > 14/08/19 23:40:00 INFO scheduler.DAGScheduler: Executor lost: 5 (epoch
>> 12)
>> > Any idea if this is a bug related to akka on YARN ?
>> >
>> > I am using master
>> >
>> > Thanks.
>> > Deb
>>
>
>

Re: Lost executor on YARN ALS iterations

Posted by Debasish Das <de...@gmail.com>.

I could reproduce the issue in both 1.0 and 1.1 using YARN...so this is
definitely a YARN related problem...

At least for me right now only deployment option possible is standalone...



On Tue, Aug 19, 2014 at 11:29 PM, Xiangrui Meng <me...@gmail.com> wrote:

> Hi Deb,
>
> I think this may be the same issue as described in
> https://issues.apache.org/jira/browse/SPARK-2121 . We know that the
> container got killed by YARN because it used much more memory that it
> requested. But we haven't figured out the root cause yet.
>
> +Sandy
>
> Best,
> Xiangrui
>
> On Tue, Aug 19, 2014 at 8:51 PM, Debasish Das <de...@gmail.com>
> wrote:
> > Hi,
> >
> > During the 4th ALS iteration, I am noticing that one of the executor gets
> > disconnected:
> >
> > 14/08/19 23:40:00 ERROR network.ConnectionManager: Corresponding
> > SendingConnectionManagerId not found
> >
> > 14/08/19 23:40:00 INFO cluster.YarnClientSchedulerBackend: Executor 5
> > disconnected, so removing it
> >
> > 14/08/19 23:40:00 ERROR cluster.YarnClientClusterScheduler: Lost
> executor 5
> > on tblpmidn42adv-hdp.tdc.vzwcorp.com: remote Akka client disassociated
> >
> > 14/08/19 23:40:00 INFO scheduler.DAGScheduler: Executor lost: 5 (epoch
> 12)
> > Any idea if this is a bug related to akka on YARN ?
> >
> > I am using master
> >
> > Thanks.
> > Deb
>

Re: Lost executor on YARN ALS iterations

Posted by Sandy Ryza <sa...@cloudera.com>.

That's right

On Tue, Sep 9, 2014 at 2:04 PM, Debasish Das <de...@gmail.com>
wrote:

> Last time it did not show up on environment tab but I will give it another
> shot...Expected behavior is that this env variable will show up right ?
>
> On Tue, Sep 9, 2014 at 12:15 PM, Sandy Ryza <sa...@cloudera.com>
> wrote:
>
>> I would expect 2 GB would be enough or more than enough for 16 GB
>> executors (unless ALS is using a bunch of off-heap memory?).  You mentioned
>> earlier in this thread that the property wasn't showing up in the
>> Environment tab.  Are you sure it's making it in?
>>
>> -Sandy
>>
>> On Tue, Sep 9, 2014 at 11:58 AM, Debasish Das <de...@gmail.com>
>> wrote:
>>
>>> Hmm...I did try it increase to few gb but did not get a successful run
>>> yet...
>>>
>>> Any idea if I am using say 40 executors, each running 16GB, what's the
>>> typical spark.yarn.executor.memoryOverhead for say 100M x 10 M large
>>> matrices with say few billion ratings...
>>>
>>> On Tue, Sep 9, 2014 at 10:49 AM, Sandy Ryza <sa...@cloudera.com>
>>> wrote:
>>>
>>>> Hi Deb,
>>>>
>>>> The current state of the art is to increase
>>>> spark.yarn.executor.memoryOverhead until the job stops failing.  We do have
>>>> plans to try to automatically scale this based on the amount of memory
>>>> requested, but it will still just be a heuristic.
>>>>
>>>> -Sandy
>>>>
>>>> On Tue, Sep 9, 2014 at 7:32 AM, Debasish Das <de...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Sandy,
>>>>>
>>>>> Any resolution for YARN failures ? It's a blocker for running spark on
>>>>> top of YARN.
>>>>>
>>>>> Thanks.
>>>>> Deb
>>>>>
>>>>> On Tue, Aug 19, 2014 at 11:29 PM, Xiangrui Meng <me...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Deb,
>>>>>>
>>>>>> I think this may be the same issue as described in
>>>>>> https://issues.apache.org/jira/browse/SPARK-2121 . We know that the
>>>>>> container got killed by YARN because it used much more memory that it
>>>>>> requested. But we haven't figured out the root cause yet.
>>>>>>
>>>>>> +Sandy
>>>>>>
>>>>>> Best,
>>>>>> Xiangrui
>>>>>>
>>>>>> On Tue, Aug 19, 2014 at 8:51 PM, Debasish Das <
>>>>>> debasish.das83@gmail.com> wrote:
>>>>>> > Hi,
>>>>>> >
>>>>>> > During the 4th ALS iteration, I am noticing that one of the
>>>>>> executor gets
>>>>>> > disconnected:
>>>>>> >
>>>>>> > 14/08/19 23:40:00 ERROR network.ConnectionManager: Corresponding
>>>>>> > SendingConnectionManagerId not found
>>>>>> >
>>>>>> > 14/08/19 23:40:00 INFO cluster.YarnClientSchedulerBackend: Executor
>>>>>> 5
>>>>>> > disconnected, so removing it
>>>>>> >
>>>>>> > 14/08/19 23:40:00 ERROR cluster.YarnClientClusterScheduler: Lost
>>>>>> executor 5
>>>>>> > on tblpmidn42adv-hdp.tdc.vzwcorp.com: remote Akka client
>>>>>> disassociated
>>>>>> >
>>>>>> > 14/08/19 23:40:00 INFO scheduler.DAGScheduler: Executor lost: 5
>>>>>> (epoch 12)
>>>>>> > Any idea if this is a bug related to akka on YARN ?
>>>>>> >
>>>>>> > I am using master
>>>>>> >
>>>>>> > Thanks.
>>>>>> > Deb
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Lost executor on YARN ALS iterations

Posted by Debasish Das <de...@gmail.com>.

Last time it did not show up on environment tab but I will give it another
shot...Expected behavior is that this env variable will show up right ?

On Tue, Sep 9, 2014 at 12:15 PM, Sandy Ryza <sa...@cloudera.com> wrote:

> I would expect 2 GB would be enough or more than enough for 16 GB
> executors (unless ALS is using a bunch of off-heap memory?).  You mentioned
> earlier in this thread that the property wasn't showing up in the
> Environment tab.  Are you sure it's making it in?
>
> -Sandy
>
> On Tue, Sep 9, 2014 at 11:58 AM, Debasish Das <de...@gmail.com>
> wrote:
>
>> Hmm...I did try it increase to few gb but did not get a successful run
>> yet...
>>
>> Any idea if I am using say 40 executors, each running 16GB, what's the
>> typical spark.yarn.executor.memoryOverhead for say 100M x 10 M large
>> matrices with say few billion ratings...
>>
>> On Tue, Sep 9, 2014 at 10:49 AM, Sandy Ryza <sa...@cloudera.com>
>> wrote:
>>
>>> Hi Deb,
>>>
>>> The current state of the art is to increase
>>> spark.yarn.executor.memoryOverhead until the job stops failing.  We do have
>>> plans to try to automatically scale this based on the amount of memory
>>> requested, but it will still just be a heuristic.
>>>
>>> -Sandy
>>>
>>> On Tue, Sep 9, 2014 at 7:32 AM, Debasish Das <de...@gmail.com>
>>> wrote:
>>>
>>>> Hi Sandy,
>>>>
>>>> Any resolution for YARN failures ? It's a blocker for running spark on
>>>> top of YARN.
>>>>
>>>> Thanks.
>>>> Deb
>>>>
>>>> On Tue, Aug 19, 2014 at 11:29 PM, Xiangrui Meng <me...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Deb,
>>>>>
>>>>> I think this may be the same issue as described in
>>>>> https://issues.apache.org/jira/browse/SPARK-2121 . We know that the
>>>>> container got killed by YARN because it used much more memory that it
>>>>> requested. But we haven't figured out the root cause yet.
>>>>>
>>>>> +Sandy
>>>>>
>>>>> Best,
>>>>> Xiangrui
>>>>>
>>>>> On Tue, Aug 19, 2014 at 8:51 PM, Debasish Das <
>>>>> debasish.das83@gmail.com> wrote:
>>>>> > Hi,
>>>>> >
>>>>> > During the 4th ALS iteration, I am noticing that one of the executor
>>>>> gets
>>>>> > disconnected:
>>>>> >
>>>>> > 14/08/19 23:40:00 ERROR network.ConnectionManager: Corresponding
>>>>> > SendingConnectionManagerId not found
>>>>> >
>>>>> > 14/08/19 23:40:00 INFO cluster.YarnClientSchedulerBackend: Executor 5
>>>>> > disconnected, so removing it
>>>>> >
>>>>> > 14/08/19 23:40:00 ERROR cluster.YarnClientClusterScheduler: Lost
>>>>> executor 5
>>>>> > on tblpmidn42adv-hdp.tdc.vzwcorp.com: remote Akka client
>>>>> disassociated
>>>>> >
>>>>> > 14/08/19 23:40:00 INFO scheduler.DAGScheduler: Executor lost: 5
>>>>> (epoch 12)
>>>>> > Any idea if this is a bug related to akka on YARN ?
>>>>> >
>>>>> > I am using master
>>>>> >
>>>>> > Thanks.
>>>>> > Deb
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Lost executor on YARN ALS iterations

Posted by Sandy Ryza <sa...@cloudera.com>.

I would expect 2 GB would be enough or more than enough for 16 GB executors
(unless ALS is using a bunch of off-heap memory?).  You mentioned earlier
in this thread that the property wasn't showing up in the Environment tab.
 Are you sure it's making it in?

-Sandy

On Tue, Sep 9, 2014 at 11:58 AM, Debasish Das <de...@gmail.com>
wrote:

> Hmm...I did try it increase to few gb but did not get a successful run
> yet...
>
> Any idea if I am using say 40 executors, each running 16GB, what's the
> typical spark.yarn.executor.memoryOverhead for say 100M x 10 M large
> matrices with say few billion ratings...
>
> On Tue, Sep 9, 2014 at 10:49 AM, Sandy Ryza <sa...@cloudera.com>
> wrote:
>
>> Hi Deb,
>>
>> The current state of the art is to increase
>> spark.yarn.executor.memoryOverhead until the job stops failing.  We do have
>> plans to try to automatically scale this based on the amount of memory
>> requested, but it will still just be a heuristic.
>>
>> -Sandy
>>
>> On Tue, Sep 9, 2014 at 7:32 AM, Debasish Das <de...@gmail.com>
>> wrote:
>>
>>> Hi Sandy,
>>>
>>> Any resolution for YARN failures ? It's a blocker for running spark on
>>> top of YARN.
>>>
>>> Thanks.
>>> Deb
>>>
>>> On Tue, Aug 19, 2014 at 11:29 PM, Xiangrui Meng <me...@gmail.com>
>>> wrote:
>>>
>>>> Hi Deb,
>>>>
>>>> I think this may be the same issue as described in
>>>> https://issues.apache.org/jira/browse/SPARK-2121 . We know that the
>>>> container got killed by YARN because it used much more memory that it
>>>> requested. But we haven't figured out the root cause yet.
>>>>
>>>> +Sandy
>>>>
>>>> Best,
>>>> Xiangrui
>>>>
>>>> On Tue, Aug 19, 2014 at 8:51 PM, Debasish Das <de...@gmail.com>
>>>> wrote:
>>>> > Hi,
>>>> >
>>>> > During the 4th ALS iteration, I am noticing that one of the executor
>>>> gets
>>>> > disconnected:
>>>> >
>>>> > 14/08/19 23:40:00 ERROR network.ConnectionManager: Corresponding
>>>> > SendingConnectionManagerId not found
>>>> >
>>>> > 14/08/19 23:40:00 INFO cluster.YarnClientSchedulerBackend: Executor 5
>>>> > disconnected, so removing it
>>>> >
>>>> > 14/08/19 23:40:00 ERROR cluster.YarnClientClusterScheduler: Lost
>>>> executor 5
>>>> > on tblpmidn42adv-hdp.tdc.vzwcorp.com: remote Akka client
>>>> disassociated
>>>> >
>>>> > 14/08/19 23:40:00 INFO scheduler.DAGScheduler: Executor lost: 5
>>>> (epoch 12)
>>>> > Any idea if this is a bug related to akka on YARN ?
>>>> >
>>>> > I am using master
>>>> >
>>>> > Thanks.
>>>> > Deb
>>>>
>>>
>>>
>>
>

Re: Lost executor on YARN ALS iterations

Posted by Debasish Das <de...@gmail.com>.

Hmm...I did try it increase to few gb but did not get a successful run
yet...

Any idea if I am using say 40 executors, each running 16GB, what's the
typical spark.yarn.executor.memoryOverhead for say 100M x 10 M large
matrices with say few billion ratings...

On Tue, Sep 9, 2014 at 10:49 AM, Sandy Ryza <sa...@cloudera.com> wrote:

> Hi Deb,
>
> The current state of the art is to increase
> spark.yarn.executor.memoryOverhead until the job stops failing.  We do have
> plans to try to automatically scale this based on the amount of memory
> requested, but it will still just be a heuristic.
>
> -Sandy
>
> On Tue, Sep 9, 2014 at 7:32 AM, Debasish Das <de...@gmail.com>
> wrote:
>
>> Hi Sandy,
>>
>> Any resolution for YARN failures ? It's a blocker for running spark on
>> top of YARN.
>>
>> Thanks.
>> Deb
>>
>> On Tue, Aug 19, 2014 at 11:29 PM, Xiangrui Meng <me...@gmail.com> wrote:
>>
>>> Hi Deb,
>>>
>>> I think this may be the same issue as described in
>>> https://issues.apache.org/jira/browse/SPARK-2121 . We know that the
>>> container got killed by YARN because it used much more memory that it
>>> requested. But we haven't figured out the root cause yet.
>>>
>>> +Sandy
>>>
>>> Best,
>>> Xiangrui
>>>
>>> On Tue, Aug 19, 2014 at 8:51 PM, Debasish Das <de...@gmail.com>
>>> wrote:
>>> > Hi,
>>> >
>>> > During the 4th ALS iteration, I am noticing that one of the executor
>>> gets
>>> > disconnected:
>>> >
>>> > 14/08/19 23:40:00 ERROR network.ConnectionManager: Corresponding
>>> > SendingConnectionManagerId not found
>>> >
>>> > 14/08/19 23:40:00 INFO cluster.YarnClientSchedulerBackend: Executor 5
>>> > disconnected, so removing it
>>> >
>>> > 14/08/19 23:40:00 ERROR cluster.YarnClientClusterScheduler: Lost
>>> executor 5
>>> > on tblpmidn42adv-hdp.tdc.vzwcorp.com: remote Akka client disassociated
>>> >
>>> > 14/08/19 23:40:00 INFO scheduler.DAGScheduler: Executor lost: 5 (epoch
>>> 12)
>>> > Any idea if this is a bug related to akka on YARN ?
>>> >
>>> > I am using master
>>> >
>>> > Thanks.
>>> > Deb
>>>
>>
>>
>

Re: Lost executor on YARN ALS iterations

Posted by Sandy Ryza <sa...@cloudera.com>.

Hi Deb,

The current state of the art is to increase
spark.yarn.executor.memoryOverhead until the job stops failing.  We do have
plans to try to automatically scale this based on the amount of memory
requested, but it will still just be a heuristic.

-Sandy

On Tue, Sep 9, 2014 at 7:32 AM, Debasish Das <de...@gmail.com>
wrote:

> Hi Sandy,
>
> Any resolution for YARN failures ? It's a blocker for running spark on top
> of YARN.
>
> Thanks.
> Deb
>
> On Tue, Aug 19, 2014 at 11:29 PM, Xiangrui Meng <me...@gmail.com> wrote:
>
>> Hi Deb,
>>
>> I think this may be the same issue as described in
>> https://issues.apache.org/jira/browse/SPARK-2121 . We know that the
>> container got killed by YARN because it used much more memory that it
>> requested. But we haven't figured out the root cause yet.
>>
>> +Sandy
>>
>> Best,
>> Xiangrui
>>
>> On Tue, Aug 19, 2014 at 8:51 PM, Debasish Das <de...@gmail.com>
>> wrote:
>> > Hi,
>> >
>> > During the 4th ALS iteration, I am noticing that one of the executor
>> gets
>> > disconnected:
>> >
>> > 14/08/19 23:40:00 ERROR network.ConnectionManager: Corresponding
>> > SendingConnectionManagerId not found
>> >
>> > 14/08/19 23:40:00 INFO cluster.YarnClientSchedulerBackend: Executor 5
>> > disconnected, so removing it
>> >
>> > 14/08/19 23:40:00 ERROR cluster.YarnClientClusterScheduler: Lost
>> executor 5
>> > on tblpmidn42adv-hdp.tdc.vzwcorp.com: remote Akka client disassociated
>> >
>> > 14/08/19 23:40:00 INFO scheduler.DAGScheduler: Executor lost: 5 (epoch
>> 12)
>> > Any idea if this is a bug related to akka on YARN ?
>> >
>> > I am using master
>> >
>> > Thanks.
>> > Deb
>>
>
>

Re: Lost executor on YARN ALS iterations

Posted by Debasish Das <de...@gmail.com>.

Hi Sandy,

Any resolution for YARN failures ? It's a blocker for running spark on top
of YARN.

Thanks.
Deb

On Tue, Aug 19, 2014 at 11:29 PM, Xiangrui Meng <me...@gmail.com> wrote:

> Hi Deb,
>
> I think this may be the same issue as described in
> https://issues.apache.org/jira/browse/SPARK-2121 . We know that the
> container got killed by YARN because it used much more memory that it
> requested. But we haven't figured out the root cause yet.
>
> +Sandy
>
> Best,
> Xiangrui
>
> On Tue, Aug 19, 2014 at 8:51 PM, Debasish Das <de...@gmail.com>
> wrote:
> > Hi,
> >
> > During the 4th ALS iteration, I am noticing that one of the executor gets
> > disconnected:
> >
> > 14/08/19 23:40:00 ERROR network.ConnectionManager: Corresponding
> > SendingConnectionManagerId not found
> >
> > 14/08/19 23:40:00 INFO cluster.YarnClientSchedulerBackend: Executor 5
> > disconnected, so removing it
> >
> > 14/08/19 23:40:00 ERROR cluster.YarnClientClusterScheduler: Lost
> executor 5
> > on tblpmidn42adv-hdp.tdc.vzwcorp.com: remote Akka client disassociated
> >
> > 14/08/19 23:40:00 INFO scheduler.DAGScheduler: Executor lost: 5 (epoch
> 12)
> > Any idea if this is a bug related to akka on YARN ?
> >
> > I am using master
> >
> > Thanks.
> > Deb
>

Re: Lost executor on YARN ALS iterations

Posted by Xiangrui Meng <me...@gmail.com>.

Hi Deb,

I think this may be the same issue as described in
https://issues.apache.org/jira/browse/SPARK-2121 . We know that the
container got killed by YARN because it used much more memory that it
requested. But we haven't figured out the root cause yet.

+Sandy

Best,
Xiangrui

On Tue, Aug 19, 2014 at 8:51 PM, Debasish Das <de...@gmail.com> wrote:
> Hi,
>
> During the 4th ALS iteration, I am noticing that one of the executor gets
> disconnected:
>
> 14/08/19 23:40:00 ERROR network.ConnectionManager: Corresponding
> SendingConnectionManagerId not found
>
> 14/08/19 23:40:00 INFO cluster.YarnClientSchedulerBackend: Executor 5
> disconnected, so removing it
>
> 14/08/19 23:40:00 ERROR cluster.YarnClientClusterScheduler: Lost executor 5
> on tblpmidn42adv-hdp.tdc.vzwcorp.com: remote Akka client disassociated
>
> 14/08/19 23:40:00 INFO scheduler.DAGScheduler: Executor lost: 5 (epoch 12)
> Any idea if this is a bug related to akka on YARN ?
>
> I am using master
>
> Thanks.
> Deb

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org