You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Yana Kadiyska <ya...@gmail.com> on 2014/06/13 14:57:47 UTC

Master not seeing recovered nodes("Got heartbeat from unregistered worker ....")

Hi, I see this has been asked before but has not gotten any satisfactory
answer so I'll try again:

(here is the original thread I found:
http://mail-archives.apache.org/mod_mbox/spark-user/201403.mbox/%3C1394044078706-2312.post@n3.nabble.com%3E
)

I have a set of workers dying and coming back again. The master prints the
following warning:

"Got heartbeat from unregistered worker ...."

What is the solution to this -- rolling the master is very undesirable to
me as I have a Shark context sitting on top of it (it's meant to be highly
available).

Insights appreciated -- I don't think an executor going down is very
unexpected but it does seem odd that it won't be able to rejoin the working
set.

I'm running Spark 0.9.1 on CDH

Re: Master not seeing recovered nodes("Got heartbeat from unregistered worker ....")

Posted by Piotr Kołaczkowski <pk...@datastax.com>.

We are having the same problem. We're running Spark 0.9.1 in standalone
mode and on some heavy jobs workers become unresponsive and marked by
master as dead, even though the worker process is still running. Then they
never join the cluster again and cluster becomes essentially unusable until
we restart each worker.

We'd like to know:
1. Why worker can become unresponsive? Are there any well known config /
usage pitfalls that we could have fallen into? We're still investigating
the issue, but maybe there are some hints?
2. Is there an option to auto-recover a worker? e.g. automatically start a
new one if the old one failed? or at least some hooks to implement
functionality liek that?

Thanks,
Piotr


2014-06-13 22:58 GMT+02:00 Gino Bustelo <gi...@bustelos.com>:

> I get the same problem, but I'm running in a dev environment based on
> docker scripts. The additional issue is that the worker processes do not
> die and so the docker container does not exit. So I end up with worker
> containers that are not participating in the cluster.
>
>
> On Fri, Jun 13, 2014 at 9:44 AM, Mayur Rustagi <ma...@gmail.com>
> wrote:
>
>> I have also had trouble in worker joining the working set. I have
>> typically moved to Mesos based setup. Frankly for high availability you are
>> better off using a cluster manager.
>>
>> Mayur Rustagi
>> Ph: +1 (760) 203 3257
>> http://www.sigmoidanalytics.com
>> @mayur_rustagi <https://twitter.com/mayur_rustagi>
>>
>>
>>
>> On Fri, Jun 13, 2014 at 8:57 AM, Yana Kadiyska <ya...@gmail.com>
>> wrote:
>>
>>> Hi, I see this has been asked before but has not gotten any satisfactory
>>> answer so I'll try again:
>>>
>>> (here is the original thread I found:
>>> http://mail-archives.apache.org/mod_mbox/spark-user/201403.mbox/%3C1394044078706-2312.post@n3.nabble.com%3E
>>> )
>>>
>>> I have a set of workers dying and coming back again. The master prints
>>> the following warning:
>>>
>>> "Got heartbeat from unregistered worker ...."
>>>
>>> What is the solution to this -- rolling the master is very undesirable
>>> to me as I have a Shark context sitting on top of it (it's meant to be
>>> highly available).
>>>
>>> Insights appreciated -- I don't think an executor going down is very
>>> unexpected but it does seem odd that it won't be able to rejoin the working
>>> set.
>>>
>>> I'm running Spark 0.9.1 on CDH
>>>
>>>
>>>
>>
>


-- 
Piotr Kolaczkowski, Lead Software Engineer
pkolaczk@datastax.com

http://www.datastax.com/
777 Mariners Island Blvd., Suite 510
San Mateo, CA 94404

Re: Master not seeing recovered nodes("Got heartbeat from unregistered worker ....")

Posted by Gino Bustelo <gi...@bustelos.com>.

I get the same problem, but I'm running in a dev environment based on
docker scripts. The additional issue is that the worker processes do not
die and so the docker container does not exit. So I end up with worker
containers that are not participating in the cluster.


On Fri, Jun 13, 2014 at 9:44 AM, Mayur Rustagi <ma...@gmail.com>
wrote:

> I have also had trouble in worker joining the working set. I have
> typically moved to Mesos based setup. Frankly for high availability you are
> better off using a cluster manager.
>
> Mayur Rustagi
> Ph: +1 (760) 203 3257
> http://www.sigmoidanalytics.com
> @mayur_rustagi <https://twitter.com/mayur_rustagi>
>
>
>
> On Fri, Jun 13, 2014 at 8:57 AM, Yana Kadiyska <ya...@gmail.com>
> wrote:
>
>> Hi, I see this has been asked before but has not gotten any satisfactory
>> answer so I'll try again:
>>
>> (here is the original thread I found:
>> http://mail-archives.apache.org/mod_mbox/spark-user/201403.mbox/%3C1394044078706-2312.post@n3.nabble.com%3E
>> )
>>
>> I have a set of workers dying and coming back again. The master prints
>> the following warning:
>>
>> "Got heartbeat from unregistered worker ...."
>>
>> What is the solution to this -- rolling the master is very undesirable to
>> me as I have a Shark context sitting on top of it (it's meant to be highly
>> available).
>>
>> Insights appreciated -- I don't think an executor going down is very
>> unexpected but it does seem odd that it won't be able to rejoin the working
>> set.
>>
>> I'm running Spark 0.9.1 on CDH
>>
>>
>>
>

Re: Master not seeing recovered nodes("Got heartbeat from unregistered worker ....")

Posted by Mayur Rustagi <ma...@gmail.com>.

I have also had trouble in worker joining the working set. I have typically
moved to Mesos based setup. Frankly for high availability you are better
off using a cluster manager.

Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi <https://twitter.com/mayur_rustagi>



On Fri, Jun 13, 2014 at 8:57 AM, Yana Kadiyska <ya...@gmail.com>
wrote:

> Hi, I see this has been asked before but has not gotten any satisfactory
> answer so I'll try again:
>
> (here is the original thread I found:
> http://mail-archives.apache.org/mod_mbox/spark-user/201403.mbox/%3C1394044078706-2312.post@n3.nabble.com%3E
> )
>
> I have a set of workers dying and coming back again. The master prints the
> following warning:
>
> "Got heartbeat from unregistered worker ...."
>
> What is the solution to this -- rolling the master is very undesirable to
> me as I have a Shark context sitting on top of it (it's meant to be highly
> available).
>
> Insights appreciated -- I don't think an executor going down is very
> unexpected but it does seem odd that it won't be able to rejoin the working
> set.
>
> I'm running Spark 0.9.1 on CDH
>
>
>