You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mesos.apache.org by Vova Shelgunov <vv...@gmail.com> on 2017/01/23 16:05:20 UTC

Framework stops to receive the heartbeats and events and gets removed from master

Hi,

I faced a very strange situation with my framework that talks to
mesos master via Scheduler HTTP API:

Sometimes my framework stops to receive the heartbeats and task updates
from a master.
I read the documentation of mesos (
http://mesos.apache.org/documentation/latest/scheduler-http-api/), *Network
partitions *section and I see that if a framework does not receive the
heartbeats within some time it should reconnect to the master.

I have written a heartbeat monitor that checks if there were not heartbeats
last n seconds, then reconnect, but after the reconnection, I all the time
receive an ERROR from the mesos master that my framework has been removed.

Why is it happening?

Regards,
Uladzimir

Re: Framework stops to receive the heartbeats and events and gets removed from master

Posted by Vinod Kone <vi...@mesosphere.io>.
Can you paste the logs or master and framework?

@vinodkone

> On Jan 23, 2017, at 8:05 AM, Vova Shelgunov <vv...@gmail.com> wrote:
> 
> Hi,
> 
> I faced a very strange situation with my framework that talks to mesos master via Scheduler HTTP API:
> 
> Sometimes my framework stops to receive the heartbeats and task updates from a master.
> I read the documentation of mesos (http://mesos.apache.org/documentation/latest/scheduler-http-api/), Network partitions section and I see that if a framework does not receive the heartbeats within some time it should reconnect to the master.
> 
> I have written a heartbeat monitor that checks if there were not heartbeats last n seconds, then reconnect, but after the reconnection, I all the time receive an ERROR from the mesos master that my framework has been removed.
> 
> Why is it happening?
> 
> Regards,
> Uladzimir

Re: Framework stops to receive the heartbeats and events and gets removed from master

Posted by Vinod Kone <vi...@gmail.com>.
No problem. Glad you figured out. 

@vinodkone

> On Jan 23, 2017, at 8:38 AM, Vova Shelgunov <vv...@gmail.com> wrote:
> 
> Yes, it works. Sorry for troubling, the first time when I looked at the logs I did not notice that failover_timeout is zero.
> 
> 2017-01-23 19:27 GMT+03:00 Vova Shelgunov <vv...@gmail.com>:
>> Logs from mesos master:
>> 
>> 0123 15:53:44.523613     7 http.cpp:391] HTTP POST for /master/api/v1/scheduler from 172.18.0.1:58864 with User-Agent='AHC/2.0'
>> I0123 15:53:44.524159     7 master.cpp:4827] Processing ACKNOWLEDGE call ac9a6e5e-67b3-490a-930f-0024eab734b4 for task 10336 of framework 3edce0a6-2a9e-448f-a5c2-666e2c2c3086-0005 (Test HTTP Framework) on agent 16c100c1-13fe-47b8-a2a0-aed9bafbbf8c-S0
>> I0123 15:53:44.524849     7 master.cpp:7744] Removing task 10336 with resources cpus(*):0.1; mem(*):32 of framework 3edce0a6-2a9e-448f-a5c2-666e2c2c3086-0005 on agent 16c100c1-13fe-47b8-a2a0-aed9bafbbf8c-S0 at slave(1)@172.18.0.3:5051 (mesos-slave)
>> I0123 15:53:44.529033     7 master.cpp:1297] Framework 3edce0a6-2a9e-448f-a5c2-666e2c2c3086-0005 (Test HTTP Framework) disconnected
>> I0123 15:53:44.529636     7 master.cpp:2902] Disconnecting framework 3edce0a6-2a9e-448f-a5c2-666e2c2c3086-0005 (Test HTTP Framework)
>> I0123 15:53:44.529974     7 master.cpp:2926] Deactivating framework 3edce0a6-2a9e-448f-a5c2-666e2c2c3086-0005 (Test HTTP Framework)
>> I0123 15:53:44.530299     7 master.cpp:1310] Giving framework 3edce0a6-2a9e-448f-a5c2-666e2c2c3086-0005 (Test HTTP Framework) 0ns to failover
>> I0123 15:53:44.530594     7 hierarchical.cpp:386] Deactivated framework 3edce0a6-2a9e-448f-a5c2-666e2c2c3086-0005
>> I0123 15:53:44.531962     7 master.cpp:6369] Framework failover timeout, removing framework 3edce0a6-2a9e-448f-a5c2-666e2c2c3086-0005 (Test HTif TP Framework)
>> I0123 15:53:44.534992     7 master.cpp:7103] Removing framework 3edce0a6-2a9e-448f-a5c2-666e2c2c3086-0005 (Test HTTP Framework)
>> 
>> It seems failover timeout is set to zero for the framework.
>> 
>> It can be my coding error if framework looses its connection to the master multiple times (I see that I do not pass failover_timeout value during reconnection).
>> I will try to observe if it solves my issue.
>> 
>> Thanks
>> 
>> 2017-01-23 19:05 GMT+03:00 Vova Shelgunov <vv...@gmail.com>:
>>> Hi,
>>> 
>>> I faced a very strange situation with my framework that talks to mesos master via Scheduler HTTP API:
>>> 
>>> Sometimes my framework stops to receive the heartbeats and task updates from a master.
>>> I read the documentation of mesos (http://mesos.apache.org/documentation/latest/scheduler-http-api/), Network partitions section and I see that if a framework does not receive the heartbeats within some time it should reconnect to the master.
>>> 
>>> I have written a heartbeat monitor that checks if there were not heartbeats last n seconds, then reconnect, but after the reconnection, I all the time receive an ERROR from the mesos master that my framework has been removed.
>>> 
>>> Why is it happening?
>>> 
>>> Regards,
>>> Uladzimir
>> 
> 

Re: Framework stops to receive the heartbeats and events and gets removed from master

Posted by Vova Shelgunov <vv...@gmail.com>.
Yes, it works. Sorry for troubling, the first time when I looked at the
logs I did not notice that failover_timeout is zero.

2017-01-23 19:27 GMT+03:00 Vova Shelgunov <vv...@gmail.com>:

> Logs from mesos master:
>
> 0123 15:53:44.523613     7 http.cpp:391] HTTP POST for
> /master/api/v1/scheduler from 172.18.0.1:58864 with User-Agent='AHC/2.0'
> I0123 15:53:44.524159     7 master.cpp:4827] Processing ACKNOWLEDGE call
> ac9a6e5e-67b3-490a-930f-0024eab734b4 for task 10336 of framework
> 3edce0a6-2a9e-448f-a5c2-666e2c2c3086-0005 (Test HTTP Framework) on agent
> 16c100c1-13fe-47b8-a2a0-aed9bafbbf8c-S0
> I0123 15:53:44.524849     7 master.cpp:7744] Removing task 10336 with
> resources cpus(*):0.1; mem(*):32 of framework 3edce0a6-2a9e-448f-a5c2-666e2c2c3086-0005
> on agent 16c100c1-13fe-47b8-a2a0-aed9bafbbf8c-S0 at slave(1)@
> 172.18.0.3:5051 (mesos-slave)
> I0123 15:53:44.529033     7 master.cpp:1297] Framework
> 3edce0a6-2a9e-448f-a5c2-666e2c2c3086-0005 (Test HTTP Framework)
> disconnected
> I0123 15:53:44.529636     7 master.cpp:2902] Disconnecting framework
> 3edce0a6-2a9e-448f-a5c2-666e2c2c3086-0005 (Test HTTP Framework)
> I0123 15:53:44.529974     7 master.cpp:2926] Deactivating framework
> 3edce0a6-2a9e-448f-a5c2-666e2c2c3086-0005 (Test HTTP Framework)
> I0123 15:53:44.530299     7 master.cpp:1310] Giving framework
> 3edce0a6-2a9e-448f-a5c2-666e2c2c3086-0005 (Test HTTP Framework) 0ns to
> failover
> I0123 15:53:44.530594     7 hierarchical.cpp:386] Deactivated framework
> 3edce0a6-2a9e-448f-a5c2-666e2c2c3086-0005
> I0123 15:53:44.531962     7 master.cpp:6369] Framework failover timeout,
> removing framework 3edce0a6-2a9e-448f-a5c2-666e2c2c3086-0005 (Test HTif
> TP Framework)
> I0123 15:53:44.534992     7 master.cpp:7103] Removing framework
> 3edce0a6-2a9e-448f-a5c2-666e2c2c3086-0005 (Test HTTP Framework)
>
> It seems failover timeout is set to zero for the framework.
>
> It can be my coding error if framework looses its connection to the master
> multiple times (I see that I do not pass failover_timeout value during
> reconnection).
> I will try to observe if it solves my issue.
>
> Thanks
>
> 2017-01-23 19:05 GMT+03:00 Vova Shelgunov <vv...@gmail.com>:
>
>> Hi,
>>
>> I faced a very strange situation with my framework that talks to
>> mesos master via Scheduler HTTP API:
>>
>> Sometimes my framework stops to receive the heartbeats and task updates
>> from a master.
>> I read the documentation of mesos (http://mesos.apache.org
>> /documentation/latest/scheduler-http-api/), *Network partitions *section
>> and I see that if a framework does not receive the heartbeats within some
>> time it should reconnect to the master.
>>
>> I have written a heartbeat monitor that checks if there were not
>> heartbeats last n seconds, then reconnect, but after the reconnection, I
>> all the time receive an ERROR from the mesos master that my framework has
>> been removed.
>>
>> Why is it happening?
>>
>> Regards,
>> Uladzimir
>>
>
>

Re: Framework stops to receive the heartbeats and events and gets removed from master

Posted by Vova Shelgunov <vv...@gmail.com>.
Logs from mesos master:

0123 15:53:44.523613     7 http.cpp:391] HTTP POST for
/master/api/v1/scheduler from 172.18.0.1:58864 with User-Agent='AHC/2.0'
I0123 15:53:44.524159     7 master.cpp:4827] Processing ACKNOWLEDGE call
ac9a6e5e-67b3-490a-930f-0024eab734b4 for task 10336 of framework
3edce0a6-2a9e-448f-a5c2-666e2c2c3086-0005 (Test HTTP Framework) on agent
16c100c1-13fe-47b8-a2a0-aed9bafbbf8c-S0
I0123 15:53:44.524849     7 master.cpp:7744] Removing task 10336 with
resources cpus(*):0.1; mem(*):32 of framework
3edce0a6-2a9e-448f-a5c2-666e2c2c3086-0005 on agent
16c100c1-13fe-47b8-a2a0-aed9bafbbf8c-S0 at slave(1)@172.18.0.3:5051
(mesos-slave)
I0123 15:53:44.529033     7 master.cpp:1297] Framework
3edce0a6-2a9e-448f-a5c2-666e2c2c3086-0005 (Test HTTP Framework) disconnected
I0123 15:53:44.529636     7 master.cpp:2902] Disconnecting framework
3edce0a6-2a9e-448f-a5c2-666e2c2c3086-0005 (Test HTTP Framework)
I0123 15:53:44.529974     7 master.cpp:2926] Deactivating framework
3edce0a6-2a9e-448f-a5c2-666e2c2c3086-0005 (Test HTTP Framework)
I0123 15:53:44.530299     7 master.cpp:1310] Giving framework
3edce0a6-2a9e-448f-a5c2-666e2c2c3086-0005 (Test HTTP Framework) 0ns to
failover
I0123 15:53:44.530594     7 hierarchical.cpp:386] Deactivated framework
3edce0a6-2a9e-448f-a5c2-666e2c2c3086-0005
I0123 15:53:44.531962     7 master.cpp:6369] Framework failover timeout,
removing framework 3edce0a6-2a9e-448f-a5c2-666e2c2c3086-0005 (Test HTif TP
Framework)
I0123 15:53:44.534992     7 master.cpp:7103] Removing framework
3edce0a6-2a9e-448f-a5c2-666e2c2c3086-0005 (Test HTTP Framework)

It seems failover timeout is set to zero for the framework.

It can be my coding error if framework looses its connection to the master
multiple times (I see that I do not pass failover_timeout value during
reconnection).
I will try to observe if it solves my issue.

Thanks

2017-01-23 19:05 GMT+03:00 Vova Shelgunov <vv...@gmail.com>:

> Hi,
>
> I faced a very strange situation with my framework that talks to
> mesos master via Scheduler HTTP API:
>
> Sometimes my framework stops to receive the heartbeats and task updates
> from a master.
> I read the documentation of mesos (http://mesos.apache.
> org/documentation/latest/scheduler-http-api/), *Network partitions *section
> and I see that if a framework does not receive the heartbeats within some
> time it should reconnect to the master.
>
> I have written a heartbeat monitor that checks if there were not
> heartbeats last n seconds, then reconnect, but after the reconnection, I
> all the time receive an ERROR from the mesos master that my framework has
> been removed.
>
> Why is it happening?
>
> Regards,
> Uladzimir
>