You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mesos.apache.org by Jeff Pollard <je...@gmail.com> on 2019/02/05 20:04:27 UTC

Check failed: reservationScalarQuantities.contains(role)

We recently upgraded our Mesos  cluster from version 1.3 to 1.5, and since
then have been getting periodic master crashes due to this error:

Feb  5 15:53:57 ip-10-0-16-140 mesos-master[8414]: F0205 15:53:57.385118
8434 hierarchical.cpp:2630] Check failed:
reservationScalarQuantities.contains(role)

Full stack trace is at the end of this email. When the master fails, we
automatically restart it and it rejoins the cluster just fine. I did some
initial searching and was unable to find any existing bug reports or other
people experiencing this issue. We run a cluster of 3 masters, and see
crashes on all 3 instances.

Hope to get some guidance on what is going on and/or where to start looking
for more information.

Feb  5 15:53:57 ip-10-0-16-140 mesos-master[8414]:     @
 0x7f87e9170a7d  google::LogMessage::Fail()
Feb  5 15:53:57 ip-10-0-16-140 mesos-master[8414]:     @
 0x7f87e9172830  google::LogMessage::SendToLog()
Feb  5 15:53:57 ip-10-0-16-140 mesos-master[8414]:     @
 0x7f87e9170663  google::LogMessage::Flush()
Feb  5 15:53:57 ip-10-0-16-140 mesos-master[8414]:     @
 0x7f87e9173259  google::LogMessageFatal::~LogMessageFatal()
Feb  5 15:53:57 ip-10-0-16-140 mesos-master[8414]:     @
 0x7f87e8443cbd
mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::untrackReservations()
Feb  5 15:53:57 ip-10-0-16-140 mesos-master[8414]:     @
 0x7f87e8448fcd
mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::removeSlave()
Feb  5 15:53:57 ip-10-0-16-140 mesos-master[8414]:     @
 0x7f87e90c4f11  process::ProcessBase::consume()
Feb  5 15:53:57 ip-10-0-16-140 mesos-master[8414]:     @
 0x7f87e90dea4a  process::ProcessManager::resume()
Feb  5 15:53:57 ip-10-0-16-140 mesos-master[8414]:     @
 0x7f87e90e25d6
_ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
Feb  5 15:53:57 ip-10-0-16-140 mesos-master[8414]:     @
 0x7f87e6700c80  (unknown)
Feb  5 15:53:57 ip-10-0-16-140 mesos-master[8414]:     @
 0x7f87e5f136ba  start_thread
Feb  5 15:53:57 ip-10-0-16-140 mesos-master[8414]:     @
 0x7f87e5c4941d  (unknown)

Re: Check failed: reservationScalarQuantities.contains(role)

Posted by Benjamin Mahler <bm...@apache.org>.

Thanks for reporting this, we can help investigate this with you in JIRA.

On Tue, Feb 5, 2019 at 5:40 PM Jeff Pollard <je...@gmail.com> wrote:

> Thanks for the info. I did find the "Removed agent" line as you suspected,
> but not much else in logging looked promising. I opened a JIRA to track
> from here on out https://issues.apache.org/jira/browse/MESOS-9555.
>
> On Tue, Feb 5, 2019 at 2:03 PM Joseph Wu <jo...@mesosphere.io> wrote:
>
>> From the stack, it looks like the master is attempting to remove an agent
>> from the master's in-memory state.  In the master's logs you should find a
>> line shortly before the exit, like:
>>
>> <timestamp> master.cpp:nnnn] Removed agent <ID of agent>: <reason>
>>
>> The agent's ID should at least give you some pointer to which agent is
>> causing the problem.  Feel free to create a JIRA (
>> https://issues.apache.org/jira/) with any information you can glean.
>> This particular type of failure, a CHECK-failure, means some invariant has
>> been violated and usually means we missed a corner case.
>>
>> On Tue, Feb 5, 2019 at 12:04 PM Jeff Pollard <je...@gmail.com>
>> wrote:
>>
>>> We recently upgraded our Mesos  cluster from version 1.3 to 1.5, and
>>> since then have been getting periodic master crashes due to this error:
>>>
>>> Feb  5 15:53:57 ip-10-0-16-140 mesos-master[8414]: F0205
>>> 15:53:57.385118  8434 hierarchical.cpp:2630] Check failed:
>>> reservationScalarQuantities.contains(role)
>>>
>>> Full stack trace is at the end of this email. When the master fails, we
>>> automatically restart it and it rejoins the cluster just fine. I did some
>>> initial searching and was unable to find any existing bug reports or other
>>> people experiencing this issue. We run a cluster of 3 masters, and see
>>> crashes on all 3 instances.
>>>
>>> Hope to get some guidance on what is going on and/or where to start
>>> looking for more information.
>>>
>>> Feb  5 15:53:57 ip-10-0-16-140 mesos-master[8414]:     @
>>>  0x7f87e9170a7d  google::LogMessage::Fail()
>>> Feb  5 15:53:57 ip-10-0-16-140 mesos-master[8414]:     @
>>>  0x7f87e9172830  google::LogMessage::SendToLog()
>>> Feb  5 15:53:57 ip-10-0-16-140 mesos-master[8414]:     @
>>>  0x7f87e9170663  google::LogMessage::Flush()
>>> Feb  5 15:53:57 ip-10-0-16-140 mesos-master[8414]:     @
>>>  0x7f87e9173259  google::LogMessageFatal::~LogMessageFatal()
>>> Feb  5 15:53:57 ip-10-0-16-140 mesos-master[8414]:     @
>>>  0x7f87e8443cbd
>>> mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::untrackReservations()
>>> Feb  5 15:53:57 ip-10-0-16-140 mesos-master[8414]:     @
>>>  0x7f87e8448fcd
>>> mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::removeSlave()
>>> Feb  5 15:53:57 ip-10-0-16-140 mesos-master[8414]:     @
>>>  0x7f87e90c4f11  process::ProcessBase::consume()
>>> Feb  5 15:53:57 ip-10-0-16-140 mesos-master[8414]:     @
>>>  0x7f87e90dea4a  process::ProcessManager::resume()
>>> Feb  5 15:53:57 ip-10-0-16-140 mesos-master[8414]:     @
>>>  0x7f87e90e25d6
>>> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
>>> Feb  5 15:53:57 ip-10-0-16-140 mesos-master[8414]:     @
>>>  0x7f87e6700c80  (unknown)
>>> Feb  5 15:53:57 ip-10-0-16-140 mesos-master[8414]:     @
>>>  0x7f87e5f136ba  start_thread
>>> Feb  5 15:53:57 ip-10-0-16-140 mesos-master[8414]:     @
>>>  0x7f87e5c4941d  (unknown)
>>>
>>

Re: Check failed: reservationScalarQuantities.contains(role)

Posted by Jeff Pollard <je...@gmail.com>.

Thanks for the info. I did find the "Removed agent" line as you suspected,
but not much else in logging looked promising. I opened a JIRA to track
from here on out https://issues.apache.org/jira/browse/MESOS-9555.

On Tue, Feb 5, 2019 at 2:03 PM Joseph Wu <jo...@mesosphere.io> wrote:

> From the stack, it looks like the master is attempting to remove an agent
> from the master's in-memory state.  In the master's logs you should find a
> line shortly before the exit, like:
>
> <timestamp> master.cpp:nnnn] Removed agent <ID of agent>: <reason>
>
> The agent's ID should at least give you some pointer to which agent is
> causing the problem.  Feel free to create a JIRA (
> https://issues.apache.org/jira/) with any information you can glean.
> This particular type of failure, a CHECK-failure, means some invariant has
> been violated and usually means we missed a corner case.
>
> On Tue, Feb 5, 2019 at 12:04 PM Jeff Pollard <je...@gmail.com>
> wrote:
>
>> We recently upgraded our Mesos  cluster from version 1.3 to 1.5, and
>> since then have been getting periodic master crashes due to this error:
>>
>> Feb  5 15:53:57 ip-10-0-16-140 mesos-master[8414]: F0205 15:53:57.385118
>> 8434 hierarchical.cpp:2630] Check failed:
>> reservationScalarQuantities.contains(role)
>>
>> Full stack trace is at the end of this email. When the master fails, we
>> automatically restart it and it rejoins the cluster just fine. I did some
>> initial searching and was unable to find any existing bug reports or other
>> people experiencing this issue. We run a cluster of 3 masters, and see
>> crashes on all 3 instances.
>>
>> Hope to get some guidance on what is going on and/or where to start
>> looking for more information.
>>
>> Feb  5 15:53:57 ip-10-0-16-140 mesos-master[8414]:     @
>>  0x7f87e9170a7d  google::LogMessage::Fail()
>> Feb  5 15:53:57 ip-10-0-16-140 mesos-master[8414]:     @
>>  0x7f87e9172830  google::LogMessage::SendToLog()
>> Feb  5 15:53:57 ip-10-0-16-140 mesos-master[8414]:     @
>>  0x7f87e9170663  google::LogMessage::Flush()
>> Feb  5 15:53:57 ip-10-0-16-140 mesos-master[8414]:     @
>>  0x7f87e9173259  google::LogMessageFatal::~LogMessageFatal()
>> Feb  5 15:53:57 ip-10-0-16-140 mesos-master[8414]:     @
>>  0x7f87e8443cbd
>> mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::untrackReservations()
>> Feb  5 15:53:57 ip-10-0-16-140 mesos-master[8414]:     @
>>  0x7f87e8448fcd
>> mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::removeSlave()
>> Feb  5 15:53:57 ip-10-0-16-140 mesos-master[8414]:     @
>>  0x7f87e90c4f11  process::ProcessBase::consume()
>> Feb  5 15:53:57 ip-10-0-16-140 mesos-master[8414]:     @
>>  0x7f87e90dea4a  process::ProcessManager::resume()
>> Feb  5 15:53:57 ip-10-0-16-140 mesos-master[8414]:     @
>>  0x7f87e90e25d6
>> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
>> Feb  5 15:53:57 ip-10-0-16-140 mesos-master[8414]:     @
>>  0x7f87e6700c80  (unknown)
>> Feb  5 15:53:57 ip-10-0-16-140 mesos-master[8414]:     @
>>  0x7f87e5f136ba  start_thread
>> Feb  5 15:53:57 ip-10-0-16-140 mesos-master[8414]:     @
>>  0x7f87e5c4941d  (unknown)
>>
>

Re: Check failed: reservationScalarQuantities.contains(role)

Posted by Joseph Wu <jo...@mesosphere.io>.

From the stack, it looks like the master is attempting to remove an agent
from the master's in-memory state.  In the master's logs you should find a
line shortly before the exit, like:

<timestamp> master.cpp:nnnn] Removed agent <ID of agent>: <reason>

The agent's ID should at least give you some pointer to which agent is
causing the problem.  Feel free to create a JIRA (
https://issues.apache.org/jira/) with any information you can glean.  This
particular type of failure, a CHECK-failure, means some invariant has been
violated and usually means we missed a corner case.

On Tue, Feb 5, 2019 at 12:04 PM Jeff Pollard <je...@gmail.com> wrote:

> We recently upgraded our Mesos  cluster from version 1.3 to 1.5, and since
> then have been getting periodic master crashes due to this error:
>
> Feb  5 15:53:57 ip-10-0-16-140 mesos-master[8414]: F0205 15:53:57.385118
> 8434 hierarchical.cpp:2630] Check failed:
> reservationScalarQuantities.contains(role)
>
> Full stack trace is at the end of this email. When the master fails, we
> automatically restart it and it rejoins the cluster just fine. I did some
> initial searching and was unable to find any existing bug reports or other
> people experiencing this issue. We run a cluster of 3 masters, and see
> crashes on all 3 instances.
>
> Hope to get some guidance on what is going on and/or where to start
> looking for more information.
>
> Feb  5 15:53:57 ip-10-0-16-140 mesos-master[8414]:     @
>  0x7f87e9170a7d  google::LogMessage::Fail()
> Feb  5 15:53:57 ip-10-0-16-140 mesos-master[8414]:     @
>  0x7f87e9172830  google::LogMessage::SendToLog()
> Feb  5 15:53:57 ip-10-0-16-140 mesos-master[8414]:     @
>  0x7f87e9170663  google::LogMessage::Flush()
> Feb  5 15:53:57 ip-10-0-16-140 mesos-master[8414]:     @
>  0x7f87e9173259  google::LogMessageFatal::~LogMessageFatal()
> Feb  5 15:53:57 ip-10-0-16-140 mesos-master[8414]:     @
>  0x7f87e8443cbd
> mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::untrackReservations()
> Feb  5 15:53:57 ip-10-0-16-140 mesos-master[8414]:     @
>  0x7f87e8448fcd
> mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::removeSlave()
> Feb  5 15:53:57 ip-10-0-16-140 mesos-master[8414]:     @
>  0x7f87e90c4f11  process::ProcessBase::consume()
> Feb  5 15:53:57 ip-10-0-16-140 mesos-master[8414]:     @
>  0x7f87e90dea4a  process::ProcessManager::resume()
> Feb  5 15:53:57 ip-10-0-16-140 mesos-master[8414]:     @
>  0x7f87e90e25d6
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
> Feb  5 15:53:57 ip-10-0-16-140 mesos-master[8414]:     @
>  0x7f87e6700c80  (unknown)
> Feb  5 15:53:57 ip-10-0-16-140 mesos-master[8414]:     @
>  0x7f87e5f136ba  start_thread
> Feb  5 15:53:57 ip-10-0-16-140 mesos-master[8414]:     @
>  0x7f87e5c4941d  (unknown)
>