You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mesos.apache.org by "Yan Xu (JIRA)" <ji...@apache.org> on 2014/05/01 20:49:17 UTC
[jira] [Commented] (MESOS-1250) ReservationAllocatorTest.ResourcesReturned flaky - repeated_field.h:824 CHECK failed: (index) < (size())

    [ https://issues.apache.org/jira/browse/MESOS-1250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13986867#comment-13986867 ] 

Yan Xu commented on MESOS-1250:
-------------------------------

So... I haven't fixed this bug but here are my findings:

The CHECK failure occurred in [RepeatedPtrFieldBase::Get(int index)|https://code.google.com/p/protobuf/source/browse/tags/2.5.0/src/google/protobuf/repeated_field.h#824].

The sequence of events:
# A slave 'slave1' in the test with resources {{cpus(role1):1; mem(role1):200; cpus(role2):2; mem(role2):600; cpus\(\*):1; mem\(\*):200; disk\(\*):0; ports\(\*):\[31000-32000]}} is [added by the master|https://github.com/apache/mesos/blob/81a63ee53932c2fc372b7ba0d881596cd44a2850/src/tests/allocator_tests.cpp#L536].
# Master [adds it to the allocator|https://github.com/apache/mesos/blob/81a63ee53932c2fc372b7ba0d881596cd44a2850/src/master/master.cpp#L3308] with no 'used' resources.
# Allocator first [creates a Slave object|https://github.com/apache/mesos/blob/81a63ee53932c2fc372b7ba0d881596cd44a2850/src/master/hierarchical_allocator_process.hpp#L423] and copies {{slaveInfo.resources()}} to its 'available' field.
# Allocator then adds the slave's resources already used by frameworks, which are none so this is a no-op.
# It then [assigns the unused resources back to 'available'|https://github.com/apache/mesos/blob/81a63ee53932c2fc372b7ba0d881596cd44a2850/src/master/hierarchical_allocator_process.hpp#L443]. 'unused' should be identical to 'available' before the assignment because they both come from (are copies of) {{slaveInfo.resources()}} and there is no mutation.
# The assignment uses a [MergeFrom|https://github.com/apache/mesos/blob/81a63ee53932c2fc372b7ba0d881596cd44a2850/include/mesos/resources.hpp#L95], which does a {{Get\(i)}} and tripped over the the CHECK failure.

Given that the code does check the index before using it and it's only flaky
{code}
for (int i = 0; i < other.current_size_; i++) {
  TypeHandler::Merge(other.template Get<TypeHandler>(i), Add<TypeHandler>());
}
{code}
It appears that something has modified {{current_size_}} after the check.
In fact, I modified the CHECK to print the values {{i}} and {{current_size_}}:
{code}
GOOGLE_DCHECK_LT(index, size()) << "index: " << index << "; size: " << size();
{code}
and it prints
{noformat}
[libprotobuf FATAL ../3rdparty/libprocess/3rdparty/protobuf-2.5.0/src/google/protobuf/repeated_field.h:825] CHECK failed: (index) < (size()): index: 7; size: 0
{noformat}

{{size: 0}}?

I don't understand how it is possible because the two Resources objects on the two sides of the [assignment|https://github.com/apache/mesos/blob/81a63ee53932c2fc372b7ba0d881596cd44a2850/src/master/hierarchical_allocator_process.hpp#L443] are both copied from slaveInfo.resources() within the same libprocess Process method [slaveAdded|https://github.com/apache/mesos/blob/81a63ee53932c2fc372b7ba0d881596cd44a2850/src/master/hierarchical_allocator_process.hpp#L414] so there should be no thread safety issue here.

And the flakiness is tricky in that when too many log lines around the statements are printed and gdb conditional break points are set it just stopped recurring...

> ReservationAllocatorTest.ResourcesReturned flaky - repeated_field.h:824 CHECK failed: (index) < (size())
> --------------------------------------------------------------------------------------------------------
>
>                 Key: MESOS-1250
>                 URL: https://issues.apache.org/jira/browse/MESOS-1250
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Yan Xu
>            Assignee: Yan Xu
>
> {noformat:title=The CHECK failure}
> [libprotobuf FATAL ../3rdparty/libprocess/3rdparty/protobuf-2.5.0/src/google/protobuf/repeated_field.h:824] CHECK failed: (index) < (size()): 
> {noformat}
> {noformat:title=Full log}
> [ RUN      ] ReservationAllocatorTest.ResourcesReturned
> I0425 18:24:35.375020 23940 leveldb.cpp:174] Opened db in 80.11386ms
> I0425 18:24:35.385205 23940 leveldb.cpp:181] Compacted db in 10.16346ms
> I0425 18:24:35.385226 23940 leveldb.cpp:196] Created db iterator in 3534ns
> I0425 18:24:35.385233 23940 leveldb.cpp:202] Seeked to beginning of db in 513ns
> I0425 18:24:35.385239 23940 leveldb.cpp:271] Iterated through 0 keys in the db in 200ns
> I0425 18:24:35.385251 23940 replica.cpp:729] Replica recovered with log positions 0 -> 0 with 1 holes and 0 unlearned
> I0425 18:24:35.385484 23963 recover.cpp:425] Starting replica recovery
> I0425 18:24:35.385722 23961 recover.cpp:451] Replica is in EMPTY status
> I0425 18:24:35.385973 23963 master.cpp:266] Master 20140425-182435-1032504131-35984-23940 (juno.apache.org) started on 67.195.138.61:35984
> I0425 18:24:35.385993 23963 master.cpp:304] Master only allowing authenticated frameworks to register
> I0425 18:24:35.386001 23963 credentials.hpp:35] Loading credentials for authentication
> W0425 18:24:35.386051 23963 credentials.hpp:48] Failed to stat credentials file 'file:///tmp/ReservationAllocatorTest_ResourcesReturned_K72a9s/credentials': No such file or directory
> I0425 18:24:35.386209 23966 replica.cpp:626] Replica in EMPTY status received a broadcasted recover request
> I0425 18:24:35.386451 23964 recover.cpp:188] Received a recover response from a replica in EMPTY status
> I0425 18:24:35.386603 23966 hierarchical_allocator_process.hpp:302] Initializing hierarchical allocator process with master : master@67.195.138.61:35984
> I0425 18:24:35.386672 23966 master.cpp:104] No whitelist given. Advertising offers for all slaves
> I0425 18:24:35.386672 23964 recover.cpp:542] Updating replica status to STARTING
> I0425 18:24:35.387045 23968 master.cpp:918] The newly elected leader is master@67.195.138.61:35984 with id 20140425-182435-1032504131-35984-23940
> I0425 18:24:35.387058 23968 master.cpp:928] Elected as the leading master!
> I0425 18:24:35.387064 23968 master.cpp:749] Recovering from registrar
> I0425 18:24:35.387130 23968 registrar.cpp:275] Recovering registrar
> I0425 18:24:35.416841 23964 leveldb.cpp:304] Persisting metadata (8 bytes) to leveldb took 30.043801ms
> I0425 18:24:35.416859 23964 replica.cpp:320] Persisted replica status to STARTING
> I0425 18:24:35.416947 23964 recover.cpp:451] Replica is in STARTING status
> I0425 18:24:35.417343 23968 replica.cpp:626] Replica in STARTING status received a broadcasted recover request
> I0425 18:24:35.417513 23968 recover.cpp:188] Received a recover response from a replica in STARTING status
> I0425 18:24:35.417755 23967 recover.cpp:542] Updating replica status to VOTING
> I0425 18:24:35.428218 23968 leveldb.cpp:304] Persisting metadata (8 bytes) to leveldb took 10.386348ms
> I0425 18:24:35.428267 23968 replica.cpp:320] Persisted replica status to VOTING
> I0425 18:24:35.428336 23967 recover.cpp:556] Successfully joined the Paxos group
> I0425 18:24:35.428447 23967 recover.cpp:440] Recover process terminated
> I0425 18:24:35.428591 23967 log.cpp:656] Attempting to start the writer
> I0425 18:24:35.429041 23966 replica.cpp:474] Replica received implicit promise request with proposal 1
> I0425 18:24:35.436518 23966 leveldb.cpp:304] Persisting metadata (8 bytes) to leveldb took 7.457582ms
> I0425 18:24:35.436539 23966 replica.cpp:342] Persisted promised to 1
> I0425 18:24:35.436861 23964 coordinator.cpp:229] Coordinator attemping to fill missing position
> I0425 18:24:35.436869 23967 hierarchical_allocator_process.hpp:726] No resources available to allocate!
> I0425 18:24:35.436878 23967 hierarchical_allocator_process.hpp:688] Performed allocation for 0 slaves in 15305ns
> I0425 18:24:35.437417 23968 replica.cpp:375] Replica received explicit promise request for position 0 with proposal 2
> I0425 18:24:35.444845 23968 leveldb.cpp:341] Persisting action (8 bytes) to leveldb took 7.408147ms
> I0425 18:24:35.444866 23968 replica.cpp:664] Persisted action at 0
> I0425 18:24:35.445366 23961 replica.cpp:508] Replica received write request for position 0
> I0425 18:24:35.445392 23961 leveldb.cpp:436] Reading position from leveldb took 12659ns
> I0425 18:24:35.453176 23961 leveldb.cpp:341] Persisting action (14 bytes) to leveldb took 7.767811ms
> I0425 18:24:35.453198 23961 replica.cpp:664] Persisted action at 0
> I0425 18:24:35.453548 23965 replica.cpp:643] Replica received learned notice for position 0
> I0425 18:24:35.461506 23965 leveldb.cpp:341] Persisting action (16 bytes) to leveldb took 7.919252ms
> I0425 18:24:35.461527 23965 replica.cpp:664] Persisted action at 0
> I0425 18:24:35.461536 23965 replica.cpp:649] Replica learned NOP action at position 0
> I0425 18:24:35.461809 23966 log.cpp:672] Writer started with ending position 0
> I0425 18:24:35.462334 23961 leveldb.cpp:436] Reading position from leveldb took 15347ns
> I0425 18:24:35.464083 23967 registrar.cpp:308] Successfully recovered registrar
> I0425 18:24:35.464109 23967 registrar.cpp:379] Attempting to update the 'registry'
> I0425 18:24:35.465710 23962 log.cpp:680] Attempting to append 137 bytes to the log
> I0425 18:24:35.465852 23963 coordinator.cpp:339] Coordinator attempting to write APPEND action at position 1
> I0425 18:24:35.466189 23968 replica.cpp:508] Replica received write request for position 1
> I0425 18:24:35.475623 23968 leveldb.cpp:341] Persisting action (156 bytes) to leveldb took 9.417683ms
> I0425 18:24:35.475642 23968 replica.cpp:664] Persisted action at 1
> I0425 18:24:35.475893 23968 replica.cpp:643] Replica received learned notice for position 1
> I0425 18:24:35.483952 23968 leveldb.cpp:341] Persisting action (158 bytes) to leveldb took 8.039894ms
> I0425 18:24:35.483976 23968 replica.cpp:664] Persisted action at 1
> I0425 18:24:35.483985 23968 replica.cpp:649] Replica learned APPEND action at position 1
> I0425 18:24:35.484416 23965 registrar.cpp:427] Successfully updated 'registry'
> I0425 18:24:35.484467 23968 log.cpp:699] Attempting to truncate the log to 1
> I0425 18:24:35.484530 23962 coordinator.cpp:339] Coordinator attempting to write TRUNCATE action at position 2
> I0425 18:24:35.484622 23961 master.cpp:776] Recovered 0 slaves from the Registry (99B) ; allowing 10mins for slaves to re-register
> I0425 18:24:35.485131 23968 replica.cpp:508] Replica received write request for position 2
> I0425 18:24:35.487083 23965 hierarchical_allocator_process.hpp:726] No resources available to allocate!
> I0425 18:24:35.487120 23965 hierarchical_allocator_process.hpp:688] Performed allocation for 0 slaves in 51286ns
> I0425 18:24:35.487566 23963 slave.cpp:130] Slave started on 6)@67.195.138.61:35984
> I0425 18:24:35.487763 23963 slave.cpp:218] Slave resources: cpus(role1):1; mem(role1):200; cpus(role2):2; mem(role2):600; cpus(*):1; mem(*):200; disk(*):0; ports(*):[31000-32000]
> I0425 18:24:35.487838 23963 slave.cpp:246] Slave hostname: juno.apache.org
> I0425 18:24:35.487845 23963 slave.cpp:247] Slave checkpoint: false
> I0425 18:24:35.488287 23965 state.cpp:33] Recovering state from '/tmp/ReservationAllocatorTest_ResourcesReturned_cRsGAx/meta'
> I0425 18:24:35.488456 23963 status_update_manager.cpp:193] Recovering status update manager
> I0425 18:24:35.488628 23963 slave.cpp:2804] Finished recovery
> I0425 18:24:35.488865 23963 slave.cpp:499] New master detected at master@67.195.138.61:35984
> I0425 18:24:35.488914 23963 slave.cpp:733] Will retry registration in 7.800343128secs if necessary
> I0425 18:24:35.488931 23963 slave.cpp:524] Detecting new master
> I0425 18:24:35.488968 23963 status_update_manager.cpp:167] New master detected at master@67.195.138.61:35984
> I0425 18:24:35.489169 23963 registrar.cpp:379] Attempting to update the 'registry'
> I0425 18:24:35.494820 23968 leveldb.cpp:341] Persisting action (16 bytes) to leveldb took 9.670136ms
> I0425 18:24:35.494839 23968 replica.cpp:664] Persisted action at 2
> I0425 18:24:35.495146 23962 replica.cpp:643] Replica received learned notice for position 2
> I0425 18:24:35.503154 23962 leveldb.cpp:341] Persisting action (18 bytes) to leveldb took 7.987444ms
> I0425 18:24:35.503190 23962 leveldb.cpp:399] Deleting ~1 keys from leveldb took 16160ns
> I0425 18:24:35.503201 23962 replica.cpp:664] Persisted action at 2
> I0425 18:24:35.503209 23962 replica.cpp:649] Replica learned TRUNCATE action at position 2
> I0425 18:24:35.503527 23961 log.cpp:680] Attempting to append 438 bytes to the log
> I0425 18:24:35.503597 23968 coordinator.cpp:339] Coordinator attempting to write APPEND action at position 3
> I0425 18:24:35.503970 23967 replica.cpp:508] Replica received write request for position 3
> I0425 18:24:35.511482 23967 leveldb.cpp:341] Persisting action (457 bytes) to leveldb took 7.473886ms
> I0425 18:24:35.511502 23967 replica.cpp:664] Persisted action at 3
> I0425 18:24:35.511749 23968 replica.cpp:643] Replica received learned notice for position 3
> I0425 18:24:35.519814 23968 leveldb.cpp:341] Persisting action (459 bytes) to leveldb took 8.043537ms
> I0425 18:24:35.519836 23968 replica.cpp:664] Persisted action at 3
> I0425 18:24:35.519845 23968 replica.cpp:649] Replica learned APPEND action at position 3
> I0425 18:24:35.520158 23965 registrar.cpp:427] Successfully updated 'registry'
> I0425 18:24:35.520236 23964 log.cpp:699] Attempting to truncate the log to 3
> I0425 18:24:35.520279 23965 master.cpp:2139] Admitted slave on juno.apache.org at slave(6)@67.195.138.61:35984
> I0425 18:24:35.520285 23964 coordinator.cpp:339] Coordinator attempting to write TRUNCATE action at position 4
> I0425 18:24:35.520306 23965 master.cpp:3213] Adding slave 20140425-182435-1032504131-35984-23940-0 at juno.apache.org with cpus(role1):1; mem(role1):200; cpus(role2):2; mem(role2):600; cpus(*):1; mem(*):200; disk(*):0; ports(*):[31000-32000]
> I0425 18:24:35.520434 23967 slave.cpp:542] Registered with master master@67.195.138.61:35984; given slave ID 20140425-182435-1032504131-35984-23940-0
> [libprotobuf FATAL ../3rdparty/libprocess/3rdparty/protobuf-2.5.0/src/google/protobuf/repeated_field.h:824] CHECK failed: (index) < (size()): 
> I0425 18:24:35.520853 23964 replica.cpp:508] Replica received write request for position 4
> I0425 18:24:35.520884 23963 hierarchical_allocator_process.hpp:445] Added slave 20140425-182435-1032504131-35984-23940-0 (juno.apache.org) with cpus(role1):1; mem(role1):200; cpus(role2):2; mem(role2):600; cpus(*):1; mem(*):200; disk(*):0; ports(*):[31000-32000] (and cpus(role1):1; mem(role1):200; cpus(role2):2; mem(role2):600; cpus(*):1; mem(*):200; disk(*):0; ports(*):[31000-32000] available)
> I0425 18:24:35.520987 23963 hierarchical_allocator_process.hpp:708] Performed allocation for slave 20140425-182435-1032504131-35984-23940-0 in 17123ns
> ../../src/tests/allocator_tests.cpp:523: Failure
> Actual function call count doesn't match EXPECT_CALL(allocator, slaveAdded(_, _, _))...
>          Expected: to be called twice
>            Actual: called once - unsatisfied and active
> unknown file: Failure
> C++ exception with description "CHECK failed: (index) < (size()): " thrown in the test body.
> I0425 18:24:35.521409 23940 master.cpp:547] Master terminating
> I0425 18:24:35.521536 23965 slave.cpp:2012] master@67.195.138.61:35984 exited
> W0425 18:24:35.521553 23965 slave.cpp:2015] Master disconnected! Waiting for a new master to be elected
> I0425 18:24:35.528180 23964 leveldb.cpp:341] Persisting action (16 bytes) to leveldb took 7.286467ms
> I0425 18:24:35.528199 23964 replica.cpp:664] Persisted action at 4
> I0425 18:24:35.528659 23940 slave.cpp:388] Slave terminating
> [  FAILED  ] ReservationAllocatorTest.ResourcesReturned (235 ms)
> {noformat}
> {noformat:title=This is only the result of the CHECK failure and irrelevant}
> Actual function call count doesn't match EXPECT_CALL(allocator, slaveAdded(_, _, _))...
>          Expected: to be called twice
>            Actual: called once - unsatisfied and active
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)