You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Cody Maloney (JIRA)" <ji...@apache.org> on 2014/11/01 01:39:34 UTC

[jira] [Commented] (MESOS-1250) ReservationAllocatorTest.ResourcesReturned flaky - repeated_field.h:824 CHECK failed: (index) < (size())

    [ https://issues.apache.org/jira/browse/MESOS-1250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14192827#comment-14192827 ] 

Cody Maloney commented on MESOS-1250:
-------------------------------------

Just hit this today accompanied by a segfault on the Mesosphere buildbot. Definitely doesn't seem to be frequent.

{noformat:title=Check}
[libprotobuf FATAL ../3rdparty/libprocess/3rdparty/protobuf-2.5.0/src/google/protobuf/repeated_field.h:824] CHECK failed: (index) < (size()): 
../../src/tests/allocator_tests.cpp:753: Failure
Actual function call count doesn't match EXPECT_CALL(allocator, slaveAdded(_, _, _))...
         Expected: to be called twice
           Actual: called once - unsatisfied and active
unknown file: Failure
C++ exception with description "CHECK failed: (index) < (size()): " thrown in the test body.
{noformat}

{noformat:title=Log}
[ RUN      ] ReservationAllocatorTest.ResourcesReturned
[libprotobuf FATAL ../3rdparty/libprocess/3rdparty/protobuf-2.5.0/src/google/protobuf/repeated_field.h:824] CHECK failed: (index) < (size()): 
../../src/tests/allocator_tests.cpp:753: Failure
Actual function call count doesn't match EXPECT_CALL(allocator, slaveAdded(_, _, _))...
         Expected: to be called twice
           Actual: called once - unsatisfied and active
unknown file: Failure
C++ exception with description "CHECK failed: (index) < (size()): " thrown in the test body.
*** Aborted at 1414800164 (unix time) try "date -d @1414800164" if you are using GNU date ***
PC: @     0x2b1ec26c05c8 (unknown)
*** SIGSEGV (@0x0) received by PID 28529 (TID 0x2b1ebd74e100) from PID 0; stack trace: ***
    @     0x2b1ec2e35340 (unknown)
    @     0x2b1ec26c05c8 (unknown)
    @           0x8fffe3 process::UPID::UPID()
    @           0x902ebe process::ProcessBase::self()
    @           0x90303f process::terminate()
    @           0x90b127 mesos::internal::master::allocator::Allocator::~Allocator()
    @           0x90b1aa mesos::internal::master::allocator::Allocator::~Allocator()
    @           0xe0fdcc process::Owned<>::Data::~Data()
    @           0xe13dfc std::_Sp_counted_ptr<>::_M_dispose()
    @           0x935a1a std::_Sp_counted_base<>::_M_release()
    @           0x916e5b std::__shared_count<>::~__shared_count()
    @           0x90f8f6 std::__shared_ptr<>::~__shared_ptr()
    @           0x90f910 std::shared_ptr<>::~shared_ptr()
    @           0x90f92a process::Owned<>::~Owned()
    @           0x91010c mesos::internal::tests::Cluster::Masters::Master::~Master()
    @           0x980eb8 std::pair<>::~pair()
    @           0x980ee2 std::_Rb_tree_node<>::~_Rb_tree_node()
    @           0x980f00 __gnu_cxx::new_allocator<>::destroy<>()
    @           0x9687a1 std::_Rb_tree<>::_M_destroy_node()
    @           0x943d21 std::_Rb_tree<>::_M_erase()
    @           0x92555c std::_Rb_tree<>::~_Rb_tree()
    @           0x90fdea std::map<>::~map()
    @           0x910047 mesos::internal::tests::Cluster::Masters::shutdown()
    @           0xdf3eaa mesos::internal::tests::MesosTest::ShutdownMasters()
    @           0xdf3e73 mesos::internal::tests::MesosTest::Shutdown()
    @           0xdf16ad mesos::internal::tests::MesosTest::TearDown()
    @          0x10d4d04 testing::internal::HandleSehExceptionsInMethodIfSupported<>()
    @          0x10cfe5a testing::internal::HandleExceptionsInMethodIfSupported<>()
    @          0x10b81b4 testing::Test::Run()
    @          0x10b888e testing::TestInfo::Run()
    @          0x10b8e16 testing::TestCase::Run()
    @          0x10bdae0 testing::internal::UnitTestImpl::RunAllTests()
make[4]: *** [check-local] Segmentation fault (core dumped)
{noformat}

> ReservationAllocatorTest.ResourcesReturned flaky - repeated_field.h:824 CHECK failed: (index) < (size())
> --------------------------------------------------------------------------------------------------------
>
>                 Key: MESOS-1250
>                 URL: https://issues.apache.org/jira/browse/MESOS-1250
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Yan Xu
>
> {noformat:title=The CHECK failure}
> [libprotobuf FATAL ../3rdparty/libprocess/3rdparty/protobuf-2.5.0/src/google/protobuf/repeated_field.h:824] CHECK failed: (index) < (size()): 
> {noformat}
> {noformat:title=Full log}
> [ RUN      ] ReservationAllocatorTest.ResourcesReturned
> I0425 18:24:35.375020 23940 leveldb.cpp:174] Opened db in 80.11386ms
> I0425 18:24:35.385205 23940 leveldb.cpp:181] Compacted db in 10.16346ms
> I0425 18:24:35.385226 23940 leveldb.cpp:196] Created db iterator in 3534ns
> I0425 18:24:35.385233 23940 leveldb.cpp:202] Seeked to beginning of db in 513ns
> I0425 18:24:35.385239 23940 leveldb.cpp:271] Iterated through 0 keys in the db in 200ns
> I0425 18:24:35.385251 23940 replica.cpp:729] Replica recovered with log positions 0 -> 0 with 1 holes and 0 unlearned
> I0425 18:24:35.385484 23963 recover.cpp:425] Starting replica recovery
> I0425 18:24:35.385722 23961 recover.cpp:451] Replica is in EMPTY status
> I0425 18:24:35.385973 23963 master.cpp:266] Master 20140425-182435-1032504131-35984-23940 (juno.apache.org) started on 67.195.138.61:35984
> I0425 18:24:35.385993 23963 master.cpp:304] Master only allowing authenticated frameworks to register
> I0425 18:24:35.386001 23963 credentials.hpp:35] Loading credentials for authentication
> W0425 18:24:35.386051 23963 credentials.hpp:48] Failed to stat credentials file 'file:///tmp/ReservationAllocatorTest_ResourcesReturned_K72a9s/credentials': No such file or directory
> I0425 18:24:35.386209 23966 replica.cpp:626] Replica in EMPTY status received a broadcasted recover request
> I0425 18:24:35.386451 23964 recover.cpp:188] Received a recover response from a replica in EMPTY status
> I0425 18:24:35.386603 23966 hierarchical_allocator_process.hpp:302] Initializing hierarchical allocator process with master : master@67.195.138.61:35984
> I0425 18:24:35.386672 23966 master.cpp:104] No whitelist given. Advertising offers for all slaves
> I0425 18:24:35.386672 23964 recover.cpp:542] Updating replica status to STARTING
> I0425 18:24:35.387045 23968 master.cpp:918] The newly elected leader is master@67.195.138.61:35984 with id 20140425-182435-1032504131-35984-23940
> I0425 18:24:35.387058 23968 master.cpp:928] Elected as the leading master!
> I0425 18:24:35.387064 23968 master.cpp:749] Recovering from registrar
> I0425 18:24:35.387130 23968 registrar.cpp:275] Recovering registrar
> I0425 18:24:35.416841 23964 leveldb.cpp:304] Persisting metadata (8 bytes) to leveldb took 30.043801ms
> I0425 18:24:35.416859 23964 replica.cpp:320] Persisted replica status to STARTING
> I0425 18:24:35.416947 23964 recover.cpp:451] Replica is in STARTING status
> I0425 18:24:35.417343 23968 replica.cpp:626] Replica in STARTING status received a broadcasted recover request
> I0425 18:24:35.417513 23968 recover.cpp:188] Received a recover response from a replica in STARTING status
> I0425 18:24:35.417755 23967 recover.cpp:542] Updating replica status to VOTING
> I0425 18:24:35.428218 23968 leveldb.cpp:304] Persisting metadata (8 bytes) to leveldb took 10.386348ms
> I0425 18:24:35.428267 23968 replica.cpp:320] Persisted replica status to VOTING
> I0425 18:24:35.428336 23967 recover.cpp:556] Successfully joined the Paxos group
> I0425 18:24:35.428447 23967 recover.cpp:440] Recover process terminated
> I0425 18:24:35.428591 23967 log.cpp:656] Attempting to start the writer
> I0425 18:24:35.429041 23966 replica.cpp:474] Replica received implicit promise request with proposal 1
> I0425 18:24:35.436518 23966 leveldb.cpp:304] Persisting metadata (8 bytes) to leveldb took 7.457582ms
> I0425 18:24:35.436539 23966 replica.cpp:342] Persisted promised to 1
> I0425 18:24:35.436861 23964 coordinator.cpp:229] Coordinator attemping to fill missing position
> I0425 18:24:35.436869 23967 hierarchical_allocator_process.hpp:726] No resources available to allocate!
> I0425 18:24:35.436878 23967 hierarchical_allocator_process.hpp:688] Performed allocation for 0 slaves in 15305ns
> I0425 18:24:35.437417 23968 replica.cpp:375] Replica received explicit promise request for position 0 with proposal 2
> I0425 18:24:35.444845 23968 leveldb.cpp:341] Persisting action (8 bytes) to leveldb took 7.408147ms
> I0425 18:24:35.444866 23968 replica.cpp:664] Persisted action at 0
> I0425 18:24:35.445366 23961 replica.cpp:508] Replica received write request for position 0
> I0425 18:24:35.445392 23961 leveldb.cpp:436] Reading position from leveldb took 12659ns
> I0425 18:24:35.453176 23961 leveldb.cpp:341] Persisting action (14 bytes) to leveldb took 7.767811ms
> I0425 18:24:35.453198 23961 replica.cpp:664] Persisted action at 0
> I0425 18:24:35.453548 23965 replica.cpp:643] Replica received learned notice for position 0
> I0425 18:24:35.461506 23965 leveldb.cpp:341] Persisting action (16 bytes) to leveldb took 7.919252ms
> I0425 18:24:35.461527 23965 replica.cpp:664] Persisted action at 0
> I0425 18:24:35.461536 23965 replica.cpp:649] Replica learned NOP action at position 0
> I0425 18:24:35.461809 23966 log.cpp:672] Writer started with ending position 0
> I0425 18:24:35.462334 23961 leveldb.cpp:436] Reading position from leveldb took 15347ns
> I0425 18:24:35.464083 23967 registrar.cpp:308] Successfully recovered registrar
> I0425 18:24:35.464109 23967 registrar.cpp:379] Attempting to update the 'registry'
> I0425 18:24:35.465710 23962 log.cpp:680] Attempting to append 137 bytes to the log
> I0425 18:24:35.465852 23963 coordinator.cpp:339] Coordinator attempting to write APPEND action at position 1
> I0425 18:24:35.466189 23968 replica.cpp:508] Replica received write request for position 1
> I0425 18:24:35.475623 23968 leveldb.cpp:341] Persisting action (156 bytes) to leveldb took 9.417683ms
> I0425 18:24:35.475642 23968 replica.cpp:664] Persisted action at 1
> I0425 18:24:35.475893 23968 replica.cpp:643] Replica received learned notice for position 1
> I0425 18:24:35.483952 23968 leveldb.cpp:341] Persisting action (158 bytes) to leveldb took 8.039894ms
> I0425 18:24:35.483976 23968 replica.cpp:664] Persisted action at 1
> I0425 18:24:35.483985 23968 replica.cpp:649] Replica learned APPEND action at position 1
> I0425 18:24:35.484416 23965 registrar.cpp:427] Successfully updated 'registry'
> I0425 18:24:35.484467 23968 log.cpp:699] Attempting to truncate the log to 1
> I0425 18:24:35.484530 23962 coordinator.cpp:339] Coordinator attempting to write TRUNCATE action at position 2
> I0425 18:24:35.484622 23961 master.cpp:776] Recovered 0 slaves from the Registry (99B) ; allowing 10mins for slaves to re-register
> I0425 18:24:35.485131 23968 replica.cpp:508] Replica received write request for position 2
> I0425 18:24:35.487083 23965 hierarchical_allocator_process.hpp:726] No resources available to allocate!
> I0425 18:24:35.487120 23965 hierarchical_allocator_process.hpp:688] Performed allocation for 0 slaves in 51286ns
> I0425 18:24:35.487566 23963 slave.cpp:130] Slave started on 6)@67.195.138.61:35984
> I0425 18:24:35.487763 23963 slave.cpp:218] Slave resources: cpus(role1):1; mem(role1):200; cpus(role2):2; mem(role2):600; cpus(*):1; mem(*):200; disk(*):0; ports(*):[31000-32000]
> I0425 18:24:35.487838 23963 slave.cpp:246] Slave hostname: juno.apache.org
> I0425 18:24:35.487845 23963 slave.cpp:247] Slave checkpoint: false
> I0425 18:24:35.488287 23965 state.cpp:33] Recovering state from '/tmp/ReservationAllocatorTest_ResourcesReturned_cRsGAx/meta'
> I0425 18:24:35.488456 23963 status_update_manager.cpp:193] Recovering status update manager
> I0425 18:24:35.488628 23963 slave.cpp:2804] Finished recovery
> I0425 18:24:35.488865 23963 slave.cpp:499] New master detected at master@67.195.138.61:35984
> I0425 18:24:35.488914 23963 slave.cpp:733] Will retry registration in 7.800343128secs if necessary
> I0425 18:24:35.488931 23963 slave.cpp:524] Detecting new master
> I0425 18:24:35.488968 23963 status_update_manager.cpp:167] New master detected at master@67.195.138.61:35984
> I0425 18:24:35.489169 23963 registrar.cpp:379] Attempting to update the 'registry'
> I0425 18:24:35.494820 23968 leveldb.cpp:341] Persisting action (16 bytes) to leveldb took 9.670136ms
> I0425 18:24:35.494839 23968 replica.cpp:664] Persisted action at 2
> I0425 18:24:35.495146 23962 replica.cpp:643] Replica received learned notice for position 2
> I0425 18:24:35.503154 23962 leveldb.cpp:341] Persisting action (18 bytes) to leveldb took 7.987444ms
> I0425 18:24:35.503190 23962 leveldb.cpp:399] Deleting ~1 keys from leveldb took 16160ns
> I0425 18:24:35.503201 23962 replica.cpp:664] Persisted action at 2
> I0425 18:24:35.503209 23962 replica.cpp:649] Replica learned TRUNCATE action at position 2
> I0425 18:24:35.503527 23961 log.cpp:680] Attempting to append 438 bytes to the log
> I0425 18:24:35.503597 23968 coordinator.cpp:339] Coordinator attempting to write APPEND action at position 3
> I0425 18:24:35.503970 23967 replica.cpp:508] Replica received write request for position 3
> I0425 18:24:35.511482 23967 leveldb.cpp:341] Persisting action (457 bytes) to leveldb took 7.473886ms
> I0425 18:24:35.511502 23967 replica.cpp:664] Persisted action at 3
> I0425 18:24:35.511749 23968 replica.cpp:643] Replica received learned notice for position 3
> I0425 18:24:35.519814 23968 leveldb.cpp:341] Persisting action (459 bytes) to leveldb took 8.043537ms
> I0425 18:24:35.519836 23968 replica.cpp:664] Persisted action at 3
> I0425 18:24:35.519845 23968 replica.cpp:649] Replica learned APPEND action at position 3
> I0425 18:24:35.520158 23965 registrar.cpp:427] Successfully updated 'registry'
> I0425 18:24:35.520236 23964 log.cpp:699] Attempting to truncate the log to 3
> I0425 18:24:35.520279 23965 master.cpp:2139] Admitted slave on juno.apache.org at slave(6)@67.195.138.61:35984
> I0425 18:24:35.520285 23964 coordinator.cpp:339] Coordinator attempting to write TRUNCATE action at position 4
> I0425 18:24:35.520306 23965 master.cpp:3213] Adding slave 20140425-182435-1032504131-35984-23940-0 at juno.apache.org with cpus(role1):1; mem(role1):200; cpus(role2):2; mem(role2):600; cpus(*):1; mem(*):200; disk(*):0; ports(*):[31000-32000]
> I0425 18:24:35.520434 23967 slave.cpp:542] Registered with master master@67.195.138.61:35984; given slave ID 20140425-182435-1032504131-35984-23940-0
> [libprotobuf FATAL ../3rdparty/libprocess/3rdparty/protobuf-2.5.0/src/google/protobuf/repeated_field.h:824] CHECK failed: (index) < (size()): 
> I0425 18:24:35.520853 23964 replica.cpp:508] Replica received write request for position 4
> I0425 18:24:35.520884 23963 hierarchical_allocator_process.hpp:445] Added slave 20140425-182435-1032504131-35984-23940-0 (juno.apache.org) with cpus(role1):1; mem(role1):200; cpus(role2):2; mem(role2):600; cpus(*):1; mem(*):200; disk(*):0; ports(*):[31000-32000] (and cpus(role1):1; mem(role1):200; cpus(role2):2; mem(role2):600; cpus(*):1; mem(*):200; disk(*):0; ports(*):[31000-32000] available)
> I0425 18:24:35.520987 23963 hierarchical_allocator_process.hpp:708] Performed allocation for slave 20140425-182435-1032504131-35984-23940-0 in 17123ns
> ../../src/tests/allocator_tests.cpp:523: Failure
> Actual function call count doesn't match EXPECT_CALL(allocator, slaveAdded(_, _, _))...
>          Expected: to be called twice
>            Actual: called once - unsatisfied and active
> unknown file: Failure
> C++ exception with description "CHECK failed: (index) < (size()): " thrown in the test body.
> I0425 18:24:35.521409 23940 master.cpp:547] Master terminating
> I0425 18:24:35.521536 23965 slave.cpp:2012] master@67.195.138.61:35984 exited
> W0425 18:24:35.521553 23965 slave.cpp:2015] Master disconnected! Waiting for a new master to be elected
> I0425 18:24:35.528180 23964 leveldb.cpp:341] Persisting action (16 bytes) to leveldb took 7.286467ms
> I0425 18:24:35.528199 23964 replica.cpp:664] Persisted action at 4
> I0425 18:24:35.528659 23940 slave.cpp:388] Slave terminating
> [  FAILED  ] ReservationAllocatorTest.ResourcesReturned (235 ms)
> {noformat}
> {noformat:title=This is only the result of the CHECK failure and irrelevant}
> Actual function call count doesn't match EXPECT_CALL(allocator, slaveAdded(_, _, _))...
>          Expected: to be called twice
>            Actual: called once - unsatisfied and active
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)