You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@kudu.apache.org by "Alexey Serbin (Jira)" <ji...@apache.org> on 2019/08/20 19:56:00 UTC

[jira] [Updated] (KUDU-2923) RaftConsensusITest.MultiThreadedInsertWithFailovers is flaky

     [ https://issues.apache.org/jira/browse/KUDU-2923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alexey Serbin updated KUDU-2923:
--------------------------------
    Description: 
In a rare cases, the {{RaftConsensusITest.MultiThreadedInsertWithFailovers}} test scenario crashes with the following output:

{noformat}
I0820 18:21:28.614696  1042 raft_consensus.cc:2890] T a0ea2bbe8bad446b8782b902ca670735 P b1747d43333448629e1fb9b7c7dff193 [term 13 FOLLOWER]: Advancing to term 14
I0820 18:21:28.615350  1042 raft_consensus.cc:1184] T a0ea2bbe8bad446b8782b902ca670735 P b1747d43333448629e1fb9b7c7dff193 [term 14 FOLLOWER]: Refusing update from remote peer 83184eab2eae4146956292754c8fe346: Log matching property violated. Preceding OpId in replica: term: 13 index: 3478. Preceding OpId from leader: term: 14 index: 3480. (index mismatch)
F0820 18:21:28.637754   225 raft_consensus-itest.cc:394] Check failed: _s.ok() Bad status: Not found: leader replica not found
*** Check failure stack trace: ***
*** Aborted at 1566325288 (unix time) try "date -d @1566325288" if you are using GNU date ***
PC: @     0x7f76cf228c37 gsignal
*** SIGABRT (@0x3e8000000e1) received by PID 225 (TID 0x7f76d45f3000) from PID 225; stack trace: ***
    @     0x7f76d185a330 (unknown) at ??:0
    @     0x7f76cf228c37 gsignal at ??:0
    @     0x7f76cf22c028 abort at ??:0
    @     0x7f76d0297e09 google::logging_fail() at ??:0
    @     0x7f76d029962d google::LogMessage::Fail() at ??:0
    @     0x7f76d029b64c google::LogMessage::SendToLog() at ??:0
    @     0x7f76d0299189 google::LogMessage::Flush() at ??:0
    @     0x7f76d029bfdf google::LogMessageFatal::~LogMessageFatal() at ??:0
    @           0x42d346 kudu::tserver::RaftConsensusITest::StopOrKillLeaderAndElectNewOne() at /data0/jenkins/workspace/kudu-pre-commit-unittest-RELEASE/src/kudu/integration-tests/raft_consensus-itest.cc:399 (discriminator 1)
    @           0x43734b kudu::tserver::RaftConsensusITest_MultiThreadedInsertWithFailovers_Test::TestBody() at /data0/jenkins/workspace/kudu-pre-commit-unittest-RELEASE/src/kudu/integration-tests/raft_consensus-itest.cc:1005
    @     0x7f76d0aeeb89 testing::internal::HandleExceptionsInMethodIfSupported<>() at ??:0
    @     0x7f76d0adf68f testing::Test::Run() at ??:0
    @     0x7f76d0adf74d testing::TestInfo::Run() at ??:0
    @     0x7f76d0adf865 testing::TestCase::Run() at ??:0
    @     0x7f76d0adfb28 testing::internal::UnitTestImpl::RunAllTests() at ??:0
    @     0x7f76d0adfdc9 testing::UnitTest::Run() at ??:0
    @     0x7f76d3d7e502 main at ??:0
    @     0x7f76cf213f45 __libc_start_main at ??:0
    @           0x42adb3 (unknown) at ??:?

{noformat}

It seems the issue is in https://github.com/apache/kudu/blob/413396c85d7dd56830f563ec754653b2b0ae26fd/src/kudu/integration-tests/ts_itest-base.cc#L294-L313 : if a snapshot of a tablet's Raft configuration has been captured during leader election, it might end up with no leader replica.  In such case, {{TabletServerIntegrationTestBase::GetTabletLeaderAndFollowers()}} returns {{Status::NotFound()}} and the test crashes at https://github.com/apache/kudu/blob/413396c85d7dd56830f563ec754653b2b0ae26fd/src/kudu/integration-tests/raft_consensus-itest.cc#L394

The full log of the test scanario's run is attached for reference.

  was:
In a rare cases, the {{}} test scenario crashes with the following output:

{noformat}
I0820 18:21:28.614696  1042 raft_consensus.cc:2890] T a0ea2bbe8bad446b8782b902ca670735 P b1747d43333448629e1fb9b7c7dff193 [term 13 FOLLOWER]: Advancing to term 14
I0820 18:21:28.615350  1042 raft_consensus.cc:1184] T a0ea2bbe8bad446b8782b902ca670735 P b1747d43333448629e1fb9b7c7dff193 [term 14 FOLLOWER]: Refusing update from remote peer 83184eab2eae4146956292754c8fe346: Log matching property violated. Preceding OpId in replica: term: 13 index: 3478. Preceding OpId from leader: term: 14 index: 3480. (index mismatch)
F0820 18:21:28.637754   225 raft_consensus-itest.cc:394] Check failed: _s.ok() Bad status: Not found: leader replica not found
*** Check failure stack trace: ***
*** Aborted at 1566325288 (unix time) try "date -d @1566325288" if you are using GNU date ***
PC: @     0x7f76cf228c37 gsignal
*** SIGABRT (@0x3e8000000e1) received by PID 225 (TID 0x7f76d45f3000) from PID 225; stack trace: ***
    @     0x7f76d185a330 (unknown) at ??:0
    @     0x7f76cf228c37 gsignal at ??:0
    @     0x7f76cf22c028 abort at ??:0
    @     0x7f76d0297e09 google::logging_fail() at ??:0
    @     0x7f76d029962d google::LogMessage::Fail() at ??:0
    @     0x7f76d029b64c google::LogMessage::SendToLog() at ??:0
    @     0x7f76d0299189 google::LogMessage::Flush() at ??:0
    @     0x7f76d029bfdf google::LogMessageFatal::~LogMessageFatal() at ??:0
    @           0x42d346 kudu::tserver::RaftConsensusITest::StopOrKillLeaderAndElectNewOne() at /data0/jenkins/workspace/kudu-pre-commit-unittest-RELEASE/src/kudu/integration-tests/raft_consensus-itest.cc:399 (discriminator 1)
    @           0x43734b kudu::tserver::RaftConsensusITest_MultiThreadedInsertWithFailovers_Test::TestBody() at /data0/jenkins/workspace/kudu-pre-commit-unittest-RELEASE/src/kudu/integration-tests/raft_consensus-itest.cc:1005
    @     0x7f76d0aeeb89 testing::internal::HandleExceptionsInMethodIfSupported<>() at ??:0
    @     0x7f76d0adf68f testing::Test::Run() at ??:0
    @     0x7f76d0adf74d testing::TestInfo::Run() at ??:0
    @     0x7f76d0adf865 testing::TestCase::Run() at ??:0
    @     0x7f76d0adfb28 testing::internal::UnitTestImpl::RunAllTests() at ??:0
    @     0x7f76d0adfdc9 testing::UnitTest::Run() at ??:0
    @     0x7f76d3d7e502 main at ??:0
    @     0x7f76cf213f45 __libc_start_main at ??:0
    @           0x42adb3 (unknown) at ??:?

{noformat}

It seems the issue is in https://github.com/apache/kudu/blob/413396c85d7dd56830f563ec754653b2b0ae26fd/src/kudu/integration-tests/ts_itest-base.cc#L294-L313 : if a snapshot of a tablet's Raft configuration has been captured during leader election, it might end up with no leader replica.  In such case, {{TabletServerIntegrationTestBase::GetTabletLeaderAndFollowers()}} returns {{Status::NotFound()}} and the test crashes at https://github.com/apache/kudu/blob/413396c85d7dd56830f563ec754653b2b0ae26fd/src/kudu/integration-tests/raft_consensus-itest.cc#L394

The full log of the test scanario's run is attached for reference.


> RaftConsensusITest.MultiThreadedInsertWithFailovers is flaky
> ------------------------------------------------------------
>
>                 Key: KUDU-2923
>                 URL: https://issues.apache.org/jira/browse/KUDU-2923
>             Project: Kudu
>          Issue Type: Bug
>          Components: consensus, test
>    Affects Versions: 1.11.0
>            Reporter: Alexey Serbin
>            Priority: Minor
>         Attachments: raft_consensus-itest.txt.xz
>
>
> In a rare cases, the {{RaftConsensusITest.MultiThreadedInsertWithFailovers}} test scenario crashes with the following output:
> {noformat}
> I0820 18:21:28.614696  1042 raft_consensus.cc:2890] T a0ea2bbe8bad446b8782b902ca670735 P b1747d43333448629e1fb9b7c7dff193 [term 13 FOLLOWER]: Advancing to term 14
> I0820 18:21:28.615350  1042 raft_consensus.cc:1184] T a0ea2bbe8bad446b8782b902ca670735 P b1747d43333448629e1fb9b7c7dff193 [term 14 FOLLOWER]: Refusing update from remote peer 83184eab2eae4146956292754c8fe346: Log matching property violated. Preceding OpId in replica: term: 13 index: 3478. Preceding OpId from leader: term: 14 index: 3480. (index mismatch)
> F0820 18:21:28.637754   225 raft_consensus-itest.cc:394] Check failed: _s.ok() Bad status: Not found: leader replica not found
> *** Check failure stack trace: ***
> *** Aborted at 1566325288 (unix time) try "date -d @1566325288" if you are using GNU date ***
> PC: @     0x7f76cf228c37 gsignal
> *** SIGABRT (@0x3e8000000e1) received by PID 225 (TID 0x7f76d45f3000) from PID 225; stack trace: ***
>     @     0x7f76d185a330 (unknown) at ??:0
>     @     0x7f76cf228c37 gsignal at ??:0
>     @     0x7f76cf22c028 abort at ??:0
>     @     0x7f76d0297e09 google::logging_fail() at ??:0
>     @     0x7f76d029962d google::LogMessage::Fail() at ??:0
>     @     0x7f76d029b64c google::LogMessage::SendToLog() at ??:0
>     @     0x7f76d0299189 google::LogMessage::Flush() at ??:0
>     @     0x7f76d029bfdf google::LogMessageFatal::~LogMessageFatal() at ??:0
>     @           0x42d346 kudu::tserver::RaftConsensusITest::StopOrKillLeaderAndElectNewOne() at /data0/jenkins/workspace/kudu-pre-commit-unittest-RELEASE/src/kudu/integration-tests/raft_consensus-itest.cc:399 (discriminator 1)
>     @           0x43734b kudu::tserver::RaftConsensusITest_MultiThreadedInsertWithFailovers_Test::TestBody() at /data0/jenkins/workspace/kudu-pre-commit-unittest-RELEASE/src/kudu/integration-tests/raft_consensus-itest.cc:1005
>     @     0x7f76d0aeeb89 testing::internal::HandleExceptionsInMethodIfSupported<>() at ??:0
>     @     0x7f76d0adf68f testing::Test::Run() at ??:0
>     @     0x7f76d0adf74d testing::TestInfo::Run() at ??:0
>     @     0x7f76d0adf865 testing::TestCase::Run() at ??:0
>     @     0x7f76d0adfb28 testing::internal::UnitTestImpl::RunAllTests() at ??:0
>     @     0x7f76d0adfdc9 testing::UnitTest::Run() at ??:0
>     @     0x7f76d3d7e502 main at ??:0
>     @     0x7f76cf213f45 __libc_start_main at ??:0
>     @           0x42adb3 (unknown) at ??:?
> {noformat}
> It seems the issue is in https://github.com/apache/kudu/blob/413396c85d7dd56830f563ec754653b2b0ae26fd/src/kudu/integration-tests/ts_itest-base.cc#L294-L313 : if a snapshot of a tablet's Raft configuration has been captured during leader election, it might end up with no leader replica.  In such case, {{TabletServerIntegrationTestBase::GetTabletLeaderAndFollowers()}} returns {{Status::NotFound()}} and the test crashes at https://github.com/apache/kudu/blob/413396c85d7dd56830f563ec754653b2b0ae26fd/src/kudu/integration-tests/raft_consensus-itest.cc#L394
> The full log of the test scanario's run is attached for reference.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)