You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@kudu.apache.org by "ZhangYao (Jira)" <ji...@apache.org> on 2019/11/01 02:25:00 UTC

[jira] [Updated] (KUDU-2987) Intra location rebalance will crash in special case

     [ https://issues.apache.org/jira/browse/KUDU-2987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

ZhangYao updated KUDU-2987:
---------------------------
    Description: 
Recently I am doing POC about rebalance and I get core when running intra location rebalance.

Here is the log:
{code:java}
I2019-10-30 20:02:17.843044 40915 rebalancer_tool.cc:225] running rebalancer within location '/location/2044'
F2019-10-30 20:02:17.884591 40915 map-util.h:109] Check failed: it != collection.end() Map key not found: a9119004b2d24f42a1acf09d142565fb
*** Check failure stack trace: ***
    @          0x111a75d  google::LogMessage::Fail()
    @          0x111c6d3  google::LogMessage::SendToLog()
    @          0x111a2b9  google::LogMessage::Flush()
    @          0x111d0ef  google::LogMessageFatal::~LogMessageFatal()
    @           0xe26da7  FindOrDie<>()
    @           0xe1f204  kudu::tools::RebalancerTool::AlgoBasedRunner::GetNextMovesImpl()
    @           0xe162e0  kudu::tools::RebalancerTool::BaseRunner::GetNextMoves()
    @           0xe15bf5  kudu::tools::RebalancerTool::RunWith()
    @           0xe1db0e  kudu::tools::RebalancerTool::Run()
    @           0xb6fea1  kudu::tools::(anonymous namespace)::RunRebalance()
    @           0xb70e14  std::_Function_handler<>::_M_invoke()
    @          0x11714a2  kudu::tools::Action::Run()
    @           0xc00587  kudu::tools::DispatchCommand()
    @           0xc00f4b  kudu::tools::RunTool()
    @           0xb0fd6d  main
    @     0x7f37086a4b15  __libc_start_main
    @           0xb6b399  (unknown)

{code}
I found it may be the problem in {{RebalancerTool::AlgoBasedRunner::GetNextMovesImpl}} when building extra_info_by_tablet_id, it check that the table id in tablet must occur in table info. But when we build ClusterRawInfo in {{RebalancerTool::KsckResultsToClusterRawInfo}} we only collect the table occurs in location but all tablets in cluster. 

 This problem will occur when the location doesn't have replica for all table. When location is far more than table's replica it will happen.

 

 

  was:
Recently I am doing POC about rebalance and I get core when running intra location rebalance.

Here is the log:

{{}}
{code:java}
I2019-10-30 20:02:17.843044 40915 rebalancer_tool.cc:225] running rebalancer within location '/location/2044'
F2019-10-30 20:02:17.884591 40915 map-util.h:109] Check failed: it != collection.end() Map key not found: a9119004b2d24f42a1acf09d142565fb
*** Check failure stack trace: ***
    @          0x111a75d  google::LogMessage::Fail()
    @          0x111c6d3  google::LogMessage::SendToLog()
    @          0x111a2b9  google::LogMessage::Flush()
    @          0x111d0ef  google::LogMessageFatal::~LogMessageFatal()
    @           0xe26da7  FindOrDie<>()
    @           0xe1f204  kudu::tools::RebalancerTool::AlgoBasedRunner::GetNextMovesImpl()
    @           0xe162e0  kudu::tools::RebalancerTool::BaseRunner::GetNextMoves()
    @           0xe15bf5  kudu::tools::RebalancerTool::RunWith()
    @           0xe1db0e  kudu::tools::RebalancerTool::Run()
    @           0xb6fea1  kudu::tools::(anonymous namespace)::RunRebalance()
    @           0xb70e14  std::_Function_handler<>::_M_invoke()
    @          0x11714a2  kudu::tools::Action::Run()
    @           0xc00587  kudu::tools::DispatchCommand()
    @           0xc00f4b  kudu::tools::RunTool()
    @           0xb0fd6d  main
    @     0x7f37086a4b15  __libc_start_main
    @           0xb6b399  (unknown)

{code}
I found it may be the problem in {{RebalancerTool::AlgoBasedRunner::GetNextMovesImpl}} when building extra_info_by_tablet_id, it check that the table id in tablet must occur in table info. But when we build ClusterRawInfo in {{RebalancerTool::KsckResultsToClusterRawInfo}} we only collect the table occurs in location but all tablets in cluster. 

 This problem will occur when the location doesn't have replica for all table. When location is far more than table's replica it will happen.

{{}}

 


> Intra location rebalance will crash in special case
> ---------------------------------------------------
>
>                 Key: KUDU-2987
>                 URL: https://issues.apache.org/jira/browse/KUDU-2987
>             Project: Kudu
>          Issue Type: Bug
>            Reporter: ZhangYao
>            Priority: Major
>
> Recently I am doing POC about rebalance and I get core when running intra location rebalance.
> Here is the log:
> {code:java}
> I2019-10-30 20:02:17.843044 40915 rebalancer_tool.cc:225] running rebalancer within location '/location/2044'
> F2019-10-30 20:02:17.884591 40915 map-util.h:109] Check failed: it != collection.end() Map key not found: a9119004b2d24f42a1acf09d142565fb
> *** Check failure stack trace: ***
>     @          0x111a75d  google::LogMessage::Fail()
>     @          0x111c6d3  google::LogMessage::SendToLog()
>     @          0x111a2b9  google::LogMessage::Flush()
>     @          0x111d0ef  google::LogMessageFatal::~LogMessageFatal()
>     @           0xe26da7  FindOrDie<>()
>     @           0xe1f204  kudu::tools::RebalancerTool::AlgoBasedRunner::GetNextMovesImpl()
>     @           0xe162e0  kudu::tools::RebalancerTool::BaseRunner::GetNextMoves()
>     @           0xe15bf5  kudu::tools::RebalancerTool::RunWith()
>     @           0xe1db0e  kudu::tools::RebalancerTool::Run()
>     @           0xb6fea1  kudu::tools::(anonymous namespace)::RunRebalance()
>     @           0xb70e14  std::_Function_handler<>::_M_invoke()
>     @          0x11714a2  kudu::tools::Action::Run()
>     @           0xc00587  kudu::tools::DispatchCommand()
>     @           0xc00f4b  kudu::tools::RunTool()
>     @           0xb0fd6d  main
>     @     0x7f37086a4b15  __libc_start_main
>     @           0xb6b399  (unknown)
> {code}
> I found it may be the problem in {{RebalancerTool::AlgoBasedRunner::GetNextMovesImpl}} when building extra_info_by_tablet_id, it check that the table id in tablet must occur in table info. But when we build ClusterRawInfo in {{RebalancerTool::KsckResultsToClusterRawInfo}} we only collect the table occurs in location but all tablets in cluster. 
>  This problem will occur when the location doesn't have replica for all table. When location is far more than table's replica it will happen.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)