You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@kudu.apache.org by "ASF subversion and git services (Jira)" <ji...@apache.org> on 2019/11/12 05:57:00 UTC
[jira] [Commented] (KUDU-2987) Intra location rebalance will crash
in special case
[ https://issues.apache.org/jira/browse/KUDU-2987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972089#comment-16972089 ]
ASF subversion and git services commented on KUDU-2987:
-------------------------------------------------------
Commit a8733419fd43c488586172f71cab0892581146e8 in kudu's branch refs/heads/branch-1.11.x from triplesheep
[ https://gitbox.apache.org/repos/asf?p=kudu.git;h=a873341 ]
KUDU-2987 Intra location rebalance crashes in special case.
The crash manifested itself in cases where a Kudu cluster
had a location that didn't host even a single replica of
a tablet.
Change-Id: Iea39472e55178fca688b390249432a0f7fefaaba
Reviewed-on: http://gerrit.cloudera.org:8080/14608
Tested-by: Alexey Serbin <as...@cloudera.com>
Reviewed-by: Alexey Serbin <as...@cloudera.com>
(cherry picked from commit 0fea1cb8ede852a87efc422b394ffe8d1e89bc6c)
Reviewed-on: http://gerrit.cloudera.org:8080/14686
Tested-by: Kudu Jenkins
Reviewed-by: Grant Henke <gr...@apache.org>
> Intra location rebalance will crash in special case
> ---------------------------------------------------
>
> Key: KUDU-2987
> URL: https://issues.apache.org/jira/browse/KUDU-2987
> Project: Kudu
> Issue Type: Bug
> Reporter: ZhangYao
> Assignee: ZhangYao
> Priority: Major
>
> Recently I am doing POC about rebalance and I get core when running intra location rebalance.
> Here is the log:
> {code:java}
> I2019-10-30 20:02:17.843044 40915 rebalancer_tool.cc:225] running rebalancer within location '/location/2044'
> F2019-10-30 20:02:17.884591 40915 map-util.h:109] Check failed: it != collection.end() Map key not found: a9119004b2d24f42a1acf09d142565fb
> *** Check failure stack trace: ***
> @ 0x111a75d google::LogMessage::Fail()
> @ 0x111c6d3 google::LogMessage::SendToLog()
> @ 0x111a2b9 google::LogMessage::Flush()
> @ 0x111d0ef google::LogMessageFatal::~LogMessageFatal()
> @ 0xe26da7 FindOrDie<>()
> @ 0xe1f204 kudu::tools::RebalancerTool::AlgoBasedRunner::GetNextMovesImpl()
> @ 0xe162e0 kudu::tools::RebalancerTool::BaseRunner::GetNextMoves()
> @ 0xe15bf5 kudu::tools::RebalancerTool::RunWith()
> @ 0xe1db0e kudu::tools::RebalancerTool::Run()
> @ 0xb6fea1 kudu::tools::(anonymous namespace)::RunRebalance()
> @ 0xb70e14 std::_Function_handler<>::_M_invoke()
> @ 0x11714a2 kudu::tools::Action::Run()
> @ 0xc00587 kudu::tools::DispatchCommand()
> @ 0xc00f4b kudu::tools::RunTool()
> @ 0xb0fd6d main
> @ 0x7f37086a4b15 __libc_start_main
> @ 0xb6b399 (unknown)
> {code}
> I found it may be the problem in {{RebalancerTool::AlgoBasedRunner::GetNextMovesImpl}} when building extra_info_by_tablet_id, it check that the table id in tablet must occur in table info. But when we build ClusterRawInfo in {{RebalancerTool::KsckResultsToClusterRawInfo}} we only collect the table occurs in location but all tablets in cluster.
> This problem will occur when the location doesn't have replica for all table. When location is far more than table's replica it will happen.
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)