You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@kudu.apache.org by "Ray Liu (rayliu)" <ra...@cisco.com> on 2020/07/15 06:41:34 UTC
Master can not rejoin cluster when failed.
We have a Kudu cluster with 3 Masters and 9 tablet servers.
When we try to drop a table with more a thousand tablets the Leader Master crashed.
The last logs for crashed master are a bunch of
W0715 04:00:57.330158 30337 catalog_manager.cc:3485] TS cd17b92888a84d39b2adcad1ca947037 (hdsj1kud005.webex.com:7050): delete failed for tablet 4250e813a29e4ca7a2633c6015c5530d because the tablet was not found. No further retry: Not found: Tablet not found: 4250e813a29e4ca7a2633c6015c5530d
Before these delete failed logs, there are many:
W0715 03:59:40.047675 30336 connection.cc:361] RPC call timeout handler was delayed by 11.8487s! This may be due to a process-wide pause such as swapping, logging-related delays, or allocator lock contention. Will allow an additional 3s for a response.
So, when this leader master crashed, a new leader master was elected from the remaining two masters.
But when I try to restart the crashed master, it stuck forever (2 hours for now).
The logs are a repetition of these:
I0715 06:30:36.438797 18042 raft_consensus.cc:465] T 00000000000000000000000000000000 P 81338568ef854b10ac0acac1d9eeeb6c [term 4 FOLLOWER]: Starting pre-election (no leader contacted us within the election timeout)
I0715 06:30:36.438868 18042 raft_consensus.cc:487] T 00000000000000000000000000000000 P 81338568ef854b10ac0acac1d9eeeb6c [term 4 FOLLOWER]: Starting pre-election with config: opid_index: -1 OBSOLETE_local: false peers { permanent_uuid: "e8f90d84b4754a379ffedaa32b528fb4" member_type: VOTER last_known_addr { host: "master1" port: 7051 } } peers { permanent_uuid: "ba89996893e44391a10a2fc1f2c2ada3" member_type: VOTER last_known_addr { host: "master2" port: 7051 } } peers { permanent_uuid: "81338568ef854b10ac0acac1d9eeeb6c" member_type: VOTER last_known_addr { host: "master3" port: 7051 } }
I0715 06:30:36.439072 18042 leader_election.cc:296] T 00000000000000000000000000000000 P 81338568ef854b10ac0acac1d9eeeb6c [CANDIDATE]: Term 5 pre-election: Requested pre-vote from peers e8f90d84b4754a379ffedaa32b528fb4 (master1:7051), ba89996893e44391a10a2fc1f2c2ada3 (master2:7051)
W0715 06:30:36.439657 13256 leader_election.cc:341] T 00000000000000000000000000000000 P 81338568ef854b10ac0acac1d9eeeb6c [CANDIDATE]: Term 5 pre-election: RPC error from VoteRequest() call to peer ba89996893e44391a10a2fc1f2c2ada3 (master2:7051): Remote error: Not authorized: unauthorized access to method: RequestConsensusVote
W0715 06:30:36.439787 13255 leader_election.cc:341] T 00000000000000000000000000000000 P 81338568ef854b10ac0acac1d9eeeb6c [CANDIDATE]: Term 5 pre-election: RPC error from VoteRequest() call to peer e8f90d84b4754a379ffedaa32b528fb4 (master1:7051): Remote error: Not authorized: unauthorized access to method: RequestConsensusVote
I0715 06:30:36.439808 13255 leader_election.cc:310] T 00000000000000000000000000000000 P 81338568ef854b10ac0acac1d9eeeb6c [CANDIDATE]: Term 5 pre-election: Election decided. Result: candidate lost. Election summary: received 3 responses out of 3 voters: 1 yes votes; 2 no votes. yes voters: 81338568ef854b10ac0acac1d9eeeb6c; no voters: ba89996893e44391a10a2fc1f2c2ada3, e8f90d84b4754a379ffedaa32b528fb4
I0715 06:30:36.439839 18042 raft_consensus.cc:2597] T 00000000000000000000000000000000 P 81338568ef854b10ac0acac1d9eeeb6c [term 4 FOLLOWER]: Leader pre-election lost for term 5. Reason: could not achieve majority
W0715 06:30:36.531898 13465 server_base.cc:587] Unauthorized access attempt to method kudu.consensus.ConsensusService.UpdateConsensus from {username='kudu'} at ip:port
The master summary from ksck is
Master Summary
UUID | Address | Status
----------------------------------+-----------------------+--------------
ba89996893e44391a10a2fc1f2c2ada3 | master1 | HEALTHY
e8f90d84b4754a379ffedaa32b528fb4 | master2 | HEALTHY
81338568ef854b10ac0acac1d9eeeb6c | master3 | UNAUTHORIZED
Error from master3: Remote error: could not fetch consensus info from master: Not authorized: unauthorized access to method: GetConsensusState (UNAUTHORIZED)
All reported replicas are:
A = ba89996893e44391a10a2fc1f2c2ada3
B = e8f90d84b4754a379ffedaa32b528fb4
C = 81338568ef854b10ac0acac1d9eeeb6c
The consensus matrix is:
Config source | Replicas | Current term | Config index | Committed?
---------------+------------------------+--------------+--------------+------------
A | A* B C | 4 | -1 | Yes
B | A* B C | 4 | -1 | Yes
C | [config not available] | | |
What can I do if these three masters can’t achieve consensus forever?
Is it safe to delete --fs_data_dirs/ --fs_metadata_dir/ --fs_wal_dir for the crashed master if in order to get it online without any data loss?
Thanks
Re: Master can not rejoin cluster when failed.
Posted by "Ray Liu (rayliu)" <ra...@cisco.com>.
The root cause is we launched the kudu master process with users don't have super user privileges.
On 7/15/20, 21:23, "Attila Bukor" <ab...@apache.org> wrote:
Hi Ray,
It seems the problem is that kudu user is not authorized to UpdateConsensus on
the other masters. What user are the other two masters started with?
I wouldn't recommend wiping the master, it most likely wouldn't solve the
problem, and Kudu can't automatically recover from a deleted master, you would
need to recreate it manually.
Attila
On Wed, Jul 15, 2020 at 06:41:34AM +0000, Ray Liu (rayliu) wrote:
> We have a Kudu cluster with 3 Masters and 9 tablet servers.
> When we try to drop a table with more a thousand tablets the Leader Master crashed.
> The last logs for crashed master are a bunch of
> W0715 04:00:57.330158 30337 catalog_manager.cc:3485] TS cd17b92888a84d39b2adcad1ca947037 (hdsj1kud005.webex.com:7050): delete failed for tablet 4250e813a29e4ca7a2633c6015c5530d because the tablet was not found. No further retry: Not found: Tablet not found: 4250e813a29e4ca7a2633c6015c5530d
>
> Before these delete failed logs, there are many:
> W0715 03:59:40.047675 30336 connection.cc:361] RPC call timeout handler was delayed by 11.8487s! This may be due to a process-wide pause such as swapping, logging-related delays, or allocator lock contention. Will allow an additional 3s for a response.
>
> So, when this leader master crashed, a new leader master was elected from the remaining two masters.
> But when I try to restart the crashed master, it stuck forever (2 hours for now).
> The logs are a repetition of these:
>
> I0715 06:30:36.438797 18042 raft_consensus.cc:465] T 00000000000000000000000000000000 P 81338568ef854b10ac0acac1d9eeeb6c [term 4 FOLLOWER]: Starting pre-election (no leader contacted us within the election timeout)
> I0715 06:30:36.438868 18042 raft_consensus.cc:487] T 00000000000000000000000000000000 P 81338568ef854b10ac0acac1d9eeeb6c [term 4 FOLLOWER]: Starting pre-election with config: opid_index: -1 OBSOLETE_local: false peers { permanent_uuid: "e8f90d84b4754a379ffedaa32b528fb4" member_type: VOTER last_known_addr { host: "master1" port: 7051 } } peers { permanent_uuid: "ba89996893e44391a10a2fc1f2c2ada3" member_type: VOTER last_known_addr { host: "master2" port: 7051 } } peers { permanent_uuid: "81338568ef854b10ac0acac1d9eeeb6c" member_type: VOTER last_known_addr { host: "master3" port: 7051 } }
> I0715 06:30:36.439072 18042 leader_election.cc:296] T 00000000000000000000000000000000 P 81338568ef854b10ac0acac1d9eeeb6c [CANDIDATE]: Term 5 pre-election: Requested pre-vote from peers e8f90d84b4754a379ffedaa32b528fb4 (master1:7051), ba89996893e44391a10a2fc1f2c2ada3 (master2:7051)
> W0715 06:30:36.439657 13256 leader_election.cc:341] T 00000000000000000000000000000000 P 81338568ef854b10ac0acac1d9eeeb6c [CANDIDATE]: Term 5 pre-election: RPC error from VoteRequest() call to peer ba89996893e44391a10a2fc1f2c2ada3 (master2:7051): Remote error: Not authorized: unauthorized access to method: RequestConsensusVote
> W0715 06:30:36.439787 13255 leader_election.cc:341] T 00000000000000000000000000000000 P 81338568ef854b10ac0acac1d9eeeb6c [CANDIDATE]: Term 5 pre-election: RPC error from VoteRequest() call to peer e8f90d84b4754a379ffedaa32b528fb4 (master1:7051): Remote error: Not authorized: unauthorized access to method: RequestConsensusVote
> I0715 06:30:36.439808 13255 leader_election.cc:310] T 00000000000000000000000000000000 P 81338568ef854b10ac0acac1d9eeeb6c [CANDIDATE]: Term 5 pre-election: Election decided. Result: candidate lost. Election summary: received 3 responses out of 3 voters: 1 yes votes; 2 no votes. yes voters: 81338568ef854b10ac0acac1d9eeeb6c; no voters: ba89996893e44391a10a2fc1f2c2ada3, e8f90d84b4754a379ffedaa32b528fb4
> I0715 06:30:36.439839 18042 raft_consensus.cc:2597] T 00000000000000000000000000000000 P 81338568ef854b10ac0acac1d9eeeb6c [term 4 FOLLOWER]: Leader pre-election lost for term 5. Reason: could not achieve majority
> W0715 06:30:36.531898 13465 server_base.cc:587] Unauthorized access attempt to method kudu.consensus.ConsensusService.UpdateConsensus from {username='kudu'} at ip:port
>
>
> The master summary from ksck is
> Master Summary
> UUID | Address | Status
> ----------------------------------+-----------------------+--------------
> ba89996893e44391a10a2fc1f2c2ada3 | master1 | HEALTHY
> e8f90d84b4754a379ffedaa32b528fb4 | master2 | HEALTHY
> 81338568ef854b10ac0acac1d9eeeb6c | master3 | UNAUTHORIZED
> Error from master3: Remote error: could not fetch consensus info from master: Not authorized: unauthorized access to method: GetConsensusState (UNAUTHORIZED)
> All reported replicas are:
> A = ba89996893e44391a10a2fc1f2c2ada3
> B = e8f90d84b4754a379ffedaa32b528fb4
> C = 81338568ef854b10ac0acac1d9eeeb6c
> The consensus matrix is:
> Config source | Replicas | Current term | Config index | Committed?
> ---------------+------------------------+--------------+--------------+------------
> A | A* B C | 4 | -1 | Yes
> B | A* B C | 4 | -1 | Yes
> C | [config not available] | | |
>
> What can I do if these three masters can’t achieve consensus forever?
> Is it safe to delete --fs_data_dirs/ --fs_metadata_dir/ --fs_wal_dir for the crashed master if in order to get it online without any data loss?
>
> Thanks
Re: Master can not rejoin cluster when failed.
Posted by Attila Bukor <ab...@apache.org>.
Hi Ray,
It seems the problem is that kudu user is not authorized to UpdateConsensus on
the other masters. What user are the other two masters started with?
I wouldn't recommend wiping the master, it most likely wouldn't solve the
problem, and Kudu can't automatically recover from a deleted master, you would
need to recreate it manually.
Attila
On Wed, Jul 15, 2020 at 06:41:34AM +0000, Ray Liu (rayliu) wrote:
> We have a Kudu cluster with 3 Masters and 9 tablet servers.
> When we try to drop a table with more a thousand tablets the Leader Master crashed.
> The last logs for crashed master are a bunch of
> W0715 04:00:57.330158 30337 catalog_manager.cc:3485] TS cd17b92888a84d39b2adcad1ca947037 (hdsj1kud005.webex.com:7050): delete failed for tablet 4250e813a29e4ca7a2633c6015c5530d because the tablet was not found. No further retry: Not found: Tablet not found: 4250e813a29e4ca7a2633c6015c5530d
>
> Before these delete failed logs, there are many:
> W0715 03:59:40.047675 30336 connection.cc:361] RPC call timeout handler was delayed by 11.8487s! This may be due to a process-wide pause such as swapping, logging-related delays, or allocator lock contention. Will allow an additional 3s for a response.
>
> So, when this leader master crashed, a new leader master was elected from the remaining two masters.
> But when I try to restart the crashed master, it stuck forever (2 hours for now).
> The logs are a repetition of these:
>
> I0715 06:30:36.438797 18042 raft_consensus.cc:465] T 00000000000000000000000000000000 P 81338568ef854b10ac0acac1d9eeeb6c [term 4 FOLLOWER]: Starting pre-election (no leader contacted us within the election timeout)
> I0715 06:30:36.438868 18042 raft_consensus.cc:487] T 00000000000000000000000000000000 P 81338568ef854b10ac0acac1d9eeeb6c [term 4 FOLLOWER]: Starting pre-election with config: opid_index: -1 OBSOLETE_local: false peers { permanent_uuid: "e8f90d84b4754a379ffedaa32b528fb4" member_type: VOTER last_known_addr { host: "master1" port: 7051 } } peers { permanent_uuid: "ba89996893e44391a10a2fc1f2c2ada3" member_type: VOTER last_known_addr { host: "master2" port: 7051 } } peers { permanent_uuid: "81338568ef854b10ac0acac1d9eeeb6c" member_type: VOTER last_known_addr { host: "master3" port: 7051 } }
> I0715 06:30:36.439072 18042 leader_election.cc:296] T 00000000000000000000000000000000 P 81338568ef854b10ac0acac1d9eeeb6c [CANDIDATE]: Term 5 pre-election: Requested pre-vote from peers e8f90d84b4754a379ffedaa32b528fb4 (master1:7051), ba89996893e44391a10a2fc1f2c2ada3 (master2:7051)
> W0715 06:30:36.439657 13256 leader_election.cc:341] T 00000000000000000000000000000000 P 81338568ef854b10ac0acac1d9eeeb6c [CANDIDATE]: Term 5 pre-election: RPC error from VoteRequest() call to peer ba89996893e44391a10a2fc1f2c2ada3 (master2:7051): Remote error: Not authorized: unauthorized access to method: RequestConsensusVote
> W0715 06:30:36.439787 13255 leader_election.cc:341] T 00000000000000000000000000000000 P 81338568ef854b10ac0acac1d9eeeb6c [CANDIDATE]: Term 5 pre-election: RPC error from VoteRequest() call to peer e8f90d84b4754a379ffedaa32b528fb4 (master1:7051): Remote error: Not authorized: unauthorized access to method: RequestConsensusVote
> I0715 06:30:36.439808 13255 leader_election.cc:310] T 00000000000000000000000000000000 P 81338568ef854b10ac0acac1d9eeeb6c [CANDIDATE]: Term 5 pre-election: Election decided. Result: candidate lost. Election summary: received 3 responses out of 3 voters: 1 yes votes; 2 no votes. yes voters: 81338568ef854b10ac0acac1d9eeeb6c; no voters: ba89996893e44391a10a2fc1f2c2ada3, e8f90d84b4754a379ffedaa32b528fb4
> I0715 06:30:36.439839 18042 raft_consensus.cc:2597] T 00000000000000000000000000000000 P 81338568ef854b10ac0acac1d9eeeb6c [term 4 FOLLOWER]: Leader pre-election lost for term 5. Reason: could not achieve majority
> W0715 06:30:36.531898 13465 server_base.cc:587] Unauthorized access attempt to method kudu.consensus.ConsensusService.UpdateConsensus from {username='kudu'} at ip:port
>
>
> The master summary from ksck is
> Master Summary
> UUID | Address | Status
> ----------------------------------+-----------------------+--------------
> ba89996893e44391a10a2fc1f2c2ada3 | master1 | HEALTHY
> e8f90d84b4754a379ffedaa32b528fb4 | master2 | HEALTHY
> 81338568ef854b10ac0acac1d9eeeb6c | master3 | UNAUTHORIZED
> Error from master3: Remote error: could not fetch consensus info from master: Not authorized: unauthorized access to method: GetConsensusState (UNAUTHORIZED)
> All reported replicas are:
> A = ba89996893e44391a10a2fc1f2c2ada3
> B = e8f90d84b4754a379ffedaa32b528fb4
> C = 81338568ef854b10ac0acac1d9eeeb6c
> The consensus matrix is:
> Config source | Replicas | Current term | Config index | Committed?
> ---------------+------------------------+--------------+--------------+------------
> A | A* B C | 4 | -1 | Yes
> B | A* B C | 4 | -1 | Yes
> C | [config not available] | | |
>
> What can I do if these three masters can’t achieve consensus forever?
> Is it safe to delete --fs_data_dirs/ --fs_metadata_dir/ --fs_wal_dir for the crashed master if in order to get it online without any data loss?
>
> Thanks