You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@kudu.apache.org by "Will Berkeley (Code Review)" <ge...@cloudera.org> on 2018/10/26 18:23:57 UTC

[kudu-CR] Improve consensus queue overflow logging

Will Berkeley has uploaded this change for review. ( http://gerrit.cloudera.org:8080/11801


Change subject: Improve consensus queue overflow logging
......................................................................

Improve consensus queue overflow logging

Suppose tablet server X is a leader of T tablets for which tablet server Y is a
follower. The relevant situation is when T is on the order of 100-1000. If Y
strains under its consensus load and falls behind processing consensus service
requests, UpdateConsensus requests from the leader will get rejected and cause
a message to be logged on the leader X for each of the T tablets. The message
looks like:

W1022 17:20:59.767554 13057 consensus_peers.cc:422] T 9255fdf03ad4451e9fcd62f26741bfe6 P 892cc0d4442c4cdaaf633ed2732f9246 -> Peer dc0af5867d52468f8fd47abf13c08040 (tablet_server_Y.kudu.com:7050): Couldn't send request to peer dc0af5867d52468f8fd47abf13c08040 for tablet 9255fdf03ad4451e9fcd62f26741bfe6. Status: Remote error: Service unavailable: UpdateConsensus request on kudu.consensus.ConsensusService from 10.1.1.1:55528 dropped due to backpressure. The service queue is full; it has 50 items.. Retrying in the next heartbeat period. Already tried 1 times.

Y's consensus service pool also logs the same thing, but it doesn't have the
information about the tablet id or peer ids available to it, and it is throttled
to occur no more than once per second:

W1022 17:37:33.535168  4330 service_pool.cc:130] UpdateConsensus request on kudu.consensus.ConsensusService from 10.45.26.115:36820 dropped due to backpressure. The service queue is full; it has 50 items.

This patch attempts to reduce the spam of the first message in the logs
by throttling it to occur once every 5 retries. It still is logged for
every tablet peer, but those messages are useful if one wants to trace
the history of a particular tablet.

I also added the throttling messages to Y's output, so it's now

W1022 17:37:33.535168  4330 service_pool.cc:130] UpdateConsensus request on kudu.consensus.ConsensusService from 10.45.26.115:36820 dropped due to backpressure. The service queue is full; it has 50 items. [suppressed 5 similar messages]

when e.g. 5 other messages have been suppressed.

Change-Id: I7697c63babefac0f76bcc8c87d70f7e7125e55cc
---
M src/kudu/consensus/consensus_peers.cc
M src/kudu/consensus/consensus_peers.h
M src/kudu/rpc/service_pool.cc
M src/kudu/util/logging.h
4 files changed, 23 insertions(+), 10 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/01/11801/1
-- 
To view, visit http://gerrit.cloudera.org:8080/11801
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I7697c63babefac0f76bcc8c87d70f7e7125e55cc
Gerrit-Change-Number: 11801
Gerrit-PatchSet: 1
Gerrit-Owner: Will Berkeley <wd...@gmail.com>

[kudu-CR] Improve consensus queue overflow logging

Posted by "Will Berkeley (Code Review)" <ge...@cloudera.org>.
Will Berkeley has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/11801 )

Change subject: Improve consensus queue overflow logging
......................................................................

Improve consensus queue overflow logging

Suppose tablet server X is a leader of T tablets for which tablet server Y is a
follower. The relevant situation is when T is on the order of 100-1000. If Y
strains under its consensus load and falls behind processing consensus service
requests, UpdateConsensus requests from the leader will get rejected and cause
a message to be logged on the leader X for each of the T tablets. The message
looks like:

W1022 17:20:59.767554 13057 consensus_peers.cc:422] T 9255fdf03ad4451e9fcd62f26741bfe6 P 892cc0d4442c4cdaaf633ed2732f9246 -> Peer dc0af5867d52468f8fd47abf13c08040 (tablet_server_Y.kudu.com:7050): Couldn't send request to peer dc0af5867d52468f8fd47abf13c08040 for tablet 9255fdf03ad4451e9fcd62f26741bfe6. Status: Remote error: Service unavailable: UpdateConsensus request on kudu.consensus.ConsensusService from 10.1.1.1:55528 dropped due to backpressure. The service queue is full; it has 50 items.. Retrying in the next heartbeat period. Already tried 1 times.

Y's consensus service pool also logs the same thing, but it doesn't have the
information about the tablet id or peer ids available to it, and it is throttled
to occur no more than once per second:

W1022 17:37:33.535168  4330 service_pool.cc:130] UpdateConsensus request on kudu.consensus.ConsensusService from 10.45.26.115:36820 dropped due to backpressure. The service queue is full; it has 50 items.

This patch attempts to reduce the spam of the first message in the logs
by throttling it to occur once every 5 retries. It still is logged for
every tablet peer, but those messages are useful if one wants to trace
the history of a particular tablet.

I also added the throttling messages to Y's output, so it's now

W1022 17:37:33.535168  4330 service_pool.cc:130] UpdateConsensus request on kudu.consensus.ConsensusService from 10.45.26.115:36820 dropped due to backpressure. The service queue is full; it has 50 items. [suppressed 5 similar messages]

when e.g. 5 other messages have been suppressed.

Change-Id: I7697c63babefac0f76bcc8c87d70f7e7125e55cc
Reviewed-on: http://gerrit.cloudera.org:8080/11801
Tested-by: Will Berkeley <wd...@gmail.com>
Reviewed-by: Alexey Serbin <as...@cloudera.com>
---
M src/kudu/consensus/consensus_peers.cc
M src/kudu/consensus/consensus_peers.h
M src/kudu/rpc/service_pool.cc
M src/kudu/util/logging.h
4 files changed, 22 insertions(+), 10 deletions(-)

Approvals:
  Will Berkeley: Verified
  Alexey Serbin: Looks good to me, approved

-- 
To view, visit http://gerrit.cloudera.org:8080/11801
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I7697c63babefac0f76bcc8c87d70f7e7125e55cc
Gerrit-Change-Number: 11801
Gerrit-PatchSet: 3
Gerrit-Owner: Will Berkeley <wd...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>

[kudu-CR] Improve consensus queue overflow logging

Posted by "Will Berkeley (Code Review)" <ge...@cloudera.org>.
Will Berkeley has posted comments on this change. ( http://gerrit.cloudera.org:8080/11801 )

Change subject: Improve consensus queue overflow logging
......................................................................


Patch Set 1:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/11801/1/src/kudu/consensus/consensus_peers.cc
File src/kudu/consensus/consensus_peers.cc:

http://gerrit.cloudera.org:8080/#/c/11801/1/src/kudu/consensus/consensus_peers.cc@458
PS1, Line 458: .$2
> How is it going to look like and why is it separated by a dot from the tabl
See L446- it's either an empty string or an actual messages depending on the sort of error that happened, so the Substitute string is funny-looking so the actual message is normal-looking in both situations.


http://gerrit.cloudera.org:8080/#/c/11801/1/src/kudu/consensus/consensus_peers.cc@461
PS1, Line 461: tablet_id_
> The LOG_WITH_PREFIX_UNLOCKED() produces a string prefix that already contai
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/11801
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I7697c63babefac0f76bcc8c87d70f7e7125e55cc
Gerrit-Change-Number: 11801
Gerrit-PatchSet: 1
Gerrit-Owner: Will Berkeley <wd...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>
Gerrit-Comment-Date: Fri, 26 Oct 2018 19:15:36 +0000
Gerrit-HasComments: Yes

[kudu-CR] Improve consensus queue overflow logging

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/11801 )

Change subject: Improve consensus queue overflow logging
......................................................................


Patch Set 1:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/11801/1/src/kudu/consensus/consensus_peers.cc
File src/kudu/consensus/consensus_peers.cc:

http://gerrit.cloudera.org:8080/#/c/11801/1/src/kudu/consensus/consensus_peers.cc@458
PS1, Line 458: .$2
How is it going to look like and why is it separated by a dot from the tablet id?


http://gerrit.cloudera.org:8080/#/c/11801/1/src/kudu/consensus/consensus_peers.cc@461
PS1, Line 461: tablet_id_
The LOG_WITH_PREFIX_UNLOCKED() produces a string prefix that already contains information on tablet identifier.  Is it really necessary to duplicate it in the message itself?



-- 
To view, visit http://gerrit.cloudera.org:8080/11801
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I7697c63babefac0f76bcc8c87d70f7e7125e55cc
Gerrit-Change-Number: 11801
Gerrit-PatchSet: 1
Gerrit-Owner: Will Berkeley <wd...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>
Gerrit-Comment-Date: Fri, 26 Oct 2018 18:50:34 +0000
Gerrit-HasComments: Yes

[kudu-CR] Improve consensus queue overflow logging

Posted by "Will Berkeley (Code Review)" <ge...@cloudera.org>.
Will Berkeley has posted comments on this change. ( http://gerrit.cloudera.org:8080/11801 )

Change subject: Improve consensus queue overflow logging
......................................................................


Patch Set 1:

Yes. KUDU-1707 should help greatly. I'm not working it on it presently but it's on my 3-month list of objectives, let's say.


-- 
To view, visit http://gerrit.cloudera.org:8080/11801
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I7697c63babefac0f76bcc8c87d70f7e7125e55cc
Gerrit-Change-Number: 11801
Gerrit-PatchSet: 1
Gerrit-Owner: Will Berkeley <wd...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>
Gerrit-Comment-Date: Fri, 26 Oct 2018 18:49:24 +0000
Gerrit-HasComments: No

[kudu-CR] Improve consensus queue overflow logging

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/11801 )

Change subject: Improve consensus queue overflow logging
......................................................................


Patch Set 2: Code-Review+2

(1 comment)

http://gerrit.cloudera.org:8080/#/c/11801/1/src/kudu/consensus/consensus_peers.cc
File src/kudu/consensus/consensus_peers.cc:

http://gerrit.cloudera.org:8080/#/c/11801/1/src/kudu/consensus/consensus_peers.cc@458
PS1, Line 458: . T
> See L446- it's either an empty string or an actual messages depending on th
Ah, indeed -- I missed that part.



-- 
To view, visit http://gerrit.cloudera.org:8080/11801
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I7697c63babefac0f76bcc8c87d70f7e7125e55cc
Gerrit-Change-Number: 11801
Gerrit-PatchSet: 2
Gerrit-Owner: Will Berkeley <wd...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>
Gerrit-Comment-Date: Fri, 26 Oct 2018 20:29:15 +0000
Gerrit-HasComments: Yes

[kudu-CR] Improve consensus queue overflow logging

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change. ( http://gerrit.cloudera.org:8080/11801 )

Change subject: Improve consensus queue overflow logging
......................................................................


Patch Set 1: Code-Review+2

This will become less of an issue when KUDU-1707 is fixed, right? Is that something you're working on?


-- 
To view, visit http://gerrit.cloudera.org:8080/11801
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I7697c63babefac0f76bcc8c87d70f7e7125e55cc
Gerrit-Change-Number: 11801
Gerrit-PatchSet: 1
Gerrit-Owner: Will Berkeley <wd...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Comment-Date: Fri, 26 Oct 2018 18:30:53 +0000
Gerrit-HasComments: No

[kudu-CR] Improve consensus queue overflow logging

Posted by "Will Berkeley (Code Review)" <ge...@cloudera.org>.
Will Berkeley has posted comments on this change. ( http://gerrit.cloudera.org:8080/11801 )

Change subject: Improve consensus queue overflow logging
......................................................................


Patch Set 2: Verified+1

Failure is (unrealted) KUDU-1736.


-- 
To view, visit http://gerrit.cloudera.org:8080/11801
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I7697c63babefac0f76bcc8c87d70f7e7125e55cc
Gerrit-Change-Number: 11801
Gerrit-PatchSet: 2
Gerrit-Owner: Will Berkeley <wd...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>
Gerrit-Comment-Date: Fri, 26 Oct 2018 20:19:27 +0000
Gerrit-HasComments: No

[kudu-CR] Improve consensus queue overflow logging

Posted by "Will Berkeley (Code Review)" <ge...@cloudera.org>.
Hello Alexey Serbin, Kudu Jenkins, Adar Dembo, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/11801

to look at the new patch set (#2).

Change subject: Improve consensus queue overflow logging
......................................................................

Improve consensus queue overflow logging

Suppose tablet server X is a leader of T tablets for which tablet server Y is a
follower. The relevant situation is when T is on the order of 100-1000. If Y
strains under its consensus load and falls behind processing consensus service
requests, UpdateConsensus requests from the leader will get rejected and cause
a message to be logged on the leader X for each of the T tablets. The message
looks like:

W1022 17:20:59.767554 13057 consensus_peers.cc:422] T 9255fdf03ad4451e9fcd62f26741bfe6 P 892cc0d4442c4cdaaf633ed2732f9246 -> Peer dc0af5867d52468f8fd47abf13c08040 (tablet_server_Y.kudu.com:7050): Couldn't send request to peer dc0af5867d52468f8fd47abf13c08040 for tablet 9255fdf03ad4451e9fcd62f26741bfe6. Status: Remote error: Service unavailable: UpdateConsensus request on kudu.consensus.ConsensusService from 10.1.1.1:55528 dropped due to backpressure. The service queue is full; it has 50 items.. Retrying in the next heartbeat period. Already tried 1 times.

Y's consensus service pool also logs the same thing, but it doesn't have the
information about the tablet id or peer ids available to it, and it is throttled
to occur no more than once per second:

W1022 17:37:33.535168  4330 service_pool.cc:130] UpdateConsensus request on kudu.consensus.ConsensusService from 10.45.26.115:36820 dropped due to backpressure. The service queue is full; it has 50 items.

This patch attempts to reduce the spam of the first message in the logs
by throttling it to occur once every 5 retries. It still is logged for
every tablet peer, but those messages are useful if one wants to trace
the history of a particular tablet.

I also added the throttling messages to Y's output, so it's now

W1022 17:37:33.535168  4330 service_pool.cc:130] UpdateConsensus request on kudu.consensus.ConsensusService from 10.45.26.115:36820 dropped due to backpressure. The service queue is full; it has 50 items. [suppressed 5 similar messages]

when e.g. 5 other messages have been suppressed.

Change-Id: I7697c63babefac0f76bcc8c87d70f7e7125e55cc
---
M src/kudu/consensus/consensus_peers.cc
M src/kudu/consensus/consensus_peers.h
M src/kudu/rpc/service_pool.cc
M src/kudu/util/logging.h
4 files changed, 22 insertions(+), 10 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/01/11801/2
-- 
To view, visit http://gerrit.cloudera.org:8080/11801
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I7697c63babefac0f76bcc8c87d70f7e7125e55cc
Gerrit-Change-Number: 11801
Gerrit-PatchSet: 2
Gerrit-Owner: Will Berkeley <wd...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>

[kudu-CR] Improve consensus queue overflow logging

Posted by "Will Berkeley (Code Review)" <ge...@cloudera.org>.
Will Berkeley has removed Kudu Jenkins from this change.  ( http://gerrit.cloudera.org:8080/11801 )

Change subject: Improve consensus queue overflow logging
......................................................................


Removed reviewer Kudu Jenkins with the following votes:

* Verified-1 by Kudu Jenkins (120)
-- 
To view, visit http://gerrit.cloudera.org:8080/11801
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: deleteReviewer
Gerrit-Change-Id: I7697c63babefac0f76bcc8c87d70f7e7125e55cc
Gerrit-Change-Number: 11801
Gerrit-PatchSet: 2
Gerrit-Owner: Will Berkeley <wd...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>