You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@kudu.apache.org by "Todd Lipcon (Code Review)" <ge...@cloudera.org> on 2017/11/16 02:40:31 UTC
[kudu-CR] KUDU-1078. Fix 'ops in future' error under load
Hello Mike Percy,
I'd like you to do a code review. Please visit
http://gerrit.cloudera.org:8080/8563
to review the following change.
Change subject: KUDU-1078. Fix 'ops in future' error under load
......................................................................
KUDU-1078. Fix 'ops in future' error under load
This fixes an error which often ends up in the logs when the WAL is
under a lot of load:
E0823 01:22:37.049101 24094 consensus_queue.cc:415] T 6ffafa79dd91459fb6a55da4b52f477d P 1a6f2219a4cc41f5803b03f8c4498f56 [LEADER]: Error trying to read ahead of the log whi le preparing peer request: Incomplete: Op with index 294511 is ahead of the local log (next sequential op: 294511). Destination peer: Peer: b039d94e5c3f486e907e7567d148cf21, Is new: false, Last received: 3.294511, Next index: 294512, Last known committed idx: 294508, Last exchange result: SUCCESS, Needs tablet copy: false
As described in the JIRA, the issue was the following:
- in LogCache::AppendOperations, we do the following order:
1. under lock: add new operations to the cache map
2. outside lock: Log::AsyncAppendReplicates (which may block when under load)
3. under lock: increase next_sequential_op_index_
- in LogCache::ReadOps(...), we do:
1. Look for the operation in the cache.
2. If it's a hit, iterate forward in the cache until we stop "hitting"
So the following interleaving caused the error to be loggged:
Thread 1
---------
- call LogCache::AppendOperations() with batch 10..20
- inserts 10..20 into the cache
- blocks on AsyncAppendReplicates
Thread 2
---------
- preparing a request for a follower, calls ReadOps(10)
--- results in ops 10..20 being copied into the outgoing request
- sends the request
Later, while Thread 1 is still blocked:
- response received from the peer for 10..20. 'last_replicated' = 20, so next_index = 21
- we call ReadOps(21)
--- next_sequential_op_id_ has still not advanced from 10
--- this triggers the error message indicated above
The fix, as hinted in the JIRA, is to update 'next_sequential_op_index_' before
appending to the log, rather than after. This doesn't have any looser
guarantees in terms of durability, since the WAL append itself is
completely asynchronous -- appending to the WAL just enqueues the
entries onto an in-memory queue and then asynchronously wakes up the
background WAL appender thread.
The new unit test crashed very rapidly without the fix with a CHECK
failure showing the same Status::Incomplete() message as expected. With
the fix it runs reliably.
Change-Id: Ie062adcc7fcaa48f09bbe382267d8f755353a443
---
M src/kudu/consensus/log_cache-test.cc
M src/kudu/consensus/log_cache.cc
2 files changed, 48 insertions(+), 7 deletions(-)
git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/63/8563/1
--
To view, visit http://gerrit.cloudera.org:8080/8563
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: Ie062adcc7fcaa48f09bbe382267d8f755353a443
Gerrit-Change-Number: 8563
Gerrit-PatchSet: 1
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
[kudu-CR] KUDU-1078. Fix 'ops in future' error under load
Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Hello Mike Percy, Kudu Jenkins,
I'd like you to reexamine a change. Please visit
http://gerrit.cloudera.org:8080/8563
to look at the new patch set (#3).
Change subject: KUDU-1078. Fix 'ops in future' error under load
......................................................................
KUDU-1078. Fix 'ops in future' error under load
This fixes an error which often ends up in the logs when the WAL is
under a lot of load:
E0823 01:22:37.049101 24094 consensus_queue.cc:415] T 6ffafa79dd91459fb6a55da4b52f477d P 1a6f2219a4cc41f5803b03f8c4498f56 [LEADER]: Error trying to read ahead of the log while preparing peer request: Incomplete: Op with index 294511 is ahead of the local log (next sequential op: 294511). Destination peer: Peer: b039d94e5c3f486e907e7567d148cf21, Is new: false, Last received: 3.294511, Next index: 294512, Last known committed idx: 294508, Last exchange result: SUCCESS, Needs tablet copy: false
As described in the JIRA, the issue was the following:
- in LogCache::AppendOperations, we do the following order:
1. under lock: add new operations to the cache map
2. outside lock: Log::AsyncAppendReplicates (which may block when under load)
3. under lock: increase next_sequential_op_index_
- in LogCache::ReadOps(...), we do:
1. Look for the operation in the cache.
2. If it's a hit, iterate forward in the cache until we stop "hitting"
So the following interleaving caused the error to be loggged:
Thread 1
---------
- call LogCache::AppendOperations() with batch 10..20
- inserts 10..20 into the cache
- blocks on AsyncAppendReplicates
Thread 2
---------
- preparing a request for a follower, calls ReadOps(10)
--- results in ops 10..20 being copied into the outgoing request
- sends the request
Later, while Thread 1 is still blocked:
- response received from the peer for 10..20. 'last_replicated' = 20, so next_index = 21
- we call ReadOps(21)
--- next_sequential_op_id_ has still not advanced from 10
--- this triggers the error message indicated above
The fix, as hinted in the JIRA, is to update 'next_sequential_op_index_' before
appending to the log, rather than after. This doesn't have any looser
guarantees in terms of durability, since the WAL append itself is
completely asynchronous -- appending to the WAL just enqueues the
entries onto an in-memory queue and then asynchronously wakes up the
background WAL appender thread.
The new unit test crashed very rapidly without the fix with a CHECK
failure showing the same Status::Incomplete() message as expected. With
the fix it runs reliably.
Change-Id: Ie062adcc7fcaa48f09bbe382267d8f755353a443
---
M src/kudu/consensus/log_cache-test.cc
M src/kudu/consensus/log_cache.cc
2 files changed, 48 insertions(+), 7 deletions(-)
git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/63/8563/3
--
To view, visit http://gerrit.cloudera.org:8080/8563
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ie062adcc7fcaa48f09bbe382267d8f755353a443
Gerrit-Change-Number: 8563
Gerrit-PatchSet: 3
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
[kudu-CR] KUDU-1078. Fix 'ops in future' error under load
Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Mike Percy has posted comments on this change. ( http://gerrit.cloudera.org:8080/8563 )
Change subject: KUDU-1078. Fix 'ops in future' error under load
......................................................................
Patch Set 2:
(3 comments)
Nice, I wasn't sure of the cause of this warning.
http://gerrit.cloudera.org:8080/#/c/8563/2//COMMIT_MSG
Commit Message:
http://gerrit.cloudera.org:8080/#/c/8563/2//COMMIT_MSG@12
PS2, Line 12:
nit: extra space
http://gerrit.cloudera.org:8080/#/c/8563/2/src/kudu/consensus/log_cache-test.cc
File src/kudu/consensus/log_cache-test.cc:
http://gerrit.cloudera.org:8080/#/c/8563/2/src/kudu/consensus/log_cache-test.cc@381
PS2, Line 381: atomic<bool> stop { false };
nit: Perhaps we should add: if (!AllowSlowTests()) return;
http://gerrit.cloudera.org:8080/#/c/8563/2/src/kudu/consensus/log_cache-test.cc@399
PS2, Line 399: .
nit: extra period
--
To view, visit http://gerrit.cloudera.org:8080/8563
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ie062adcc7fcaa48f09bbe382267d8f755353a443
Gerrit-Change-Number: 8563
Gerrit-PatchSet: 2
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Thu, 16 Nov 2017 22:25:40 +0000
Gerrit-HasComments: Yes
[kudu-CR] KUDU-1078. Fix 'ops in future' error under load
Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Mike Percy has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/8563 )
Change subject: KUDU-1078. Fix 'ops in future' error under load
......................................................................
KUDU-1078. Fix 'ops in future' error under load
This fixes an error which often ends up in the logs when the WAL is
under a lot of load:
E0823 01:22:37.049101 24094 consensus_queue.cc:415] T 6ffafa79dd91459fb6a55da4b52f477d P 1a6f2219a4cc41f5803b03f8c4498f56 [LEADER]: Error trying to read ahead of the log while preparing peer request: Incomplete: Op with index 294511 is ahead of the local log (next sequential op: 294511). Destination peer: Peer: b039d94e5c3f486e907e7567d148cf21, Is new: false, Last received: 3.294511, Next index: 294512, Last known committed idx: 294508, Last exchange result: SUCCESS, Needs tablet copy: false
As described in the JIRA, the issue was the following:
- in LogCache::AppendOperations, we do the following order:
1. under lock: add new operations to the cache map
2. outside lock: Log::AsyncAppendReplicates (which may block when under load)
3. under lock: increase next_sequential_op_index_
- in LogCache::ReadOps(...), we do:
1. Look for the operation in the cache.
2. If it's a hit, iterate forward in the cache until we stop "hitting"
So the following interleaving caused the error to be loggged:
Thread 1
---------
- call LogCache::AppendOperations() with batch 10..20
- inserts 10..20 into the cache
- blocks on AsyncAppendReplicates
Thread 2
---------
- preparing a request for a follower, calls ReadOps(10)
--- results in ops 10..20 being copied into the outgoing request
- sends the request
Later, while Thread 1 is still blocked:
- response received from the peer for 10..20. 'last_replicated' = 20, so next_index = 21
- we call ReadOps(21)
--- next_sequential_op_id_ has still not advanced from 10
--- this triggers the error message indicated above
The fix, as hinted in the JIRA, is to update 'next_sequential_op_index_' before
appending to the log, rather than after. This doesn't have any looser
guarantees in terms of durability, since the WAL append itself is
completely asynchronous -- appending to the WAL just enqueues the
entries onto an in-memory queue and then asynchronously wakes up the
background WAL appender thread.
The new unit test crashed very rapidly without the fix with a CHECK
failure showing the same Status::Incomplete() message as expected. With
the fix it runs reliably.
Change-Id: Ie062adcc7fcaa48f09bbe382267d8f755353a443
Reviewed-on: http://gerrit.cloudera.org:8080/8563
Tested-by: Kudu Jenkins
Reviewed-by: Mike Percy <mp...@apache.org>
---
M src/kudu/consensus/log_cache-test.cc
M src/kudu/consensus/log_cache.cc
2 files changed, 48 insertions(+), 7 deletions(-)
Approvals:
Kudu Jenkins: Verified
Mike Percy: Looks good to me, approved
--
To view, visit http://gerrit.cloudera.org:8080/8563
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: Ie062adcc7fcaa48f09bbe382267d8f755353a443
Gerrit-Change-Number: 8563
Gerrit-PatchSet: 5
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
[kudu-CR] KUDU-1078. Fix 'ops in future' error under load
Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/8563 )
Change subject: KUDU-1078. Fix 'ops in future' error under load
......................................................................
Patch Set 2:
actually decided that the test failures were unrelated flakes. Rebased this on top of a fix for them.
--
To view, visit http://gerrit.cloudera.org:8080/8563
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ie062adcc7fcaa48f09bbe382267d8f755353a443
Gerrit-Change-Number: 8563
Gerrit-PatchSet: 2
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Thu, 16 Nov 2017 05:03:08 +0000
Gerrit-HasComments: No
[kudu-CR] KUDU-1078. Fix 'ops in future' error under load
Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Mike Percy has posted comments on this change. ( http://gerrit.cloudera.org:8080/8563 )
Change subject: KUDU-1078. Fix 'ops in future' error under load
......................................................................
Patch Set 3: Code-Review+2
--
To view, visit http://gerrit.cloudera.org:8080/8563
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ie062adcc7fcaa48f09bbe382267d8f755353a443
Gerrit-Change-Number: 8563
Gerrit-PatchSet: 3
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Fri, 17 Nov 2017 00:31:00 +0000
Gerrit-HasComments: No
[kudu-CR] KUDU-1078. Fix 'ops in future' error under load
Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has removed a vote on this change.
Change subject: KUDU-1078. Fix 'ops in future' error under load
......................................................................
Removed Verified-1 by Kudu Jenkins (120)
--
To view, visit http://gerrit.cloudera.org:8080/8563
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: deleteVote
Gerrit-Change-Id: Ie062adcc7fcaa48f09bbe382267d8f755353a443
Gerrit-Change-Number: 8563
Gerrit-PatchSet: 3
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
[kudu-CR] KUDU-1078. Fix 'ops in future' error under load
Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/8563 )
Change subject: KUDU-1078. Fix 'ops in future' error under load
......................................................................
Patch Set 3: Verified+1
Unrelated test flakes
--
To view, visit http://gerrit.cloudera.org:8080/8563
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ie062adcc7fcaa48f09bbe382267d8f755353a443
Gerrit-Change-Number: 8563
Gerrit-PatchSet: 3
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Fri, 17 Nov 2017 21:49:37 +0000
Gerrit-HasComments: No
[kudu-CR] KUDU-1078. Fix 'ops in future' error under load
Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/8563 )
Change subject: KUDU-1078. Fix 'ops in future' error under load
......................................................................
Patch Set 1: Verified-1
seems this patch introduces some issue with log retention. investigating
--
To view, visit http://gerrit.cloudera.org:8080/8563
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ie062adcc7fcaa48f09bbe382267d8f755353a443
Gerrit-Change-Number: 8563
Gerrit-PatchSet: 1
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Thu, 16 Nov 2017 03:18:55 +0000
Gerrit-HasComments: No
[kudu-CR] KUDU-1078. Fix 'ops in future' error under load
Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/8563 )
Change subject: KUDU-1078. Fix 'ops in future' error under load
......................................................................
Patch Set 2:
(3 comments)
http://gerrit.cloudera.org:8080/#/c/8563/2//COMMIT_MSG
Commit Message:
http://gerrit.cloudera.org:8080/#/c/8563/2//COMMIT_MSG@12
PS2, Line 12:
> nit: extra space
Done
http://gerrit.cloudera.org:8080/#/c/8563/2/src/kudu/consensus/log_cache-test.cc
File src/kudu/consensus/log_cache-test.cc:
http://gerrit.cloudera.org:8080/#/c/8563/2/src/kudu/consensus/log_cache-test.cc@381
PS2, Line 381: atomic<bool> stop { false };
> nit: Perhaps we should add: if (!AllowSlowTests()) return;
I'll make it run shorter in slow-test mode
http://gerrit.cloudera.org:8080/#/c/8563/2/src/kudu/consensus/log_cache-test.cc@399
PS2, Line 399: .
> nit: extra period
Done
--
To view, visit http://gerrit.cloudera.org:8080/8563
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ie062adcc7fcaa48f09bbe382267d8f755353a443
Gerrit-Change-Number: 8563
Gerrit-PatchSet: 2
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Thu, 16 Nov 2017 22:46:32 +0000
Gerrit-HasComments: Yes
[kudu-CR] KUDU-1078. Fix 'ops in future' error under load
Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Mike Percy has posted comments on this change. ( http://gerrit.cloudera.org:8080/8563 )
Change subject: KUDU-1078. Fix 'ops in future' error under load
......................................................................
Patch Set 4: Code-Review+2
--
To view, visit http://gerrit.cloudera.org:8080/8563
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ie062adcc7fcaa48f09bbe382267d8f755353a443
Gerrit-Change-Number: 8563
Gerrit-PatchSet: 4
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Mon, 20 Nov 2017 05:51:16 +0000
Gerrit-HasComments: No