You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@kudu.apache.org by "Todd Lipcon (Code Review)" <ge...@cloudera.org> on 2017/11/16 02:40:31 UTC

[kudu-CR] KUDU-1078. Fix 'ops in future' error under load

Hello Mike Percy,

I'd like you to do a code review. Please visit

    http://gerrit.cloudera.org:8080/8563

to review the following change.


Change subject: KUDU-1078. Fix 'ops in future' error under load
......................................................................

KUDU-1078. Fix 'ops in future' error under load

This fixes an error which often ends up in the logs when the WAL is
under a lot of load:

E0823 01:22:37.049101 24094 consensus_queue.cc:415] T 6ffafa79dd91459fb6a55da4b52f477d P 1a6f2219a4cc41f5803b03f8c4498f56 [LEADER]: Error trying to read ahead of the log whi le preparing peer request: Incomplete: Op with index 294511 is ahead of the local log (next sequential op: 294511). Destination peer: Peer: b039d94e5c3f486e907e7567d148cf21, Is new: false, Last received: 3.294511, Next index: 294512, Last known committed idx: 294508, Last exchange result: SUCCESS, Needs tablet copy: false

As described in the JIRA, the issue was the following:

- in LogCache::AppendOperations, we do the following order:
  1. under lock: add new operations to the cache map
  2. outside lock: Log::AsyncAppendReplicates (which may block when under load)
  3. under lock: increase next_sequential_op_index_

- in LogCache::ReadOps(...), we do:
   1. Look for the operation in the cache.
   2. If it's a hit, iterate forward in the cache until we stop "hitting"

So the following interleaving caused the error to be loggged:

Thread 1
---------
- call LogCache::AppendOperations() with batch 10..20
- inserts 10..20 into the cache
- blocks on AsyncAppendReplicates

Thread 2
---------
- preparing a request for a follower, calls ReadOps(10)
--- results in ops 10..20 being copied into the outgoing request
- sends the request

Later, while Thread 1 is still blocked:

- response received from the peer for 10..20. 'last_replicated' = 20, so next_index = 21
- we call ReadOps(21)
--- next_sequential_op_id_ has still not advanced from 10
--- this triggers the error message indicated above

The fix, as hinted in the JIRA, is to update 'next_sequential_op_index_' before
appending to the log, rather than after. This doesn't have any looser
guarantees in terms of durability, since the WAL append itself is
completely asynchronous -- appending to the WAL just enqueues the
entries onto an in-memory queue and then asynchronously wakes up the
background WAL appender thread.

The new unit test crashed very rapidly without the fix with a CHECK
failure showing the same Status::Incomplete() message as expected. With
the fix it runs reliably.

Change-Id: Ie062adcc7fcaa48f09bbe382267d8f755353a443
---
M src/kudu/consensus/log_cache-test.cc
M src/kudu/consensus/log_cache.cc
2 files changed, 48 insertions(+), 7 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/63/8563/1
-- 
To view, visit http://gerrit.cloudera.org:8080/8563
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: Ie062adcc7fcaa48f09bbe382267d8f755353a443
Gerrit-Change-Number: 8563
Gerrit-PatchSet: 1
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Mike Percy <mp...@apache.org>

[kudu-CR] KUDU-1078. Fix 'ops in future' error under load

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Hello Mike Percy, Kudu Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/8563

to look at the new patch set (#3).

Change subject: KUDU-1078. Fix 'ops in future' error under load
......................................................................

KUDU-1078. Fix 'ops in future' error under load

This fixes an error which often ends up in the logs when the WAL is
under a lot of load:

E0823 01:22:37.049101 24094 consensus_queue.cc:415] T 6ffafa79dd91459fb6a55da4b52f477d P 1a6f2219a4cc41f5803b03f8c4498f56 [LEADER]: Error trying to read ahead of the log while preparing peer request: Incomplete: Op with index 294511 is ahead of the local log (next sequential op: 294511). Destination peer: Peer: b039d94e5c3f486e907e7567d148cf21, Is new: false, Last received: 3.294511, Next index: 294512, Last known committed idx: 294508, Last exchange result: SUCCESS, Needs tablet copy: false

As described in the JIRA, the issue was the following:

- in LogCache::AppendOperations, we do the following order:
  1. under lock: add new operations to the cache map
  2. outside lock: Log::AsyncAppendReplicates (which may block when under load)
  3. under lock: increase next_sequential_op_index_

- in LogCache::ReadOps(...), we do:
   1. Look for the operation in the cache.
   2. If it's a hit, iterate forward in the cache until we stop "hitting"

So the following interleaving caused the error to be loggged:

Thread 1
---------
- call LogCache::AppendOperations() with batch 10..20
- inserts 10..20 into the cache
- blocks on AsyncAppendReplicates

Thread 2
---------
- preparing a request for a follower, calls ReadOps(10)
--- results in ops 10..20 being copied into the outgoing request
- sends the request

Later, while Thread 1 is still blocked:

- response received from the peer for 10..20. 'last_replicated' = 20, so next_index = 21
- we call ReadOps(21)
--- next_sequential_op_id_ has still not advanced from 10
--- this triggers the error message indicated above

The fix, as hinted in the JIRA, is to update 'next_sequential_op_index_' before
appending to the log, rather than after. This doesn't have any looser
guarantees in terms of durability, since the WAL append itself is
completely asynchronous -- appending to the WAL just enqueues the
entries onto an in-memory queue and then asynchronously wakes up the
background WAL appender thread.

The new unit test crashed very rapidly without the fix with a CHECK
failure showing the same Status::Incomplete() message as expected. With
the fix it runs reliably.

Change-Id: Ie062adcc7fcaa48f09bbe382267d8f755353a443
---
M src/kudu/consensus/log_cache-test.cc
M src/kudu/consensus/log_cache.cc
2 files changed, 48 insertions(+), 7 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/63/8563/3
-- 
To view, visit http://gerrit.cloudera.org:8080/8563
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ie062adcc7fcaa48f09bbe382267d8f755353a443
Gerrit-Change-Number: 8563
Gerrit-PatchSet: 3
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] KUDU-1078. Fix 'ops in future' error under load

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Mike Percy has posted comments on this change. ( http://gerrit.cloudera.org:8080/8563 )

Change subject: KUDU-1078. Fix 'ops in future' error under load
......................................................................


Patch Set 2:

(3 comments)

Nice, I wasn't sure of the cause of this warning.

http://gerrit.cloudera.org:8080/#/c/8563/2//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/8563/2//COMMIT_MSG@12
PS2, Line 12:  
nit: extra space


http://gerrit.cloudera.org:8080/#/c/8563/2/src/kudu/consensus/log_cache-test.cc
File src/kudu/consensus/log_cache-test.cc:

http://gerrit.cloudera.org:8080/#/c/8563/2/src/kudu/consensus/log_cache-test.cc@381
PS2, Line 381:   atomic<bool> stop { false };
nit: Perhaps we should add: if (!AllowSlowTests()) return;


http://gerrit.cloudera.org:8080/#/c/8563/2/src/kudu/consensus/log_cache-test.cc@399
PS2, Line 399: .
nit: extra period



-- 
To view, visit http://gerrit.cloudera.org:8080/8563
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ie062adcc7fcaa48f09bbe382267d8f755353a443
Gerrit-Change-Number: 8563
Gerrit-PatchSet: 2
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Thu, 16 Nov 2017 22:25:40 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-1078. Fix 'ops in future' error under load

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Mike Percy has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/8563 )

Change subject: KUDU-1078. Fix 'ops in future' error under load
......................................................................

KUDU-1078. Fix 'ops in future' error under load

This fixes an error which often ends up in the logs when the WAL is
under a lot of load:

E0823 01:22:37.049101 24094 consensus_queue.cc:415] T 6ffafa79dd91459fb6a55da4b52f477d P 1a6f2219a4cc41f5803b03f8c4498f56 [LEADER]: Error trying to read ahead of the log while preparing peer request: Incomplete: Op with index 294511 is ahead of the local log (next sequential op: 294511). Destination peer: Peer: b039d94e5c3f486e907e7567d148cf21, Is new: false, Last received: 3.294511, Next index: 294512, Last known committed idx: 294508, Last exchange result: SUCCESS, Needs tablet copy: false

As described in the JIRA, the issue was the following:

- in LogCache::AppendOperations, we do the following order:
  1. under lock: add new operations to the cache map
  2. outside lock: Log::AsyncAppendReplicates (which may block when under load)
  3. under lock: increase next_sequential_op_index_

- in LogCache::ReadOps(...), we do:
   1. Look for the operation in the cache.
   2. If it's a hit, iterate forward in the cache until we stop "hitting"

So the following interleaving caused the error to be loggged:

Thread 1
---------
- call LogCache::AppendOperations() with batch 10..20
- inserts 10..20 into the cache
- blocks on AsyncAppendReplicates

Thread 2
---------
- preparing a request for a follower, calls ReadOps(10)
--- results in ops 10..20 being copied into the outgoing request
- sends the request

Later, while Thread 1 is still blocked:

- response received from the peer for 10..20. 'last_replicated' = 20, so next_index = 21
- we call ReadOps(21)
--- next_sequential_op_id_ has still not advanced from 10
--- this triggers the error message indicated above

The fix, as hinted in the JIRA, is to update 'next_sequential_op_index_' before
appending to the log, rather than after. This doesn't have any looser
guarantees in terms of durability, since the WAL append itself is
completely asynchronous -- appending to the WAL just enqueues the
entries onto an in-memory queue and then asynchronously wakes up the
background WAL appender thread.

The new unit test crashed very rapidly without the fix with a CHECK
failure showing the same Status::Incomplete() message as expected. With
the fix it runs reliably.

Change-Id: Ie062adcc7fcaa48f09bbe382267d8f755353a443
Reviewed-on: http://gerrit.cloudera.org:8080/8563
Tested-by: Kudu Jenkins
Reviewed-by: Mike Percy <mp...@apache.org>
---
M src/kudu/consensus/log_cache-test.cc
M src/kudu/consensus/log_cache.cc
2 files changed, 48 insertions(+), 7 deletions(-)

Approvals:
  Kudu Jenkins: Verified
  Mike Percy: Looks good to me, approved

-- 
To view, visit http://gerrit.cloudera.org:8080/8563
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: Ie062adcc7fcaa48f09bbe382267d8f755353a443
Gerrit-Change-Number: 8563
Gerrit-PatchSet: 5
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] KUDU-1078. Fix 'ops in future' error under load

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/8563 )

Change subject: KUDU-1078. Fix 'ops in future' error under load
......................................................................


Patch Set 2:

actually decided that the test failures were unrelated flakes. Rebased this on top of a fix for them.


-- 
To view, visit http://gerrit.cloudera.org:8080/8563
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ie062adcc7fcaa48f09bbe382267d8f755353a443
Gerrit-Change-Number: 8563
Gerrit-PatchSet: 2
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Thu, 16 Nov 2017 05:03:08 +0000
Gerrit-HasComments: No

[kudu-CR] KUDU-1078. Fix 'ops in future' error under load

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Mike Percy has posted comments on this change. ( http://gerrit.cloudera.org:8080/8563 )

Change subject: KUDU-1078. Fix 'ops in future' error under load
......................................................................


Patch Set 3: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/8563
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ie062adcc7fcaa48f09bbe382267d8f755353a443
Gerrit-Change-Number: 8563
Gerrit-PatchSet: 3
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Fri, 17 Nov 2017 00:31:00 +0000
Gerrit-HasComments: No

[kudu-CR] KUDU-1078. Fix 'ops in future' error under load

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has removed a vote on this change.

Change subject: KUDU-1078. Fix 'ops in future' error under load
......................................................................


Removed Verified-1 by Kudu Jenkins (120)
-- 
To view, visit http://gerrit.cloudera.org:8080/8563
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: deleteVote
Gerrit-Change-Id: Ie062adcc7fcaa48f09bbe382267d8f755353a443
Gerrit-Change-Number: 8563
Gerrit-PatchSet: 3
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] KUDU-1078. Fix 'ops in future' error under load

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/8563 )

Change subject: KUDU-1078. Fix 'ops in future' error under load
......................................................................


Patch Set 3: Verified+1

Unrelated test flakes


-- 
To view, visit http://gerrit.cloudera.org:8080/8563
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ie062adcc7fcaa48f09bbe382267d8f755353a443
Gerrit-Change-Number: 8563
Gerrit-PatchSet: 3
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Fri, 17 Nov 2017 21:49:37 +0000
Gerrit-HasComments: No

[kudu-CR] KUDU-1078. Fix 'ops in future' error under load

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/8563 )

Change subject: KUDU-1078. Fix 'ops in future' error under load
......................................................................


Patch Set 1: Verified-1

seems this patch introduces some issue with log retention. investigating


-- 
To view, visit http://gerrit.cloudera.org:8080/8563
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ie062adcc7fcaa48f09bbe382267d8f755353a443
Gerrit-Change-Number: 8563
Gerrit-PatchSet: 1
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Thu, 16 Nov 2017 03:18:55 +0000
Gerrit-HasComments: No

[kudu-CR] KUDU-1078. Fix 'ops in future' error under load

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/8563 )

Change subject: KUDU-1078. Fix 'ops in future' error under load
......................................................................


Patch Set 2:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/8563/2//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/8563/2//COMMIT_MSG@12
PS2, Line 12:  
> nit: extra space
Done


http://gerrit.cloudera.org:8080/#/c/8563/2/src/kudu/consensus/log_cache-test.cc
File src/kudu/consensus/log_cache-test.cc:

http://gerrit.cloudera.org:8080/#/c/8563/2/src/kudu/consensus/log_cache-test.cc@381
PS2, Line 381:   atomic<bool> stop { false };
> nit: Perhaps we should add: if (!AllowSlowTests()) return;
I'll make it run shorter in slow-test mode


http://gerrit.cloudera.org:8080/#/c/8563/2/src/kudu/consensus/log_cache-test.cc@399
PS2, Line 399: .
> nit: extra period
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/8563
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ie062adcc7fcaa48f09bbe382267d8f755353a443
Gerrit-Change-Number: 8563
Gerrit-PatchSet: 2
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Thu, 16 Nov 2017 22:46:32 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-1078. Fix 'ops in future' error under load

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Mike Percy has posted comments on this change. ( http://gerrit.cloudera.org:8080/8563 )

Change subject: KUDU-1078. Fix 'ops in future' error under load
......................................................................


Patch Set 4: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/8563
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ie062adcc7fcaa48f09bbe382267d8f755353a443
Gerrit-Change-Number: 8563
Gerrit-PatchSet: 4
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Mon, 20 Nov 2017 05:51:16 +0000
Gerrit-HasComments: No