You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@kudu.apache.org by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org> on 2016/12/01 12:19:04 UTC

[kudu-CR] WIP KUDU-1127 Don't hang scanner threads waiting for safe time

David Ribeiro Alves has uploaded a new change for review.

  http://gerrit.cloudera.org:8080/5305

Change subject: WIP KUDU-1127 Don't hang scanner threads waiting for safe time
......................................................................

WIP KUDU-1127 Don't hang scanner threads waiting for safe time

This makes it so we don't hang scanner threads on snapshot reads
when the difference between safe time and a requested timestamp
is bigger than the difference between 'now' and the client's
deadline. This makes the server return a Status::ServiceUnavailable()
with a THROTTLED error back to the client so that it can retry.

This allowed to swap linked_list-test to finish with snapshot scans
in the present and have it run on dist-test for 500 loops without
errors.

WIP as this requires a small directed test.

Change-Id: Ic7cd0b0749e715c5d9e665a8e37d0f1c95af574e
---
M src/kudu/consensus/time_manager.cc
M src/kudu/tserver/tablet_service.cc
2 files changed, 49 insertions(+), 15 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/05/5305/1
-- 
To view, visit http://gerrit.cloudera.org:8080/5305
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: Ic7cd0b0749e715c5d9e665a8e37d0f1c95af574e
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves <dr...@apache.org>

[kudu-CR] WIP KUDU-1127 Don't hang scanner threads waiting for safe time

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Alexey Serbin has posted comments on this change.

Change subject: WIP KUDU-1127 Don't hang scanner threads waiting for safe time
......................................................................


Patch Set 4:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/5305/4/src/kudu/integration-tests/consistency-itest.cc
File src/kudu/integration-tests/consistency-itest.cc:

PS4, Line 639: 3
Interesting, why 1 row is not enough?


-- 
To view, visit http://gerrit.cloudera.org:8080/5305
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ic7cd0b0749e715c5d9e665a8e37d0f1c95af574e
Gerrit-PatchSet: 4
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] WIP KUDU-1127 Don't hang scanner threads waiting for safe time

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change.

Change subject: WIP KUDU-1127 Don't hang scanner threads waiting for safe time
......................................................................


Patch Set 1:

(3 comments)

seems like a reasonable heuristic.

The only kinda funny thing is that in the normal case, where the safetime moves stepwise every raft heartbeat (eg once every 500ms or once a second), then the heuristic has the opposite effect from desired. In other words, just after the safetime has been updated, we won't reject anything (even though that's precisely the time when the next update is farthest off).

Put another way, there are basically two "modes" to worry about. In the lagging/abandoned mode, the current time gets farther and farther ahead of the safetime, and thus it's reasonable to assume "the longer we have been abandoned, the more likely we are to be abandoned for a longer time". In the non-failure mode, it's the opposite "the longer we've been waiting for a heartbeat, the more likely it is that our next heartbeat is about to arrive".

I don't know if it's worth trying to adjust for this based on knowledge of the raft heartbeat interval or empirical knowledge of the timing between the (n-2th) safetime update and the (n-1th) update, but maybe worth a note in the code about this weird effect?

http://gerrit.cloudera.org:8080/#/c/5305/1//COMMIT_MSG
Commit Message:

Line 15: This allowed to swap linked_list-test to finish with snapshot scans
why not merge the test change in, so this goes in with its end-to-end test coverage?


http://gerrit.cloudera.org:8080/#/c/5305/1/src/kudu/consensus/time_manager.cc
File src/kudu/consensus/time_manager.cc:

PS1, Line 133: LOG(WARNING) 
probably worth throttling this, otherwise a server that got abandoned might spew these warnings


PS1, Line 135: deadline
remaining time budget? remaining timeout?


-- 
To view, visit http://gerrit.cloudera.org:8080/5305
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ic7cd0b0749e715c5d9e665a8e37d0f1c95af574e
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] WIP KUDU-1127 Don't hang scanner threads waiting for safe time

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
David Ribeiro Alves has abandoned this change.

Change subject: WIP KUDU-1127 Don't hang scanner threads waiting for safe time
......................................................................


Abandoned

decided to go another way

-- 
To view, visit http://gerrit.cloudera.org:8080/5305
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: abandon
Gerrit-Change-Id: Ic7cd0b0749e715c5d9e665a8e37d0f1c95af574e
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] WIP KUDU-1127 Don't hang scanner threads waiting for safe time

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/5305

to look at the new patch set (#4).

Change subject: WIP KUDU-1127 Don't hang scanner threads waiting for safe time
......................................................................

WIP KUDU-1127 Don't hang scanner threads waiting for safe time

This makes it so we don't hang scanner threads on snapshot reads
when the difference between safe time and a requested timestamp
is bigger than the difference between 'now' and the client's
deadline. This makes the server return a Status::ServiceUnavailable()
with a THROTTLED error back to the client so that it can retry.

This allowed to swap linked_list-test to finish with snapshot scans
in the present and have it run on dist-test for 500 loops without
errors.

WIP as this requires a small directed test.

Change-Id: Ic7cd0b0749e715c5d9e665a8e37d0f1c95af574e
---
M src/kudu/consensus/time_manager.cc
M src/kudu/integration-tests/consistency-itest.cc
M src/kudu/integration-tests/linked_list-test-util.h
M src/kudu/integration-tests/linked_list-test.cc
M src/kudu/tserver/tablet_service.cc
5 files changed, 153 insertions(+), 60 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/05/5305/4
-- 
To view, visit http://gerrit.cloudera.org:8080/5305
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ic7cd0b0749e715c5d9e665a8e37d0f1c95af574e
Gerrit-PatchSet: 4
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] WIP KUDU-1127 Don't hang scanner threads waiting for safe time

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
David Ribeiro Alves has posted comments on this change.

Change subject: WIP KUDU-1127 Don't hang scanner threads waiting for safe time
......................................................................


Patch Set 1:

btw I think the solution for that would actually more in the line of making time manager call an "update me" callback that would make the consensus replica try and get an update from the leader regarding safe time.
After all theres a 1/3 chance that this would happen on another replica if we just returned and let the client retry.

-- 
To view, visit http://gerrit.cloudera.org:8080/5305
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ic7cd0b0749e715c5d9e665a8e37d0f1c95af574e
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: No

[kudu-CR] WIP KUDU-1127 Don't hang scanner threads waiting for safe time

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
David Ribeiro Alves has restored this change.

Change subject: WIP KUDU-1127 Don't hang scanner threads waiting for safe time
......................................................................


Restored

oops abandoned wrong patch

-- 
To view, visit http://gerrit.cloudera.org:8080/5305
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: restore
Gerrit-Change-Id: Ic7cd0b0749e715c5d9e665a8e37d0f1c95af574e
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] WIP KUDU-1127 Don't hang scanner threads waiting for safe time

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Alexey Serbin has posted comments on this change.

Change subject: WIP KUDU-1127 Don't hang scanner threads waiting for safe time
......................................................................


Patch Set 3:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/5305/3/src/kudu/consensus/time_manager.cc
File src/kudu/consensus/time_manager.cc:

PS3, Line 130: requested_time.GetDeltaSince
Could you use the operator-(x, y) here:

MonoDelta safe_time_diff = requested_time - safe_time;

?


PS3, Line 131: deadline.GetDeltaSince
ditto


-- 
To view, visit http://gerrit.cloudera.org:8080/5305
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ic7cd0b0749e715c5d9e665a8e37d0f1c95af574e
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] WIP KUDU-1127 Don't hang scanner threads waiting for safe time

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Alexey Serbin has posted comments on this change.

Change subject: WIP KUDU-1127 Don't hang scanner threads waiting for safe time
......................................................................


Patch Set 3:

(10 comments)

http://gerrit.cloudera.org:8080/#/c/5305/3/src/kudu/consensus/time_manager.cc
File src/kudu/consensus/time_manager.cc:

PS3, Line 126:     MonoTime safe_time = clock_->PhysicalComponentOfTimestamp(last_safe_ts_copy_);
             :     MonoTime requested_time = clock_->PhysicalComponentOfTimestamp(timestamp);
> Instead of the line of conversions like
It seems how there is GetPhysicalComponentDifference(), right?


http://gerrit.cloudera.org:8080/#/c/5305/3/src/kudu/integration-tests/linked_list-test-util.h
File src/kudu/integration-tests/linked_list-test-util.h:

PS3, Line 594: snapshot_timestamp == kSnapshotAtNow
This is also kNoSnapshot constant, if I'm not mistaken.  What if this method called with kNoSnapshot?


PS3, Line 608: snapshot_timestamp != kSnapshotAtNow
ditto


http://gerrit.cloudera.org:8080/#/c/5305/3/src/kudu/integration-tests/linked_list-test.cc
File src/kudu/integration-tests/linked_list-test.cc:

PS3, Line 88:     vector<string> common_flags;
            : 
            :     common_flags.push_back("--skip_remove_old_recovery_dir");
            : 
            :     // Set history retention to one day, so that we don't GC history in this test.
            :     // We rely on verifying "back in time" with snapshot scans.
            :     common_flags.push_back("--tablet_history_max_age_sec=86400");
            :     common_flags.push_back("--scanner_max_safe_time_wait_msecs=20000");
nit: while you at it, consider replacing sequence of push_back() calls with initializers list for 'common_flags'.


http://gerrit.cloudera.org:8080/#/c/5305/3/src/kudu/tserver/tablet_service.cc
File src/kudu/tserver/tablet_service.cc:

PS3, Line 1383: '
nit: remove


PS3, Line 1384: Status::OK() if it's not
This wording and the name of the method look contradictory to me.  Is there any typo by chance?


PS3, Line 1386: tablet::HistoryGcOpts history_gc_opts = tablet->GetHistoryGcOpts();
Nit: consider moving this into the if (read_mode == READ_AT_SNAPSHOT) {} scope to avoid making this unnecessary  call in other modes.


PS3, Line 1796:  
nit: is it worth adding a comma here?


PS3, Line 1796: ahm
nit: AHM


PS3, Line 1809: - MonoDelta::FromMilliseconds(10)
Why is it necessary to have this extra-margin?  May be, it's worth dropping it?  If not, it would be nice to have a comment why it's here.


-- 
To view, visit http://gerrit.cloudera.org:8080/5305
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ic7cd0b0749e715c5d9e665a8e37d0f1c95af574e
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] WIP KUDU-1127 Don't hang scanner threads waiting for safe time

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Alexey Serbin has posted comments on this change.

Change subject: WIP KUDU-1127 Don't hang scanner threads waiting for safe time
......................................................................


Patch Set 3:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/5305/3/src/kudu/consensus/time_manager.cc
File src/kudu/consensus/time_manager.cc:

PS3, Line 126:     MonoTime safe_time = clock_->PhysicalComponentOfTimestamp(last_safe_ts_copy_);
             :     MonoTime requested_time = clock_->PhysicalComponentOfTimestamp(timestamp);
Instead of the line of conversions like

(Timestamp, Timestamp) --> (MonoTime, MonoTime) --> MonoDelta

and creating MonoTime objects in different domains, why not to implement operator-(const Timestamp&, const Timestamp&) for getting difference of timestamps and then just do

MonoDelta diff(MonoDelta::FromMicroseconds(clock.GetPhysicalValueMicros(t1 - t0)))

?


-- 
To view, visit http://gerrit.cloudera.org:8080/5305
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ic7cd0b0749e715c5d9e665a8e37d0f1c95af574e
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] WIP KUDU-1127 Don't hang scanner threads waiting for safe time

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/5305

to look at the new patch set (#2).

Change subject: WIP KUDU-1127 Don't hang scanner threads waiting for safe time
......................................................................

WIP KUDU-1127 Don't hang scanner threads waiting for safe time

This makes it so we don't hang scanner threads on snapshot reads
when the difference between safe time and a requested timestamp
is bigger than the difference between 'now' and the client's
deadline. This makes the server return a Status::ServiceUnavailable()
with a THROTTLED error back to the client so that it can retry.

This allowed to swap linked_list-test to finish with snapshot scans
in the present and have it run on dist-test for 500 loops without
errors.

WIP as this requires a small directed test.

Change-Id: Ic7cd0b0749e715c5d9e665a8e37d0f1c95af574e
---
M src/kudu/consensus/time_manager.cc
M src/kudu/integration-tests/linked_list-test-util.h
M src/kudu/integration-tests/linked_list-test.cc
M src/kudu/tserver/tablet_service.cc
4 files changed, 158 insertions(+), 61 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/05/5305/2
-- 
To view, visit http://gerrit.cloudera.org:8080/5305
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ic7cd0b0749e715c5d9e665a8e37d0f1c95af574e
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] WIP KUDU-1127 Don't hang scanner threads waiting for safe time

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
David Ribeiro Alves has abandoned this change.

Change subject: WIP KUDU-1127 Don't hang scanner threads waiting for safe time
......................................................................


Abandoned

-- 
To view, visit http://gerrit.cloudera.org:8080/5305
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: abandon
Gerrit-Change-Id: Ic7cd0b0749e715c5d9e665a8e37d0f1c95af574e
Gerrit-PatchSet: 4
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] WIP KUDU-1127 Don't hang scanner threads waiting for safe time

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
David Ribeiro Alves has posted comments on this change.

Change subject: WIP KUDU-1127 Don't hang scanner threads waiting for safe time
......................................................................


Patch Set 1:

(1 comment)

Yeah this doesn't try to make it not wait at all, just tries to avoid waiting when it would likely be pointless because safe time is so far in the past that it's probably not getting to get to 'timestamp' before the client's deadline expires.
I assume that in most cases the client's deadline would be bigger than the heartbeat interval.

http://gerrit.cloudera.org:8080/#/c/5305/1//COMMIT_MSG
Commit Message:

Line 15: This allowed to swap linked_list-test to finish with snapshot scans
> why not merge the test change in, so this goes in with its end-to-end test 
sure, will do. think it's still worth a unit test or would that be enough?


-- 
To view, visit http://gerrit.cloudera.org:8080/5305
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ic7cd0b0749e715c5d9e665a8e37d0f1c95af574e
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes