You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@kudu.apache.org by "Dinesh Bhat (Code Review)" <ge...@cloudera.org> on 2016/10/25 10:37:49 UTC

[kudu-CR] [tools] Manual recovery tools (part 1)

Hello Mike Percy, Adar Dembo, Todd Lipcon,

I'd like you to do a code review.  Please visit

    http://gerrit.cloudera.org:8080/4834

to review the following change.

Change subject: [tools] Manual recovery tools (part 1)
......................................................................

[tools] Manual recovery tools (part 1)

This change introduces two recovery tools:
1) 'kudu tablet copy' copies the given tablet onto
   a destination server irrespective of the state of the
   destination replica. This is a port from
   https://gerrit.cloudera.org/#/c/3582/
2) 'kudu local_replica delete' (KUDU-1618) deletes a local replica
   of a tablet when the tablet server can not come up due to a
   bad replica of a tablet.

Also added few tests exercising typical usage scenario for these tools.

Change-Id: I113a25e9b6c14f7c3814140917b61e35030b58d0
---
M src/kudu/integration-tests/cluster_itest_util.cc
M src/kudu/tools/kudu-tool-test.cc
M src/kudu/tools/tool_action_local_replica.cc
M src/kudu/tools/tool_action_tablet.cc
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
6 files changed, 372 insertions(+), 23 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/34/4834/1
-- 
To view, visit http://gerrit.cloudera.org:8080/4834
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: I113a25e9b6c14f7c3814140917b61e35030b58d0
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] [tools] Manual recovery tools (part 1)

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change.

Change subject: [tools] Manual recovery tools (part 1)
......................................................................


Patch Set 5:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/4834/1/src/kudu/tools/kudu-tool-test.cc
File src/kudu/tools/kudu-tool-test.cc:

PS1, Line 974: er_->tablet_ser
> Picked 1000 out of thin air, to have few KB worth of data in the table, and
Well, even a tablet with no data still has metadata that needs to be copied; an empty tablet is never a "no-op" in terms of copying. Since we're never measuring the size of the data, I don't see why we'd need to insert any rows at all.

That said, I see that you're passing workload.batches_completed() into the various WaitUntilAllReplicasHaveOp() calls. Why are those calls needed? Does the test fail without them?


http://gerrit.cloudera.org:8080/#/c/4834/2/src/kudu/tools/kudu-tool-test.cc
File src/kudu/tools/kudu-tool-test.cc:

PS2, Line 936: }
             : 
             : // Test 'kudu tablet copy' tool when the destination tablet server is online.
             : // 1. Test the copy tool when the destination replica is healthy
             : // 2. Test the copy tool when the 
> Done
Need to update the comments I highlighted too. At least the very first line.


http://gerrit.cloudera.org:8080/#/c/4834/5/src/kudu/tools/kudu-tool-test.cc
File src/kudu/tools/kudu-tool-test.cc:

PS5, Line 1004:   string stdout;
              :   string stderr;
Can remove these. Check the rest too.


PS5, Line 1052:  Explicit RPC to
              :   // delete-with-retries
What does this mean? DeleteTabletWithRetries() is a function; what does "explicit RPC" mean in this context?


PS5, Line 1101: replica data size to be visible.
What does this mean?


http://gerrit.cloudera.org:8080/#/c/4834/2/src/kudu/tools/tool_action_remote_replica.cc
File src/kudu/tools/tool_action_remote_replica.cc:

Line 312:   req.set_caller_term(std::numeric_limits<int64_t>::max());
> That's the intent yeah, although now that I think about it, it may be a nic
Yeah, I think that may be good. This would be -1 or 0 normally, and max when --force is set.

Run it by Mike though.


-- 
To view, visit http://gerrit.cloudera.org:8080/4834
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I113a25e9b6c14f7c3814140917b61e35030b58d0
Gerrit-PatchSet: 5
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] [tools] Manual recovery tools (part 1)

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Mike Percy has posted comments on this change.

Change subject: [tools] Manual recovery tools (part 1)
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/4834/1/src/kudu/tools/tool_action_local_replica.cc
File src/kudu/tools/tool_action_local_replica.cc:

PS1, Line 276:   RETURN_NOT_OK(TSTabletManager::DeleteTabletData(
             :       meta, TabletDataState::TABLET_DATA_DELETED, boost::none));
> Yeah sure, my first approach was to provide last_logged_opid, but I think t
I am a bit torn on this one.

On the one hand, I think tombstoning the replica by default would be much better, while also providing an option to delete it. Deletion is not generally safe and should only happen in an emergency scenario or when the table itself is being deleted.

However, in order to properly tombstone a tablet, we should pass in the last_logged_opid. That comes from the WAL. I'm not sure how easy that is to do that without starting up lots of other components.

I think it's worth attempting to get the last logged opid by spinning up a LogReader, and if it's actually not that difficult then we should do it. If it's very complicated then I suppose we can just rely on the remote replica tool for tombstoning and just use this one for local deletion.


-- 
To view, visit http://gerrit.cloudera.org:8080/4834
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I113a25e9b6c14f7c3814140917b61e35030b58d0
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] [tools] Manual recovery tools (part 1)

Posted by "Dinesh Bhat (Code Review)" <ge...@cloudera.org>.
Hello Mike Percy, Adar Dembo, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/4834

to look at the new patch set (#8).

Change subject: [tools] Manual recovery tools (part 1)
......................................................................

[tools] Manual recovery tools (part 1)

This change introduces two recovery tools:
1) 'kudu remote_replica copy' copies the given tablet onto
   a destination server irrespective of the state of the
   destination replica (--force option provided).
   This is a port from https://gerrit.cloudera.org/#/c/3582/
2) 'kudu local_replica delete' (KUDU-1618) deletes a local replica
   of a tablet when the tablet server can not come up due to a
   bad replica of a tablet.

Also added few tests exercising typical usage scenario for these tools.
Added trace to stderr under existing RunAction routines to throw more
info in the log if an action fails.

Change-Id: I113a25e9b6c14f7c3814140917b61e35030b58d0
---
M src/kudu/integration-tests/cluster_itest_util.cc
M src/kudu/integration-tests/cluster_itest_util.h
M src/kudu/integration-tests/delete_table-test.cc
M src/kudu/tools/kudu-tool-test.cc
M src/kudu/tools/tool_action_local_replica.cc
M src/kudu/tools/tool_action_remote_replica.cc
M src/kudu/tserver/heartbeater.cc
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
9 files changed, 445 insertions(+), 63 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/34/4834/8
-- 
To view, visit http://gerrit.cloudera.org:8080/4834
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I113a25e9b6c14f7c3814140917b61e35030b58d0
Gerrit-PatchSet: 8
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] [tools] Manual recovery tools (part 1)

Posted by "Dinesh Bhat (Code Review)" <ge...@cloudera.org>.
Hello David Ribeiro Alves, Mike Percy, Adar Dembo, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/4834

to look at the new patch set (#6).

Change subject: [tools] Manual recovery tools (part 1)
......................................................................

[tools] Manual recovery tools (part 1)

This change introduces two recovery tools:
1) 'kudu remote_replica copy' copies the given tablet onto
   a destination server irrespective of the state of the
   destination replica (--force option provided).
   This is a port from https://gerrit.cloudera.org/#/c/3582/
2) 'kudu local_replica delete' (KUDU-1618) deletes a local replica
   of a tablet when the tablet server can not come up due to a
   bad replica of a tablet.

Also added few tests exercising typical usage scenario for these tools.
Added trace to stderr under existing RunAction routines to throw more
info in the log if an action fails.

Change-Id: I113a25e9b6c14f7c3814140917b61e35030b58d0
---
M src/kudu/integration-tests/cluster_itest_util.cc
M src/kudu/integration-tests/cluster_itest_util.h
M src/kudu/integration-tests/delete_table-test.cc
M src/kudu/tools/kudu-tool-test.cc
M src/kudu/tools/tool_action_local_replica.cc
M src/kudu/tools/tool_action_remote_replica.cc
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
8 files changed, 487 insertions(+), 61 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/34/4834/6
-- 
To view, visit http://gerrit.cloudera.org:8080/4834
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I113a25e9b6c14f7c3814140917b61e35030b58d0
Gerrit-PatchSet: 6
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] [tools] Manual recovery tools (part 1)

Posted by "Dinesh Bhat (Code Review)" <ge...@cloudera.org>.
Hello Mike Percy, Adar Dembo, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/4834

to look at the new patch set (#15).

Change subject: [tools] Manual recovery tools (part 1)
......................................................................

[tools] Manual recovery tools (part 1)

This change introduces two recovery tools:
1) 'kudu remote_replica copy' copies the given tablet onto
   a destination server irrespective of the state of the
   destination replica (--force option provided).
   This is a port from https://gerrit.cloudera.org/#/c/3582/
2) 'kudu local_replica delete' (KUDU-1618) deletes a local replica
   of a tablet when the tablet server can not come up due to a
   bad replica of a tablet. This tool should be used with caution,
   hence added a --clean_unsafe flag to warn the user with
   consequences. As of this change, tool supports
   deleting the tablet permanently if used with --clean_unsafe flag,
   and fails with 'unsupported action error' without the flag.
   A future patch will add a default action of tombstoning
   the tablet instead of permanently deleting it.

Also added few tests exercising typical usage scenario for these tools.
Added trace to stderr under existing RunAction routines to throw more
info in the log if an action fails.

Change-Id: I113a25e9b6c14f7c3814140917b61e35030b58d0
---
M src/kudu/integration-tests/cluster_itest_util.cc
M src/kudu/integration-tests/cluster_itest_util.h
M src/kudu/integration-tests/delete_table-test.cc
M src/kudu/tools/CMakeLists.txt
M src/kudu/tools/ksck_remote.cc
M src/kudu/tools/kudu-tool-test.cc
M src/kudu/tools/tool_action_common.cc
M src/kudu/tools/tool_action_local_replica.cc
M src/kudu/tools/tool_action_remote_replica.cc
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
11 files changed, 456 insertions(+), 65 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/34/4834/15
-- 
To view, visit http://gerrit.cloudera.org:8080/4834
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I113a25e9b6c14f7c3814140917b61e35030b58d0
Gerrit-PatchSet: 15
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] [tools] Manual recovery tools (part 1)

Posted by "Dinesh Bhat (Code Review)" <ge...@cloudera.org>.
Hello Mike Percy, Adar Dembo, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/4834

to look at the new patch set (#16).

Change subject: [tools] Manual recovery tools (part 1)
......................................................................

[tools] Manual recovery tools (part 1)

This change introduces two recovery tools:
1) 'kudu remote_replica copy' copies the given tablet onto
   a destination server irrespective of the state of the
   destination replica (--force option provided).
   This is a port from https://gerrit.cloudera.org/#/c/3582/
2) 'kudu local_replica delete' (KUDU-1618) deletes a local replica
   of a tablet when the tablet server can not come up due to a
   bad replica of a tablet. This tool should be used with caution,
   hence added a --clean_unsafe flag to warn the user with
   consequences. As of this change, tool supports
   deleting the tablet permanently if used with --clean_unsafe flag,
   and fails with 'unsupported action error' without the flag.
   A future patch will add a default action of tombstoning
   the tablet instead of permanently deleting it.

Also added few tests exercising typical usage scenario for these tools.
Added trace to stderr under existing RunAction routines to throw more
info in the log if an action fails.

Change-Id: I113a25e9b6c14f7c3814140917b61e35030b58d0
---
M src/kudu/integration-tests/cluster_itest_util.cc
M src/kudu/integration-tests/cluster_itest_util.h
M src/kudu/integration-tests/delete_table-test.cc
M src/kudu/tools/CMakeLists.txt
M src/kudu/tools/ksck_remote.cc
M src/kudu/tools/kudu-tool-test.cc
M src/kudu/tools/tool_action_common.cc
M src/kudu/tools/tool_action_local_replica.cc
M src/kudu/tools/tool_action_remote_replica.cc
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
11 files changed, 456 insertions(+), 65 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/34/4834/16
-- 
To view, visit http://gerrit.cloudera.org:8080/4834
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I113a25e9b6c14f7c3814140917b61e35030b58d0
Gerrit-PatchSet: 16
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] [tools] Manual recovery tools (part 1)

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change.

Change subject: [tools] Manual recovery tools (part 1)
......................................................................


Patch Set 1:

(32 comments)

http://gerrit.cloudera.org:8080/#/c/4834/1/src/kudu/integration-tests/cluster_itest_util.cc
File src/kudu/integration-tests/cluster_itest_util.cc:

PS1, Line 224:   return Status::TimedOut(Substitute("Index $0 not available on all replicas after $1. "
             :                                               "Replicas: [ $2 ]",
             :                                               log_index, passed.ToString(), replicas_str));
Good catch. Could you reindent the continuation lines though?


http://gerrit.cloudera.org:8080/#/c/4834/1/src/kudu/tools/kudu-tool-test.cc
File src/kudu/tools/kudu-tool-test.cc:

PS1, Line 172: RunActionStderrString
Nit: RunActionStdoutStderrString (since it's returning both).


Line 215:  protected:
Why this change?


Line 216:   string tool_path_;
Remember to order functions first, then fields.


PS1, Line 225:   gscoped_ptr<ExternalMiniCluster> cluster_;
             :   gscoped_ptr<ExternalMiniClusterFsInspector> inspect_;
             :   unordered_map<string, TServerDetails*> ts_map_;
             :   gscoped_ptr<MiniCluster> mini_cluster_;
Can we use unique_ptrs here instead?


Line 842:   NO_FATALS(StartExternalMiniCluster({}, {"--never_fsync"}, num_tservers));
Okay but we lost the "// fsync causes ..." comment.


Line 937: // Test 'kudu tablet copy' tool when the destination tablet server is online.
Mike should definitely review this new test; I'm not familiar with the tablet deletion logic the way he is, and so I don't know how to assess the various WaitForFoo() calls.


Line 941: TEST_F(ToolTest, TestTabletCopy) {
In the various RunActionFoo() calls, you're not using the resulting stdout/stderr; can you replace them with no-arg calls?


Line 944:   const int kSrcTsIndex = 1, kDstTsIndex = 2, kAddOnTsIndex = 3;
Nit: declare each of these on its own line.


Line 947:       {}, {"--follower_unavailable_considered_failed_sec=3"}, 4));
Nit: should also explain (in a comment) why we need 4 tservers.


PS1, Line 951: to add this add-on tserver for all the tablets which are under-replicated.
Nit: to replicate all under-replicated tablets to the add-on tserver.


PS1, Line 974: consistent rows
What does "consistent" mean in this context? And how does waiting for the client to write at least 1000 rows guarantee it?


PS1, Line 984:  ;  i++) {
Nit: there's extra space here, both before and after the semicolon.


Line 985:     tservers.push_back(ts_map_[cluster_->tablet_server(i)->uuid()]);
Maybe loop from i to num_tservers, then skip when i == kAddOnTsIndex? That'll be robust to changes in kAddOnTsIndex; right now this loop assumes that kAddOnTsIndex is the last index.


Line 996:   string stdout, stderr;
Nit: declare on separate lines.


PS1, Line 1018: "local_replica delete $0 --fs_wal_dir=$1 "
              :                                              "--fs_data_dirs=$2
See my other comment about substitution and fs_wal_dir/fs_data_dirs.


PS1, Line 1033: auto
Can this be const auto?


Line 1067:   NO_FATALS(StartMiniCluster());
Why not use an external mini cluster here too, and avoid the need for code that can start either kind? Can we get by with that?

If the need is to force a flush on the particular tablet, you can set flush_threshold_mb very low, insert rows in a loop, and periodically look at a maintenance manager metric to see if a flush happened.


Line 1079:   // Grab the tablet's local replica to delete
You mean grab its tablet_id, right?


PS1, Line 1084:   TabletServerServiceProxy* ts_proxy = new TabletServerServiceProxy(
              :       ts->server()->messenger(), ts->bound_rpc_addr());
Do we have to allocate this from the heap? Can we just declare it on the stack?


PS1, Line 1088:   vector<ListTabletsResponsePB::StatusAndSchemaPB> tablets;
              :   tablets.assign(resp.status_and_schema().begin(), resp.status_and_schema().end());
              :   const string& tablet_id = tablets.at(0).tablet_status().tablet_id();
Why do we need the indirection of 'tablets'? Can't we reach into resp.status_and_schema() directly to get the first tablet? It's the only tablet right? We should ASSERT that.


Line 1107:   const string& data_dir = JoinPathSegments(tserver_dir, "data");
Why must we consider the data subdirectory specifically? Can't we just compare the size of tserver_dir before and after?


PS1, Line 1111: "local_replica delete $0 --fs_wal_dir=$1 "
              :                                               "--fs_data_dirs=$2",
You can omit --fs_data_dirs. Or you can reuse $1, then pass tserver_dir just once into the substitution.


PS1, Line 1113: stdout
stdout is never used; can you call the variant that doesn't provide any output?


http://gerrit.cloudera.org:8080/#/c/4834/1/src/kudu/tools/tool_action_local_replica.cc
File src/kudu/tools/tool_action_local_replica.cc:

PS1, Line 276:   RETURN_NOT_OK(TSTabletManager::DeleteTabletData(
             :       meta, TabletDataState::TABLET_DATA_DELETED, boost::none));
Looks reasonable to me, but can you make sure Mike looks at this? I'm specifically wondering whether 1) we should tombstone the replica, and 2) whether we need to provide a last_logged_opid.


http://gerrit.cloudera.org:8080/#/c/4834/1/src/kudu/tools/tool_action_tablet.cc
File src/kudu/tools/tool_action_tablet.cc:

PS1, Line 215:   const string& src_address = FindOrDie(context.required_args, "src_tserver_address");
             :   const string& dst_address = FindOrDie(context.required_args, "dst_tserver_address");
Define the two strings as static constants earlier in the file, then use those here and in BuildTabletMode().


Line 232:   rpc.set_timeout(MonoDelta::FromSeconds(30)); // Should this be higher ?
If you don't set a timeout at all, won't the request never time out? Then you won't need to LOG that warning below either.


PS1, Line 247: StartTabletCopy
As per KUDU-921, we'll modify the StartTabletCopy() implementation in the future so that it doesn't block on the entire copy procedure finishing. We should adjust the name/description of this action accordingly, to imply that it merely starts the copy operation, but doesn't wait for it to finish.

If you think it's important for it to wait on the copy to finish, we'll need to add some way to ask a tserver about its outstanding tablet copies.


PS1, Line 300:       .AddRequiredParameter({ "src_tserver_address", "Source tablet server to copy the tablet replica from" })
             :       .AddRequiredParameter({ "dst_tserver_address", "Destination tablet server to copy the tablet replica" })
How about just "src_address" and "dst_address", since the descriptions do a good job of explaining that these are tserver addresses?

Also, try to use the same description as is used in tool_action_tserver.cc's kTServerAddressDesc.


Line 315:       .AddAction(std::move(copy_tablet))
I think the remote_replica mode would be a more appropriate home, since this operation has to do with a single replica rather than an entire tablet.


http://gerrit.cloudera.org:8080/#/c/4834/1/src/kudu/tserver/ts_tablet_manager.cc
File src/kudu/tserver/ts_tablet_manager.cc:

Line 916:   LOG(INFO) << "T " + tablet_id + " P " + meta->fs_manager()->uuid() + ": "
Can you make a static two-arg variant of LogPrefix() and chain the non-static variant into it? That way the actual logging format (i.e. "T " tablet_id ...") is still described in just one place.


http://gerrit.cloudera.org:8080/#/c/4834/1/src/kudu/tserver/ts_tablet_manager.h
File src/kudu/tserver/ts_tablet_manager.h:

PS1, Line 180:   // This allows the tablet cleanup even when the tablet server is offline.
Nit: I don't think this new sentence is necessary; we generally don't use a function comment to justify the existence of the function.


-- 
To view, visit http://gerrit.cloudera.org:8080/4834
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I113a25e9b6c14f7c3814140917b61e35030b58d0
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] [tools] Manual recovery tools (part 1)

Posted by "Dinesh Bhat (Code Review)" <ge...@cloudera.org>.
Dinesh Bhat has posted comments on this change.

Change subject: [tools] Manual recovery tools (part 1)
......................................................................


Patch Set 8:

(1 comment)

Hi Mike,

I have reworked the entire tablet-copy test to make it less convoluted and much simpler this time, also made it pass 7000 iterations of dist_test run for this test without any race/flakiness observed. Also, the test was taking 15 secs earlier, now roughly finishes under 1-2 secs.

http://gerrit.cloudera.org:8080/#/c/4834/6/src/kudu/tools/kudu-tool-test.cc
File src/kudu/tools/kudu-tool-test.cc:

Line 1069:   // There is only one tserver in the minicluster.
> You bet it does :). That's precisely I added a new routine @L1066 which wai
I ended up removing some of these altogether as this became useless after I re-worked the test.


-- 
To view, visit http://gerrit.cloudera.org:8080/4834
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I113a25e9b6c14f7c3814140917b61e35030b58d0
Gerrit-PatchSet: 8
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] [tools] Manual recovery tools (part 1)

Posted by "Dinesh Bhat (Code Review)" <ge...@cloudera.org>.
Hello Mike Percy, Adar Dembo, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/4834

to look at the new patch set (#11).

Change subject: [tools] Manual recovery tools (part 1)
......................................................................

[tools] Manual recovery tools (part 1)

This change introduces two recovery tools:
1) 'kudu remote_replica copy' copies the given tablet onto
   a destination server irrespective of the state of the
   destination replica (--force option provided).
   This is a port from https://gerrit.cloudera.org/#/c/3582/
2) 'kudu local_replica delete' (KUDU-1618) deletes a local replica
   of a tablet when the tablet server can not come up due to a
   bad replica of a tablet.

Also added few tests exercising typical usage scenario for these tools.
Added trace to stderr under existing RunAction routines to throw more
info in the log if an action fails.

Change-Id: I113a25e9b6c14f7c3814140917b61e35030b58d0
---
M src/kudu/integration-tests/cluster_itest_util.cc
M src/kudu/integration-tests/cluster_itest_util.h
M src/kudu/integration-tests/delete_table-test.cc
M src/kudu/tools/CMakeLists.txt
M src/kudu/tools/ksck_remote.cc
M src/kudu/tools/kudu-tool-test.cc
M src/kudu/tools/tool_action_common.cc
M src/kudu/tools/tool_action_local_replica.cc
M src/kudu/tools/tool_action_remote_replica.cc
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
11 files changed, 435 insertions(+), 65 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/34/4834/11
-- 
To view, visit http://gerrit.cloudera.org:8080/4834
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I113a25e9b6c14f7c3814140917b61e35030b58d0
Gerrit-PatchSet: 11
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] [tools] Manual recovery tools (part 1)

Posted by "Dinesh Bhat (Code Review)" <ge...@cloudera.org>.
Hello David Ribeiro Alves, Mike Percy, Adar Dembo, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/4834

to look at the new patch set (#4).

Change subject: [tools] Manual recovery tools (part 1)
......................................................................

[tools] Manual recovery tools (part 1)

This change introduces two recovery tools:
1) 'kudu tablet copy' copies the given tablet onto
   a destination server irrespective of the state of the
   destination replica. This is a port from
   https://gerrit.cloudera.org/#/c/3582/
2) 'kudu local_replica delete' (KUDU-1618) deletes a local replica
   of a tablet when the tablet server can not come up due to a
   bad replica of a tablet.

Also added few tests exercising typical usage scenario for these tools.

Change-Id: I113a25e9b6c14f7c3814140917b61e35030b58d0
---
M src/kudu/integration-tests/cluster_itest_util.cc
M src/kudu/integration-tests/cluster_itest_util.h
M src/kudu/integration-tests/delete_table-test.cc
M src/kudu/tools/kudu-tool-test.cc
M src/kudu/tools/tool_action_local_replica.cc
M src/kudu/tools/tool_action_remote_replica.cc
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
8 files changed, 417 insertions(+), 58 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/34/4834/4
-- 
To view, visit http://gerrit.cloudera.org:8080/4834
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I113a25e9b6c14f7c3814140917b61e35030b58d0
Gerrit-PatchSet: 4
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] [tools] Manual recovery tools (part 1)

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Mike Percy has posted comments on this change.

Change subject: [tools] Manual recovery tools (part 1)
......................................................................


Patch Set 6:

(16 comments)

http://gerrit.cloudera.org:8080/#/c/4834/6/src/kudu/integration-tests/cluster_itest_util.h
File src/kudu/integration-tests/cluster_itest_util.h:

Line 267:     const shared_ptr<master::MasterServiceProxy>& master_proxy,
> error: expected ')' [clang-diagnostic-error]
std::shared_ptr


Line 268:     const string& tablet_id,
std::string here and below


http://gerrit.cloudera.org:8080/#/c/4834/6/src/kudu/tools/kudu-tool-test.cc
File src/kudu/tools/kudu-tool-test.cc:

Is it time to start breaking these tool tests out into separate tests? This file is getting quite large.


Line 950:   // Force a change_config within 3 seconds of follower unavailability.
Why 3 seconds? Seems like it could make the test a bit unstable, no?


Line 1069: 
Won't the below stuff race with the master trying to delete them because they aren't supposed to be there?


Line 1126:   ts->Shutdown();
Try running the tool before shutdown to ensure that it fails?


Line 1149: }
I think also restarting the TS and ensuring that the tablet doesn't exist anymore would also be useful


http://gerrit.cloudera.org:8080/#/c/4834/6/src/kudu/tools/tool_action_local_replica.cc
File src/kudu/tools/tool_action_local_replica.cc:

Line 267:   // This action cleans up metadata/data/WAL for a tablet on this node when
I think the information in this comment should instead be moved to the tool help.


Line 736:       .Description("Delete Kudu replica from the local filesystem")
s/replica/tablet replica/ here and elsewhere in this file's help text


Line 743:       .Description("Operate on local Kudu replicas via the local filesystem")
tablet replicas


http://gerrit.cloudera.org:8080/#/c/4834/6/src/kudu/tools/tool_action_remote_replica.cc
File src/kudu/tools/tool_action_remote_replica.cc:

Line 299:   RETURN_NOT_OK(GetServerStatus(dst_address, TabletServer::kDefaultPort,
Is there no way to override the port in these tools? Here and below. Seems like a problem


Line 315:   // This is also applicable for replicas which are tombstoned.
What does "applicable" mean here?


Line 322:   RETURN_NOT_OK(HostPortToPB(src_host_port, req.mutable_copy_peer_addr()));
Why all these conversions? How about:

*req.mutable_copy_peer_addr() = src_status.bound_rpc_addresses(0);


Line 331:     return Status::RemoteError(resp.ShortDebugString());
I think something like this would be better:

  return StatusFromPB(resp.error().status()).CloneAndPrepend(Substitute("Remote server returned error code $0", TabletServerErrorPB_Code(resp.error().code())));

Maybe there is a helper function somewhere for that or we could create one.


http://gerrit.cloudera.org:8080/#/c/4834/6/src/kudu/tserver/ts_tablet_manager.cc
File src/kudu/tserver/ts_tablet_manager.cc:

Line 563:   DCHECK_NOTNULL(fs_manager);
nit: DCHECK_NOTNULL macro returns its argument so consider using DCHECK(fs_manager != nullptr) instead to avoid a compiler warning when not using the argument


Line 564:   return "T " + tablet_id + " P " + fs_manager->uuid() + ": ";
nit: consider using strings::Substitute here to cut down on memory allocations (I know it was already like that)


-- 
To view, visit http://gerrit.cloudera.org:8080/4834
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I113a25e9b6c14f7c3814140917b61e35030b58d0
Gerrit-PatchSet: 6
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] [tools] Manual recovery tools (part 1)

Posted by "Dinesh Bhat (Code Review)" <ge...@cloudera.org>.
Hello Mike Percy, Adar Dembo, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/4834

to look at the new patch set (#17).

Change subject: [tools] Manual recovery tools (part 1)
......................................................................

[tools] Manual recovery tools (part 1)

This change introduces two recovery tools:
1) 'kudu remote_replica copy' copies the given tablet onto
   a destination server irrespective of the state of the
   destination replica (--force option provided).
   This is a port from https://gerrit.cloudera.org/#/c/3582/
2) 'kudu local_replica delete' (KUDU-1618) deletes a local replica
   of a tablet when the tablet server can not come up due to a
   bad replica of a tablet. This tool should be used with caution,
   hence added a --clean_unsafe flag to warn the user with
   consequences. As of this change, tool supports
   deleting the tablet permanently if used with --clean_unsafe flag,
   and fails with 'unsupported action error' without the flag.
   A future patch will add a default action of tombstoning
   the tablet instead of permanently deleting it.

Also added few tests exercising typical usage scenario for these tools.
Added trace to stderr under existing RunAction routines to throw more
info in the log if an action fails.

Change-Id: I113a25e9b6c14f7c3814140917b61e35030b58d0
---
M src/kudu/integration-tests/cluster_itest_util.cc
M src/kudu/integration-tests/cluster_itest_util.h
M src/kudu/integration-tests/delete_table-test.cc
M src/kudu/tools/CMakeLists.txt
M src/kudu/tools/ksck_remote.cc
M src/kudu/tools/kudu-tool-test.cc
M src/kudu/tools/tool_action_common.cc
M src/kudu/tools/tool_action_local_replica.cc
M src/kudu/tools/tool_action_remote_replica.cc
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
11 files changed, 456 insertions(+), 65 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/34/4834/17
-- 
To view, visit http://gerrit.cloudera.org:8080/4834
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I113a25e9b6c14f7c3814140917b61e35030b58d0
Gerrit-PatchSet: 17
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] [tools] Manual recovery tools (part 1)

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Mike Percy has posted comments on this change.

Change subject: [tools] Manual recovery tools (part 1)
......................................................................


Patch Set 13:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/4834/13/src/kudu/tools/tool_action_local_replica.cc
File src/kudu/tools/tool_action_local_replica.cc:

Line 80: DEFINE_bool(force_unsafe, false,
Hmm, --force_unsafe, I guess I would prefer either --force or --clean_unsafe since "force" implies unsafe in my mind but "clean" is more descriptive.


Line 81:             "Delete the local replica permanently which may "
How about: Delete the local replica completely, not leaving a tombstone. This is not guaranteed to be safe because it also removes the consensus metadata (including Raft voting record) for the specified tablet, which violates the Raft vote durability requirements.


Line 270:         "not supported without --force_unsafe flag", tablet_id));
s/not supported/currently not supported/


-- 
To view, visit http://gerrit.cloudera.org:8080/4834
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I113a25e9b6c14f7c3814140917b61e35030b58d0
Gerrit-PatchSet: 13
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] [tools] Manual recovery tools (part 1)

Posted by "Dinesh Bhat (Code Review)" <ge...@cloudera.org>.
Dinesh Bhat has posted comments on this change.

Change subject: [tools] Manual recovery tools (part 1)
......................................................................


Patch Set 7:

(16 comments)

http://gerrit.cloudera.org:8080/#/c/4834/6/src/kudu/integration-tests/cluster_itest_util.h
File src/kudu/integration-tests/cluster_itest_util.h:

Line 268:     const std::string& tablet_id,
> std::string here and below
done, unfortunately these are not caught while compiling on OS X :(.


http://gerrit.cloudera.org:8080/#/c/4834/6/src/kudu/tools/kudu-tool-test.cc
File src/kudu/tools/kudu-tool-test.cc:

> Is it time to start breaking these tool tests out into separate tests? This
I think having all the tests related to ./bin/kudu toolset in one file is better. In fact we have few more tests in kudu-admin-test which we wanted to migrate here. Is there a concern about how long this test will run ? As of now it finishes roughly under 25 secs (entire test suite).


Line 950:   // Force a change_config within 3 seconds of follower unavailability.
> Why 3 seconds? Seems like it could make the test a bit unstable, no?
Can you explain bit more what you meant by test becoming unstable ? The intent here was to kick the change_config as quickly as possible, but not as aggressive as the heartbeat timeout itself. Hence chose 2x factor here.


PS6, Line 1013:   ASSERT_STR_CONTAINS(stderr, "Rejecting tablet copy request");
              : 
> Nit: this text is only used in one place, so perhaps combine the two lines 
Done


Line 1069:   // Verify that tombstoned_tablet_id is copied successfully.
> Won't the below stuff race with the master trying to delete them because th
You bet it does :). That's precisely I added a new routine @L1066 which waits till master evicts destination replica from its config for the tombstoned tablet we are about to copy.  That way we know that tablet is in steady tombstoned state and copy below is expected to succeed.

For eg, this is the race you are probably indicating which I tried to fix above: 
http://104.196.14.100/job/kudu-gerrit/4202/BUILD_TYPE=RELEASE/testReport/junit/(root)/ToolTest/TestRemoteReplicaCopy/


Line 1126:   const string& tserver_dir = ts->options()->fs_opts.wal_path;
> Try running the tool before shutdown to ensure that it fails?
Done


Line 1149:   const string& meta_dir = JoinPathSegments(tserver_dir,
> I think also restarting the TS and ensuring that the tablet doesn't exist a
Done, although I am trying to see why I am not able to run ts->Start() again on OS X, it complains about already bound RPCs. But linux seems fine, I will get back on this, but updated the diffs in the meanwhile.


http://gerrit.cloudera.org:8080/#/c/4834/6/src/kudu/tools/tool_action_local_replica.cc
File src/kudu/tools/tool_action_local_replica.cc:

Line 267:   // This action cleans up metadata/data/WAL for a tablet on this node when
> I think the information in this comment should instead be moved to the tool
Hmmm... I am not sure if we want to be more more verbose in the tool help than 'this tool will make this tablet go away' - we don't need need to be explicit about wal/metadata going away I think ?


Line 736:       .Description("Delete Kudu replica from the local filesystem")
> s/replica/tablet replica/ here and elsewhere in this file's help text
This was intentional, when we were categorizing the toolset into multiple sub-modes, we decided to group them into relatively newer terminologies like this: 
'kudu tablet'  actions pertaining to tablet (or which affects tablet)
'kudu local_replica' actions on local fs when tserver not running
'kudu remote_replica' actions on the tablet replica running on another server

So we abstained from using the word 'tablet' except for actions related to tablet under tool_action_tablet.cc. I also changed the help description of 'remote_replica copy' to be more consistent with this idea.


Line 743:       .Description("Operate on local Kudu replicas via the local filesystem")
> tablet replicas
same as above.


http://gerrit.cloudera.org:8080/#/c/4834/6/src/kudu/tools/tool_action_remote_replica.cc
File src/kudu/tools/tool_action_remote_replica.cc:

Line 299:   RETURN_NOT_OK(GetServerStatus(dst_address, TabletServer::kDefaultPort,
> I believe GetServerStatus() will use the port embedded in the first arg and
That's correct:
https://github.com/apache/kudu/blob/master/src/kudu/util/net/net_util.cc#L101


Line 315:   // Provide a force option if the destination tablet server has the
> What does "applicable" mean here?
deleted the sentence, I was trying to indicate tablet-copy may fail irrespective of the state of the tablet as long as replica exists on the node, but that's probably redundant.


Line 322:   LOG(WARNING) << "NOTE: this copy may happen asynchronously "
> Why all these conversions? How about:
Done


Line 331:                             tserver::TabletServerErrorPB_Code(resp.error().code())));
> I think something like this would be better:
Done, wrapped with RETURN_NOT_OK_PREPEND.


http://gerrit.cloudera.org:8080/#/c/4834/6/src/kudu/tserver/ts_tablet_manager.cc
File src/kudu/tserver/ts_tablet_manager.cc:

Line 563:   DCHECK(fs_manager != nullptr);
> nit: DCHECK_NOTNULL macro returns its argument so consider using DCHECK(fs_
Done


Line 564:   return Substitute("T $0 P $1: ", tablet_id, fs_manager->uuid());
> nit: consider using strings::Substitute here to cut down on memory allocati
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/4834
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I113a25e9b6c14f7c3814140917b61e35030b58d0
Gerrit-PatchSet: 7
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] [tools] Manual recovery tools (part 1)

Posted by "Dinesh Bhat (Code Review)" <ge...@cloudera.org>.
Hello Mike Percy, Adar Dembo, Todd Lipcon, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/4834

to look at the new patch set (#2).

Change subject: [tools] Manual recovery tools (part 1)
......................................................................

[tools] Manual recovery tools (part 1)

This change introduces two recovery tools:
1) 'kudu tablet copy' copies the given tablet onto
   a destination server irrespective of the state of the
   destination replica. This is a port from
   https://gerrit.cloudera.org/#/c/3582/
2) 'kudu local_replica delete' (KUDU-1618) deletes a local replica
   of a tablet when the tablet server can not come up due to a
   bad replica of a tablet.

Also added few tests exercising typical usage scenario for these tools.

Change-Id: I113a25e9b6c14f7c3814140917b61e35030b58d0
---
M src/kudu/integration-tests/cluster_itest_util.cc
M src/kudu/tools/kudu-tool-test.cc
M src/kudu/tools/tool_action_local_replica.cc
M src/kudu/tools/tool_action_remote_replica.cc
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
6 files changed, 383 insertions(+), 28 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/34/4834/2
-- 
To view, visit http://gerrit.cloudera.org:8080/4834
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I113a25e9b6c14f7c3814140917b61e35030b58d0
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] [tools] Manual recovery tools (part 1)

Posted by "Dinesh Bhat (Code Review)" <ge...@cloudera.org>.
Dinesh Bhat has posted comments on this change.

Change subject: [tools] Manual recovery tools (part 1)
......................................................................


Patch Set 9:

(10 comments)

http://gerrit.cloudera.org:8080/#/c/4834/6/src/kudu/tools/kudu-tool-test.cc
File src/kudu/tools/kudu-tool-test.cc:

> > I think having all the tests related to ./bin/kudu toolset in one file is
I see, I didn't think about parallelization earlier, I was pointing out from source files grouping point of view earlier. Sure, let's do that in a follow-up patch, I will track this in a jira so that we won't miss this comment.


Line 1149
> Not sure about macOS, but on the ExternalMiniCluster the proper thing to do
it didn't look buggy after going through Restart/Start/Shutdown paths, it's just following a different Semantics than ExternalMiniCluster. Restart() here assumes that server should be running so that it does a 'Shotdown and Start'. Since we have manually shut it down earlier, Start() has brought up the tserver on same ports again.


http://gerrit.cloudera.org:8080/#/c/4834/9/src/kudu/tools/kudu-tool-test.cc
File src/kudu/tools/kudu-tool-test.cc:

Line 944: TEST_F(ToolTest, TestRemoteReplicaCopy) {
> If we're testing copying, should we write some data in this test? You could
Yeah, I think that was the test I had earlier if you recall PS6. I resorted to keeping it simple, because the intent here is to test the functionality of the tool and not the actual tablet copy itself (I think Adar also had similar opinion here). Keeping that in mind, I thought it was sufficient to execute tablet copy tool and check whether or not we were able to bring the tablet or not.


Line 994:   // by attempting to copy tablets while we tombstone the tablet on destination server.
> I don't think you ever elected a leader, so I don't know who would interfer
Good point - honestly I didn't test this without shutting down the master or other tservers to begin with, I updated diffs to keep all of them running, and this also removed several unnecessary checks, thanks you!


Line 1043:   NO_FATALS(RunActionStdoutNone(Substitute("remote_replica copy $0 $1 $2 --force_copy",
> Should --force be needed if it's tombstoned?
Yeah, here. https://github.com/apache/kudu/blob/master/src/kudu/tserver/ts_tablet_manager.cc#L378
Was that unexpected ?


Line 1048:   // deleted_tablet_id is copied from source to destination.
> nit: use active, not passive voice, i.e. 
Done


Line 1050:   ASSERT_OK(WaitUntilTabletInState(src_ts, deleted_tablet_id,
> Why is this needed? We just did this on line 1039
That was for a different tablet_id, but in any case I removed these checks since we are never shuttong down the source tserver now.


Line 1135:   ASSERT_OK(ts->Start());
> Should be Restart() I think? But looking at the impl it seems that it just 
Start() is correct here, otherwise Restart() hits a check since it expects the server to be running, a slightly different semantics than we follow with ExternalMiniCluster. I will fix this separately if I find anything here, but for now jenkins/linux test runs are happy.


http://gerrit.cloudera.org:8080/#/c/4834/6/src/kudu/tools/tool_action_remote_replica.cc
File src/kudu/tools/tool_action_remote_replica.cc:

Line 347:       .Description("Copy a replica from one Kudu Tablet Server to another")
> any thoughts on using "tablet replica" throughout this file instead of just
If you see the latest patch, I updated this to just 'replica'. Since we use 'replica' in only one context under Kudu, tablet is probably redundant IMHO.


http://gerrit.cloudera.org:8080/#/c/4834/9/src/kudu/tools/tool_action_remote_replica.cc
File src/kudu/tools/tool_action_remote_replica.cc:

Line 331:                             tserver::TabletServerErrorPB_Code(resp.error().code())));
> oops, typo in my original suggestion, should be TabletServerErrorPB::Code_N
Done, I am just curious how did we generate Code_Name from the proto file ? I am not able to navigate the generator of that routine.


-- 
To view, visit http://gerrit.cloudera.org:8080/4834
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I113a25e9b6c14f7c3814140917b61e35030b58d0
Gerrit-PatchSet: 9
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] [tools] Manual recovery tools (part 1)

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Alexey Serbin has submitted this change and it was merged.

Change subject: [tools] Manual recovery tools (part 1)
......................................................................


[tools] Manual recovery tools (part 1)

This change introduces two recovery tools:
1) 'kudu remote_replica copy' copies the given tablet onto
   a destination server irrespective of the state of the
   destination replica (--force option provided).
   This is a port from https://gerrit.cloudera.org/#/c/3582/
2) 'kudu local_replica delete' (KUDU-1618) deletes a local replica
   of a tablet when the tablet server can not come up due to a
   bad replica of a tablet. This tool should be used with caution,
   hence added a --clean_unsafe flag to warn the user with
   consequences. As of this change, tool supports
   deleting the tablet permanently if used with --clean_unsafe flag,
   and fails with 'unsupported action error' without the flag.
   A future patch will add a default action of tombstoning
   the tablet instead of permanently deleting it.

Also added few tests exercising typical usage scenario for these tools.
Added trace to stderr under existing RunAction routines to throw more
info in the log if an action fails.

Change-Id: I113a25e9b6c14f7c3814140917b61e35030b58d0
Reviewed-on: http://gerrit.cloudera.org:8080/4834
Tested-by: Kudu Jenkins
Reviewed-by: Mike Percy <mp...@apache.org>
---
M src/kudu/integration-tests/cluster_itest_util.cc
M src/kudu/integration-tests/cluster_itest_util.h
M src/kudu/integration-tests/delete_table-test.cc
M src/kudu/tools/CMakeLists.txt
M src/kudu/tools/ksck_remote.cc
M src/kudu/tools/kudu-tool-test.cc
M src/kudu/tools/tool_action_common.cc
M src/kudu/tools/tool_action_local_replica.cc
M src/kudu/tools/tool_action_remote_replica.cc
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
11 files changed, 456 insertions(+), 65 deletions(-)

Approvals:
  Mike Percy: Looks good to me, approved
  Kudu Jenkins: Verified



-- 
To view, visit http://gerrit.cloudera.org:8080/4834
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: I113a25e9b6c14f7c3814140917b61e35030b58d0
Gerrit-PatchSet: 18
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] [tools] Manual recovery tools (part 1)

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Mike Percy has posted comments on this change.

Change subject: [tools] Manual recovery tools (part 1)
......................................................................


Patch Set 17: Code-Review+2

-- 
To view, visit http://gerrit.cloudera.org:8080/4834
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I113a25e9b6c14f7c3814140917b61e35030b58d0
Gerrit-PatchSet: 17
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: No

[kudu-CR] [tools] Manual recovery tools (part 1)

Posted by "Dinesh Bhat (Code Review)" <ge...@cloudera.org>.
Dinesh Bhat has posted comments on this change.

Change subject: [tools] Manual recovery tools (part 1)
......................................................................


Patch Set 3:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/4834/1/src/kudu/tools/kudu-tool-test.cc
File src/kudu/tools/kudu-tool-test.cc:

Line 941:   const int kAddOnTsIndex = 3;
> Sounds like good idea, let me do that in a follow-up patch, updated current
Upon second thoughts, I included this suggestion in this patch itself, also updated commit_msg. We are mostly interested in doing SCOPE_TRACE of stderr from the RunAction routines, hence prototypes of existing RunAction* did not change.


-- 
To view, visit http://gerrit.cloudera.org:8080/4834
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I113a25e9b6c14f7c3814140917b61e35030b58d0
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] [tools] Manual recovery tools (part 1)

Posted by "Dinesh Bhat (Code Review)" <ge...@cloudera.org>.
Hello Mike Percy, Adar Dembo, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/4834

to look at the new patch set (#12).

Change subject: [tools] Manual recovery tools (part 1)
......................................................................

[tools] Manual recovery tools (part 1)

This change introduces two recovery tools:
1) 'kudu remote_replica copy' copies the given tablet onto
   a destination server irrespective of the state of the
   destination replica (--force option provided).
   This is a port from https://gerrit.cloudera.org/#/c/3582/
2) 'kudu local_replica delete' (KUDU-1618) deletes a local replica
   of a tablet when the tablet server can not come up due to a
   bad replica of a tablet.

Also added few tests exercising typical usage scenario for these tools.
Added trace to stderr under existing RunAction routines to throw more
info in the log if an action fails.

Change-Id: I113a25e9b6c14f7c3814140917b61e35030b58d0
---
M src/kudu/integration-tests/cluster_itest_util.cc
M src/kudu/integration-tests/cluster_itest_util.h
M src/kudu/integration-tests/delete_table-test.cc
M src/kudu/tools/CMakeLists.txt
M src/kudu/tools/ksck_remote.cc
M src/kudu/tools/kudu-tool-test.cc
M src/kudu/tools/tool_action_common.cc
M src/kudu/tools/tool_action_local_replica.cc
M src/kudu/tools/tool_action_remote_replica.cc
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
11 files changed, 454 insertions(+), 65 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/34/4834/12
-- 
To view, visit http://gerrit.cloudera.org:8080/4834
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I113a25e9b6c14f7c3814140917b61e35030b58d0
Gerrit-PatchSet: 12
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] [tools] Manual recovery tools (part 1)

Posted by "Dinesh Bhat (Code Review)" <ge...@cloudera.org>.
Dinesh Bhat has posted comments on this change.

Change subject: [tools] Manual recovery tools (part 1)
......................................................................


Patch Set 10:

(2 comments)

> (2 comments)
 > 
 > Would you mind also adding a test that makes sure that attempting
 > to copy over an existing (running) tablet, even in --force mode,
 > fails?
 
BTW, this doesn't fail if I am copying over an existing healthy replica with --force, because we set the caller id to INT_MAX from the tool.

 > Also, I wonder what the value of the --force flag is, since this is
 > a manual tool maybe we should always force. This seems like a tool
 > that you would only typically use --force with, so why bother
 > requiring people to specify it.
Precisely the above case, we don't want user to accidentally copy on an existing healthy replica, but if he wants to, we can override using force option.

http://gerrit.cloudera.org:8080/#/c/4834/6/src/kudu/tools/kudu-tool-test.cc
File src/kudu/tools/kudu-tool-test.cc:

Line 1149
> It's bad that they have different semantics, because they implement the sam
Yeah, I root-caused it to be a missing piece in socket teardown path for macOS, but I am gonna fix that in a different patch.


http://gerrit.cloudera.org:8080/#/c/4834/9/src/kudu/tools/tool_action_remote_replica.cc
File src/kudu/tools/tool_action_remote_replica.cc:

Line 331:         strings::Substitute("Remote server returned error code $0",
> it's generated and placed in build/latest/src/kudu/tserver/tserver.pb.cc
Yeah, I was more curious about what makes this Enum special ? It turns out the protobuf always generate one such routines for Enum types, that's cool :)


-- 
To view, visit http://gerrit.cloudera.org:8080/4834
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I113a25e9b6c14f7c3814140917b61e35030b58d0
Gerrit-PatchSet: 10
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] [tools] Manual recovery tools (part 1)

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Mike Percy has posted comments on this change.

Change subject: [tools] Manual recovery tools (part 1)
......................................................................


Patch Set 10:

(2 comments)

> Would you mind also adding a test that makes sure that attempting
> to copy over an existing (running) tablet, even in --force mode,
> fails?

Forget this comment, after looking at the code it would not fail, it would actually tombstone the existing one and then overwrite it.

http://gerrit.cloudera.org:8080/#/c/4834/10/src/kudu/tools/tool_action_remote_replica.cc
File src/kudu/tools/tool_action_remote_replica.cc:

Line 52: DEFINE_bool(force_copy, false,
Hmm, can we just make this --force instead of --force_copy ? --force seems a lot easier to remember.


Line 320:   }
I think by default we should pass in the source server's term. This can be obtained via the consensus_proxy->GetConsensusState() call.

If we don't do this, people will have to just use --force all of the time which seems silly.


-- 
To view, visit http://gerrit.cloudera.org:8080/4834
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I113a25e9b6c14f7c3814140917b61e35030b58d0
Gerrit-PatchSet: 10
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] [tools] Manual recovery tools (part 1)

Posted by "Dinesh Bhat (Code Review)" <ge...@cloudera.org>.
Hello Mike Percy, Adar Dembo, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/4834

to look at the new patch set (#9).

Change subject: [tools] Manual recovery tools (part 1)
......................................................................

[tools] Manual recovery tools (part 1)

This change introduces two recovery tools:
1) 'kudu remote_replica copy' copies the given tablet onto
   a destination server irrespective of the state of the
   destination replica (--force option provided).
   This is a port from https://gerrit.cloudera.org/#/c/3582/
2) 'kudu local_replica delete' (KUDU-1618) deletes a local replica
   of a tablet when the tablet server can not come up due to a
   bad replica of a tablet.

Also added few tests exercising typical usage scenario for these tools.
Added trace to stderr under existing RunAction routines to throw more
info in the log if an action fails.

Change-Id: I113a25e9b6c14f7c3814140917b61e35030b58d0
---
M src/kudu/integration-tests/cluster_itest_util.cc
M src/kudu/integration-tests/cluster_itest_util.h
M src/kudu/integration-tests/delete_table-test.cc
M src/kudu/tools/kudu-tool-test.cc
M src/kudu/tools/tool_action_local_replica.cc
M src/kudu/tools/tool_action_remote_replica.cc
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
8 files changed, 439 insertions(+), 61 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/34/4834/9
-- 
To view, visit http://gerrit.cloudera.org:8080/4834
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I113a25e9b6c14f7c3814140917b61e35030b58d0
Gerrit-PatchSet: 9
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] [tools] Manual recovery tools (part 1)

Posted by "Dinesh Bhat (Code Review)" <ge...@cloudera.org>.
Dinesh Bhat has posted comments on this change.

Change subject: [tools] Manual recovery tools (part 1)
......................................................................


Patch Set 1:

(38 comments)

http://gerrit.cloudera.org:8080/#/c/4834/1/src/kudu/integration-tests/cluster_itest_util.cc
File src/kudu/integration-tests/cluster_itest_util.cc:

PS1, Line 224:   return Status::TimedOut(Substitute("Index $0 not available on all replicas after $1. "
             :                                               "Replicas: [ $2 ]",
             :                                               log_index, passed.ToString(), replicas_str));
> Good catch. Could you reindent the continuation lines though?
Done


http://gerrit.cloudera.org:8080/#/c/4834/1/src/kudu/tools/kudu-tool-test.cc
File src/kudu/tools/kudu-tool-test.cc:

Line 56: #include "kudu/tablet/tablet.h"
> warning: #includes are not sorted properly [llvm-include-order]
Done


Line 84: using consensus::ConsensusConfigType;
> warning: using decl 'ConsensusConfigType' is unused [misc-unused-using-decl
Done


Line 85: using consensus::ConsensusStatePB;
> warning: using decl 'ConsensusStatePB' is unused [misc-unused-using-decls]
Done


Line 87: using consensus::RaftPeerPB;
> warning: using decl 'RaftPeerPB' is unused [misc-unused-using-decls]
Done


PS1, Line 172: RunActionStderrString
> Nit: RunActionStdoutStderrString (since it's returning both).
Done


Line 215:  protected:
> Why this change?
TEST_F macro eventually defines a class for each of the tests defined below and they inherit ToolTest. So if I want to be able to access these members in ToolTest, I have to keep them protected(as opposed to private). Also looked at few other examples too following this approach.


Line 216:   string tool_path_;
> Remember to order functions first, then fields.
thank you for catching, done.


PS1, Line 225:   gscoped_ptr<ExternalMiniCluster> cluster_;
             :   gscoped_ptr<ExternalMiniClusterFsInspector> inspect_;
             :   unordered_map<string, TServerDetails*> ts_map_;
             :   gscoped_ptr<MiniCluster> mini_cluster_;
> Can we use unique_ptrs here instead?
Done


Line 842:   NO_FATALS(StartExternalMiniCluster({}, {"--never_fsync"}, num_tservers));
> Okay but we lost the "// fsync causes ..." comment.
Done


Line 937: // Test 'kudu tablet copy' tool when the destination tablet server is online.
> Mike should definitely review this new test; I'm not familiar with the tabl
Sure, I have added him, will poke him in hipchat too.


Line 941: TEST_F(ToolTest, TestTabletCopy) {
> In the various RunActionFoo() calls, you're not using the resulting stdout/
Essentially, I wanted to do this:
s = RunTool(std_args, stdout, stderr)
SCOPED_TRACE(stdout)
SCOPED_TRACE(stderr)
ASSERT_OK(s);

While running the tablet copy tool test below(with RunActionStdoutNone), because of a bug I had in the test, the RunTool was hitting a RunTimeError, and the only error I was seeing in the log was 'bin/kudu exited with non-zero status' which led me to add RunToolStdoutStdErr() variant to print more scope logs in place. Hence, I have kept them not because this routine uses it, but because the callee uses to SCOPE_TRACE stdout/stderr which is useful at times.


Line 944:   const int kSrcTsIndex = 1, kDstTsIndex = 2, kAddOnTsIndex = 3;
> Nit: declare each of these on its own line.
Done


Line 947:       {}, {"--follower_unavailable_considered_failed_sec=3"}, 4));
> Nit: should also explain (in a comment) why we need 4 tservers.
Done


PS1, Line 951: to add this add-on tserver for all the tablets which are under-replicated.
> Nit: to replicate all under-replicated tablets to the add-on tserver.
Done


PS1, Line 974: consistent rows
> What does "consistent" mean in this context? And how does waiting for the c
No, this was a stale comment leftover, I think I was doing a  WaitForServersToAgree(workload.batches_completed()) earlier which got removed. Removed the comment.


PS1, Line 984:  ;  i++) {
> Nit: there's extra space here, both before and after the semicolon.
Done


Line 985:     tservers.push_back(ts_map_[cluster_->tablet_server(i)->uuid()]);
> Maybe loop from i to num_tservers, then skip when i == kAddOnTsIndex? That'
Done


Line 996:   string stdout, stderr;
> Nit: declare on separate lines.
Done


PS1, Line 1018: "local_replica delete $0 --fs_wal_dir=$1 "
              :                                              "--fs_data_dirs=$2
> See my other comment about substitution and fs_wal_dir/fs_data_dirs.
Done


PS1, Line 1033: auto
> Can this be const auto?
Done


Line 1067:   NO_FATALS(StartMiniCluster());
> Why not use an external mini cluster here too, and avoid the need for code 
I added it mainly to have more control over MiniTabletServer operations, and yes flush probably is the main reason at this point, but I think it could be useful for future tests I have in mind. I prefer to not to rely on flags/metrics for flush. However, I am not fully against the idea and I could replace the MiniCluster for this test with the ideas you suggested, please lemme know where I could look at such metrics.


Line 1079:   // Grab the tablet's local replica to delete
> You mean grab its tablet_id, right?
Done


PS1, Line 1084:   TabletServerServiceProxy* ts_proxy = new TabletServerServiceProxy(
              :       ts->server()->messenger(), ts->bound_rpc_addr());
> Do we have to allocate this from the heap? Can we just declare it on the st
Done


PS1, Line 1088:   vector<ListTabletsResponsePB::StatusAndSchemaPB> tablets;
              :   tablets.assign(resp.status_and_schema().begin(), resp.status_and_schema().end());
              :   const string& tablet_id = tablets.at(0).tablet_status().tablet_id();
> Why do we need the indirection of 'tablets'? Can't we reach into resp.statu
Done


Line 1107:   const string& data_dir = JoinPathSegments(tserver_dir, "data");
> Why must we consider the data subdirectory specifically? Can't we just comp
Yeah, since delete is a 3-step operation, I wanted to make sure either subdirectory size is reduced(data) or they are gone(WAL/tablet-meta). Hence added checks against individual subdirectories.


PS1, Line 1111: "local_replica delete $0 --fs_wal_dir=$1 "
              :                                               "--fs_data_dirs=$2",
> You can omit --fs_data_dirs. Or you can reuse $1, then pass tserver_dir jus
Done


PS1, Line 1113: stdout
> stdout is never used; can you call the variant that doesn't provide any out
Done


http://gerrit.cloudera.org:8080/#/c/4834/1/src/kudu/tools/tool_action_local_replica.cc
File src/kudu/tools/tool_action_local_replica.cc:

Line 115: using tablet::Tablet;
> warning: using decl 'Tablet' is unused [misc-unused-using-decls]
Done


Line 120: using tserver::WriteRequestPB;
> warning: using decl 'WriteRequestPB' is unused [misc-unused-using-decls]
Done


PS1, Line 276:   RETURN_NOT_OK(TSTabletManager::DeleteTabletData(
             :       meta, TabletDataState::TABLET_DATA_DELETED, boost::none));
> Looks reasonable to me, but can you make sure Mike looks at this? I'm speci
Yeah sure, my first approach was to provide last_logged_opid, but I think that may be more relevant when we provide a --archive option along with 'delete'. One thing I can add in the tests is to check if WAL.recovery files are deleted too with this.


http://gerrit.cloudera.org:8080/#/c/4834/1/src/kudu/tools/tool_action_tablet.cc
File src/kudu/tools/tool_action_tablet.cc:

PS1, Line 215:   const string& src_address = FindOrDie(context.required_args, "src_tserver_address");
             :   const string& dst_address = FindOrDie(context.required_args, "dst_tserver_address");
> Define the two strings as static constants earlier in the file, then use th
Done


Line 232:   rpc.set_timeout(MonoDelta::FromSeconds(30)); // Should this be higher ?
> If you don't set a timeout at all, won't the request never time out? Then y
Honestly, I haven't gone down that road to verify that, but in theory that seems like right and I will confirm with a  large tablet copy. However, I still want to keep the warning to keep the user aware of this op taking long time.


PS1, Line 247: StartTabletCopy
> As per KUDU-921, we'll modify the StartTabletCopy() implementation in the f
Thank you for pointing out that JIRA, I have re-worded the warning slightly to indicate that it could be an async op, please suggest a better wording if it's not clear. Ideally, this tool should block for the tablet copy, and the solution you suggested can be incorporated once resolve kudu-921. But it would be interesting to know how the background thread will report back to the source tserver that copy failed.


PS1, Line 300:       .AddRequiredParameter({ "src_tserver_address", "Source tablet server to copy the tablet replica from" })
             :       .AddRequiredParameter({ "dst_tserver_address", "Destination tablet server to copy the tablet replica" })
> How about just "src_address" and "dst_address", since the descriptions do a
Done


Line 315:       .AddAction(std::move(copy_tablet))
> I think the remote_replica mode would be a more appropriate home, since thi
Done


http://gerrit.cloudera.org:8080/#/c/4834/1/src/kudu/tserver/ts_tablet_manager.cc
File src/kudu/tserver/ts_tablet_manager.cc:

Line 916:   LOG(INFO) << "T " + tablet_id + " P " + meta->fs_manager()->uuid() + ": "
> Can you make a static two-arg variant of LogPrefix() and chain the non-stat
Done


http://gerrit.cloudera.org:8080/#/c/4834/1/src/kudu/tserver/ts_tablet_manager.h
File src/kudu/tserver/ts_tablet_manager.h:

PS1, Line 180:   // This allows the tablet cleanup even when the tablet server is offline.
> Nit: I don't think this new sentence is necessary; we generally don't use a
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/4834
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I113a25e9b6c14f7c3814140917b61e35030b58d0
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] [tools] Manual recovery tools (part 1)

Posted by "Dinesh Bhat (Code Review)" <ge...@cloudera.org>.
Hello Mike Percy, Adar Dembo, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/4834

to look at the new patch set (#13).

Change subject: [tools] Manual recovery tools (part 1)
......................................................................

[tools] Manual recovery tools (part 1)

This change introduces two recovery tools:
1) 'kudu remote_replica copy' copies the given tablet onto
   a destination server irrespective of the state of the
   destination replica (--force option provided).
   This is a port from https://gerrit.cloudera.org/#/c/3582/
2) 'kudu local_replica delete' (KUDU-1618) deletes a local replica
   of a tablet when the tablet server can not come up due to a
   bad replica of a tablet.

Also added few tests exercising typical usage scenario for these tools.
Added trace to stderr under existing RunAction routines to throw more
info in the log if an action fails.

Change-Id: I113a25e9b6c14f7c3814140917b61e35030b58d0
---
M src/kudu/integration-tests/cluster_itest_util.cc
M src/kudu/integration-tests/cluster_itest_util.h
M src/kudu/integration-tests/delete_table-test.cc
M src/kudu/tools/CMakeLists.txt
M src/kudu/tools/ksck_remote.cc
M src/kudu/tools/kudu-tool-test.cc
M src/kudu/tools/tool_action_common.cc
M src/kudu/tools/tool_action_local_replica.cc
M src/kudu/tools/tool_action_remote_replica.cc
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
11 files changed, 455 insertions(+), 65 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/34/4834/13
-- 
To view, visit http://gerrit.cloudera.org:8080/4834
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I113a25e9b6c14f7c3814140917b61e35030b58d0
Gerrit-PatchSet: 13
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] [tools] Manual recovery tools (part 1)

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change.

Change subject: [tools] Manual recovery tools (part 1)
......................................................................


Patch Set 6:

(3 comments)

Looks good to me. I'll defer to Mike for the +2 from here on out.

http://gerrit.cloudera.org:8080/#/c/4834/1/src/kudu/tools/kudu-tool-test.cc
File src/kudu/tools/kudu-tool-test.cc:

PS1, Line 974: er_->tablet_ser
> Sorry, I think I didn't understand your question clearly before.  Test does
Hmm, "replica catch-up" doesn't necessarily need to be in scope for this test. After all, we're just verifying the functionality of a CLI tool here, so verifying that the copy was successful (for some basic definition of "success") and that there's something on the other end is sufficient, I think.

Anyway, since you've got it working, I guess it's fine to leave in, provided Mike thinks it's a good idea.


http://gerrit.cloudera.org:8080/#/c/4834/6/src/kudu/tools/kudu-tool-test.cc
File src/kudu/tools/kudu-tool-test.cc:

PS6, Line 1013:   string expected = "Rejecting tablet copy request";
              :   ASSERT_STR_CONTAINS(stderr, expected);
Nit: this text is only used in one place, so perhaps combine the two lines for brevity?


http://gerrit.cloudera.org:8080/#/c/4834/6/src/kudu/tools/tool_action_remote_replica.cc
File src/kudu/tools/tool_action_remote_replica.cc:

Line 299:   RETURN_NOT_OK(GetServerStatus(dst_address, TabletServer::kDefaultPort,
> Is there no way to override the port in these tools? Here and below. Seems 
I believe GetServerStatus() will use the port embedded in the first arg and only fall back to the default port (second arg) if there's no port in the first arg.


-- 
To view, visit http://gerrit.cloudera.org:8080/4834
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I113a25e9b6c14f7c3814140917b61e35030b58d0
Gerrit-PatchSet: 6
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] [tools] Manual recovery tools (part 1)

Posted by "Dinesh Bhat (Code Review)" <ge...@cloudera.org>.
Hello David Ribeiro Alves, Mike Percy, Adar Dembo, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/4834

to look at the new patch set (#5).

Change subject: [tools] Manual recovery tools (part 1)
......................................................................

[tools] Manual recovery tools (part 1)

This change introduces two recovery tools:
1) 'kudu tablet copy' copies the given tablet onto
   a destination server irrespective of the state of the
   destination replica. This is a port from
   https://gerrit.cloudera.org/#/c/3582/
2) 'kudu local_replica delete' (KUDU-1618) deletes a local replica
   of a tablet when the tablet server can not come up due to a
   bad replica of a tablet.

Also added few tests exercising typical usage scenario for these tools.
Added trace to stderr under existing RunAction routines
to throw more info in the log if an action fails.

Change-Id: I113a25e9b6c14f7c3814140917b61e35030b58d0
---
M src/kudu/integration-tests/cluster_itest_util.cc
M src/kudu/integration-tests/cluster_itest_util.h
M src/kudu/integration-tests/delete_table-test.cc
M src/kudu/tools/kudu-tool-test.cc
M src/kudu/tools/tool_action_local_replica.cc
M src/kudu/tools/tool_action_remote_replica.cc
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
8 files changed, 426 insertions(+), 61 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/34/4834/5
-- 
To view, visit http://gerrit.cloudera.org:8080/4834
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I113a25e9b6c14f7c3814140917b61e35030b58d0
Gerrit-PatchSet: 5
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] [tools] Manual recovery tools (part 1)

Posted by "Dinesh Bhat (Code Review)" <ge...@cloudera.org>.
Dinesh Bhat has posted comments on this change.

Change subject: [tools] Manual recovery tools (part 1)
......................................................................


Patch Set 10:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/4834/6/src/kudu/tools/kudu-tool-test.cc
File src/kudu/tools/kudu-tool-test.cc:

Line 1149
> Are you saying that committing this patch will break the macOS build? Seems
It doesn't fail the build, it fails the test on OS x only. Since my follow-up patch fixes it, I wouldn't see os x fix as a blocker to this, but I am fine submitting the os x fix first.


http://gerrit.cloudera.org:8080/#/c/4834/1/src/kudu/tools/tool_action_local_replica.cc
File src/kudu/tools/tool_action_local_replica.cc:

PS1, Line 276:   return Status::OK();
             : }
> I am a bit torn on this one.
Hmmm.... current notion is that we use 'local_replica delete' when we couldn't bring up tablet server due to one or more bad tablets. It makes sense to keep the bad tablet around with --archive option for debugging purposes later. At high level, I am still trying to think of a situation where we want to tombstone a bad tablet... today we bootstrap tombstoned tablet when we bring up the server right ? and chances are that we may still crash ? So, I am not sure how tombstoning would help here.

I think one important piece as Todd pointed out in recovery doc is to be able to exactly identify which tablet is causing the crash. Otherwise, we wouldn't know what tablet this tool is supposed to act on.

Having a tombstone option for a remote_replica (i.e, when destination tserver is running) may be useful. I will go down the path of last_logged id with 'remote_replica tombstone' may-be ?


-- 
To view, visit http://gerrit.cloudera.org:8080/4834
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I113a25e9b6c14f7c3814140917b61e35030b58d0
Gerrit-PatchSet: 10
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] [tools] Manual recovery tools (part 1)

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Mike Percy has posted comments on this change.

Change subject: [tools] Manual recovery tools (part 1)
......................................................................


Patch Set 9:

(14 comments)

http://gerrit.cloudera.org:8080/#/c/4834/6/src/kudu/integration-tests/cluster_itest_util.h
File src/kudu/integration-tests/cluster_itest_util.h:

Line 268:     int count,
> done, unfortunately these are not caught while compiling on OS X :(.
yeah it's just good hygiene, std::string isn't always caught on linux either due to some header files in gutil using std::string :(


http://gerrit.cloudera.org:8080/#/c/4834/6/src/kudu/tools/kudu-tool-test.cc
File src/kudu/tools/kudu-tool-test.cc:

> I think having all the tests related to ./bin/kudu toolset in one file is b
> I think having all the tests related to ./bin/kudu toolset in one file is better.

Why do you think it is better? The tests may do many different types of things and they can't run in parallel if they are all clumped together in the same files. Also, source files that are many thousands of lines are difficult to navigate - at least that is my subjective opinion. I don't think it's reached catastrophic proportions yet in this case, though.


Line 950:   const int kNumTservers = 3;
> Can you explain bit more what you meant by test becoming unstable ? The int
Under extreme load, such as on an AWS / GCE Jenkins test box, it's possible that a tablet could miss a heartbeat for 3 seconds because of some IO or thread creation latency. If that happens, then the node will get evicted. I am not sure if that will cause test flakiness in these particular tests, I just thought the timeout was quite short.


Line 1149
> Done, although I am trying to see why I am not able to run ts->Start() agai
Not sure about macOS, but on the ExternalMiniCluster the proper thing to do after Shutdown() is Restart() if you want to reuse the same ports. Typically, Start() should only be called once by design.

It's possible that the MiniTabletServer::Restart() semantics are buggy and untested.


http://gerrit.cloudera.org:8080/#/c/4834/9/src/kudu/tools/kudu-tool-test.cc
File src/kudu/tools/kudu-tool-test.cc:

Line 944: TEST_F(ToolTest, TestRemoteReplicaCopy) {
If we're testing copying, should we write some data in this test? You could need to initiate the leader election manually, ideally with the helper functions in itest::


Line 994:   // by attempting to copy tablets while we tombstone the tablet on destination server.
I don't think you ever elected a leader, so I don't know who would interfere


Line 1042:   // tombstoned_tablet_id is copied from source to destination.
passive voice nit: Copy tombstoned_tablet_id from source to destination.


Line 1043:   NO_FATALS(RunActionStdoutNone(Substitute("remote_replica copy $0 $1 $2 --force_copy",
Should --force be needed if it's tombstoned?


Line 1048:   // deleted_tablet_id is copied from source to destination.
nit: use active, not passive voice, i.e. 

Copy deleted_tablet_id from source to destination.


Line 1050:   ASSERT_OK(WaitUntilTabletInState(src_ts, deleted_tablet_id,
Why is this needed? We just did this on line 1039


Line 1135:   ASSERT_OK(ts->Start());
Should be Restart() I think? But looking at the impl it seems that it just defers to Start(). Weird. What even is the point of such an implementation? It seems broken to me...


http://gerrit.cloudera.org:8080/#/c/4834/6/src/kudu/tools/tool_action_remote_replica.cc
File src/kudu/tools/tool_action_remote_replica.cc:

Line 299:   RETURN_NOT_OK(GetServerStatus(dst_address, TabletServer::kDefaultPort,
> That's correct:
Cool, thanks for the explanation.


Line 347:       .Description("Copy a replica from one Kudu Tablet Server to another")
any thoughts on using "tablet replica" throughout this file instead of just "replica"?


http://gerrit.cloudera.org:8080/#/c/4834/9/src/kudu/tools/tool_action_remote_replica.cc
File src/kudu/tools/tool_action_remote_replica.cc:

Line 331:                             tserver::TabletServerErrorPB_Code(resp.error().code())));
oops, typo in my original suggestion, should be TabletServerErrorPB::Code_Name()


-- 
To view, visit http://gerrit.cloudera.org:8080/4834
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I113a25e9b6c14f7c3814140917b61e35030b58d0
Gerrit-PatchSet: 9
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] [tools] Manual recovery tools (part 1)

Posted by "Dinesh Bhat (Code Review)" <ge...@cloudera.org>.
Dinesh Bhat has posted comments on this change.

Change subject: [tools] Manual recovery tools (part 1)
......................................................................


Patch Set 3:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/4834/1/src/kudu/tools/kudu-tool-test.cc
File src/kudu/tools/kudu-tool-test.cc:

Line 941:   const int kAddOnTsIndex = 3;
> Hmm. Instead, maybe you can change the various RunAction routines to always
Sounds like good idea, let me do that in a follow-up patch, updated current callsite to RunActionStdoutNone.


PS1, Line 974: 
> OK. Why do we need to wait for 1000 rows to be inserted, though?
Picked 1000 out of thin air, to have few KB worth of data in the table, and thereby some data on the replica for the copy to actually matter. Will adding a small comment help here ?


http://gerrit.cloudera.org:8080/#/c/4834/2/src/kudu/tools/kudu-tool-test.cc
File src/kudu/tools/kudu-tool-test.cc:

Line 121:   }
> This would be more robust if placed in ToolTest's destructor.
Done


PS2, Line 936: TEST_F(ToolTest, TestCopyRemoteReplicaCopy) {
             :   const string kTestDir = GetTestPath("test");
             :   MonoDelta kTimeout = MonoDelta::FromSeconds(30);
             :   const int kSrcTsIndex = 1;
             :   const int kDstTsIndex = 2;
> Should now refer to "remote_replica copy" rather than "tablet copy".
Done


http://gerrit.cloudera.org:8080/#/c/4834/2/src/kudu/tools/tool_action_remote_replica.cc
File src/kudu/tools/tool_action_remote_replica.cc:

Line 312:   req.set_caller_term(std::numeric_limits<int64_t>::max());
> AFAICT, this means that if the destination tserver already has this replica
That's the intent yeah, although now that I think about it, it may be a nice idea to keep this under a --force flag in case user accidentally specifies a destination server where replica is healthy. What do you suggest ?


http://gerrit.cloudera.org:8080/#/c/4834/1/src/kudu/tserver/ts_tablet_manager.cc
File src/kudu/tserver/ts_tablet_manager.cc:

Line 916:   const string& tablet_id = meta->tablet_id();
> You added the new one, but you didn't chain the old variant into it.
My bad, I probably didn't interpret the word 'chain' correctly earlier :), made the non-static variant to call static now.


-- 
To view, visit http://gerrit.cloudera.org:8080/4834
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I113a25e9b6c14f7c3814140917b61e35030b58d0
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] [tools] Manual recovery tools (part 1)

Posted by "Dinesh Bhat (Code Review)" <ge...@cloudera.org>.
Dinesh Bhat has posted comments on this change.

Change subject: [tools] Manual recovery tools (part 1)
......................................................................


Patch Set 2:

> didn't look at any of this code, but just wanted to drop a note
 > saying I used the local_replica delete tool to successfully recover
 > from a situation where a bunch of tablet servers were failing to
 > start up due to messed up replicas. So, this is a useful tool!

Thanks for trying out Todd, do you know if these were large size tablets ? cuz that's something I didn't test explicitly and missing a test as well for that.

-- 
To view, visit http://gerrit.cloudera.org:8080/4834
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I113a25e9b6c14f7c3814140917b61e35030b58d0
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: No

[kudu-CR] [tools] Manual recovery tools (part 1)

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change.

Change subject: [tools] Manual recovery tools (part 1)
......................................................................


Patch Set 2:

didn't look at any of this code, but just wanted to drop a note saying I used the local_replica delete tool to successfully recover from a situation where a bunch of tablet servers were failing to start up due to messed up replicas. So, this is a useful tool!

-- 
To view, visit http://gerrit.cloudera.org:8080/4834
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I113a25e9b6c14f7c3814140917b61e35030b58d0
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: No

[kudu-CR] [tools] Manual recovery tools (part 1)

Posted by "Dinesh Bhat (Code Review)" <ge...@cloudera.org>.
Dinesh Bhat has posted comments on this change.

Change subject: [tools] Manual recovery tools (part 1)
......................................................................


Patch Set 15:

(3 comments)

Also updated COMMIT_MSG to reflect the new flag added.

http://gerrit.cloudera.org:8080/#/c/4834/13/src/kudu/tools/tool_action_local_replica.cc
File src/kudu/tools/tool_action_local_replica.cc:

Line 80: DEFINE_bool(clean_unsafe, false,
> Hmm, --force_unsafe, I guess I would prefer either --force or --clean_unsaf
Done


Line 81:             "Delete the local replica completely, not leaving a tombstone. "
> How about: Delete the local replica completely, not leaving a tombstone. Th
Done


Line 270:     return Status::NotSupported(Substitute("Deletion of local replica $0 is "
> s/not supported/currently not supported/
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/4834
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I113a25e9b6c14f7c3814140917b61e35030b58d0
Gerrit-PatchSet: 15
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] [tools] Manual recovery tools (part 1)

Posted by "Dinesh Bhat (Code Review)" <ge...@cloudera.org>.
Hello Mike Percy, Adar Dembo, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/4834

to look at the new patch set (#7).

Change subject: [tools] Manual recovery tools (part 1)
......................................................................

[tools] Manual recovery tools (part 1)

This change introduces two recovery tools:
1) 'kudu remote_replica copy' copies the given tablet onto
   a destination server irrespective of the state of the
   destination replica (--force option provided).
   This is a port from https://gerrit.cloudera.org/#/c/3582/
2) 'kudu local_replica delete' (KUDU-1618) deletes a local replica
   of a tablet when the tablet server can not come up due to a
   bad replica of a tablet.

Also added few tests exercising typical usage scenario for these tools.
Added trace to stderr under existing RunAction routines to throw more
info in the log if an action fails.

Change-Id: I113a25e9b6c14f7c3814140917b61e35030b58d0
---
M src/kudu/integration-tests/cluster_itest_util.cc
M src/kudu/integration-tests/cluster_itest_util.h
M src/kudu/integration-tests/delete_table-test.cc
M src/kudu/tools/kudu-tool-test.cc
M src/kudu/tools/tool_action_local_replica.cc
M src/kudu/tools/tool_action_remote_replica.cc
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
8 files changed, 510 insertions(+), 61 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/34/4834/7
-- 
To view, visit http://gerrit.cloudera.org:8080/4834
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I113a25e9b6c14f7c3814140917b61e35030b58d0
Gerrit-PatchSet: 7
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] [tools] Manual recovery tools (part 1)

Posted by "Dinesh Bhat (Code Review)" <ge...@cloudera.org>.
Dinesh Bhat has posted comments on this change.

Change subject: [tools] Manual recovery tools (part 1)
......................................................................


Patch Set 5:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/4834/1/src/kudu/tools/kudu-tool-test.cc
File src/kudu/tools/kudu-tool-test.cc:

PS1, Line 974: er_->tablet_ser
> Well, even a tablet with no data still has metadata that needs to be copied
Sorry, I think I didn't understand your question clearly before.  Test doesn't fail without those calls,  but how can we confirm that destination replica caught up with source replica after tablet-copy ? One way could be to check the OpIds on the Log has caught upto workload.batches_completed(). If we didn't want to rely on workload.batches_completed(), then we could ask the source peer for it's lastOpId and check if all servers reached consensus on that OpId. In the absence of any workload, that logIndex could be -1 or 0. You are right that we don't need the 'data blocks' on disk for the tablet copy to take effect, but here the data was used more to verify Log replication than for the presence of data on the disk. I updated the comment to reflect this logic and also reduced the rows to 10 (seemed sufficient)

PS: I took some of these ideas from an existing test (tablet_copy-itest IIRC).


http://gerrit.cloudera.org:8080/#/c/4834/2/src/kudu/tools/kudu-tool-test.cc
File src/kudu/tools/kudu-tool-test.cc:

PS2, Line 936: }
             : 
             : // Test 'kudu tablet copy' tool when the destination tablet server is online.
             : // 1. Test the copy tool when the destination replica is healthy
             : // 2. Test the copy tool when the 
> Need to update the comments I highlighted too. At least the very first line
Done


http://gerrit.cloudera.org:8080/#/c/4834/5/src/kudu/tools/kudu-tool-test.cc
File src/kudu/tools/kudu-tool-test.cc:

PS5, Line 1004:   string stdout;
              :   string stderr;
> Can remove these. Check the rest too.
Done


PS5, Line 1052:  Explicit RPC to
              :   // delete-with-retries
> What does this mean? DeleteTabletWithRetries() is a function; what does "ex
Please ignore this since this didn't work as I expected, I need to fix a race between catalog_manager issuing a DELETE_TOMBSTONED in the change_config eviction path and this test issuing the tablet-copy on DELETE_TOMBSTONED tablet. The race is caught here : http://104.196.14.100/job/kudu-gerrit/4202/BUILD_TYPE=RELEASE/testReport/junit/(root)/ToolTest/TestRemoteReplicaCopy/

I wonder if it's possible to consult master catalog manager to see if this replica has been evicted successfully from the config so that this test can safely issue a tablet copy to destination tserver which has the tablet tombstoned. As it is now, it turns out, it is not sufficient just to verify the state of the tablet on destination tserver @L1045.


PS5, Line 1101: replica data size to be visible.
> What does this mean?
bad wording I guess - changed it to "Generate some workload followed by a flush to grow the size of the tablet on disk."


http://gerrit.cloudera.org:8080/#/c/4834/2/src/kudu/tools/tool_action_remote_replica.cc
File src/kudu/tools/tool_action_remote_replica.cc:

Line 312:   req.set_caller_term(std::numeric_limits<int64_t>::max());
> Yeah, I think that may be good. This would be -1 or 0 normally, and max whe
Thanks, following it up in the coming patch along with test-flakiness fix.


-- 
To view, visit http://gerrit.cloudera.org:8080/4834
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I113a25e9b6c14f7c3814140917b61e35030b58d0
Gerrit-PatchSet: 5
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] [tools] Manual recovery tools (part 1)

Posted by "Dinesh Bhat (Code Review)" <ge...@cloudera.org>.
Hello Mike Percy, Adar Dembo, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/4834

to look at the new patch set (#14).

Change subject: [tools] Manual recovery tools (part 1)
......................................................................

[tools] Manual recovery tools (part 1)

This change introduces two recovery tools:
1) 'kudu remote_replica copy' copies the given tablet onto
   a destination server irrespective of the state of the
   destination replica (--force option provided).
   This is a port from https://gerrit.cloudera.org/#/c/3582/
2) 'kudu local_replica delete' (KUDU-1618) deletes a local replica
   of a tablet when the tablet server can not come up due to a
   bad replica of a tablet.

Also added few tests exercising typical usage scenario for these tools.
Added trace to stderr under existing RunAction routines to throw more
info in the log if an action fails.

Change-Id: I113a25e9b6c14f7c3814140917b61e35030b58d0
---
M src/kudu/integration-tests/cluster_itest_util.cc
M src/kudu/integration-tests/cluster_itest_util.h
M src/kudu/integration-tests/delete_table-test.cc
M src/kudu/tools/CMakeLists.txt
M src/kudu/tools/ksck_remote.cc
M src/kudu/tools/kudu-tool-test.cc
M src/kudu/tools/tool_action_common.cc
M src/kudu/tools/tool_action_local_replica.cc
M src/kudu/tools/tool_action_remote_replica.cc
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
11 files changed, 456 insertions(+), 65 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/34/4834/14
-- 
To view, visit http://gerrit.cloudera.org:8080/4834
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I113a25e9b6c14f7c3814140917b61e35030b58d0
Gerrit-PatchSet: 14
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] [tools] Manual recovery tools (part 1)

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Mike Percy has posted comments on this change.

Change subject: [tools] Manual recovery tools (part 1)
......................................................................


Patch Set 10:

(2 comments)

Would you mind also adding a test that makes sure that attempting to copy over an existing (running) tablet, even in --force mode, fails?

Also, I wonder what the value of the --force flag is, since this is a manual tool maybe we should always force. This seems like a tool that you would only typically use --force with, so why bother requiring people to specify it.

http://gerrit.cloudera.org:8080/#/c/4834/6/src/kudu/tools/kudu-tool-test.cc
File src/kudu/tools/kudu-tool-test.cc:

Line 1149
> it didn't look buggy after going through Restart/Start/Shutdown paths, it's
It's bad that they have different semantics, because they implement the same interface.

Anyway, I think we can live with this as-is, as long as the test passes on both Linux and macOS. Since I don't have a Mac, I'll assume you have gotten that part working.


http://gerrit.cloudera.org:8080/#/c/4834/9/src/kudu/tools/tool_action_remote_replica.cc
File src/kudu/tools/tool_action_remote_replica.cc:

Line 331:         strings::Substitute("Remote server returned error code $0",
> Done, I am just curious how did we generate Code_Name from the proto file ?
it's generated and placed in build/latest/src/kudu/tserver/tserver.pb.cc


-- 
To view, visit http://gerrit.cloudera.org:8080/4834
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I113a25e9b6c14f7c3814140917b61e35030b58d0
Gerrit-PatchSet: 10
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] [tools] Manual recovery tools (part 1)

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Mike Percy has posted comments on this change.

Change subject: [tools] Manual recovery tools (part 1)
......................................................................


Patch Set 10:

> maybe we should always force

Actually, what I think we should do is attempt to send the actual committed_opid_index with the default request... let me figure out how we can do that.

-- 
To view, visit http://gerrit.cloudera.org:8080/4834
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I113a25e9b6c14f7c3814140917b61e35030b58d0
Gerrit-PatchSet: 10
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: No

[kudu-CR] [tools] Manual recovery tools (part 1)

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change.

Change subject: [tools] Manual recovery tools (part 1)
......................................................................


Patch Set 2:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/4834/1/src/kudu/tools/kudu-tool-test.cc
File src/kudu/tools/kudu-tool-test.cc:

Line 941:   const string kTestDir = GetTestPath("test");
> Essentially, I wanted to do this:
Hmm. Instead, maybe you can change the various RunAction routines to always capture stdout/stderr, call SCOPED_TRACE() on them, and then decide whether to pass them on to the caller or not?


PS1, Line 974: tablet_server(k
> No, this was a stale comment leftover, I think I was doing a  WaitForServer
OK. Why do we need to wait for 1000 rows to be inserted, though?


http://gerrit.cloudera.org:8080/#/c/4834/2/src/kudu/tools/kudu-tool-test.cc
File src/kudu/tools/kudu-tool-test.cc:

Line 121:     STLDeleteValues(&ts_map_);
This would be more robust if placed in ToolTest's destructor.


PS2, Line 936: // Test 'kudu tablet copy' tool when the destination tablet server is online.
             : // 1. Test the copy tool when the destination replica is healthy
             : // 2. Test the copy tool when the destination replica is tombstoned
             : // 3. Test the copy tool when the destination replica is deleted
             : TEST_F(ToolTest, TestTabletCopy) {
Should now refer to "remote_replica copy" rather than "tablet copy".


http://gerrit.cloudera.org:8080/#/c/4834/2/src/kudu/tools/tool_action_remote_replica.cc
File src/kudu/tools/tool_action_remote_replica.cc:

Line 312:   req.set_caller_term(std::numeric_limits<int64_t>::max());
AFAICT, this means that if the destination tserver already has this replica, we'll always blow it away. Was this an intentional choice?


http://gerrit.cloudera.org:8080/#/c/4834/1/src/kudu/tserver/ts_tablet_manager.cc
File src/kudu/tserver/ts_tablet_manager.cc:

Line 916: 
> Done
You added the new one, but you didn't chain the old variant into it.


-- 
To view, visit http://gerrit.cloudera.org:8080/4834
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I113a25e9b6c14f7c3814140917b61e35030b58d0
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] [tools] Manual recovery tools (part 1)

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Mike Percy has posted comments on this change.

Change subject: [tools] Manual recovery tools (part 1)
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/4834/1/src/kudu/tools/tool_action_local_replica.cc
File src/kudu/tools/tool_action_local_replica.cc:

PS1, Line 276:   RETURN_NOT_OK(TSTabletManager::DeleteTabletData(
             :       meta, TabletDataState::TABLET_DATA_DELETED, boost::none));
> Hmmm.... current notion is that we use 'local_replica delete' when we could
> I am still trying to think of a situation where we want to tombstone a bad tablet.

Tombstoning should be the default action. Tombstoning actually removes all actual data and WALs so it's hard to imagine a lot of scenarios where a startup crash will result from such a stripped-down tablet: it would have to be a problem with the superblock or the consensus metadata.

If you don't tombstone, but instead manually delete, then it's possible to double-vote which can cause split-brain scenarios in Raft. So a non-tombstone delete is dangerous.

> I think one important piece as Todd pointed out in recovery doc is to be able to exactly identify which tablet is causing the crash.

Agreed.

> Having a tombstone option for a remote_replica may be useful.

Isn't that already implemented?


-- 
To view, visit http://gerrit.cloudera.org:8080/4834
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I113a25e9b6c14f7c3814140917b61e35030b58d0
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] [tools] Manual recovery tools (part 1)

Posted by "Dinesh Bhat (Code Review)" <ge...@cloudera.org>.
Hello Mike Percy, Adar Dembo, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/4834

to look at the new patch set (#10).

Change subject: [tools] Manual recovery tools (part 1)
......................................................................

[tools] Manual recovery tools (part 1)

This change introduces two recovery tools:
1) 'kudu remote_replica copy' copies the given tablet onto
   a destination server irrespective of the state of the
   destination replica (--force option provided).
   This is a port from https://gerrit.cloudera.org/#/c/3582/
2) 'kudu local_replica delete' (KUDU-1618) deletes a local replica
   of a tablet when the tablet server can not come up due to a
   bad replica of a tablet.

Also added few tests exercising typical usage scenario for these tools.
Added trace to stderr under existing RunAction routines to throw more
info in the log if an action fails.

Change-Id: I113a25e9b6c14f7c3814140917b61e35030b58d0
---
M src/kudu/integration-tests/cluster_itest_util.cc
M src/kudu/integration-tests/cluster_itest_util.h
M src/kudu/integration-tests/delete_table-test.cc
M src/kudu/tools/kudu-tool-test.cc
M src/kudu/tools/tool_action_local_replica.cc
M src/kudu/tools/tool_action_remote_replica.cc
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
8 files changed, 423 insertions(+), 61 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/34/4834/10
-- 
To view, visit http://gerrit.cloudera.org:8080/4834
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I113a25e9b6c14f7c3814140917b61e35030b58d0
Gerrit-PatchSet: 10
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] [tools] Manual recovery tools (part 1)

Posted by "Dinesh Bhat (Code Review)" <ge...@cloudera.org>.
Hello David Ribeiro Alves, Mike Percy, Adar Dembo, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/4834

to look at the new patch set (#3).

Change subject: [tools] Manual recovery tools (part 1)
......................................................................

[tools] Manual recovery tools (part 1)

This change introduces two recovery tools:
1) 'kudu tablet copy' copies the given tablet onto
   a destination server irrespective of the state of the
   destination replica. This is a port from
   https://gerrit.cloudera.org/#/c/3582/
2) 'kudu local_replica delete' (KUDU-1618) deletes a local replica
   of a tablet when the tablet server can not come up due to a
   bad replica of a tablet.

Also added few tests exercising typical usage scenario for these tools.

Change-Id: I113a25e9b6c14f7c3814140917b61e35030b58d0
---
M src/kudu/integration-tests/cluster_itest_util.cc
M src/kudu/integration-tests/cluster_itest_util.h
M src/kudu/integration-tests/delete_table-test.cc
M src/kudu/tools/kudu-tool-test.cc
M src/kudu/tools/tool_action_local_replica.cc
M src/kudu/tools/tool_action_remote_replica.cc
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
8 files changed, 417 insertions(+), 58 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/34/4834/3
-- 
To view, visit http://gerrit.cloudera.org:8080/4834
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I113a25e9b6c14f7c3814140917b61e35030b58d0
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] [tools] Manual recovery tools (part 1)

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Mike Percy has posted comments on this change.

Change subject: [tools] Manual recovery tools (part 1)
......................................................................


Patch Set 6:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/4834/6/src/kudu/tools/kudu-tool-test.cc
File src/kudu/tools/kudu-tool-test.cc:

Line 1149: }
> Yeah, I root-caused it to be a missing piece in socket teardown path for ma
Are you saying that committing this patch will break the macOS build? Seems like we need to wait for that other patch to happen first if that is the case.


-- 
To view, visit http://gerrit.cloudera.org:8080/4834
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I113a25e9b6c14f7c3814140917b61e35030b58d0
Gerrit-PatchSet: 6
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes