You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@kudu.apache.org by "wangning (Code Review)" <ge...@cloudera.org> on 2020/08/03 08:19:02 UTC

[kudu-CR] [KUDU-1728] parallelize download blocks in tablet-copy-client

wangning has uploaded this change for review. ( http://gerrit.cloudera.org:8080/16274


Change subject: [KUDU-1728] parallelize download blocks in tablet-copy-client
......................................................................

[KUDU-1728] parallelize download blocks in tablet-copy-client

Parallelize the action of 'Download blocks' in tablet-copy-client.

Previsouly download blocks from tablet server is sequential
executed, thus actions which strongly relied on 'DownloadBlocks' like
recover from other tserver, cluster rebalance are slow.
Sometimes use only one thread to download those blocks are slow and bandwidth
is not the bottleneck.

Here I attached the simple benchmark, the result are averaged by 8 epoch
result.
The experiment is download all data from remote tserver to local tserver.

settings:
Remote machine: 8 cores 32g tserver limited to 8g, tserver version 1.10
Local machine: 8 cores 12g memory
Tserver has 3 x 500M tablets + 3 x 430M tablets + 3 x 420M tablets and
259 x 8M tablets.
All tablets were compacted well with dirskrowset height 1.0.

Here the number of tablet-copy-client threadpool and the number of
blocks-download threadpool are chosen as variable.
tablet-copy-client thread is write as tc.
block-download thread is write as bd.

Metrics of result is seconds with accuracy of 2 decimal points.

time     tc-1      tc-2      tc-4      tc-8

bd-1     76.91     48.50     34.27     32.39
bd-2     54.45     43.26     33.27     32.37
bd-4     48.36     37.58     33.31     32.30
bd-8     47.95     38.36     33.98     32.60

Change-Id: Id83abca7a38cf183d9c27d82bb8a022699079e0e
---
M src/kudu/tserver/tablet_copy_client.cc
M src/kudu/tserver/tablet_copy_client.h
2 files changed, 85 insertions(+), 59 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/74/16274/1
-- 
To view, visit http://gerrit.cloudera.org:8080/16274
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: Id83abca7a38cf183d9c27d82bb8a022699079e0e
Gerrit-Change-Number: 16274
Gerrit-PatchSet: 1
Gerrit-Owner: wangning <19...@gmail.com>

[kudu-CR] KUDU-1728 parallelize download blocks in tablet-copy-client

Posted by "wangning (Code Review)" <ge...@cloudera.org>.
wangning has posted comments on this change. ( http://gerrit.cloudera.org:8080/16274 )

Change subject: KUDU-1728 parallelize download blocks in tablet-copy-client
......................................................................


Patch Set 8:

> Thank you very much for your contribution!

Thx for your patient for reviewing and correcting. I learned a lot in this commit which is more important than 'how to implement a function'.


-- 
To view, visit http://gerrit.cloudera.org:8080/16274
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Id83abca7a38cf183d9c27d82bb8a022699079e0e
Gerrit-Change-Number: 16274
Gerrit-PatchSet: 8
Gerrit-Owner: wangning <19...@gmail.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Attila Bukor <ab...@apache.org>
Gerrit-Reviewer: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: wangning <19...@gmail.com>
Gerrit-Comment-Date: Thu, 10 Sep 2020 03:02:14 +0000
Gerrit-HasComments: No

[kudu-CR] [KUDU-1728] parallelize download blocks in tablet-copy-client

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/16274 )

Change subject: [KUDU-1728] parallelize download blocks in tablet-copy-client
......................................................................


Patch Set 6:

(15 comments)

http://gerrit.cloudera.org:8080/#/c/16274/6/src/kudu/tserver/tablet_copy_client.h
File src/kudu/tserver/tablet_copy_client.h:

http://gerrit.cloudera.org:8080/#/c/16274/6/src/kudu/tserver/tablet_copy_client.h@182
PS6, Line 182: Set rowset download to failed if status not ok, with lock protect.
How about:

Store rowset download failure status in 'end_status' if it hasn't been set as failed yet.  This method is thread-safe.


http://gerrit.cloudera.org:8080/#/c/16274/6/src/kudu/tserver/tablet_copy_client.h@204
PS6, Line 204:   // Download rowset for thread to call.
             :   // Any blocks failed to download would lead current and other thread to exit.
             :   // If all threads to download rowset, the end_status would be Status::OK,
             :   // or it would be the first failed status.
How about:

A task do download blocks of the specified rowset. In case of a failure, a thread notifies others via 'end_status', so they can abort downloading of their rowsets' blocks. If all threads succeed, 'end_status' is set to Status::OK().


http://gerrit.cloudera.org:8080/#/c/16274/6/src/kudu/tserver/tablet_copy_client.h@298
PS6, Line 298: blocks
data blocks


http://gerrit.cloudera.org:8080/#/c/16274/6/src/kudu/tserver/tablet_copy_client.h@301
PS6, Line 301: Protect add/create blocks, add rowset, write download status.
How about:

Protects adding/creating blocks, adding a rowset, updating rowset download status.


http://gerrit.cloudera.org:8080/#/c/16274/2/src/kudu/tserver/tablet_copy_client.cc
File src/kudu/tserver/tablet_copy_client.cc:

http://gerrit.cloudera.org:8080/#/c/16274/2/src/kudu/tserver/tablet_copy_client.cc@175
PS2, Line 175:   transaction_ = bm->NewCreationTransaction();
             :   if (tablet_copy_metrics_) {
             :     tablet_copy_metrics_->open_client_sessions->Increment();
             :   }
> It should be reclaimed immediately once tablet-copy-client finished its' wo
TabletCopyClient client is created by demand by the upper-level code.  That code could provide a pool to run the data block copying in parallel, but I agree with Bankim that encapsulating this just into TabletCopyClient is cleaner.

BTW, as for awaiting for completion of all download tasks related to a tablet, you can create a token per tablet with CONCURRENT execution mode and await for all tasks on the token to complete.  So here it would not be an obstacle.

From one side, introducing a dedicated thread pool for every tablet copy client isn't efficient in sharing resources: there might be small and large tablets, and copying M tablets using M pools of N threads each might be slower than copying those when having a single pool with (M*N) threads or even less.

From the other side, having just one pool might be a problem when one huge tablet schedules all its downloads first, and all other tablet copy sessions await for those to complete to make a progress even on tiny tablets which started copying just a bit later.

It seems the latter brings a bigger drawback than the former brings as a benefit.

With that, I think it's better to keep it as it's implemented now.


http://gerrit.cloudera.org:8080/#/c/16274/6/src/kudu/tserver/tablet_copy_client.cc
File src/kudu/tserver/tablet_copy_client.cc:

http://gerrit.cloudera.org:8080/#/c/16274/6/src/kudu/tserver/tablet_copy_client.cc@106
PS6, Line 106: num_threads_blocks_download
Rename this into

  tablet_copy_download_threads_num_per_session

to match the rest of the flags related to tablet copying.


http://gerrit.cloudera.org:8080/#/c/16274/6/src/kudu/tserver/tablet_copy_client.cc@106
PS6, Line 106: DEFINE_int32
I think it's worth adding a flag validator to make sure the specified value is greater than 0, otherwise some runtime error might occur.


http://gerrit.cloudera.org:8080/#/c/16274/6/src/kudu/tserver/tablet_copy_client.cc@107
PS6, Line 107: number of threads for downloading blocks of a tablet
How about:

Number of threads per tablet copy session for downloading tablet data blocks


http://gerrit.cloudera.org:8080/#/c/16274/6/src/kudu/tserver/tablet_copy_client.cc@194
PS6, Line 194:   blocks_download_pool_->Shutdown();
I think it would be safer to place this _before_ decrementing the metric since ThreadPool::Shutdown() awaits for all tasks to complete.  The process of downloading block might be stuck for some time, and that might lead to confusing readings of the tablet copy metrics in such a case.


http://gerrit.cloudera.org:8080/#/c/16274/6/src/kudu/tserver/tablet_copy_client.cc@414
PS6, Line 414: // Download all blocks file in parallel.
How about:

Download all the tablet's files. Data blocks are downloaded in parallel, where the concurrency is controlled by the --tablet_copy_download_threads_num_per_session flag.


http://gerrit.cloudera.org:8080/#/c/16274/6/src/kudu/tserver/tablet_copy_client.cc@624
PS6, Line 624:     if (!end_status->ok()) {
             :       return;
             :     }
             :     BlockIdPB new_block_id;
             :     s = DownloadAndRewriteBlock(src_col.block(), num_remote_blocks,
             :                                 block_count, &new_block_id);
             :     if (!s.ok()) {
             :       SetDownloadRowsetFailed(s, end_status);
             :       return;
             :     }
This piece looks like a common boilerplate in the code below.  Is it possible to separate this into a function or a lambda and reuse it below?


http://gerrit.cloudera.org:8080/#/c/16274/6/src/kudu/tserver/tablet_copy_client.cc@639
PS6, Line 639: end_status->ok()
Is it safe to probe for 'end_status' here when some other thread might be updating it concurrently at the same time?


http://gerrit.cloudera.org:8080/#/c/16274/6/src/kudu/tserver/tablet_copy_client.cc@701
PS6, Line 701:   int num_remote_blocks = CountRemoteBlocks();
nit: add 'const' ?


http://gerrit.cloudera.org:8080/#/c/16274/6/src/kudu/tserver/tablet_copy_client.cc@710
PS6, Line 710: Inside each parallel unit, the blocks of a rowset is download in sequential.
How about:

Each task downloads the blocks of the corresponding rowset sequentially.


http://gerrit.cloudera.org:8080/#/c/16274/6/src/kudu/tserver/tablet_copy_client.cc@790
PS6, Line 790: lock
nit: a lock



-- 
To view, visit http://gerrit.cloudera.org:8080/16274
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Id83abca7a38cf183d9c27d82bb8a022699079e0e
Gerrit-Change-Number: 16274
Gerrit-PatchSet: 6
Gerrit-Owner: wangning <19...@gmail.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Attila Bukor <ab...@apache.org>
Gerrit-Reviewer: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: wangning <19...@gmail.com>
Gerrit-Comment-Date: Thu, 03 Sep 2020 16:34:05 +0000
Gerrit-HasComments: Yes

[kudu-CR] [KUDU-1728] parallelize download blocks in tablet-copy-client

Posted by "Bankim Bhavsar (Code Review)" <ge...@cloudera.org>.
Bankim Bhavsar has posted comments on this change. ( http://gerrit.cloudera.org:8080/16274 )

Change subject: [KUDU-1728] parallelize download blocks in tablet-copy-client
......................................................................


Patch Set 2:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/16274/2/src/kudu/tserver/tablet_copy_client.cc
File src/kudu/tserver/tablet_copy_client.cc:

http://gerrit.cloudera.org:8080/#/c/16274/2/src/kudu/tserver/tablet_copy_client.cc@104
PS2, Line 104: max number
Nit: Number instead?


http://gerrit.cloudera.org:8080/#/c/16274/2/src/kudu/tserver/tablet_copy_client.cc@175
PS2, Line 175:   ThreadPoolBuilder("blocks-download-pool")
             :     .set_max_threads(FLAGS_num_threads_blocks_copy)
             :     .set_min_threads(FLAGS_num_threads_blocks_copy)
             :     .Build(&blocks_download_pool_);
> Maybe, it's better to instantiate the pool for tablets' blocks copying some
I understand this is improvement is mainly for tablet copies.

TabletCopyClient() is also used by the tool to migrate to multi-master configuration. So having the threadpool within TabletCopyClient will allow the tool to benefit from parallelization without any changes to the caller. So an argument in favor of having the theadpool within TabletCopyClient class.


http://gerrit.cloudera.org:8080/#/c/16274/2/src/kudu/tserver/tablet_copy_client.cc@595
PS2, Line 595: [&]() mutable {
Nit: I find this function to be too big to be anonymous. It'd be better if it's a proper named private member function.


http://gerrit.cloudera.org:8080/#/c/16274/2/src/kudu/tserver/tablet_copy_client.cc@595
PS2, Line 595: Submit
I don't know much of this code but earlier rowsets were downloaded in sequence and written in sequence. Now the blocks are downloaded in parallel and it's possible that blocks are written out of sequence, can you add a comment explaining why that's okay?



-- 
To view, visit http://gerrit.cloudera.org:8080/16274
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Id83abca7a38cf183d9c27d82bb8a022699079e0e
Gerrit-Change-Number: 16274
Gerrit-PatchSet: 2
Gerrit-Owner: wangning <19...@gmail.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Attila Bukor <ab...@apache.org>
Gerrit-Reviewer: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Comment-Date: Tue, 04 Aug 2020 00:35:38 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-1728 parallelize download blocks in tablet-copy-client

Posted by "wangning (Code Review)" <ge...@cloudera.org>.
Hello Alexey Serbin, Attila Bukor, Kudu Jenkins, Bankim Bhavsar, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/16274

to look at the new patch set (#7).

Change subject: KUDU-1728 parallelize download blocks in tablet-copy-client
......................................................................

KUDU-1728 parallelize download blocks in tablet-copy-client

Parallelize the action of 'Download blocks' in tablet-copy-client.

Previsouly downloading blocks from tablet server is executed sequentially,
thus actions which related to 'DownloadBlocks' like
'recover from other tserver', 'cluster rebalance' may be slow sometimes.
And sometimes downloading the blocks is slow when only one thread is used
while bandwidth isn't the bottleneck.

This commit introduce FLAGS_num_threads_blocks_download to control the
number of threads for download blocks within a tablet-copy-client.

Here I attached the simple benchmark, the result are averaged by 8 epoch
result. Metrics of result is seconds with accuracy of 2 decimal points.

The experiment is download all data from remote tserver to local tserver.

Settings:
Remote machine: 8 cores 32g tserver limited to 8g, tserver version 1.10
Local machine: 8 cores 12g memory
Tserver has 3 x 500M tablets + 3 x 430M tablets + 3 x 420M tablets and
259 x 8M tablets.
All tablets were compacted well with diskrowset height 1.0.

Here I use two variable to controll the experiment.
numbers of tablet-copy-client thread, write as tc.
numbers of blocks-download thread within each tablet-copy-client thread,
(FLAGS_num_threads_blocks_download), write as bd.

The result 37.58 correspond to column 'tc-2', row 'bd-4' can be explained as,
It takes 37.58s to download all data from tserver with 2
tablet-copy-client threads and each tablet-copy-client thread has 4 threads to
download blocks.

All result in column tc-1 refer to behave before this patch.

time     tc-1      tc-2      tc-4      tc-8

bd-1     76.91     48.50     34.27     32.39
bd-2     54.45     43.26     33.27     32.37
bd-4     48.36     37.58     33.31     32.30
bd-8     47.95     38.36     33.98     32.60

Change-Id: Id83abca7a38cf183d9c27d82bb8a022699079e0e
---
M src/kudu/tserver/tablet_copy_client-test.cc
M src/kudu/tserver/tablet_copy_client.cc
M src/kudu/tserver/tablet_copy_client.h
3 files changed, 183 insertions(+), 63 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/74/16274/7
-- 
To view, visit http://gerrit.cloudera.org:8080/16274
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Id83abca7a38cf183d9c27d82bb8a022699079e0e
Gerrit-Change-Number: 16274
Gerrit-PatchSet: 7
Gerrit-Owner: wangning <19...@gmail.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Attila Bukor <ab...@apache.org>
Gerrit-Reviewer: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: wangning <19...@gmail.com>

[kudu-CR] [KUDU-1728] parallelize download blocks in tablet-copy-client

Posted by "wangning (Code Review)" <ge...@cloudera.org>.
Hello Alexey Serbin, Attila Bukor, Kudu Jenkins, Bankim Bhavsar, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/16274

to look at the new patch set (#6).

Change subject: [KUDU-1728] parallelize download blocks in tablet-copy-client
......................................................................

[KUDU-1728] parallelize download blocks in tablet-copy-client

Parallelize the action of 'Download blocks' in tablet-copy-client.

Previsouly downloading blocks from tablet server is executed sequentially,
thus actions which related to 'DownloadBlocks' like
'recover from other tserver', 'cluster rebalance' may be slow sometimes.
And sometimes downloading the blocks is slow when only one thread is used
while bandwidth isn't the bottleneck.

This commit introduce FLAGS_num_threads_blocks_download to control the
number of threads for download blocks within a tablet-copy-client.

Here I attached the simple benchmark, the result are averaged by 8 epoch
result. Metrics of result is seconds with accuracy of 2 decimal points.

The experiment is download all data from remote tserver to local tserver.

Settings:
Remote machine: 8 cores 32g tserver limited to 8g, tserver version 1.10
Local machine: 8 cores 12g memory
Tserver has 3 x 500M tablets + 3 x 430M tablets + 3 x 420M tablets and
259 x 8M tablets.
All tablets were compacted well with diskrowset height 1.0.

Here I use two variable to controll the experiment.
numbers of tablet-copy-client thread, write as tc.
numbers of blocks-download thread within each tablet-copy-client thread,
(FLAGS_num_threads_blocks_download), write as bd.

The result 37.58 correspond to column 'tc-2', row 'bd-4' can be explained as,
It takes 37.58s to download all data from tserver with 2
tablet-copy-client threads and each tablet-copy-client thread has 4 threads to
download blocks.

All result in column tc-1 refer to behave before this patch.

time     tc-1      tc-2      tc-4      tc-8

bd-1     76.91     48.50     34.27     32.39
bd-2     54.45     43.26     33.27     32.37
bd-4     48.36     37.58     33.31     32.30
bd-8     47.95     38.36     33.98     32.60

Change-Id: Id83abca7a38cf183d9c27d82bb8a022699079e0e
---
M src/kudu/tserver/tablet_copy_client-test.cc
M src/kudu/tserver/tablet_copy_client.cc
M src/kudu/tserver/tablet_copy_client.h
3 files changed, 181 insertions(+), 62 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/74/16274/6
-- 
To view, visit http://gerrit.cloudera.org:8080/16274
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Id83abca7a38cf183d9c27d82bb8a022699079e0e
Gerrit-Change-Number: 16274
Gerrit-PatchSet: 6
Gerrit-Owner: wangning <19...@gmail.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Attila Bukor <ab...@apache.org>
Gerrit-Reviewer: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: wangning <19...@gmail.com>

[kudu-CR] [KUDU-1728] parallelize download blocks in tablet-copy-client

Posted by "Attila Bukor (Code Review)" <ge...@cloudera.org>.
Attila Bukor has posted comments on this change. ( http://gerrit.cloudera.org:8080/16274 )

Change subject: [KUDU-1728] parallelize download blocks in tablet-copy-client
......................................................................


Patch Set 2:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/16274/2//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/16274/2//COMMIT_MSG@11
PS2, Line 11: Previsouly download blocks from tablet server is sequential
nit: Mixed past and present tenses in this sentence.


http://gerrit.cloudera.org:8080/#/c/16274/2//COMMIT_MSG@14
PS2, Line 14: Sometimes use only one thread to download those blocks are slow and bandwidth
nit: Maybe "sometimes downloading the blocks is slow when only one thread is used and bandwidth isn't the bottleneck" would be clearer? I'm not sure if that's what you meant here.


http://gerrit.cloudera.org:8080/#/c/16274/2//COMMIT_MSG@33
PS2, Line 33: Metrics of result is seconds with accuracy of 2 decimal points.
What were the numbers before this patch? Also, I'm not sure I understand the benefit of this patch, with tc-8 bd-x doesn't matter and bd-8 is slower than tc-8. Isn't it better to just use tc-8?


http://gerrit.cloudera.org:8080/#/c/16274/2/src/kudu/tserver/tablet_copy_client.h
File src/kudu/tserver/tablet_copy_client.h:

http://gerrit.cloudera.org:8080/#/c/16274/2/src/kudu/tserver/tablet_copy_client.h@285
PS2, Line 285:   simple_spinlock mutex_;
nit: this shouldn't be called mutex_ if it's actually a spinlock. Also, maybe a more descriptive name would be nice.


http://gerrit.cloudera.org:8080/#/c/16274/2/src/kudu/tserver/tablet_copy_client.cc
File src/kudu/tserver/tablet_copy_client.cc:

http://gerrit.cloudera.org:8080/#/c/16274/2/src/kudu/tserver/tablet_copy_client.cc@714
PS2, Line 714:                               downloaded_blocks_count + 1 , num_blocks));
nit: why not block_count->Load() + 1? downloaded_blocks_count is used only once.



-- 
To view, visit http://gerrit.cloudera.org:8080/16274
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Id83abca7a38cf183d9c27d82bb8a022699079e0e
Gerrit-Change-Number: 16274
Gerrit-PatchSet: 2
Gerrit-Owner: wangning <19...@gmail.com>
Gerrit-Reviewer: Attila Bukor <ab...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Comment-Date: Mon, 03 Aug 2020 10:12:38 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-1728 parallelize download blocks in tablet-copy-client

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Alexey Serbin has removed a vote on this change.

Change subject: KUDU-1728 parallelize download blocks in tablet-copy-client
......................................................................


Removed Verified-1 by Kudu Jenkins (120)
-- 
To view, visit http://gerrit.cloudera.org:8080/16274
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: deleteVote
Gerrit-Change-Id: Id83abca7a38cf183d9c27d82bb8a022699079e0e
Gerrit-Change-Number: 16274
Gerrit-PatchSet: 7
Gerrit-Owner: wangning <19...@gmail.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Attila Bukor <ab...@apache.org>
Gerrit-Reviewer: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: wangning <19...@gmail.com>

[kudu-CR] [KUDU-1728] parallelize download blocks in tablet-copy-client

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/16274 )

Change subject: [KUDU-1728] parallelize download blocks in tablet-copy-client
......................................................................


Patch Set 2:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/16274/2//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/16274/2//COMMIT_MSG@7
PS2, Line 7: parallelize download blocks
Updating the existing tests and adding new test scenarios would be great.  For example, what happens when copying of one block among 10 total fails?


http://gerrit.cloudera.org:8080/#/c/16274/2/src/kudu/tserver/tablet_copy_client.cc
File src/kudu/tserver/tablet_copy_client.cc:

http://gerrit.cloudera.org:8080/#/c/16274/2/src/kudu/tserver/tablet_copy_client.cc@175
PS2, Line 175:   ThreadPoolBuilder("blocks-download-pool")
             :     .set_max_threads(FLAGS_num_threads_blocks_copy)
             :     .set_min_threads(FLAGS_num_threads_blocks_copy)
             :     .Build(&blocks_download_pool_);
Maybe, it's better to instantiate the pool for tablets' blocks copying somewhere at the upper level (e.g., at TSTabletManager level)?  TabletCopyClient are instantiated by TSTabletManager::RunTabletCopy() method, which is run on a threadpool itself.  I think having one pool for all tablet block copying with min_threads set to 0 might be more resource-conserving if thinking about copying small tablets.



-- 
To view, visit http://gerrit.cloudera.org:8080/16274
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Id83abca7a38cf183d9c27d82bb8a022699079e0e
Gerrit-Change-Number: 16274
Gerrit-PatchSet: 2
Gerrit-Owner: wangning <19...@gmail.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Attila Bukor <ab...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Comment-Date: Mon, 03 Aug 2020 17:28:46 +0000
Gerrit-HasComments: Yes

[kudu-CR] [KUDU-1728] parallelize download blocks in tablet-copy-client

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/16274 )

Change subject: [KUDU-1728] parallelize download blocks in tablet-copy-client
......................................................................


Patch Set 4:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/16274/2//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/16274/2//COMMIT_MSG@7
PS2, Line 7: parallelize download blocks
> This commit changes the function 'DownloadBlocks' from sequential to parall
I think one of the new tests scenarios to try is to fail downloading just one block out of many (in random sequence, not just the very first or the very last) and verify that it's handled appropriately.  Or you found that's already covered?


http://gerrit.cloudera.org:8080/#/c/16274/2//COMMIT_MSG@9
PS2, Line 9: Download blocks
> As I learned from kudu and some LogFS related paper, WAL is only a small su
If from your experience using parallel downloads for WAL doesn't seem to be a priority, that sounds reasonable to me.  I was just curious whether it was a deliberate decision to not perform parallel copy for WAL files or it was simply overlooked.



-- 
To view, visit http://gerrit.cloudera.org:8080/16274
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Id83abca7a38cf183d9c27d82bb8a022699079e0e
Gerrit-Change-Number: 16274
Gerrit-PatchSet: 4
Gerrit-Owner: wangning <19...@gmail.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Attila Bukor <ab...@apache.org>
Gerrit-Reviewer: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: wangning <19...@gmail.com>
Gerrit-Comment-Date: Mon, 10 Aug 2020 21:20:22 +0000
Gerrit-HasComments: Yes

[kudu-CR] [KUDU-1728] parallelize download blocks in tablet-copy-client

Posted by "wangning (Code Review)" <ge...@cloudera.org>.
wangning has posted comments on this change. ( http://gerrit.cloudera.org:8080/16274 )

Change subject: [KUDU-1728] parallelize download blocks in tablet-copy-client
......................................................................


Patch Set 4:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/16274/2//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/16274/2//COMMIT_MSG@7
PS2, Line 7: parallelize download blocks
> Updating the existing tests and adding new test scenarios would be great.  
This commit changes the function 'DownloadBlocks' from sequential to parallel. 
I tried spent hours to read the current tests about tablet_copy_client-test and the integration tests in order to figure out what have to be tested in this patch, but the scenarios like 'dir failed' and verify the data block are covered in existing test. 
So I still get very confused about what I can add for new scenario test, can I get some more hint if possible? thx



-- 
To view, visit http://gerrit.cloudera.org:8080/16274
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Id83abca7a38cf183d9c27d82bb8a022699079e0e
Gerrit-Change-Number: 16274
Gerrit-PatchSet: 4
Gerrit-Owner: wangning <19...@gmail.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Attila Bukor <ab...@apache.org>
Gerrit-Reviewer: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: wangning <19...@gmail.com>
Gerrit-Comment-Date: Wed, 05 Aug 2020 10:26:41 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-1728 parallelize download blocks in tablet-copy-client

Posted by "wangning (Code Review)" <ge...@cloudera.org>.
wangning has posted comments on this change. ( http://gerrit.cloudera.org:8080/16274 )

Change subject: KUDU-1728 parallelize download blocks in tablet-copy-client
......................................................................


Patch Set 7:

(12 comments)

http://gerrit.cloudera.org:8080/#/c/16274/6/src/kudu/tserver/tablet_copy_client.h
File src/kudu/tserver/tablet_copy_client.h:

http://gerrit.cloudera.org:8080/#/c/16274/6/src/kudu/tserver/tablet_copy_client.h@298
PS6, Line 298: 
> data blocks
Done


http://gerrit.cloudera.org:8080/#/c/16274/6/src/kudu/tserver/tablet_copy_client.h@301
PS6, Line 301: 
> How about:
Done


http://gerrit.cloudera.org:8080/#/c/16274/6/src/kudu/tserver/tablet_copy_client.cc
File src/kudu/tserver/tablet_copy_client.cc:

http://gerrit.cloudera.org:8080/#/c/16274/6/src/kudu/tserver/tablet_copy_client.cc@106
PS6, Line 106: tablet_copy_download_thread
> Rename this into
Done


http://gerrit.cloudera.org:8080/#/c/16274/6/src/kudu/tserver/tablet_copy_client.cc@106
PS6, Line 106: DEFINE_int32
> I think it's worth adding a flag validator to make sure the specified value
Done


http://gerrit.cloudera.org:8080/#/c/16274/6/src/kudu/tserver/tablet_copy_client.cc@107
PS6, Line 107: Number of threads per tablet copy session for downlo
> How about:
Done


http://gerrit.cloudera.org:8080/#/c/16274/6/src/kudu/tserver/tablet_copy_client.cc@194
PS6, Line 194:   if (tablet_copy_metrics_) {
> I think it would be safer to place this _before_ decrementing the metric si
Done


http://gerrit.cloudera.org:8080/#/c/16274/6/src/kudu/tserver/tablet_copy_client.cc@414
PS6, Line 414: tablet_replica_ = tablet_replica;
> How about:
Done


http://gerrit.cloudera.org:8080/#/c/16274/6/src/kudu/tserver/tablet_copy_client.cc@624
PS6, Line 624:     }
             :     ColumnDataPB* dst_col = dst_rowset->add_columns();
             :     *dst_col = src_col;
             :     *dst_col->mutable_block() = new_block_id;
             :   }
             :   for (const DeltaDataPB& src_redo : src_rowset.redo_deltas()) {
             :     BlockIdPB new_block_id;
             :     s = DownloadAndRewriteBlockIfEndStatusOK(src_redo.block(), num_remote_blocks,
             :                                              block_count, &new_block_id, end_status);
             :     i
> This piece looks like a common boilerplate in the code below.  Is it possib
Done


http://gerrit.cloudera.org:8080/#/c/16274/6/src/kudu/tserver/tablet_copy_client.cc@639
PS6, Line 639: 
> Is it safe to probe for 'end_status' here when some other thread might be u
Done


http://gerrit.cloudera.org:8080/#/c/16274/6/src/kudu/tserver/tablet_copy_client.cc@701
PS6, Line 701:   data_id.set_wal_segment_seqno(wal_segment_seqno);
> nit: add 'const' ?
Done


http://gerrit.cloudera.org:8080/#/c/16274/6/src/kudu/tserver/tablet_copy_client.cc@710
PS6, Line 710:                    Substitute("Unable to download WAL segment with seq. numb
> How about:
Done


http://gerrit.cloudera.org:8080/#/c/16274/6/src/kudu/tserver/tablet_copy_client.cc@790
PS6, Line 790: load
> nit: a lock
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/16274
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Id83abca7a38cf183d9c27d82bb8a022699079e0e
Gerrit-Change-Number: 16274
Gerrit-PatchSet: 7
Gerrit-Owner: wangning <19...@gmail.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Attila Bukor <ab...@apache.org>
Gerrit-Reviewer: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: wangning <19...@gmail.com>
Gerrit-Comment-Date: Mon, 07 Sep 2020 07:22:54 +0000
Gerrit-HasComments: Yes

[kudu-CR] [KUDU-1728] parallelize download blocks in tablet-copy-client

Posted by "wangning (Code Review)" <ge...@cloudera.org>.
wangning has posted comments on this change. ( http://gerrit.cloudera.org:8080/16274 )

Change subject: [KUDU-1728] parallelize download blocks in tablet-copy-client
......................................................................


Patch Set 4:

> > (2 comments)
 > 
 > Hey @wangning!
 > 
 > Thank you for working on KUDU-1728.  Are you planning to iterate on
 > this patch further any time soon?

sorry about not reply and update on time for i was stuck in another thing (the cipher changed in kudu java client really took me a lot of time since there is no clue to troubleshoot it). i'm going to work on it in few days, actually this commit already worked for few days in our release environment, I can patch it very soon.


-- 
To view, visit http://gerrit.cloudera.org:8080/16274
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Id83abca7a38cf183d9c27d82bb8a022699079e0e
Gerrit-Change-Number: 16274
Gerrit-PatchSet: 4
Gerrit-Owner: wangning <19...@gmail.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Attila Bukor <ab...@apache.org>
Gerrit-Reviewer: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: wangning <19...@gmail.com>
Gerrit-Comment-Date: Sat, 29 Aug 2020 08:19:50 +0000
Gerrit-HasComments: No

[kudu-CR] [KUDU-1728] parallelize download blocks in tablet-copy-client

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/16274 )

Change subject: [KUDU-1728] parallelize download blocks in tablet-copy-client
......................................................................


Patch Set 4:

> (2 comments)

Hey @wangning!

Thank you for working on KUDU-1728.  Are you planning to iterate on this patch further any time soon?


-- 
To view, visit http://gerrit.cloudera.org:8080/16274
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Id83abca7a38cf183d9c27d82bb8a022699079e0e
Gerrit-Change-Number: 16274
Gerrit-PatchSet: 4
Gerrit-Owner: wangning <19...@gmail.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Attila Bukor <ab...@apache.org>
Gerrit-Reviewer: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: wangning <19...@gmail.com>
Gerrit-Comment-Date: Fri, 28 Aug 2020 17:40:48 +0000
Gerrit-HasComments: No

[kudu-CR] [KUDU-1728] parallelize download blocks in tablet-copy-client

Posted by "wangning (Code Review)" <ge...@cloudera.org>.
Hello Alexey Serbin, Attila Bukor, Kudu Jenkins, Bankim Bhavsar, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/16274

to look at the new patch set (#3).

Change subject: [KUDU-1728] parallelize download blocks in tablet-copy-client
......................................................................

[KUDU-1728] parallelize download blocks in tablet-copy-client

Parallelize the action of 'Download blocks' in tablet-copy-client.

Previsouly downloading blocks from tablet server is executed sequentially,
thus actions which related to 'DownloadBlocks' like
'recover from other tserver', 'cluster rebalance' may be slow sometimes.
And sometimes downloading the blocks is slow when only one thread is used
while bandwidth isn't the bottleneck.

This commit introduce FLAGS_num_threads_blocks_download to control the
number of threads for download blocks within a tablet-copy-client.

Here I attached the simple benchmark, the result are averaged by 8 epoch
result. Metrics of result is seconds with accuracy of 2 decimal points.

The experiment is download all data from remote tserver to local tserver.

Settings:
Remote machine: 8 cores 32g tserver limited to 8g, tserver version 1.10
Local machine: 8 cores 12g memory
Tserver has 3 x 500M tablets + 3 x 430M tablets + 3 x 420M tablets and
259 x 8M tablets.
All tablets were compacted well with diskrowset height 1.0.

Here I use two variable to controll the experiment.
numbers of tablet-copy-client thread, write as tc.
numbers of blocks-download thread within each tablet-copy-client thread,
(FLAGS_num_threads_blocks_download), write as bd.

The result 37.58 correspond to column 'tc-2', row 'bd-4' can be explained as,
It takes 37.58s to download all data from tserver with 2
tablet-copy-client threads and each tablet-copy-client thread has 4 threads to
download blocks.

All result in column tc-1 refer to behave before this patch.

time     tc-1      tc-2      tc-4      tc-8

bd-1     76.91     48.50     34.27     32.39
bd-2     54.45     43.26     33.27     32.37
bd-4     48.36     37.58     33.31     32.30
bd-8     47.95     38.36     33.98     32.60

Change-Id: Id83abca7a38cf183d9c27d82bb8a022699079e0e
---
M src/kudu/tserver/tablet_copy_client-test.cc
M src/kudu/tserver/tablet_copy_client.cc
M src/kudu/tserver/tablet_copy_client.h
3 files changed, 174 insertions(+), 60 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/74/16274/3
-- 
To view, visit http://gerrit.cloudera.org:8080/16274
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Id83abca7a38cf183d9c27d82bb8a022699079e0e
Gerrit-Change-Number: 16274
Gerrit-PatchSet: 3
Gerrit-Owner: wangning <19...@gmail.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Attila Bukor <ab...@apache.org>
Gerrit-Reviewer: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)

[kudu-CR] [KUDU-1728] parallelize download blocks in tablet-copy-client

Posted by "wangning (Code Review)" <ge...@cloudera.org>.
wangning has posted comments on this change. ( http://gerrit.cloudera.org:8080/16274 )

Change subject: [KUDU-1728] parallelize download blocks in tablet-copy-client
......................................................................


Patch Set 3:

(10 comments)

http://gerrit.cloudera.org:8080/#/c/16274/2//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/16274/2//COMMIT_MSG@9
PS2, Line 9: Download blocks
> What about downloading WAL segments?  Is it worth running it in parallel on
As I learned from kudu and some LogFS related paper, WAL is only a small subset of blocks (max up to 1024M per tablet) and it requires a lot sequential read/write. So when downloading WAL from a tablet, number of threads may not be the bottleneck, while sequential I/O and network bandwidth could be. 
But this are theoretical thought, I don't know if it is right. 
Please help point out if I misunderstood it. 
BTW I will also try to have extra experiment to verify if it's worth to do it.


http://gerrit.cloudera.org:8080/#/c/16274/2//COMMIT_MSG@11
PS2, Line 11: Previsouly downloading blocks from tablet server is executed sequentially,
> nit: Mixed past and present tenses in this sentence.
Done


http://gerrit.cloudera.org:8080/#/c/16274/2//COMMIT_MSG@14
PS2, Line 14: And sometimes downloading the blocks is slow when only one thread is used
> nit: Maybe "sometimes downloading the blocks is slow when only one thread i
Done


http://gerrit.cloudera.org:8080/#/c/16274/2//COMMIT_MSG@33
PS2, Line 33: numbers of tablet-copy-client thread, write as tc.
> What were the numbers before this patch? Also, I'm not sure I understand th
I tried to write more detail about it, please point out where I cannot explain clearly.


http://gerrit.cloudera.org:8080/#/c/16274/2/src/kudu/tserver/tablet_copy_client.h
File src/kudu/tserver/tablet_copy_client.h:

http://gerrit.cloudera.org:8080/#/c/16274/2/src/kudu/tserver/tablet_copy_client.h@282
PS2, Line 282: 
> nit: in parallel.
Done


http://gerrit.cloudera.org:8080/#/c/16274/2/src/kudu/tserver/tablet_copy_client.h@285
PS2, Line 285: std::string session_id_
> nit: add a comment describing what this lock primitive is used for
Done


http://gerrit.cloudera.org:8080/#/c/16274/2/src/kudu/tserver/tablet_copy_client.cc
File src/kudu/tserver/tablet_copy_client.cc:

http://gerrit.cloudera.org:8080/#/c/16274/2/src/kudu/tserver/tablet_copy_client.cc@104
PS2, Line 104: number of 
> Nit: Number instead?
Done


http://gerrit.cloudera.org:8080/#/c/16274/2/src/kudu/tserver/tablet_copy_client.cc@175
PS2, Line 175:   }
             :   ThreadPoolBuilder("blocks-download-pool-" + tablet_id)
             :     .set_max_threads(FLAGS_num_threads_blocks_download)
             :     .set_min_threads(FLAGS_num_thre
> I understand this is improvement is mainly for tablet copies.
It should be reclaimed immediately once tablet-copy-client finished its' work. 
And also, in the DownloadBlocks function, it should wait all blocks to download well before move on.


http://gerrit.cloudera.org:8080/#/c/16274/2/src/kudu/tserver/tablet_copy_client.cc@708
PS2, Line 708: 
> consider using std::atomic instead
Done


http://gerrit.cloudera.org:8080/#/c/16274/2/src/kudu/tserver/tablet_copy_client.cc@714
PS2, Line 714:   for (const RowSetDataPB& src_rowset : remote_superblock_->rowsets()) {
> nit: why not block_count->Load() + 1? downloaded_blocks_count is used only 
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/16274
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Id83abca7a38cf183d9c27d82bb8a022699079e0e
Gerrit-Change-Number: 16274
Gerrit-PatchSet: 3
Gerrit-Owner: wangning <19...@gmail.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Attila Bukor <ab...@apache.org>
Gerrit-Reviewer: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: wangning <19...@gmail.com>
Gerrit-Comment-Date: Wed, 05 Aug 2020 08:13:11 +0000
Gerrit-HasComments: Yes

[kudu-CR] [KUDU-1728] parallelize download blocks in tablet-copy-client

Posted by "wangning (Code Review)" <ge...@cloudera.org>.
Hello Alexey Serbin, Attila Bukor, Kudu Jenkins, Bankim Bhavsar, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/16274

to look at the new patch set (#5).

Change subject: [KUDU-1728] parallelize download blocks in tablet-copy-client
......................................................................

[KUDU-1728] parallelize download blocks in tablet-copy-client

Parallelize the action of 'Download blocks' in tablet-copy-client.

Previsouly downloading blocks from tablet server is executed sequentially,
thus actions which related to 'DownloadBlocks' like
'recover from other tserver', 'cluster rebalance' may be slow sometimes.
And sometimes downloading the blocks is slow when only one thread is used
while bandwidth isn't the bottleneck.

This commit introduce FLAGS_num_threads_blocks_download to control the
number of threads for download blocks within a tablet-copy-client.

Here I attached the simple benchmark, the result are averaged by 8 epoch
result. Metrics of result is seconds with accuracy of 2 decimal points.

The experiment is download all data from remote tserver to local tserver.

Settings:
Remote machine: 8 cores 32g tserver limited to 8g, tserver version 1.10
Local machine: 8 cores 12g memory
Tserver has 3 x 500M tablets + 3 x 430M tablets + 3 x 420M tablets and
259 x 8M tablets.
All tablets were compacted well with diskrowset height 1.0.

Here I use two variable to controll the experiment.
numbers of tablet-copy-client thread, write as tc.
numbers of blocks-download thread within each tablet-copy-client thread,
(FLAGS_num_threads_blocks_download), write as bd.

The result 37.58 correspond to column 'tc-2', row 'bd-4' can be explained as,
It takes 37.58s to download all data from tserver with 2
tablet-copy-client threads and each tablet-copy-client thread has 4 threads to
download blocks.

All result in column tc-1 refer to behave before this patch.

time     tc-1      tc-2      tc-4      tc-8

bd-1     76.91     48.50     34.27     32.39
bd-2     54.45     43.26     33.27     32.37
bd-4     48.36     37.58     33.31     32.30
bd-8     47.95     38.36     33.98     32.60

Change-Id: Id83abca7a38cf183d9c27d82bb8a022699079e0e
---
M src/kudu/tserver/tablet_copy_client-test.cc
M src/kudu/tserver/tablet_copy_client.cc
M src/kudu/tserver/tablet_copy_client.h
3 files changed, 167 insertions(+), 62 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/74/16274/5
-- 
To view, visit http://gerrit.cloudera.org:8080/16274
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Id83abca7a38cf183d9c27d82bb8a022699079e0e
Gerrit-Change-Number: 16274
Gerrit-PatchSet: 5
Gerrit-Owner: wangning <19...@gmail.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Attila Bukor <ab...@apache.org>
Gerrit-Reviewer: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: wangning <19...@gmail.com>

[kudu-CR] [KUDU-1728] parallelize download blocks in tablet-copy-client

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/16274 )

Change subject: [KUDU-1728] parallelize download blocks in tablet-copy-client
......................................................................


Patch Set 2:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/16274/2//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/16274/2//COMMIT_MSG@9
PS2, Line 9: Download blocks
What about downloading WAL segments?  Is it worth running it in parallel on the thread pool as well?


http://gerrit.cloudera.org:8080/#/c/16274/2/src/kudu/tserver/tablet_copy_client.h
File src/kudu/tserver/tablet_copy_client.h:

http://gerrit.cloudera.org:8080/#/c/16274/2/src/kudu/tserver/tablet_copy_client.h@282
PS2, Line 282: in side
nit: in parallel.

?


http://gerrit.cloudera.org:8080/#/c/16274/2/src/kudu/tserver/tablet_copy_client.h@285
PS2, Line 285: simple_spinlock mutex_;
nit: add a comment describing what this lock primitive is used for


http://gerrit.cloudera.org:8080/#/c/16274/2/src/kudu/tserver/tablet_copy_client.cc
File src/kudu/tserver/tablet_copy_client.cc:

http://gerrit.cloudera.org:8080/#/c/16274/2/src/kudu/tserver/tablet_copy_client.cc@655
PS2, Line 655: blocks_download_pool_->Wait();
How does it track per-thread errors (if any happen during the process) and report those to the caller of TabletCopyClient::DownloadBlocks()?


http://gerrit.cloudera.org:8080/#/c/16274/2/src/kudu/tserver/tablet_copy_client.cc@708
PS2, Line 708: AtomicInt
consider using std::atomic instead



-- 
To view, visit http://gerrit.cloudera.org:8080/16274
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Id83abca7a38cf183d9c27d82bb8a022699079e0e
Gerrit-Change-Number: 16274
Gerrit-PatchSet: 2
Gerrit-Owner: wangning <19...@gmail.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Attila Bukor <ab...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Comment-Date: Mon, 03 Aug 2020 16:42:59 +0000
Gerrit-HasComments: Yes

[kudu-CR] [KUDU-1728] parallelize download blocks in tablet-copy-client

Posted by "wangning (Code Review)" <ge...@cloudera.org>.
Hello Alexey Serbin, Attila Bukor, Kudu Jenkins, Bankim Bhavsar, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/16274

to look at the new patch set (#4).

Change subject: [KUDU-1728] parallelize download blocks in tablet-copy-client
......................................................................

[KUDU-1728] parallelize download blocks in tablet-copy-client

Parallelize the action of 'Download blocks' in tablet-copy-client.

Previsouly downloading blocks from tablet server is executed sequentially,
thus actions which related to 'DownloadBlocks' like
'recover from other tserver', 'cluster rebalance' may be slow sometimes.
And sometimes downloading the blocks is slow when only one thread is used
while bandwidth isn't the bottleneck.

This commit introduce FLAGS_num_threads_blocks_download to control the
number of threads for download blocks within a tablet-copy-client.

Here I attached the simple benchmark, the result are averaged by 8 epoch
result. Metrics of result is seconds with accuracy of 2 decimal points.

The experiment is download all data from remote tserver to local tserver.

Settings:
Remote machine: 8 cores 32g tserver limited to 8g, tserver version 1.10
Local machine: 8 cores 12g memory
Tserver has 3 x 500M tablets + 3 x 430M tablets + 3 x 420M tablets and
259 x 8M tablets.
All tablets were compacted well with diskrowset height 1.0.

Here I use two variable to controll the experiment.
numbers of tablet-copy-client thread, write as tc.
numbers of blocks-download thread within each tablet-copy-client thread,
(FLAGS_num_threads_blocks_download), write as bd.

The result 37.58 correspond to column 'tc-2', row 'bd-4' can be explained as,
It takes 37.58s to download all data from tserver with 2
tablet-copy-client threads and each tablet-copy-client thread has 4 threads to
download blocks.

All result in column tc-1 refer to behave before this patch.

time     tc-1      tc-2      tc-4      tc-8

bd-1     76.91     48.50     34.27     32.39
bd-2     54.45     43.26     33.27     32.37
bd-4     48.36     37.58     33.31     32.30
bd-8     47.95     38.36     33.98     32.60

Change-Id: Id83abca7a38cf183d9c27d82bb8a022699079e0e
---
M src/kudu/tserver/tablet_copy_client-test.cc
M src/kudu/tserver/tablet_copy_client.cc
M src/kudu/tserver/tablet_copy_client.h
3 files changed, 180 insertions(+), 61 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/74/16274/4
-- 
To view, visit http://gerrit.cloudera.org:8080/16274
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Id83abca7a38cf183d9c27d82bb8a022699079e0e
Gerrit-Change-Number: 16274
Gerrit-PatchSet: 4
Gerrit-Owner: wangning <19...@gmail.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Attila Bukor <ab...@apache.org>
Gerrit-Reviewer: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: wangning <19...@gmail.com>

[kudu-CR] KUDU-1728 parallelize download blocks in tablet-copy-client

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Alexey Serbin has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/16274 )

Change subject: KUDU-1728 parallelize download blocks in tablet-copy-client
......................................................................

KUDU-1728 parallelize download blocks in tablet-copy-client

Parallelize the action of 'Download blocks' in tablet-copy-client.

Previsouly downloading blocks from tablet server is executed sequentially,
thus actions which related to 'DownloadBlocks' like
'recover from other tserver', 'cluster rebalance' may be slow sometimes.
And sometimes downloading the blocks is slow when only one thread is used
while bandwidth isn't the bottleneck.

This commit introduce FLAGS_num_threads_blocks_download to control the
number of threads for download blocks within a tablet-copy-client.

Here I attached the simple benchmark, the result are averaged by 8 epoch
result. Metrics of result is seconds with accuracy of 2 decimal points.

The experiment is download all data from remote tserver to local tserver.

Settings:
Remote machine: 8 cores 32g tserver limited to 8g, tserver version 1.10
Local machine: 8 cores 12g memory
Tserver has 3 x 500M tablets + 3 x 430M tablets + 3 x 420M tablets and
259 x 8M tablets.
All tablets were compacted well with diskrowset height 1.0.

Here I use two variable to controll the experiment.
numbers of tablet-copy-client thread, write as tc.
numbers of blocks-download thread within each tablet-copy-client thread,
(FLAGS_num_threads_blocks_download), write as bd.

The result 37.58 correspond to column 'tc-2', row 'bd-4' can be explained as,
It takes 37.58s to download all data from tserver with 2
tablet-copy-client threads and each tablet-copy-client thread has 4 threads to
download blocks.

All result in column tc-1 refer to behave before this patch.

time     tc-1      tc-2      tc-4      tc-8

bd-1     76.91     48.50     34.27     32.39
bd-2     54.45     43.26     33.27     32.37
bd-4     48.36     37.58     33.31     32.30
bd-8     47.95     38.36     33.98     32.60

Change-Id: Id83abca7a38cf183d9c27d82bb8a022699079e0e
Reviewed-on: http://gerrit.cloudera.org:8080/16274
Tested-by: Alexey Serbin <as...@cloudera.com>
Reviewed-by: Alexey Serbin <as...@cloudera.com>
---
M src/kudu/tserver/tablet_copy_client-test.cc
M src/kudu/tserver/tablet_copy_client.cc
M src/kudu/tserver/tablet_copy_client.h
3 files changed, 183 insertions(+), 63 deletions(-)

Approvals:
  Alexey Serbin: Looks good to me, approved; Verified

-- 
To view, visit http://gerrit.cloudera.org:8080/16274
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: Id83abca7a38cf183d9c27d82bb8a022699079e0e
Gerrit-Change-Number: 16274
Gerrit-PatchSet: 8
Gerrit-Owner: wangning <19...@gmail.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Attila Bukor <ab...@apache.org>
Gerrit-Reviewer: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: wangning <19...@gmail.com>

[kudu-CR] KUDU-1728 parallelize download blocks in tablet-copy-client

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/16274 )

Change subject: KUDU-1728 parallelize download blocks in tablet-copy-client
......................................................................


Patch Set 7: Verified+1

Unrelated test failure in RaftConsensusITest.MultiThreadedInsertWithFailovers (ASAN build)


-- 
To view, visit http://gerrit.cloudera.org:8080/16274
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Id83abca7a38cf183d9c27d82bb8a022699079e0e
Gerrit-Change-Number: 16274
Gerrit-PatchSet: 7
Gerrit-Owner: wangning <19...@gmail.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Attila Bukor <ab...@apache.org>
Gerrit-Reviewer: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: wangning <19...@gmail.com>
Gerrit-Comment-Date: Thu, 10 Sep 2020 02:34:01 +0000
Gerrit-HasComments: No

[kudu-CR] [KUDU-1728] parallelize download blocks in tablet-copy-client

Posted by "wangning (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/16274

to look at the new patch set (#2).

Change subject: [KUDU-1728] parallelize download blocks in tablet-copy-client
......................................................................

[KUDU-1728] parallelize download blocks in tablet-copy-client

Parallelize the action of 'Download blocks' in tablet-copy-client.

Previsouly download blocks from tablet server is sequential
executed, thus actions which strongly relied on 'DownloadBlocks' like
recover from other tserver, cluster rebalance are slow.
Sometimes use only one thread to download those blocks are slow and bandwidth
is not the bottleneck.

Here I attached the simple benchmark, the result are averaged by 8 epoch
result.
The experiment is download all data from remote tserver to local tserver.

settings:
Remote machine: 8 cores 32g tserver limited to 8g, tserver version 1.10
Local machine: 8 cores 12g memory
Tserver has 3 x 500M tablets + 3 x 430M tablets + 3 x 420M tablets and
259 x 8M tablets.
All tablets were compacted well with dirskrowset height 1.0.

Here the number of tablet-copy-client threadpool and the number of
blocks-download threadpool are chosen as variable.
tablet-copy-client thread is write as tc.
block-download thread is write as bd.

Metrics of result is seconds with accuracy of 2 decimal points.

time     tc-1      tc-2      tc-4      tc-8

bd-1     76.91     48.50     34.27     32.39
bd-2     54.45     43.26     33.27     32.37
bd-4     48.36     37.58     33.31     32.30
bd-8     47.95     38.36     33.98     32.60

Change-Id: Id83abca7a38cf183d9c27d82bb8a022699079e0e
---
M src/kudu/tserver/tablet_copy_client.cc
M src/kudu/tserver/tablet_copy_client.h
2 files changed, 86 insertions(+), 59 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/74/16274/2
-- 
To view, visit http://gerrit.cloudera.org:8080/16274
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Id83abca7a38cf183d9c27d82bb8a022699079e0e
Gerrit-Change-Number: 16274
Gerrit-PatchSet: 2
Gerrit-Owner: wangning <19...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins (120)

[kudu-CR] KUDU-1728 parallelize download blocks in tablet-copy-client

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/16274 )

Change subject: KUDU-1728 parallelize download blocks in tablet-copy-client
......................................................................


Patch Set 7: Code-Review+2

Thank you very much for your contribution!


-- 
To view, visit http://gerrit.cloudera.org:8080/16274
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Id83abca7a38cf183d9c27d82bb8a022699079e0e
Gerrit-Change-Number: 16274
Gerrit-PatchSet: 7
Gerrit-Owner: wangning <19...@gmail.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Attila Bukor <ab...@apache.org>
Gerrit-Reviewer: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: wangning <19...@gmail.com>
Gerrit-Comment-Date: Thu, 10 Sep 2020 02:34:20 +0000
Gerrit-HasComments: No