You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@kudu.apache.org by "Will Berkeley (Code Review)" <ge...@cloudera.org> on 2019/05/24 21:01:14 UTC

[kudu-CR] [backup] KUDU-2786 Parallelize tables for backup and restore

Will Berkeley has uploaded this change for review. ( http://gerrit.cloudera.org:8080/13430


Change subject: [backup] KUDU-2786 Parallelize tables for backup and restore
......................................................................

[backup] KUDU-2786 Parallelize tables for backup and restore

This patch adds a hidden, experimental option to run backups and
restores parallel across tables. Managing resources across parallel
backups and restores is very difficult: the sizes of tables in terms of
number of tablets and size of tables can vary by orders of magnitude
across a cluster, and there are many resources which may be constrained
depending on many factors: CPU, memory, disk I/O, network, number of
executors available. This patch doesn't do resource management. It will
kick off the jobs in parallel, and it's up to Spark to manage the
resources of parallel jobs. Maybe this will work well, maybe it won't...
that's why this is just experimental.

I tested manually on a Spark cluster to verify that jobs are actually
run in parallel.

Change-Id: I79043b73bf4ecfa11b51f16a7f4369f93357029f

foo

Change-Id: Ib02f26fbfd6a714ad0797f8b5ed1eeeb8fd6e371

b

Change-Id: I02f0a818a6fa372ab3c696c11882284877ce207e
---
M java/kudu-backup/src/main/scala/org/apache/kudu/backup/KuduBackup.scala
M java/kudu-backup/src/main/scala/org/apache/kudu/backup/KuduRestore.scala
M java/kudu-backup/src/main/scala/org/apache/kudu/backup/Options.scala
M java/kudu-backup/src/test/scala/org/apache/kudu/backup/TestKuduBackup.scala
4 files changed, 69 insertions(+), 14 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/30/13430/1
-- 
To view, visit http://gerrit.cloudera.org:8080/13430
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I02f0a818a6fa372ab3c696c11882284877ce207e
Gerrit-Change-Number: 13430
Gerrit-PatchSet: 1
Gerrit-Owner: Will Berkeley <wd...@gmail.com>

[kudu-CR] [backup] KUDU-2786 Parallelize tables for backup and restore

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Mike Percy has posted comments on this change. ( http://gerrit.cloudera.org:8080/13430 )

Change subject: [backup] KUDU-2786 Parallelize tables for backup and restore
......................................................................


Patch Set 2: Verified+1

Overriding flaky test


-- 
To view, visit http://gerrit.cloudera.org:8080/13430
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I02f0a818a6fa372ab3c696c11882284877ce207e
Gerrit-Change-Number: 13430
Gerrit-PatchSet: 2
Gerrit-Owner: Will Berkeley <wd...@gmail.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>
Gerrit-Comment-Date: Thu, 30 May 2019 17:23:26 +0000
Gerrit-HasComments: No

[kudu-CR] [backup] KUDU-2786 Parallelize tables for backup and restore

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Mike Percy has posted comments on this change. ( http://gerrit.cloudera.org:8080/13430 )

Change subject: [backup] KUDU-2786 Parallelize tables for backup and restore
......................................................................


Patch Set 2:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/13430/1/java/kudu-backup/src/main/scala/org/apache/kudu/backup/KuduBackup.scala
File java/kudu-backup/src/main/scala/org/apache/kudu/backup/KuduBackup.scala:

http://gerrit.cloudera.org:8080/#/c/13430/1/java/kudu-backup/src/main/scala/org/apache/kudu/backup/KuduBackup.scala@128
PS1, Line 128:     val pool = new ForkJoinPool(options.numParallelBackups) // Need a clean-up reference.
> +1 to what Grant said. Separate jobs can fail separately.
Ah, I have never used the FJP. I guess I've been away from Java for too long. Thanks for the link.



-- 
To view, visit http://gerrit.cloudera.org:8080/13430
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I02f0a818a6fa372ab3c696c11882284877ce207e
Gerrit-Change-Number: 13430
Gerrit-PatchSet: 2
Gerrit-Owner: Will Berkeley <wd...@gmail.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>
Gerrit-Comment-Date: Thu, 30 May 2019 17:21:04 +0000
Gerrit-HasComments: Yes

[kudu-CR] [backup] KUDU-2786 Parallelize tables for backup and restore

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Mike Percy has removed a vote on this change.

Change subject: [backup] KUDU-2786 Parallelize tables for backup and restore
......................................................................


Removed Verified-1 by Kudu Jenkins (120)
-- 
To view, visit http://gerrit.cloudera.org:8080/13430
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: deleteVote
Gerrit-Change-Id: I02f0a818a6fa372ab3c696c11882284877ce207e
Gerrit-Change-Number: 13430
Gerrit-PatchSet: 2
Gerrit-Owner: Will Berkeley <wd...@gmail.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>

[kudu-CR] [backup] KUDU-2786 Parallelize tables for backup and restore

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Mike Percy has posted comments on this change. ( http://gerrit.cloudera.org:8080/13430 )

Change subject: [backup] KUDU-2786 Parallelize tables for backup and restore
......................................................................


Patch Set 2: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/13430
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I02f0a818a6fa372ab3c696c11882284877ce207e
Gerrit-Change-Number: 13430
Gerrit-PatchSet: 2
Gerrit-Owner: Will Berkeley <wd...@gmail.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>
Gerrit-Comment-Date: Thu, 30 May 2019 17:22:35 +0000
Gerrit-HasComments: No

[kudu-CR] [backup] KUDU-2786 Parallelize tables for backup and restore

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Mike Percy has posted comments on this change. ( http://gerrit.cloudera.org:8080/13430 )

Change subject: [backup] KUDU-2786 Parallelize tables for backup and restore
......................................................................


Patch Set 1:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/13430/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/13430/1//COMMIT_MSG@10
PS1, Line 10:  parallel
in parallel


http://gerrit.cloudera.org:8080/#/c/13430/1//COMMIT_MSG@11
PS1, Line 11: is very difficult
How would queueing and resource management of this be any different than doing a single backup or restore of a large table with hundreds or thousands of partitions?

The main thing I can think of is that one would have to allocate to Spark sufficient memory to handle backing up and restoring the widest table with the heaviest cells, so basically lowest common denominator == highest memory required. Anything else?


http://gerrit.cloudera.org:8080/#/c/13430/1//COMMIT_MSG@24
PS1, Line 24: 
            : foo
            : 
            : Change-Id: Ib02f26fbfd6a714ad0797f8b5ed1eeeb8fd6e371
            : 
            : b
            : 
            : Change-Id: I02f0a818a6fa372ab3c696c11882284877ce207e
nit: remove these remnants of a git squash


http://gerrit.cloudera.org:8080/#/c/13430/1/java/kudu-backup/src/main/scala/org/apache/kudu/backup/KuduBackup.scala
File java/kudu-backup/src/main/scala/org/apache/kudu/backup/KuduBackup.scala:

http://gerrit.cloudera.org:8080/#/c/13430/1/java/kudu-backup/src/main/scala/org/apache/kudu/backup/KuduBackup.scala@128
PS1, Line 128:     val pool = new ForkJoinPool(options.numParallelBackups) // Need a clean-up reference.
Can you talk a little about the tradeoffs involved in submitting parallel jobs vs adding support for running a single Spark job that handles multiple tables? The latter would seem more natural to me. I also wonder what the performance implications of fork() in the context of a driver running on YARN are, especially on RHEL 6.



-- 
To view, visit http://gerrit.cloudera.org:8080/13430
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I02f0a818a6fa372ab3c696c11882284877ce207e
Gerrit-Change-Number: 13430
Gerrit-PatchSet: 1
Gerrit-Owner: Will Berkeley <wd...@gmail.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Comment-Date: Tue, 28 May 2019 23:23:08 +0000
Gerrit-HasComments: Yes

[kudu-CR] [backup] KUDU-2786 Parallelize tables for backup and restore

Posted by "Will Berkeley (Code Review)" <ge...@cloudera.org>.
Will Berkeley has posted comments on this change. ( http://gerrit.cloudera.org:8080/13430 )

Change subject: [backup] KUDU-2786 Parallelize tables for backup and restore
......................................................................


Patch Set 1:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/13430/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/13430/1//COMMIT_MSG@10
PS1, Line 10:  parallel
> in parallel
Done


http://gerrit.cloudera.org:8080/#/c/13430/1//COMMIT_MSG@11
PS1, Line 11: is very difficult
> How would queueing and resource management of this be any different than do
I'm not sure, but I know there's a lot I don't know about Spark, and I know we haven't tested this very much.


http://gerrit.cloudera.org:8080/#/c/13430/1//COMMIT_MSG@24
PS1, Line 24: 
            : foo
            : 
            : Change-Id: Ib02f26fbfd6a714ad0797f8b5ed1eeeb8fd6e371
            : 
            : b
            : 
            : Change-Id: I02f0a818a6fa372ab3c696c11882284877ce207e
> nit: remove these remnants of a git squash
:(


http://gerrit.cloudera.org:8080/#/c/13430/1/java/kudu-backup/src/main/scala/org/apache/kudu/backup/KuduBackup.scala
File java/kudu-backup/src/main/scala/org/apache/kudu/backup/KuduBackup.scala:

http://gerrit.cloudera.org:8080/#/c/13430/1/java/kudu-backup/src/main/scala/org/apache/kudu/backup/KuduBackup.scala@128
PS1, Line 128:     val pool = new ForkJoinPool(options.numParallelBackups) // Need a clean-up reference.
> Due to Data input and output format and layout assumptions it's easier to k
+1 to what Grant said. Separate jobs can fail separately.

Re: the ForkJoinPool, I don't think that the "Fork" here is the syscall fork. Of course the threads in the pool will be forked (or cloned) at some point, but I don't think the pool is forking for every task. See this for an explanation of why the pool is called a ForkJoinPool: http://tutorials.jenkov.com/java-util-concurrent/java-fork-and-join-forkjoinpool.html.

Also, this is the default type of ExecutorService used, not a specific choice by me. I needed to configure the parallelism explicitly, else I could have dispensed with configuring my own pool. That would have resulted in parallelism equal to the number of processors on the driver node, which doesn't have much to do with the parallelism one might want in the restore job.



-- 
To view, visit http://gerrit.cloudera.org:8080/13430
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I02f0a818a6fa372ab3c696c11882284877ce207e
Gerrit-Change-Number: 13430
Gerrit-PatchSet: 1
Gerrit-Owner: Will Berkeley <wd...@gmail.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>
Gerrit-Comment-Date: Wed, 29 May 2019 18:05:21 +0000
Gerrit-HasComments: Yes

[kudu-CR] [backup] KUDU-2786 Parallelize tables for backup and restore

Posted by "Will Berkeley (Code Review)" <ge...@cloudera.org>.
Hello Mike Percy, Kudu Jenkins, Grant Henke, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/13430

to look at the new patch set (#2).

Change subject: [backup] KUDU-2786 Parallelize tables for backup and restore
......................................................................

[backup] KUDU-2786 Parallelize tables for backup and restore

This patch adds a hidden, experimental option to run backups and
restores in parallel across tables. Managing resources across parallel
backups and restores is very difficult: the sizes of tables in terms of
number of tablets and size of tables can vary by orders of magnitude
across a cluster, and there are many resources which may be constrained
depending on many factors: CPU, memory, disk I/O, network, number of
executors available. This patch doesn't do resource management. It will
kick off the jobs in parallel, and it's up to Spark to manage the
resources of parallel jobs. Maybe this will work well, maybe it won't...
that's why this is just experimental.

I tested manually on a Spark cluster to verify that jobs are actually
run in parallel.

Change-Id: I02f0a818a6fa372ab3c696c11882284877ce207e
---
M java/kudu-backup/src/main/scala/org/apache/kudu/backup/KuduBackup.scala
M java/kudu-backup/src/main/scala/org/apache/kudu/backup/KuduRestore.scala
M java/kudu-backup/src/main/scala/org/apache/kudu/backup/Options.scala
M java/kudu-backup/src/test/scala/org/apache/kudu/backup/TestKuduBackup.scala
4 files changed, 69 insertions(+), 14 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/30/13430/2
-- 
To view, visit http://gerrit.cloudera.org:8080/13430
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I02f0a818a6fa372ab3c696c11882284877ce207e
Gerrit-Change-Number: 13430
Gerrit-PatchSet: 2
Gerrit-Owner: Will Berkeley <wd...@gmail.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mike Percy <mp...@apache.org>

[kudu-CR] [backup] KUDU-2786 Parallelize tables for backup and restore

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Mike Percy has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/13430 )

Change subject: [backup] KUDU-2786 Parallelize tables for backup and restore
......................................................................

[backup] KUDU-2786 Parallelize tables for backup and restore

This patch adds a hidden, experimental option to run backups and
restores in parallel across tables. Managing resources across parallel
backups and restores is very difficult: the sizes of tables in terms of
number of tablets and size of tables can vary by orders of magnitude
across a cluster, and there are many resources which may be constrained
depending on many factors: CPU, memory, disk I/O, network, number of
executors available. This patch doesn't do resource management. It will
kick off the jobs in parallel, and it's up to Spark to manage the
resources of parallel jobs. Maybe this will work well, maybe it won't...
that's why this is just experimental.

I tested manually on a Spark cluster to verify that jobs are actually
run in parallel.

Change-Id: I02f0a818a6fa372ab3c696c11882284877ce207e
Reviewed-on: http://gerrit.cloudera.org:8080/13430
Reviewed-by: Mike Percy <mp...@apache.org>
Tested-by: Mike Percy <mp...@apache.org>
---
M java/kudu-backup/src/main/scala/org/apache/kudu/backup/KuduBackup.scala
M java/kudu-backup/src/main/scala/org/apache/kudu/backup/KuduRestore.scala
M java/kudu-backup/src/main/scala/org/apache/kudu/backup/Options.scala
M java/kudu-backup/src/test/scala/org/apache/kudu/backup/TestKuduBackup.scala
4 files changed, 69 insertions(+), 14 deletions(-)

Approvals:
  Mike Percy: Looks good to me, approved; Verified

-- 
To view, visit http://gerrit.cloudera.org:8080/13430
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I02f0a818a6fa372ab3c696c11882284877ce207e
Gerrit-Change-Number: 13430
Gerrit-PatchSet: 3
Gerrit-Owner: Will Berkeley <wd...@gmail.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>

[kudu-CR] [backup] KUDU-2786 Parallelize tables for backup and restore

Posted by "Grant Henke (Code Review)" <ge...@cloudera.org>.
Grant Henke has posted comments on this change. ( http://gerrit.cloudera.org:8080/13430 )

Change subject: [backup] KUDU-2786 Parallelize tables for backup and restore
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/13430/1/java/kudu-backup/src/main/scala/org/apache/kudu/backup/KuduBackup.scala
File java/kudu-backup/src/main/scala/org/apache/kudu/backup/KuduBackup.scala:

http://gerrit.cloudera.org:8080/#/c/13430/1/java/kudu-backup/src/main/scala/org/apache/kudu/backup/KuduBackup.scala@128
PS1, Line 128:     val pool = new ForkJoinPool(options.numParallelBackups) // Need a clean-up reference.
> Can you talk a little about the tradeoffs involved in submitting parallel j
Due to Data input and output format and layout assumptions it's easier to keep these isolated as separate Spark jobs. It's also easier for debugging and detecting failures to keep them separate.



-- 
To view, visit http://gerrit.cloudera.org:8080/13430
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I02f0a818a6fa372ab3c696c11882284877ce207e
Gerrit-Change-Number: 13430
Gerrit-PatchSet: 1
Gerrit-Owner: Will Berkeley <wd...@gmail.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Comment-Date: Wed, 29 May 2019 15:45:56 +0000
Gerrit-HasComments: Yes