You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@kudu.apache.org by "Anonymous Coward (Code Review)" <ge...@cloudera.org> on 2022/01/28 02:52:15 UTC

[kudu-CR] Adding repartitioning logic along with coalesce logic to backup output

mkorad@gmail.com has uploaded this change for review. ( http://gerrit.cloudera.org:8080/18174


Change subject: Adding repartitioning logic along with coalesce logic to backup output
......................................................................

Adding repartitioning logic along with coalesce logic to backup output

Change-Id: I328cb7e41bca14b7b6d73eb7721a86fb86203201
---
M java/kudu-backup/src/main/scala/org/apache/kudu/backup/KuduBackup.scala
M java/kudu-backup/src/main/scala/org/apache/kudu/backup/Options.scala
M java/kudu-backup/src/test/scala/org/apache/kudu/backup/TestOptions.scala
3 files changed, 35 insertions(+), 3 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/74/18174/1
-- 
To view, visit http://gerrit.cloudera.org:8080/18174
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I328cb7e41bca14b7b6d73eb7721a86fb86203201
Gerrit-Change-Number: 18174
Gerrit-PatchSet: 1
Gerrit-Owner: Anonymous Coward <mk...@gmail.com>

[kudu-CR] Adding repartitioning logic along with coalesce logic to backup output

Posted by "Anonymous Coward (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/18174

to look at the new patch set (#2).

Change subject: Adding repartitioning logic along with coalesce logic to backup output
......................................................................

Adding repartitioning logic along with coalesce logic to backup output

We optionally use the coalesce and repartitions options in the BackUpKudu Spark command.
For every release we have to add this commit to our internal release. 
Request to get this commit in apache/kudu to avoid having to add this commit for every new kudu release

Adding repartition logic along with coalesce to output files
Both the above params are optional.
Coalesce takes precedence over repartition if both of them are defined.

Testing

sudo /mnt/services/spark/bin/run-transform-cluster-mode-on report-center-batch-driver --stack rcspark_envoy --executor-cores 8 --total-executor-cores 32 --executor-memory 55g --driver-memory 55g --conf spark.log4j.logger.org.apache.spark=WARN --conf spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2 --conf spark.speculation=false --class com.twilio.backup.BackupKuduTable /mnt/services/report-center-batch-indexer/appJar/spark-report-center-batch-indexer-shaded.jar --kuduMasterAddresses report-center-leader-5.us1.twilio.com,report-center-leader-4.us1.twilio.com,report-center-leader-3.us1.twilio.com,report-center-leader-2.us1.twilio.com,report-center-leader-1.us1.twilio.com --splitSizeBytes 1000000000 --scanRequestTimeoutMs 60000000 --coalesceOutputPartitions 32 --rootPath s3a://com.twilio.prod.warehouse/data/report-center/kudu-table-backup/ BillableItemUsageCategories
2022-01-27 08:07:01,015 - root - INFO:  TIMEOUT is None, status check interval 60, job file None and connection retry to spark REST API 5 and arguments to job ['--executor-cores', '8', '--total-executor-cores', '32', '--executor-memory', '55g', '--driver-memory', '55g', '--conf', 'spark.log4j.logger.org.apache.spark=WARN', '--conf', 'spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2', '--conf', 'spark.speculation=false', '--class', 'com.twilio.backup.BackupKuduTable', '/mnt/services/report-center-batch-indexer/appJar/spark-report-center-batch-indexer-shaded.jar', '--kuduMasterAddresses', 'report-center-leader-5.us1.twilio.com,report-center-leader-4.us1.twilio.com,report-center-leader-3.us1.twilio.com,report-center-leader-2.us1.twilio.com,report-center-leader-1.us1.twilio.com', '--splitSizeBytes', '1000000000', '--scanRequestTimeoutMs', '60000000', '--coalesceOutputPartitions', '32', '--rootPath', 's3a://com.twilio.prod.warehouse/data/report-center/kudu-table-backup/', 'BillableItemUsageCategories']
2022-01-27 08:07:01,772 - root - INFO:  Job submitted as driver-20220127080701-17528
2022-01-27 08:08:02,680 - root - INFO:  Job submission [driver-20220127080701-17528] alive with state RUNNING on worker-20220127062817-172.25.72.200-7078
2022-01-27 08:09:03,707 - root - INFO:  Job submission [driver-20220127080701-17528] completed (state: FINISHED)
Finished: SUCCESS


Change-Id: I328cb7e41bca14b7b6d73eb7721a86fb86203201
---
M java/kudu-backup/src/main/scala/org/apache/kudu/backup/KuduBackup.scala
M java/kudu-backup/src/main/scala/org/apache/kudu/backup/Options.scala
M java/kudu-backup/src/test/scala/org/apache/kudu/backup/TestOptions.scala
3 files changed, 35 insertions(+), 3 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/74/18174/2
-- 
To view, visit http://gerrit.cloudera.org:8080/18174
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I328cb7e41bca14b7b6d73eb7721a86fb86203201
Gerrit-Change-Number: 18174
Gerrit-PatchSet: 2
Gerrit-Owner: Anonymous Coward <mk...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins (120)

[kudu-CR] Adding repartitioning logic along with coalesce logic to backup output

Posted by "Anonymous Coward (Code Review)" <ge...@cloudera.org>.
Hello Attila Bukor, Kudu Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/18174

to look at the new patch set (#6).

Change subject: Adding repartitioning logic along with coalesce logic to backup output
......................................................................

Adding repartitioning logic along with coalesce logic to backup output

Jira: https://issues.apache.org/jira/browse/KUDU-3309


Adding repartition logic along with coalesce to output files
Both the above parameters are optional.
Coalesce takes precedence over repartition if both of them are defined.
Repartition can be used to increase or decrease the number of output files.
Coalesce can be used to reduce the number of output files.


Testing

sudo /mnt/services/spark/bin/run-transform-cluster-mode-on report-center-batch-driver --stack rcspark_envoy --executor-cores 8 --total-executor-cores 32 --executor-memory 55g --driver-memory 55g --conf spark.log4j.logger.org.apache.spark=WARN --conf spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2 --conf spark.speculation=false --class com.twilio.backup.BackupKuduTable /mnt/services/report-center-batch-indexer/appJar/spark-report-center-batch-indexer-shaded.jar --kuduMasterAddresses report-center-leader-5.us1.twilio.com,report-center-leader-4.us1.twilio.com,report-center-leader-3.us1.twilio.com,report-center-leader-2.us1.twilio.com,report-center-leader-1.us1.twilio.com --splitSizeBytes 1000000000 --scanRequestTimeoutMs 60000000 --coalesceOutputPartitions 32 --rootPath s3a://com.twilio.prod.warehouse/data/report-center/kudu-table-backup/ BillableItemUsageCategories
2022-01-27 08:07:01,015 - root - INFO:  TIMEOUT is None, status check interval 60, job file None and connection retry to spark REST API 5 and arguments to job ['--executor-cores', '8', '--total-executor-cores', '32', '--executor-memory', '55g', '--driver-memory', '55g', '--conf', 'spark.log4j.logger.org.apache.spark=WARN', '--conf', 'spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2', '--conf', 'spark.speculation=false', '--class', 'com.twilio.backup.BackupKuduTable', '/mnt/services/report-center-batch-indexer/appJar/spark-report-center-batch-indexer-shaded.jar', '--kuduMasterAddresses', 'report-center-leader-5.us1.twilio.com,report-center-leader-4.us1.twilio.com,report-center-leader-3.us1.twilio.com,report-center-leader-2.us1.twilio.com,report-center-leader-1.us1.twilio.com', '--splitSizeBytes', '1000000000', '--scanRequestTimeoutMs', '60000000', '--coalesceOutputPartitions', '32', '--rootPath', 's3a://com.twilio.prod.warehouse/data/report-center/kudu-table-backup/', 'BillableItemUsageCategories']
2022-01-27 08:07:01,772 - root - INFO:  Job submitted as driver-20220127080701-17528
2022-01-27 08:08:02,680 - root - INFO:  Job submission [driver-20220127080701-17528] alive with state RUNNING on worker-20220127062817-172.25.72.200-7078
2022-01-27 08:09:03,707 - root - INFO:  Job submission [driver-20220127080701-17528] completed (state: FINISHED)
Finished: SUCCESS


Change-Id: I328cb7e41bca14b7b6d73eb7721a86fb86203201
---
M java/kudu-backup/src/main/scala/org/apache/kudu/backup/KuduBackup.scala
M java/kudu-backup/src/main/scala/org/apache/kudu/backup/Options.scala
M java/kudu-backup/src/test/scala/org/apache/kudu/backup/TestOptions.scala
3 files changed, 35 insertions(+), 3 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/74/18174/6
-- 
To view, visit http://gerrit.cloudera.org:8080/18174
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I328cb7e41bca14b7b6d73eb7721a86fb86203201
Gerrit-Change-Number: 18174
Gerrit-PatchSet: 6
Gerrit-Owner: Anonymous Coward <mk...@twilio.com>
Gerrit-Reviewer: Anonymous Coward <mk...@twilio.com>
Gerrit-Reviewer: Attila Bukor <ab...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)

[kudu-CR] [java] KUDU-3309 Add repartition and coalesce options to backup output

Posted by "Anonymous Coward (Code Review)" <ge...@cloudera.org>.
mkorad@twilio.com has posted comments on this change. ( http://gerrit.cloudera.org:8080/18174 )

Change subject: [java] KUDU-3309 Add repartition and coalesce options to backup output
......................................................................


Patch Set 7:

> Patch Set 6:
> 
> (1 comment)
> 
> Thank you for updating addressing my comments. We tend to keep the code and the test in the same commit, can you squash these two commits and abandon the follow-up commit's review on gerrit to keep them together?

Squashed all the commits together as pointed out


-- 
To view, visit http://gerrit.cloudera.org:8080/18174
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I328cb7e41bca14b7b6d73eb7721a86fb86203201
Gerrit-Change-Number: 18174
Gerrit-PatchSet: 7
Gerrit-Owner: Anonymous Coward <mk...@twilio.com>
Gerrit-Reviewer: Anonymous Coward <mk...@twilio.com>
Gerrit-Reviewer: Attila Bukor <ab...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Comment-Date: Thu, 03 Feb 2022 23:54:21 +0000
Gerrit-HasComments: No

[kudu-CR] [java] KUDU-3309 Add repartition and coalesce options to backup output

Posted by "Anonymous Coward (Code Review)" <ge...@cloudera.org>.
Hello Attila Bukor, Kudu Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/18174

to look at the new patch set (#7).

Change subject: [java] KUDU-3309 Add repartition and coalesce options to backup output
......................................................................

[java] KUDU-3309 Add repartition and coalesce options to backup output

Change-Id: I328cb7e41bca14b7b6d73eb7721a86fb86203201
---
M java/kudu-backup/src/main/scala/org/apache/kudu/backup/KuduBackup.scala
M java/kudu-backup/src/main/scala/org/apache/kudu/backup/Options.scala
M java/kudu-backup/src/test/scala/org/apache/kudu/backup/TestKuduBackup.scala
M java/kudu-backup/src/test/scala/org/apache/kudu/backup/TestOptions.scala
4 files changed, 55 insertions(+), 7 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/74/18174/7
-- 
To view, visit http://gerrit.cloudera.org:8080/18174
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I328cb7e41bca14b7b6d73eb7721a86fb86203201
Gerrit-Change-Number: 18174
Gerrit-PatchSet: 7
Gerrit-Owner: Anonymous Coward <mk...@twilio.com>
Gerrit-Reviewer: Anonymous Coward <mk...@twilio.com>
Gerrit-Reviewer: Attila Bukor <ab...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)

[kudu-CR] Adding repartitioning logic along with coalesce logic to backup output

Posted by "Anonymous Coward (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/18174

to look at the new patch set (#3).

Change subject: Adding repartitioning logic along with coalesce logic to backup output
......................................................................

Adding repartitioning logic along with coalesce logic to backup output

We optionally use the coalesce and repartitions options in the BackupKuduTable Spark command.
For every release we have to add this commit to our internal release. 
Request to get this commit in apache/kudu to avoid having to add this commit for every new kudu release

Adding repartition logic along with coalesce to output files
Both the above params are optional.
Coalesce takes precedence over repartition if both of them are defined.

Testing

sudo /mnt/services/spark/bin/run-transform-cluster-mode-on report-center-batch-driver --stack rcspark_envoy --executor-cores 8 --total-executor-cores 32 --executor-memory 55g --driver-memory 55g --conf spark.log4j.logger.org.apache.spark=WARN --conf spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2 --conf spark.speculation=false --class com.twilio.backup.BackupKuduTable /mnt/services/report-center-batch-indexer/appJar/spark-report-center-batch-indexer-shaded.jar --kuduMasterAddresses report-center-leader-5.us1.twilio.com,report-center-leader-4.us1.twilio.com,report-center-leader-3.us1.twilio.com,report-center-leader-2.us1.twilio.com,report-center-leader-1.us1.twilio.com --splitSizeBytes 1000000000 --scanRequestTimeoutMs 60000000 --coalesceOutputPartitions 32 --rootPath s3a://com.twilio.prod.warehouse/data/report-center/kudu-table-backup/ BillableItemUsageCategories
2022-01-27 08:07:01,015 - root - INFO:  TIMEOUT is None, status check interval 60, job file None and connection retry to spark REST API 5 and arguments to job ['--executor-cores', '8', '--total-executor-cores', '32', '--executor-memory', '55g', '--driver-memory', '55g', '--conf', 'spark.log4j.logger.org.apache.spark=WARN', '--conf', 'spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2', '--conf', 'spark.speculation=false', '--class', 'com.twilio.backup.BackupKuduTable', '/mnt/services/report-center-batch-indexer/appJar/spark-report-center-batch-indexer-shaded.jar', '--kuduMasterAddresses', 'report-center-leader-5.us1.twilio.com,report-center-leader-4.us1.twilio.com,report-center-leader-3.us1.twilio.com,report-center-leader-2.us1.twilio.com,report-center-leader-1.us1.twilio.com', '--splitSizeBytes', '1000000000', '--scanRequestTimeoutMs', '60000000', '--coalesceOutputPartitions', '32', '--rootPath', 's3a://com.twilio.prod.warehouse/data/report-center/kudu-table-backup/', 'BillableItemUsageCategories']
2022-01-27 08:07:01,772 - root - INFO:  Job submitted as driver-20220127080701-17528
2022-01-27 08:08:02,680 - root - INFO:  Job submission [driver-20220127080701-17528] alive with state RUNNING on worker-20220127062817-172.25.72.200-7078
2022-01-27 08:09:03,707 - root - INFO:  Job submission [driver-20220127080701-17528] completed (state: FINISHED)
Finished: SUCCESS


Change-Id: I328cb7e41bca14b7b6d73eb7721a86fb86203201
---
M java/kudu-backup/src/main/scala/org/apache/kudu/backup/KuduBackup.scala
M java/kudu-backup/src/main/scala/org/apache/kudu/backup/Options.scala
M java/kudu-backup/src/test/scala/org/apache/kudu/backup/TestOptions.scala
3 files changed, 35 insertions(+), 3 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/74/18174/3
-- 
To view, visit http://gerrit.cloudera.org:8080/18174
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I328cb7e41bca14b7b6d73eb7721a86fb86203201
Gerrit-Change-Number: 18174
Gerrit-PatchSet: 3
Gerrit-Owner: Anonymous Coward <mk...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins (120)

[kudu-CR] Adding repartitioning logic along with coalesce logic to backup output

Posted by "Anonymous Coward (Code Review)" <ge...@cloudera.org>.
mkorad@twilio.com has posted comments on this change. ( http://gerrit.cloudera.org:8080/18174 )

Change subject: Adding repartitioning logic along with coalesce logic to backup output
......................................................................


Patch Set 6:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/18174/4//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/18174/4//COMMIT_MSG@10
PS4, Line 10: 
> Ack
I have modified to description to add details about the coalesce and repartition options


http://gerrit.cloudera.org:8080/#/c/18174/4//COMMIT_MSG@17
PS4, Line 17: 
> Ack
Added assert in existing tests with the new options.


http://gerrit.cloudera.org:8080/#/c/18174/4/java/kudu-backup/src/main/scala/org/apache/kudu/backup/KuduBackup.scala
File java/kudu-backup/src/main/scala/org/apache/kudu/backup/KuduBackup.scala:

http://gerrit.cloudera.org:8080/#/c/18174/4/java/kudu-backup/src/main/scala/org/apache/kudu/backup/KuduBackup.scala@95
PS4, Line 95:     // Coalesce takes precedence over repartition.
> Maybe it should fail, or at least emit a warning if both are specified?
I think a warning would be good, added a warning if both coalesce and repartition are specified


http://gerrit.cloudera.org:8080/#/c/18174/4/java/kudu-backup/src/main/scala/org/apache/kudu/backup/Options.scala
File java/kudu-backup/src/main/scala/org/apache/kudu/backup/Options.scala:

http://gerrit.cloudera.org:8080/#/c/18174/4/java/kudu-backup/src/main/scala/org/apache/kudu/backup/Options.scala@174
PS4, Line 174:         .text("Sets the repartition output. Maximum number of output. Default: " + DefaultRepartitionOutput)
> nit: line too long
Fixed



-- 
To view, visit http://gerrit.cloudera.org:8080/18174
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I328cb7e41bca14b7b6d73eb7721a86fb86203201
Gerrit-Change-Number: 18174
Gerrit-PatchSet: 6
Gerrit-Owner: Anonymous Coward <mk...@twilio.com>
Gerrit-Reviewer: Anonymous Coward <mk...@twilio.com>
Gerrit-Reviewer: Attila Bukor <ab...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Comment-Date: Tue, 01 Feb 2022 05:56:21 +0000
Gerrit-HasComments: Yes

[kudu-CR] Adding repartitioning logic along with coalesce logic to backup output

Posted by "Anonymous Coward (Code Review)" <ge...@cloudera.org>.
Hello Attila Bukor, Kudu Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/18174

to look at the new patch set (#5).

Change subject: Adding repartitioning logic along with coalesce logic to backup output
......................................................................

Adding repartitioning logic along with coalesce logic to backup output

Jira: https://issues.apache.org/jira/browse/KUDU-3309

We optionally use the coalesce and repartitions options in the BackupKuduTable Spark command.
For every release we have to add this commit to our internal release. 
Request to get this commit in apache/kudu to avoid having to add this commit for every new kudu release

Adding repartition logic along with coalesce to output files
Both the above parameterss are optional.
Coalesce takes precedence over repartition if both of them are defined.

Testing

sudo /mnt/services/spark/bin/run-transform-cluster-mode-on report-center-batch-driver --stack rcspark_envoy --executor-cores 8 --total-executor-cores 32 --executor-memory 55g --driver-memory 55g --conf spark.log4j.logger.org.apache.spark=WARN --conf spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2 --conf spark.speculation=false --class com.twilio.backup.BackupKuduTable /mnt/services/report-center-batch-indexer/appJar/spark-report-center-batch-indexer-shaded.jar --kuduMasterAddresses report-center-leader-5.us1.twilio.com,report-center-leader-4.us1.twilio.com,report-center-leader-3.us1.twilio.com,report-center-leader-2.us1.twilio.com,report-center-leader-1.us1.twilio.com --splitSizeBytes 1000000000 --scanRequestTimeoutMs 60000000 --coalesceOutputPartitions 32 --rootPath s3a://com.twilio.prod.warehouse/data/report-center/kudu-table-backup/ BillableItemUsageCategories
2022-01-27 08:07:01,015 - root - INFO:  TIMEOUT is None, status check interval 60, job file None and connection retry to spark REST API 5 and arguments to job ['--executor-cores', '8', '--total-executor-cores', '32', '--executor-memory', '55g', '--driver-memory', '55g', '--conf', 'spark.log4j.logger.org.apache.spark=WARN', '--conf', 'spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2', '--conf', 'spark.speculation=false', '--class', 'com.twilio.backup.BackupKuduTable', '/mnt/services/report-center-batch-indexer/appJar/spark-report-center-batch-indexer-shaded.jar', '--kuduMasterAddresses', 'report-center-leader-5.us1.twilio.com,report-center-leader-4.us1.twilio.com,report-center-leader-3.us1.twilio.com,report-center-leader-2.us1.twilio.com,report-center-leader-1.us1.twilio.com', '--splitSizeBytes', '1000000000', '--scanRequestTimeoutMs', '60000000', '--coalesceOutputPartitions', '32', '--rootPath', 's3a://com.twilio.prod.warehouse/data/report-center/kudu-table-backup/', 'BillableItemUsageCategories']
2022-01-27 08:07:01,772 - root - INFO:  Job submitted as driver-20220127080701-17528
2022-01-27 08:08:02,680 - root - INFO:  Job submission [driver-20220127080701-17528] alive with state RUNNING on worker-20220127062817-172.25.72.200-7078
2022-01-27 08:09:03,707 - root - INFO:  Job submission [driver-20220127080701-17528] completed (state: FINISHED)
Finished: SUCCESS


Change-Id: I328cb7e41bca14b7b6d73eb7721a86fb86203201
---
M java/kudu-backup/src/main/scala/org/apache/kudu/backup/KuduBackup.scala
M java/kudu-backup/src/main/scala/org/apache/kudu/backup/Options.scala
M java/kudu-backup/src/test/scala/org/apache/kudu/backup/TestOptions.scala
3 files changed, 35 insertions(+), 3 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/74/18174/5
-- 
To view, visit http://gerrit.cloudera.org:8080/18174
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I328cb7e41bca14b7b6d73eb7721a86fb86203201
Gerrit-Change-Number: 18174
Gerrit-PatchSet: 5
Gerrit-Owner: Anonymous Coward <mk...@twilio.com>
Gerrit-Reviewer: Attila Bukor <ab...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)

[kudu-CR] Adding repartitioning logic along with coalesce logic to backup output

Posted by "Attila Bukor (Code Review)" <ge...@cloudera.org>.
Attila Bukor has posted comments on this change. ( http://gerrit.cloudera.org:8080/18174 )

Change subject: Adding repartitioning logic along with coalesce logic to backup output
......................................................................


Patch Set 6:

(1 comment)

Thank you for updating addressing my comments. We tend to keep the code and the test in the same commit, can you squash these two commits and abandon the follow-up commit's review on gerrit to keep them together?

http://gerrit.cloudera.org:8080/#/c/18174/4//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/18174/4//COMMIT_MSG@7
PS4, Line 7: Adding repartitioning logic along with coalesce logic to backup output
> Thanks for the review, I had created a Jira, added it to the description
Thanks for creating the JIRA. Normally, the JIRA ID is simply added to the subject of the commit message, and also a tag is added. This would look like something like this:

"[java] KUDU-3309 Add repartition and coalesce to backup output"



-- 
To view, visit http://gerrit.cloudera.org:8080/18174
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I328cb7e41bca14b7b6d73eb7721a86fb86203201
Gerrit-Change-Number: 18174
Gerrit-PatchSet: 6
Gerrit-Owner: Anonymous Coward <mk...@twilio.com>
Gerrit-Reviewer: Anonymous Coward <mk...@twilio.com>
Gerrit-Reviewer: Attila Bukor <ab...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Comment-Date: Tue, 01 Feb 2022 10:57:30 +0000
Gerrit-HasComments: Yes

[kudu-CR] Adding repartitioning logic along with coalesce logic to backup output

Posted by "Attila Bukor (Code Review)" <ge...@cloudera.org>.
Attila Bukor has posted comments on this change. ( http://gerrit.cloudera.org:8080/18174 )

Change subject: Adding repartitioning logic along with coalesce logic to backup output
......................................................................


Patch Set 4:

(5 comments)

Thank you for your contribution. It seems TestOption is failing, which should be fixed, and I also left a few comments.

http://gerrit.cloudera.org:8080/#/c/18174/4//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/18174/4//COMMIT_MSG@7
PS4, Line 7: Adding repartitioning logic along with coalesce logic to backup output
As it's a new feature, I think it would be useful to have a JIRA created for this and its ID referenced here.


http://gerrit.cloudera.org:8080/#/c/18174/4//COMMIT_MSG@10
PS4, Line 10: For every release we have to add this commit to our internal release. 
This part doesn't seem useful in the commit message once it's merged. Maybe you could rephrase it to include the motivation of why these new options are useful and how they work.

style nit: lines are too long, the body should be wrapped at 72 lines (except for log lines), and there should be no trailing whitespaces.


http://gerrit.cloudera.org:8080/#/c/18174/4//COMMIT_MSG@17
PS4, Line 17: Testing
Can you add a test instead?


http://gerrit.cloudera.org:8080/#/c/18174/4/java/kudu-backup/src/main/scala/org/apache/kudu/backup/KuduBackup.scala
File java/kudu-backup/src/main/scala/org/apache/kudu/backup/KuduBackup.scala:

http://gerrit.cloudera.org:8080/#/c/18174/4/java/kudu-backup/src/main/scala/org/apache/kudu/backup/KuduBackup.scala@95
PS4, Line 95:     // Coalesce takes precedence over repartition.
Maybe it should fail, or at least emit a warning if both are specified?


http://gerrit.cloudera.org:8080/#/c/18174/4/java/kudu-backup/src/main/scala/org/apache/kudu/backup/Options.scala
File java/kudu-backup/src/main/scala/org/apache/kudu/backup/Options.scala:

http://gerrit.cloudera.org:8080/#/c/18174/4/java/kudu-backup/src/main/scala/org/apache/kudu/backup/Options.scala@174
PS4, Line 174:         .text("Sets the repartition output. Maximum number of output. Default: " + DefaultRepartitionOutput)
nit: line too long



-- 
To view, visit http://gerrit.cloudera.org:8080/18174
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I328cb7e41bca14b7b6d73eb7721a86fb86203201
Gerrit-Change-Number: 18174
Gerrit-PatchSet: 4
Gerrit-Owner: Anonymous Coward <mk...@twilio.com>
Gerrit-Reviewer: Attila Bukor <ab...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Comment-Date: Fri, 28 Jan 2022 14:51:24 +0000
Gerrit-HasComments: Yes

[kudu-CR] [java] KUDU-3309 Add repartition and coalesce options to backup output

Posted by "Attila Bukor (Code Review)" <ge...@cloudera.org>.
Attila Bukor has posted comments on this change. ( http://gerrit.cloudera.org:8080/18174 )

Change subject: [java] KUDU-3309 Add repartition and coalesce options to backup output
......................................................................


Patch Set 7:

> Patch Set 7:
> 
> > Patch Set 6:
> > 
> > (1 comment)
> > 
> > Thank you for updating addressing my comments. We tend to keep the code and the test in the same commit, can you squash these two commits and abandon the follow-up commit's review on gerrit to keep them together?
> 
> Squashed all the commits together as pointed out

I think there's been a misunderstanding. The point is that there should be a single commit (this), all follow-up fixes and test additions squashed into it, and the remaining commits abandoned on gerrit.


-- 
To view, visit http://gerrit.cloudera.org:8080/18174
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I328cb7e41bca14b7b6d73eb7721a86fb86203201
Gerrit-Change-Number: 18174
Gerrit-PatchSet: 7
Gerrit-Owner: Anonymous Coward <mk...@twilio.com>
Gerrit-Reviewer: Anonymous Coward <mk...@twilio.com>
Gerrit-Reviewer: Attila Bukor <ab...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Comment-Date: Mon, 07 Feb 2022 12:28:16 +0000
Gerrit-HasComments: No

[kudu-CR] Adding repartitioning logic along with coalesce logic to backup output

Posted by "Anonymous Coward (Code Review)" <ge...@cloudera.org>.
mkorad@twilio.com has posted comments on this change. ( http://gerrit.cloudera.org:8080/18174 )

Change subject: Adding repartitioning logic along with coalesce logic to backup output
......................................................................


Patch Set 5:

(3 comments)

Thank you for the review. Working on addressing the comments

http://gerrit.cloudera.org:8080/#/c/18174/4//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/18174/4//COMMIT_MSG@7
PS4, Line 7: Adding repartitioning logic along with coalesce logic to backup output
> As it's a new feature, I think it would be useful to have a JIRA created fo
Thanks for the review, I had created a Jira, added it to the description


http://gerrit.cloudera.org:8080/#/c/18174/4//COMMIT_MSG@10
PS4, Line 10: 
> This part doesn't seem useful in the commit message once it's merged. Maybe
Ack


http://gerrit.cloudera.org:8080/#/c/18174/4//COMMIT_MSG@17
PS4, Line 17: Coalesce takes precedence over repartition if both of them are defined.
> Can you add a test instead?
Ack



-- 
To view, visit http://gerrit.cloudera.org:8080/18174
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I328cb7e41bca14b7b6d73eb7721a86fb86203201
Gerrit-Change-Number: 18174
Gerrit-PatchSet: 5
Gerrit-Owner: Anonymous Coward <mk...@twilio.com>
Gerrit-Reviewer: Anonymous Coward <mk...@twilio.com>
Gerrit-Reviewer: Attila Bukor <ab...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Comment-Date: Tue, 01 Feb 2022 04:49:38 +0000
Gerrit-HasComments: Yes

[kudu-CR] Adding repartitioning logic along with coalesce logic to backup output

Posted by "Anonymous Coward (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/18174

to look at the new patch set (#4).

Change subject: Adding repartitioning logic along with coalesce logic to backup output
......................................................................

Adding repartitioning logic along with coalesce logic to backup output

We optionally use the coalesce and repartitions options in the BackupKuduTable Spark command.
For every release we have to add this commit to our internal release. 
Request to get this commit in apache/kudu to avoid having to add this commit for every new kudu release

Adding repartition logic along with coalesce to output files
Both the above parameterss are optional.
Coalesce takes precedence over repartition if both of them are defined.

Testing

sudo /mnt/services/spark/bin/run-transform-cluster-mode-on report-center-batch-driver --stack rcspark_envoy --executor-cores 8 --total-executor-cores 32 --executor-memory 55g --driver-memory 55g --conf spark.log4j.logger.org.apache.spark=WARN --conf spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2 --conf spark.speculation=false --class com.twilio.backup.BackupKuduTable /mnt/services/report-center-batch-indexer/appJar/spark-report-center-batch-indexer-shaded.jar --kuduMasterAddresses report-center-leader-5.us1.twilio.com,report-center-leader-4.us1.twilio.com,report-center-leader-3.us1.twilio.com,report-center-leader-2.us1.twilio.com,report-center-leader-1.us1.twilio.com --splitSizeBytes 1000000000 --scanRequestTimeoutMs 60000000 --coalesceOutputPartitions 32 --rootPath s3a://com.twilio.prod.warehouse/data/report-center/kudu-table-backup/ BillableItemUsageCategories
2022-01-27 08:07:01,015 - root - INFO:  TIMEOUT is None, status check interval 60, job file None and connection retry to spark REST API 5 and arguments to job ['--executor-cores', '8', '--total-executor-cores', '32', '--executor-memory', '55g', '--driver-memory', '55g', '--conf', 'spark.log4j.logger.org.apache.spark=WARN', '--conf', 'spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2', '--conf', 'spark.speculation=false', '--class', 'com.twilio.backup.BackupKuduTable', '/mnt/services/report-center-batch-indexer/appJar/spark-report-center-batch-indexer-shaded.jar', '--kuduMasterAddresses', 'report-center-leader-5.us1.twilio.com,report-center-leader-4.us1.twilio.com,report-center-leader-3.us1.twilio.com,report-center-leader-2.us1.twilio.com,report-center-leader-1.us1.twilio.com', '--splitSizeBytes', '1000000000', '--scanRequestTimeoutMs', '60000000', '--coalesceOutputPartitions', '32', '--rootPath', 's3a://com.twilio.prod.warehouse/data/report-center/kudu-table-backup/', 'BillableItemUsageCategories']
2022-01-27 08:07:01,772 - root - INFO:  Job submitted as driver-20220127080701-17528
2022-01-27 08:08:02,680 - root - INFO:  Job submission [driver-20220127080701-17528] alive with state RUNNING on worker-20220127062817-172.25.72.200-7078
2022-01-27 08:09:03,707 - root - INFO:  Job submission [driver-20220127080701-17528] completed (state: FINISHED)
Finished: SUCCESS


Change-Id: I328cb7e41bca14b7b6d73eb7721a86fb86203201
---
M java/kudu-backup/src/main/scala/org/apache/kudu/backup/KuduBackup.scala
M java/kudu-backup/src/main/scala/org/apache/kudu/backup/Options.scala
M java/kudu-backup/src/test/scala/org/apache/kudu/backup/TestOptions.scala
3 files changed, 35 insertions(+), 3 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/74/18174/4
-- 
To view, visit http://gerrit.cloudera.org:8080/18174
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I328cb7e41bca14b7b6d73eb7721a86fb86203201
Gerrit-Change-Number: 18174
Gerrit-PatchSet: 4
Gerrit-Owner: Anonymous Coward <mk...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins (120)