You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@kudu.apache.org by "Mike Percy (Code Review)" <ge...@cloudera.org> on 2018/08/07 19:24:07 UTC

[kudu-CR] WIP: Create parallelized loader Spark job

Mike Percy has uploaded this change for review. ( http://gerrit.cloudera.org:8080/11147


Change subject: WIP: Create parallelized loader Spark job
......................................................................

WIP: Create parallelized loader Spark job

This patch adds a new tool called LoadRandomData that will load random
data into a table in a parallelized manner. Perhaps the tool should be
named something else so it would be straightforward to add a sequential
inserter that supports a range-partitioned table.

Marked WIP because it could use more testing.

Change-Id: I4434c9f5d709154037386b4c7be94045df162267
---
A java/kudu-client-tools/src/main/java/org/apache/kudu/mapreduce/tools/RandomDataGenerator.java
M java/kudu-spark-tools/build.gradle
M java/kudu-spark-tools/pom.xml
A java/kudu-spark-tools/src/main/scala/org/apache/kudu/spark/tools/LoadRandomData.scala
A java/kudu-spark-tools/src/main/scala/org/apache/kudu/spark/tools/LoadRandomDataOptions.scala
A java/kudu-spark-tools/src/test/scala/org/apache/kudu/spark/tools/LoadRandomDataTest.scala
6 files changed, 294 insertions(+), 0 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/47/11147/1
-- 
To view, visit http://gerrit.cloudera.org:8080/11147
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I4434c9f5d709154037386b4c7be94045df162267
Gerrit-Change-Number: 11147
Gerrit-PatchSet: 1
Gerrit-Owner: Mike Percy <mp...@apache.org>

[kudu-CR] WIP: Create parallelized loader Spark job

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.

Hello Kudu Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/11147

to look at the new patch set (#2).

Change subject: WIP: Create parallelized loader Spark job
......................................................................

WIP: Create parallelized loader Spark job

This patch adds a new tool called LoadRandomData that will load random
data into a table in a parallelized manner. Perhaps the tool should be
named something else so it would be straightforward to add a sequential
inserter that supports a range-partitioned table.

Marked WIP because it could use more testing, and because the
RandomDataGenerator was copy-pasted from /examples ... maybe it should
just be deleted from examples.

Change-Id: I4434c9f5d709154037386b4c7be94045df162267
---
A java/kudu-client-tools/src/main/java/org/apache/kudu/mapreduce/tools/RandomDataGenerator.java
M java/kudu-spark-tools/build.gradle
M java/kudu-spark-tools/pom.xml
A java/kudu-spark-tools/src/main/scala/org/apache/kudu/spark/tools/LoadRandomData.scala
A java/kudu-spark-tools/src/main/scala/org/apache/kudu/spark/tools/LoadRandomDataOptions.scala
A java/kudu-spark-tools/src/test/scala/org/apache/kudu/spark/tools/LoadRandomDataTest.scala
6 files changed, 293 insertions(+), 0 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/47/11147/2
-- 
To view, visit http://gerrit.cloudera.org:8080/11147
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I4434c9f5d709154037386b4c7be94045df162267
Gerrit-Change-Number: 11147
Gerrit-PatchSet: 2
Gerrit-Owner: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Kudu Jenkins

[kudu-CR] Create parallelized loader Spark job

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.

Adar Dembo has posted comments on this change. ( http://gerrit.cloudera.org:8080/11147 )

Change subject: Create parallelized loader Spark job
......................................................................


Patch Set 4:

(10 comments)

Not sure if you meant to abandon this or not; seemed like that was an accident.

http://gerrit.cloudera.org:8080/#/c/11147/4//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/11147/4//COMMIT_MSG@14
PS4, Line 14: marke
marked


http://gerrit.cloudera.org:8080/#/c/11147/4/java/kudu-spark-tools/src/main/scala/org/apache/kudu/spark/tools/DistributedDataGenerator.scala
File java/kudu-spark-tools/src/main/scala/org/apache/kudu/spark/tools/DistributedDataGenerator.scala:

PS4: 
License header.


http://gerrit.cloudera.org:8080/#/c/11147/4/java/kudu-spark-tools/src/main/scala/org/apache/kudu/spark/tools/DistributedDataGenerator.scala@24
PS4, Line 24: case class TableOptions(
            :     numPartitions: Int,
            :     replicationFactor: Int,
            :     numColumns: Int,
            :     intColumnPercentage: Float)
Unused?


http://gerrit.cloudera.org:8080/#/c/11147/4/java/kudu-spark-tools/src/main/scala/org/apache/kudu/spark/tools/DistributedDataGenerator.scala@46
PS4, Line 46:     val kuduClient = new KuduClientBuilder(options.masterAddresses).build()
Why does this use the KuduClient directly instead of using KuduContext or something like that? Will this work with a Kerberized cluster?


http://gerrit.cloudera.org:8080/#/c/11147/4/java/kudu-spark-tools/src/main/scala/org/apache/kudu/spark/tools/DistributedDataGenerator.scala@75
PS4, Line 75: isServiceUnavailable
Is this really indicative of a collision?


http://gerrit.cloudera.org:8080/#/c/11147/4/java/kudu-spark-tools/src/main/scala/org/apache/kudu/spark/tools/DistributedDataGenerator.scala@83
PS4, Line 83:       rowsWritten += 1
The subtraction and addition is a little weird. Maybe you can add the happy path into the if/else and increment rowsWritten there?


http://gerrit.cloudera.org:8080/#/c/11147/4/java/kudu-spark-tools/src/main/scala/org/apache/kudu/spark/tools/DistributedDataGenerator.scala@218
PS4, Line 218: Defaults to
Nit: Default: ...

(to be consistent with the other options here.)


http://gerrit.cloudera.org:8080/#/c/11147/4/java/kudu-spark-tools/src/test/scala/org/apache/kudu/spark/tools/DistributedDataGeneratorTest.scala
File java/kudu-spark-tools/src/test/scala/org/apache/kudu/spark/tools/DistributedDataGeneratorTest.scala:

PS4: 
License header.


http://gerrit.cloudera.org:8080/#/c/11147/4/java/kudu-spark-tools/src/test/scala/org/apache/kudu/spark/tools/DistributedDataGeneratorTest.scala@19
PS4, Line 19:   private val TABLE_SCHEMA: Schema = {
Can't you use your fancy schema generator here to get better coverage?


http://gerrit.cloudera.org:8080/#/c/11147/4/java/kudu-spark-tools/src/test/scala/org/apache/kudu/spark/tools/DistributedDataGeneratorTest.scala@32
PS4, Line 32:     .setNumReplicas(1)
Why this?



-- 
To view, visit http://gerrit.cloudera.org:8080/11147
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I4434c9f5d709154037386b4c7be94045df162267
Gerrit-Change-Number: 11147
Gerrit-PatchSet: 4
Gerrit-Owner: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Comment-Date: Tue, 18 Dec 2018 00:32:59 +0000
Gerrit-HasComments: Yes

[kudu-CR] Create parallelized loader Spark job

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.

Hello Kudu Jenkins, Grant Henke, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/11147

to look at the new patch set (#3).

Change subject: Create parallelized loader Spark job
......................................................................

Create parallelized loader Spark job

This patch adds a new tool called DistributedDataGenerator that will
random data into a table in a parallelized manner.

TODO: remove some copy/pasted code

Change-Id: I4434c9f5d709154037386b4c7be94045df162267
---
A java/kudu-client-tools/src/main/java/org/apache/kudu/mapreduce/tools/ColumnDataGenerator.java
A java/kudu-client-tools/src/main/java/org/apache/kudu/mapreduce/tools/RandomDataGenerator.java
A java/kudu-client-tools/src/main/java/org/apache/kudu/mapreduce/tools/SequentialDataGenerator.java
M java/kudu-spark-tools/build.gradle
M java/kudu-spark-tools/pom.xml
A java/kudu-spark-tools/src/main/scala/org/apache/kudu/spark/tools/DistributedDataGenerator.scala
A java/kudu-spark-tools/src/test/scala/org/apache/kudu/spark/tools/DistributedDataGeneratorTest.scala
7 files changed, 424 insertions(+), 0 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/47/11147/3
-- 
To view, visit http://gerrit.cloudera.org:8080/11147
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I4434c9f5d709154037386b4c7be94045df162267
Gerrit-Change-Number: 11147
Gerrit-PatchSet: 3
Gerrit-Owner: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins

[kudu-CR] Create parallelized loader Spark job

Posted by "Grant Henke (Code Review)" <ge...@cloudera.org>.

Grant Henke has uploaded a new patch set (#4) to the change originally created by Mike Percy. ( http://gerrit.cloudera.org:8080/11147 )

Change subject: Create parallelized loader Spark job
......................................................................

Create parallelized loader Spark job

This patch adds a new DistributedDataGenerator tool
that can load random or sequential data into an existing
Kudu table.

This tool was written to help test the backup and restore
tools. It is currently marke private but could be made
public in the future.

Change-Id: I4434c9f5d709154037386b4c7be94045df162267
---
M java/kudu-spark-tools/build.gradle
A java/kudu-spark-tools/src/main/scala/org/apache/kudu/spark/tools/DistributedDataGenerator.scala
A java/kudu-spark-tools/src/test/scala/org/apache/kudu/spark/tools/DistributedDataGeneratorTest.scala
3 files changed, 293 insertions(+), 0 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/47/11147/4
-- 
To view, visit http://gerrit.cloudera.org:8080/11147
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I4434c9f5d709154037386b4c7be94045df162267
Gerrit-Change-Number: 11147
Gerrit-PatchSet: 4
Gerrit-Owner: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)

[kudu-CR] Create parallelized loader Spark job

Posted by "Grant Henke (Code Review)" <ge...@cloudera.org>.

Grant Henke has abandoned this change. ( http://gerrit.cloudera.org:8080/11147 )

Change subject: Create parallelized loader Spark job
......................................................................


Abandoned
-- 
To view, visit http://gerrit.cloudera.org:8080/11147
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: abandon
Gerrit-Change-Id: I4434c9f5d709154037386b4c7be94045df162267
Gerrit-Change-Number: 11147
Gerrit-PatchSet: 4
Gerrit-Owner: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)