You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@kudu.apache.org by "Grant Henke (Code Review)" <ge...@cloudera.org> on 2018/12/14 17:10:00 UTC

[kudu-CR] KUDU-2640: Add Spark Structured Streaming Sink

Grant Henke has uploaded this change for review. ( http://gerrit.cloudera.org:8080/12087


Change subject: KUDU-2640: Add Spark Structured Streaming Sink
......................................................................

KUDU-2640: Add Spark Structured Streaming Sink

This patche adds a KuduSink and implements the
Spark StreamSinkProvider interface to support
structured streaming writes to Kudu.

These writes behave the same as writes performed
via the InsertableRelation interface.

Note: The StreamSinkProvider interface is
marked as experimental, but it has existed since
Spark 2.0.0 and is used by the Kafka integration.

Change-Id: I731e35f82c8cca7d911e4d879aa6853112132b17
---
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala
M java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/StreamingTest.scala
2 files changed, 67 insertions(+), 37 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/87/12087/1
-- 
To view, visit http://gerrit.cloudera.org:8080/12087
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I731e35f82c8cca7d911e4d879aa6853112132b17
Gerrit-Change-Number: 12087
Gerrit-PatchSet: 1
Gerrit-Owner: Grant Henke <gr...@apache.org>

[kudu-CR] KUDU-2640: Add Spark Structured Streaming Sink

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Mike Percy has posted comments on this change. ( http://gerrit.cloudera.org:8080/12087 )

Change subject: KUDU-2640: Add Spark Structured Streaming Sink
......................................................................


Patch Set 2:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/12087/2//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/12087/2//COMMIT_MSG@9
PS2, Line 9: patche
typo


http://gerrit.cloudera.org:8080/#/c/12087/2/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala
File java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala:

http://gerrit.cloudera.org:8080/#/c/12087/2/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala@466
PS2, Line 466: batchId: Long
> May be obvious, but mind adding a small note on why we shouldn't use this? 
Yeah, a comment would be nice. I'm assuming this is for de-duplication in the case of spark streaming retrying a batch operation.


http://gerrit.cloudera.org:8080/#/c/12087/2/java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/StreamingTest.scala
File java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/StreamingTest.scala:

http://gerrit.cloudera.org:8080/#/c/12087/2/java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/StreamingTest.scala@38
PS2, Line 38:   def testKuduContextWithSparkStreaming() {
seems to be missing a structured streaming test using a streaming SQL query



-- 
To view, visit http://gerrit.cloudera.org:8080/12087
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I731e35f82c8cca7d911e4d879aa6853112132b17
Gerrit-Change-Number: 12087
Gerrit-PatchSet: 2
Gerrit-Owner: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Hao Hao <ha...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Comment-Date: Wed, 02 Jan 2019 19:54:11 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2640: Add Spark Structured Streaming Sink

Posted by "Grant Henke (Code Review)" <ge...@cloudera.org>.
Hello Mike Percy, Kudu Jenkins, Andrew Wong, Hao Hao, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/12087

to look at the new patch set (#4).

Change subject: KUDU-2640: Add Spark Structured Streaming Sink
......................................................................

KUDU-2640: Add Spark Structured Streaming Sink

This patch adds a KuduSink and implements the
Spark StreamSinkProvider interface to support
structured streaming writes to Kudu.

These writes behave the same as writes performed
via the InsertableRelation interface.

Note: The StreamSinkProvider interface is
marked as experimental and unstable, but it has existed
since Spark 2.0.0 and is used by the Kafka integration.
Additionally, per SPARK-26415, there will be no
more Spark 2.x minor releases and Spark would
not break this API in a maintenance release. This
means the interface is effectively stable for all
of Spark 2.

Change-Id: I731e35f82c8cca7d911e4d879aa6853112132b17
---
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala
M java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/StreamingTest.scala
2 files changed, 72 insertions(+), 37 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/87/12087/4
-- 
To view, visit http://gerrit.cloudera.org:8080/12087
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I731e35f82c8cca7d911e4d879aa6853112132b17
Gerrit-Change-Number: 12087
Gerrit-PatchSet: 4
Gerrit-Owner: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Hao Hao <ha...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mike Percy <mp...@apache.org>

[kudu-CR] KUDU-2640: Add Spark Structured Streaming Sink

Posted by "Grant Henke (Code Review)" <ge...@cloudera.org>.
Hello Mike Percy, Kudu Jenkins, Andrew Wong, Hao Hao, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/12087

to look at the new patch set (#2).

Change subject: KUDU-2640: Add Spark Structured Streaming Sink
......................................................................

KUDU-2640: Add Spark Structured Streaming Sink

This patche adds a KuduSink and implements the
Spark StreamSinkProvider interface to support
structured streaming writes to Kudu.

These writes behave the same as writes performed
via the InsertableRelation interface.

Note: The StreamSinkProvider interface is
marked as experimental and unstable, but it has existed
since Spark 2.0.0 and is used by the Kafka integration.
Additionally, per SPARK-26415, there will be no
more Spark 2.x minor releases and Spark would
not break this API in a maintenance release. This
means the interface is effectively stable for all
of Spark 2.

Change-Id: I731e35f82c8cca7d911e4d879aa6853112132b17
---
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala
M java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/StreamingTest.scala
2 files changed, 65 insertions(+), 37 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/87/12087/2
-- 
To view, visit http://gerrit.cloudera.org:8080/12087
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I731e35f82c8cca7d911e4d879aa6853112132b17
Gerrit-Change-Number: 12087
Gerrit-PatchSet: 2
Gerrit-Owner: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Hao Hao <ha...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mike Percy <mp...@apache.org>

[kudu-CR] KUDU-2640: Add Spark Structured Streaming Sink

Posted by "Grant Henke (Code Review)" <ge...@cloudera.org>.
Hello Mike Percy, Kudu Jenkins, Andrew Wong, Hao Hao, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/12087

to look at the new patch set (#5).

Change subject: KUDU-2640: Add Spark Structured Streaming Sink
......................................................................

KUDU-2640: Add Spark Structured Streaming Sink

This patch adds a KuduSink and implements the
Spark StreamSinkProvider interface to support
structured streaming writes to Kudu.

These writes behave the same as writes performed
via the InsertableRelation interface.

Note: The StreamSinkProvider interface is
marked as experimental and unstable, but it has existed
since Spark 2.0.0 and is used by the Kafka integration.
Additionally, per SPARK-26415, there will be no
more Spark 2.x minor releases and Spark would
not break this API in a maintenance release. This
means the interface is effectively stable for all
of Spark 2.

Change-Id: I731e35f82c8cca7d911e4d879aa6853112132b17
---
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala
M java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/StreamingTest.scala
2 files changed, 70 insertions(+), 37 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/87/12087/5
-- 
To view, visit http://gerrit.cloudera.org:8080/12087
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I731e35f82c8cca7d911e4d879aa6853112132b17
Gerrit-Change-Number: 12087
Gerrit-PatchSet: 5
Gerrit-Owner: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Hao Hao <ha...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mike Percy <mp...@apache.org>

[kudu-CR] KUDU-2640: Add Spark Structured Streaming Sink

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Mike Percy has posted comments on this change. ( http://gerrit.cloudera.org:8080/12087 )

Change subject: KUDU-2640: Add Spark Structured Streaming Sink
......................................................................


Patch Set 4: Code-Review+1


-- 
To view, visit http://gerrit.cloudera.org:8080/12087
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I731e35f82c8cca7d911e4d879aa6853112132b17
Gerrit-Change-Number: 12087
Gerrit-PatchSet: 4
Gerrit-Owner: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Hao Hao <ha...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Comment-Date: Wed, 09 Jan 2019 22:03:53 +0000
Gerrit-HasComments: No

[kudu-CR] KUDU-2640: Add Spark Structured Streaming Sink

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/12087 )

Change subject: KUDU-2640: Add Spark Structured Streaming Sink
......................................................................


Patch Set 5: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/12087
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I731e35f82c8cca7d911e4d879aa6853112132b17
Gerrit-Change-Number: 12087
Gerrit-PatchSet: 5
Gerrit-Owner: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Hao Hao <ha...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Comment-Date: Fri, 11 Jan 2019 01:16:03 +0000
Gerrit-HasComments: No

[kudu-CR] KUDU-2640: Add Spark Structured Streaming Sink

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/12087 )

Change subject: KUDU-2640: Add Spark Structured Streaming Sink
......................................................................


Patch Set 4:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/12087/2/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala
File java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala:

http://gerrit.cloudera.org:8080/#/c/12087/2/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala@216
PS2, Line 216:   private def getOperationType(parameters: Map[String, String]): OperationType = {
             :     parameters.get(OPERATION).map(stringToOperationType).getOrElse(Upsert)
             :   }
> I didn't change this behavior. I just refactored it into the method from ab
Ah I missed the old L105. SGTM.


http://gerrit.cloudera.org:8080/#/c/12087/4/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala
File java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala:

http://gerrit.cloudera.org:8080/#/c/12087/4/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala@449
PS4, Line 449:  * In order to preserve exactly once semantics a sink must be idempotent in the face of
             :  * multiple attempts to add the same batch.
             :  *
             :  * Insert ignore support (KUDU-1563) would be useful, but while that doesn't exist
             :  * using upsert will work. Delete ignore would also be useful.
We chatted about this offline. I think it'd be helpful to throw in some context about what Spark is doing, that would clarify why we don't need the batchId and how users should think about the KuduSink options (especially `operationType`). My attempt at a revised class-level doc:

"Sinks provide at-least-once semantics by retrying failed batches, and provide a `batchId` interface to implement exactly-once-semantics. Since Kudu does not internally track batch IDs, this is ignored, and it is up to the user to specify an appropriate `operationType` to achieve the desired semantics when adding batches. The default `Upsert` allows for KuduSink to handle duplicate data and such retries.

Insert ignore support (KUDU-1563) would be useful, but while that doesn't exist, using Upsert will work. Delete ignore would also be useful."



-- 
To view, visit http://gerrit.cloudera.org:8080/12087
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I731e35f82c8cca7d911e4d879aa6853112132b17
Gerrit-Change-Number: 12087
Gerrit-PatchSet: 4
Gerrit-Owner: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Hao Hao <ha...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Comment-Date: Wed, 09 Jan 2019 23:55:37 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2640: Add Spark Structured Streaming Sink

Posted by "Grant Henke (Code Review)" <ge...@cloudera.org>.
Grant Henke has removed a vote on this change.

Change subject: KUDU-2640: Add Spark Structured Streaming Sink
......................................................................


Removed Verified-1 by Kudu Jenkins (120)
-- 
To view, visit http://gerrit.cloudera.org:8080/12087
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: deleteVote
Gerrit-Change-Id: I731e35f82c8cca7d911e4d879aa6853112132b17
Gerrit-Change-Number: 12087
Gerrit-PatchSet: 4
Gerrit-Owner: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Hao Hao <ha...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mike Percy <mp...@apache.org>

[kudu-CR] KUDU-2640: Add Spark Structured Streaming Sink

Posted by "Grant Henke (Code Review)" <ge...@cloudera.org>.
Grant Henke has posted comments on this change. ( http://gerrit.cloudera.org:8080/12087 )

Change subject: KUDU-2640: Add Spark Structured Streaming Sink
......................................................................


Patch Set 2:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/12087/2//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/12087/2//COMMIT_MSG@9
PS2, Line 9: patche
> typo
Done


http://gerrit.cloudera.org:8080/#/c/12087/2/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala
File java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala:

http://gerrit.cloudera.org:8080/#/c/12087/2/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala@216
PS2, Line 216:   private def getOperationType(parameters: Map[String, String]): OperationType = {
             :     parameters.get(OPERATION).map(stringToOperationType).getOrElse(Upsert)
             :   }
> Hrm, I get why this is the case for KuduSink, but should it be the case for
I didn't change this behavior. I just refactored it into the method from above. The reason upsert is the default is because in order to correctly handle Spark retires and upsert is need. A better choice might be insert ignore which is tracked by KUDU-1563.


http://gerrit.cloudera.org:8080/#/c/12087/2/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala@466
PS2, Line 466: batchId: Long
> May be obvious, but mind adding a small note on why we shouldn't use this? 
Done


http://gerrit.cloudera.org:8080/#/c/12087/2/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala@466
PS2, Line 466: batchId: Long
> Yeah, a comment would be nice. I'm assuming this is for de-duplication in t
Like Mike said the batchId is provided by spark so you can handle dedupes in the case of retries. Kudu doesn't have a way to leverage it currently. Today, we use upsert and in the future we could use insert ignore.



-- 
To view, visit http://gerrit.cloudera.org:8080/12087
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I731e35f82c8cca7d911e4d879aa6853112132b17
Gerrit-Change-Number: 12087
Gerrit-PatchSet: 2
Gerrit-Owner: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Hao Hao <ha...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Comment-Date: Wed, 09 Jan 2019 20:50:35 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2640: Add Spark Structured Streaming Sink

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Mike Percy has posted comments on this change. ( http://gerrit.cloudera.org:8080/12087 )

Change subject: KUDU-2640: Add Spark Structured Streaming Sink
......................................................................


Patch Set 4:

lgtm but I'll let Andrew take another look for the +2 if he wants to


-- 
To view, visit http://gerrit.cloudera.org:8080/12087
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I731e35f82c8cca7d911e4d879aa6853112132b17
Gerrit-Change-Number: 12087
Gerrit-PatchSet: 4
Gerrit-Owner: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Hao Hao <ha...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Comment-Date: Wed, 09 Jan 2019 22:04:27 +0000
Gerrit-HasComments: No

[kudu-CR] KUDU-2640: Add Spark Structured Streaming Sink

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/12087 )

Change subject: KUDU-2640: Add Spark Structured Streaming Sink
......................................................................


Patch Set 2:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/12087/2/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala
File java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala:

http://gerrit.cloudera.org:8080/#/c/12087/2/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala@216
PS2, Line 216:   private def getOperationType(parameters: Map[String, String]): OperationType = {
             :     parameters.get(OPERATION).map(stringToOperationType).getOrElse(Upsert)
             :   }
Hrm, I get why this is the case for KuduSink, but should it be the case for the source in general? Seems like it might make misconfiguring and subsequently incorrectly upserting rows pretty easy. Could we just add a default arg to KuduSink instead?


http://gerrit.cloudera.org:8080/#/c/12087/2/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala@466
PS2, Line 466: batchId: Long
May be obvious, but mind adding a small note on why we shouldn't use this? E.g. what it's used for in Spark and why we don't care.



-- 
To view, visit http://gerrit.cloudera.org:8080/12087
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I731e35f82c8cca7d911e4d879aa6853112132b17
Gerrit-Change-Number: 12087
Gerrit-PatchSet: 2
Gerrit-Owner: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Hao Hao <ha...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Comment-Date: Wed, 02 Jan 2019 18:25:47 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2640: Add Spark Structured Streaming Sink

Posted by "Grant Henke (Code Review)" <ge...@cloudera.org>.
Grant Henke has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/12087 )

Change subject: KUDU-2640: Add Spark Structured Streaming Sink
......................................................................

KUDU-2640: Add Spark Structured Streaming Sink

This patch adds a KuduSink and implements the
Spark StreamSinkProvider interface to support
structured streaming writes to Kudu.

These writes behave the same as writes performed
via the InsertableRelation interface.

Note: The StreamSinkProvider interface is
marked as experimental and unstable, but it has existed
since Spark 2.0.0 and is used by the Kafka integration.
Additionally, per SPARK-26415, there will be no
more Spark 2.x minor releases and Spark would
not break this API in a maintenance release. This
means the interface is effectively stable for all
of Spark 2.

Change-Id: I731e35f82c8cca7d911e4d879aa6853112132b17
Reviewed-on: http://gerrit.cloudera.org:8080/12087
Tested-by: Kudu Jenkins
Reviewed-by: Andrew Wong <aw...@cloudera.com>
---
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala
M java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/StreamingTest.scala
2 files changed, 70 insertions(+), 37 deletions(-)

Approvals:
  Kudu Jenkins: Verified
  Andrew Wong: Looks good to me, approved

-- 
To view, visit http://gerrit.cloudera.org:8080/12087
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I731e35f82c8cca7d911e4d879aa6853112132b17
Gerrit-Change-Number: 12087
Gerrit-PatchSet: 6
Gerrit-Owner: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Hao Hao <ha...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mike Percy <mp...@apache.org>

[kudu-CR] KUDU-2640: Add Spark Structured Streaming Sink

Posted by "Grant Henke (Code Review)" <ge...@cloudera.org>.
Hello Mike Percy, Kudu Jenkins, Andrew Wong, Hao Hao, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/12087

to look at the new patch set (#3).

Change subject: KUDU-2640: Add Spark Structured Streaming Sink
......................................................................

KUDU-2640: Add Spark Structured Streaming Sink

This patch adds a KuduSink and implements the
Spark StreamSinkProvider interface to support
structured streaming writes to Kudu.

These writes behave the same as writes performed
via the InsertableRelation interface.

Note: The StreamSinkProvider interface is
marked as experimental and unstable, but it has existed
since Spark 2.0.0 and is used by the Kafka integration.
Additionally, per SPARK-26415, there will be no
more Spark 2.x minor releases and Spark would
not break this API in a maintenance release. This
means the interface is effectively stable for all
of Spark 2.

Change-Id: I731e35f82c8cca7d911e4d879aa6853112132b17
---
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala
M java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/StreamingTest.scala
2 files changed, 72 insertions(+), 37 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/87/12087/3
-- 
To view, visit http://gerrit.cloudera.org:8080/12087
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I731e35f82c8cca7d911e4d879aa6853112132b17
Gerrit-Change-Number: 12087
Gerrit-PatchSet: 3
Gerrit-Owner: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Hao Hao <ha...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mike Percy <mp...@apache.org>

[kudu-CR] KUDU-2640: Add Spark Structured Streaming Sink

Posted by "Grant Henke (Code Review)" <ge...@cloudera.org>.
Grant Henke has posted comments on this change. ( http://gerrit.cloudera.org:8080/12087 )

Change subject: KUDU-2640: Add Spark Structured Streaming Sink
......................................................................


Patch Set 1:

FYI I have a pull request to the Spark project to mark the interface as Stable:
https://github.com/apache/spark/pull/23354

Once that gets merged we can safely merge this.


-- 
To view, visit http://gerrit.cloudera.org:8080/12087
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I731e35f82c8cca7d911e4d879aa6853112132b17
Gerrit-Change-Number: 12087
Gerrit-PatchSet: 1
Gerrit-Owner: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Comment-Date: Wed, 19 Dec 2018 21:07:12 +0000
Gerrit-HasComments: No

[kudu-CR] KUDU-2640: Add Spark Structured Streaming Sink

Posted by "Grant Henke (Code Review)" <ge...@cloudera.org>.
Grant Henke has posted comments on this change. ( http://gerrit.cloudera.org:8080/12087 )

Change subject: KUDU-2640: Add Spark Structured Streaming Sink
......................................................................


Patch Set 4: Verified+1

Unrelated test failure.


-- 
To view, visit http://gerrit.cloudera.org:8080/12087
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I731e35f82c8cca7d911e4d879aa6853112132b17
Gerrit-Change-Number: 12087
Gerrit-PatchSet: 4
Gerrit-Owner: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Hao Hao <ha...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Comment-Date: Wed, 09 Jan 2019 22:39:37 +0000
Gerrit-HasComments: No

[kudu-CR] KUDU-2640: Add Spark Structured Streaming Sink

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Mike Percy has posted comments on this change. ( http://gerrit.cloudera.org:8080/12087 )

Change subject: KUDU-2640: Add Spark Structured Streaming Sink
......................................................................


Patch Set 2:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/12087/2/java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/StreamingTest.scala
File java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/StreamingTest.scala:

http://gerrit.cloudera.org:8080/#/c/12087/2/java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/StreamingTest.scala@38
PS2, Line 38:   def testKuduContextWithSparkStreaming() {
> seems to be missing a structured streaming test using a streaming SQL query
I think I misunderstood the purpose of this patch, feel free to ignore this comment.



-- 
To view, visit http://gerrit.cloudera.org:8080/12087
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I731e35f82c8cca7d911e4d879aa6853112132b17
Gerrit-Change-Number: 12087
Gerrit-PatchSet: 2
Gerrit-Owner: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Hao Hao <ha...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Comment-Date: Wed, 09 Jan 2019 20:49:40 +0000
Gerrit-HasComments: Yes