You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@kudu.apache.org by "yangz (Code Review)" <ge...@cloudera.org> on 2019/01/31 07:04:32 UTC

[kudu-CR] KUDU-2670: split more scanner and add concurrent

yangz has uploaded this change for review. ( http://gerrit.cloudera.org:8080/12323


Change subject: KUDU-2670: split more scanner and add concurrent
......................................................................

KUDU-2670: split more scanner and add concurrent

1. Add a concurrent scanner, can scan with more scanner for a table concurrent.

2. Add split scanner for java client, and used by spark.

Change-Id: I0502f5d64569e8b1d45e88de3cb36aa2e01234d0
---
M java/kudu-client/src/main/java/org/apache/kudu/client/AbstractKuduScannerBuilder.java
A java/kudu-client/src/main/java/org/apache/kudu/client/ConcurrentKuduScanner.java
M java/kudu-client/src/main/java/org/apache/kudu/client/KeyEncoder.java
A java/kudu-client/src/main/java/org/apache/kudu/client/KeyRange.java
M java/kudu-client/src/main/java/org/apache/kudu/client/KuduClient.java
M java/kudu-client/src/main/java/org/apache/kudu/client/KuduScanToken.java
M java/kudu-client/src/main/java/org/apache/kudu/client/KuduScanner.java
M java/kudu-client/src/main/java/org/apache/kudu/client/PartitionPruner.java
A java/kudu-client/src/main/java/org/apache/kudu/client/SplitKeyRangeRequest.java
A java/kudu-client/src/main/java/org/apache/kudu/client/SplitKeyRangeResponse.java
A java/kudu-client/src/test/java/org/apache/kudu/client/TestConcurrentScanner.java
A java/kudu-client/src/test/java/org/apache/kudu/client/TestSplitKeyRangeRequest.java
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduRDD.scala
A java/kudu-test-utils/src/main/java/org/apache/kudu/test/SplitTestUtil.java
14 files changed, 939 insertions(+), 18 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/23/12323/1
-- 
To view, visit http://gerrit.cloudera.org:8080/12323
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I0502f5d64569e8b1d45e88de3cb36aa2e01234d0
Gerrit-Change-Number: 12323
Gerrit-PatchSet: 1
Gerrit-Owner: yangz <zh...@gmail.com>

[kudu-CR] KUDU-2670: Part 1: Build ScanToken by KeyRange

Posted by "Yao Xu (Code Review)" <ge...@cloudera.org>.
Yao Xu has uploaded a new patch set (#24) to the change originally created by yangz. ( http://gerrit.cloudera.org:8080/12323 )

Change subject: KUDU-2670: Part 1: Build ScanToken by KeyRange
......................................................................

KUDU-2670: Part 1: Build ScanToken by KeyRange

When reading data in a kudu table using spark,
if there is a large amount of data in the tablet,
reading the data takes a long time. The reason
is that KuduRDD uses a tablet to generate the
scanToken, so a spark task needs to process all
the data in a tablet.

We send SplitKeyRange RPC to TServer, split tablet's
primary key range into multiple primary key ranges
by size, and generate the scanToken by primary key ranges.

Change-Id: I0502f5d64569e8b1d45e88de3cb36aa2e01234d0
---
M java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java
A java/kudu-client/src/main/java/org/apache/kudu/client/KeyRange.java
M java/kudu-client/src/main/java/org/apache/kudu/client/KuduScanToken.java
M java/kudu-client/src/main/java/org/apache/kudu/client/KuduTable.java
A java/kudu-client/src/main/java/org/apache/kudu/client/SplitKeyRangeRequest.java
A java/kudu-client/src/main/java/org/apache/kudu/client/SplitKeyRangeResponse.java
A java/kudu-client/src/test/java/org/apache/kudu/client/TestSplitKeyRange.java
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduRDD.scala
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduReadOptions.scala
M java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/DefaultSourceTest.scala
M java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/KuduTestSuite.scala
M java/kudu-test-utils/src/main/java/org/apache/kudu/test/ClientTestUtil.java
13 files changed, 793 insertions(+), 15 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/23/12323/24
-- 
To view, visit http://gerrit.cloudera.org:8080/12323
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I0502f5d64569e8b1d45e88de3cb36aa2e01234d0
Gerrit-Change-Number: 12323
Gerrit-PatchSet: 24
Gerrit-Owner: yangz <zh...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Yao Xu <oc...@gmail.com>
Gerrit-Reviewer: yangz <zh...@gmail.com>

[kudu-CR] KUDU-2670: Part 1: Build ScanToken by KeyRange

Posted by "Yao Xu (Code Review)" <ge...@cloudera.org>.
Yao Xu has posted comments on this change. ( http://gerrit.cloudera.org:8080/12323 )

Change subject: KUDU-2670: Part 1: Build ScanToken by KeyRange
......................................................................


Patch Set 28:

I found that DefaultSourceTest.testScanWithKeyRange has a probability of failure, I am working on Debugging.


-- 
To view, visit http://gerrit.cloudera.org:8080/12323
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0502f5d64569e8b1d45e88de3cb36aa2e01234d0
Gerrit-Change-Number: 12323
Gerrit-PatchSet: 28
Gerrit-Owner: yangz <zh...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Yao Xu <oc...@gmail.com>
Gerrit-Reviewer: yangz <zh...@gmail.com>
Gerrit-Comment-Date: Fri, 17 May 2019 06:45:50 +0000
Gerrit-HasComments: No

[kudu-CR] KUDU-2670: split more scanner and add concurrent

Posted by "Yao Xu (Code Review)" <ge...@cloudera.org>.
Yao Xu has posted comments on this change. ( http://gerrit.cloudera.org:8080/12323 )

Change subject: KUDU-2670: split more scanner and add concurrent
......................................................................


Patch Set 2:

Hi, grant. I will continue this work.


-- 
To view, visit http://gerrit.cloudera.org:8080/12323
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0502f5d64569e8b1d45e88de3cb36aa2e01234d0
Gerrit-Change-Number: 12323
Gerrit-PatchSet: 2
Gerrit-Owner: yangz <zh...@gmail.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Yao Xu <oc...@gmail.com>
Gerrit-Reviewer: yangz <zh...@gmail.com>
Gerrit-Comment-Date: Fri, 01 Mar 2019 04:39:50 +0000
Gerrit-HasComments: No

[kudu-CR] KUDU-2670: Part 1: Build ScanToken by KeyRange

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change. ( http://gerrit.cloudera.org:8080/12323 )

Change subject: KUDU-2670: Part 1: Build ScanToken by KeyRange
......................................................................


Patch Set 33: Verified+1

Overriding Jenkins, the one failure here was unrelated to the new code.


-- 
To view, visit http://gerrit.cloudera.org:8080/12323
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0502f5d64569e8b1d45e88de3cb36aa2e01234d0
Gerrit-Change-Number: 12323
Gerrit-PatchSet: 33
Gerrit-Owner: yangz <zh...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Yao Xu <oc...@gmail.com>
Gerrit-Reviewer: yangz <zh...@gmail.com>
Gerrit-Comment-Date: Fri, 24 May 2019 05:15:52 +0000
Gerrit-HasComments: No

[kudu-CR] KUDU-2670: Part 1: Build ScanToken by KeyRange

Posted by "Yao Xu (Code Review)" <ge...@cloudera.org>.
Yao Xu has posted comments on this change. ( http://gerrit.cloudera.org:8080/12323 )

Change subject: KUDU-2670: Part 1: Build ScanToken by KeyRange
......................................................................


Patch Set 10:

(22 comments)

Hi, guys.
I refactored part of the code for this patch (unittest and Client API) and supplement this part of the code document.

http://gerrit.cloudera.org:8080/#/c/12323/6/java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java
File java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java:

http://gerrit.cloudera.org:8080/#/c/12323/6/java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java@1204
PS6, Line 1204:     if (tablet == null) {
> Nit: Line too long.
Done


http://gerrit.cloudera.org:8080/#/c/12323/6/java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java@1209
PS6, Line 1209:     final ServerInfo info = tablet.getReplicaSelectedServerInfo(keepAliveRequest.getReplicaSelection(),
> The wording here suggests that this value is advisory (i.e. the server migh
Done


http://gerrit.cloudera.org:8080/#/c/12323/6/java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java@1211
PS6, Line 1211:     if (info == null) {
> Nit: "a {@code Deferred} object which yields the list of key ranges."
Done


http://gerrit.cloudera.org:8080/#/c/12323/6/java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java@1213
PS6, Line 1213:     }
> Is there any use case for applications to call this API directly? If so, sh
Done


http://gerrit.cloudera.org:8080/#/c/12323/6/java/kudu-client/src/main/java/org/apache/kudu/client/KeyRange.java
File java/kudu-client/src/main/java/org/apache/kudu/client/KeyRange.java:

http://gerrit.cloudera.org:8080/#/c/12323/6/java/kudu-client/src/main/java/org/apache/kudu/client/KeyRange.java@23
PS6, Line 23:  * Class used to represent primary key range in tablet.
> Should document that startPrimaryKey and endPrimaryKey represent encoded pr
Done


http://gerrit.cloudera.org:8080/#/c/12323/6/java/kudu-client/src/main/java/org/apache/kudu/client/KuduScanToken.java
File java/kudu-client/src/main/java/org/apache/kudu/client/KuduScanToken.java:

http://gerrit.cloudera.org:8080/#/c/12323/6/java/kudu-client/src/main/java/org/apache/kudu/client/KuduScanToken.java@304
PS6, Line 304:       return this;
> Again, careful when referencing "split tablet"; it makes it sound as if the
Done


http://gerrit.cloudera.org:8080/#/c/12323/6/java/kudu-client/src/main/java/org/apache/kudu/client/KuduScanToken.java@305
PS6, Line 305:     }
> If unset or <= 0 is more accurate I suppose.
Done


http://gerrit.cloudera.org:8080/#/c/12323/6/java/kudu-client/src/main/java/org/apache/kudu/client/KuduScanToken.java@408
PS6, Line 408:           if (keyRanges.isEmpty()) {
> Nit: wrap
Done


http://gerrit.cloudera.org:8080/#/c/12323/6/java/kudu-client/src/main/java/org/apache/kudu/client/KuduScanToken.java@428
PS6, Line 428:           }
> private
Done


http://gerrit.cloudera.org:8080/#/c/12323/6/java/kudu-client/src/main/java/org/apache/kudu/client/KuduScanToken.java@437
PS6, Line 437:       } catch (Exception e) {
> I think we need to use a DedlineTracker to be sure the timeout represents t
Done


http://gerrit.cloudera.org:8080/#/c/12323/6/java/kudu-client/src/main/java/org/apache/kudu/client/SplitKeyRangeRequest.java
File java/kudu-client/src/main/java/org/apache/kudu/client/SplitKeyRangeRequest.java:

http://gerrit.cloudera.org:8080/#/c/12323/6/java/kudu-client/src/main/java/org/apache/kudu/client/SplitKeyRangeRequest.java@33
PS6, Line 33:  * RPC to split a tablet's primary key range into smaller ranges suitable for concurrent scanning.
> One of the confusing aspects of this feature is that tablets aren't actuall
Done


http://gerrit.cloudera.org:8080/#/c/12323/6/java/kudu-client/src/main/java/org/apache/kudu/client/SplitKeyRangeRequest.java@36
PS6, Line 36: class SplitKeyRangeRequest extends KuduRpc<SplitKeyRangeResponse> {
> I don't think this needs to be final since it's internal preventing extensi
Done


http://gerrit.cloudera.org:8080/#/c/12323/6/java/kudu-client/src/main/java/org/apache/kudu/client/SplitKeyRangeRequest.java@38
PS6, Line 38:   private final byte[] startPrimaryKey;
            :   private final byte[] endPrimaryKey;
            :   private final byte[] partitionKey;
            :   private final long splitSizeBytes;
> Document the meaning of these fields.
Done


http://gerrit.cloudera.org:8080/#/c/12323/6/java/kudu-client/src/main/java/org/apache/kudu/client/SplitKeyRangeRequest.java@43
PS6, Line 43:   /**
> Nit: unnecessary empty line here.
Done


http://gerrit.cloudera.org:8080/#/c/12323/6/java/kudu-client/src/test/java/org/apache/kudu/client/TestSplitKeyRangeRequest.java
File java/kudu-client/src/test/java/org/apache/kudu/client/TestSplitKeyRangeRequest.java:

http://gerrit.cloudera.org:8080/#/c/12323/6/java/kudu-client/src/test/java/org/apache/kudu/client/TestSplitKeyRangeRequest.java@19
PS6, Line 19: 
> I don't mind a test of the request/response. More coverage is a great thing
Done


http://gerrit.cloudera.org:8080/#/c/12323/6/java/kudu-client/src/test/java/org/apache/kudu/client/TestSplitKeyRangeRequest.java@34
PS6, Line 34: 
> Why do these flags need to be set?
Done


http://gerrit.cloudera.org:8080/#/c/12323/6/java/kudu-client/src/test/java/org/apache/kudu/util/SplitTestUtil.java
File java/kudu-client/src/test/java/org/apache/kudu/util/SplitTestUtil.java:

PS6: 
> License header.
Done


http://gerrit.cloudera.org:8080/#/c/12323/6/java/kudu-client/src/test/java/org/apache/kudu/util/SplitTestUtil.java@16
PS6, Line 16: 
> Could this be moved into the ClientTestUtil class for things that are gener
Done


http://gerrit.cloudera.org:8080/#/c/12323/6/java/kudu-client/src/test/java/org/apache/kudu/util/SplitTestUtil.java@104
PS6, Line 104: 
> for (int i = 1; i < numbers.size() - 1; i += 2) {
Done


http://gerrit.cloudera.org:8080/#/c/12323/6/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala
File java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala:

PS6: 
> Could you write a spark test that uses this new feature?
Done


http://gerrit.cloudera.org:8080/#/c/12323/6/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduReadOptions.scala
File java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduReadOptions.scala:

http://gerrit.cloudera.org:8080/#/c/12323/6/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduReadOptions.scala@37
PS6, Line 37:  * @param splitSizeBytes Sets the target number of bytes per spark task. If set, tablet's
> Agreed with Grant, though we need to be careful when we talk about "splitti
Done


http://gerrit.cloudera.org:8080/#/c/12323/6/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduReadOptions.scala@37
PS6, Line 37:  * @param splitSizeBytes Sets the target number of bytes per spark task. If set, tablet's
> Perhaps make these docs a bit more spark centric. Something along the lines
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/12323
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0502f5d64569e8b1d45e88de3cb36aa2e01234d0
Gerrit-Change-Number: 12323
Gerrit-PatchSet: 10
Gerrit-Owner: yangz <zh...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Yao Xu <oc...@gmail.com>
Gerrit-Reviewer: yangz <zh...@gmail.com>
Gerrit-Comment-Date: Sat, 23 Mar 2019 10:37:47 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2670: Part 1: Build ScanToken by KeyRange

Posted by "Grant Henke (Code Review)" <ge...@cloudera.org>.
Grant Henke has posted comments on this change. ( http://gerrit.cloudera.org:8080/12323 )

Change subject: KUDU-2670: Part 1: Build ScanToken by KeyRange
......................................................................


Patch Set 30:

This patch looks good to me, I think we just need to understand the test failures.


-- 
To view, visit http://gerrit.cloudera.org:8080/12323
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0502f5d64569e8b1d45e88de3cb36aa2e01234d0
Gerrit-Change-Number: 12323
Gerrit-PatchSet: 30
Gerrit-Owner: yangz <zh...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Yao Xu <oc...@gmail.com>
Gerrit-Reviewer: yangz <zh...@gmail.com>
Gerrit-Comment-Date: Tue, 21 May 2019 15:17:23 +0000
Gerrit-HasComments: No

[kudu-CR] KUDU-2670: Part 1: Build ScanToken by KeyRange

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change. ( http://gerrit.cloudera.org:8080/12323 )

Change subject: KUDU-2670: Part 1: Build ScanToken by KeyRange
......................................................................


Patch Set 32:

> The reason for the problem should be that KeyRange is duplicate. 
> In the following log, the tablet(79e027fcb6bd4f67b85028ab7d08352c) primary key range is duplicate.
> 
> 13:07:31.784 [DEBUG - New I/O worker #158] (AsyncKuduClient.java:2039) Add key range [<start>, <end>), 40603, "79e027fcb6bd4f67b85028ab7d08352c" [0x80000032, <end>)
> 13:07:31.784 [DEBUG - New I/O worker #157] (AsyncKuduClient.java:2039) Add key range [<start>, <end>), 35286, "453294f2eb9341cfb4e59964404fc4e0" [<start>, 0x80000032)
> 13:07:31.785 [DEBUG - New I/O worker #158] (AsyncKuduClient.java:2039) Add key range [<start>, <end>), 40603, "79e027fcb6bd4f67b85028ab7d08352c" [0x80000032, <end>)

That sounds like it might be an interaction between split key range requests and the DuplicatingRowSet, a special kind of rowset that is used during a MRS flush. See Tablet::DoMergeCompactionOrFlush for details.

In any case, the problem is likely to be server-side, so I suggest you add some debugging to Tablet and RowSetInfo to figure out what's going on.


-- 
To view, visit http://gerrit.cloudera.org:8080/12323
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0502f5d64569e8b1d45e88de3cb36aa2e01234d0
Gerrit-Change-Number: 12323
Gerrit-PatchSet: 32
Gerrit-Owner: yangz <zh...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Yao Xu <oc...@gmail.com>
Gerrit-Reviewer: yangz <zh...@gmail.com>
Gerrit-Comment-Date: Wed, 22 May 2019 17:15:55 +0000
Gerrit-HasComments: No

[kudu-CR] KUDU-2670: Part 1: Build ScanToken by splitSizeBytes

Posted by "Yao Xu (Code Review)" <ge...@cloudera.org>.
Yao Xu has uploaded a new patch set (#7) to the change originally created by yangz. ( http://gerrit.cloudera.org:8080/12323 )

Change subject: KUDU-2670: Part 1: Build ScanToken by splitSizeBytes
......................................................................

KUDU-2670: Part 1: Build ScanToken by splitSizeBytes

When reading data in a kudu table using spark,
if there is a large amount of data in the tablet,
reading the data takes a long time. The reason
is that KuduRDD uses a tablet to generate the
scanToken, so a spark task needs to process all
the data in a tablet.

We send SplitKeyRange RPC to TServer, split tablet
into multiple primary key ranges by size. And generate
the scanToken by tablet's key ranges.

Change-Id: I0502f5d64569e8b1d45e88de3cb36aa2e01234d0
---
M java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java
A java/kudu-client/src/main/java/org/apache/kudu/client/KeyRange.java
M java/kudu-client/src/main/java/org/apache/kudu/client/KuduScanToken.java
M java/kudu-client/src/main/java/org/apache/kudu/client/KuduTable.java
A java/kudu-client/src/main/java/org/apache/kudu/client/SplitKeyRangeRequest.java
A java/kudu-client/src/main/java/org/apache/kudu/client/SplitKeyRangeResponse.java
A java/kudu-client/src/test/java/org/apache/kudu/client/TestSplitKeyRange.java
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduRDD.scala
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduReadOptions.scala
M java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/DefaultSourceTest.scala
M java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/KuduTestSuite.scala
M java/kudu-test-utils/src/main/java/org/apache/kudu/test/ClientTestUtil.java
13 files changed, 876 insertions(+), 19 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/23/12323/7
-- 
To view, visit http://gerrit.cloudera.org:8080/12323
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I0502f5d64569e8b1d45e88de3cb36aa2e01234d0
Gerrit-Change-Number: 12323
Gerrit-PatchSet: 7
Gerrit-Owner: yangz <zh...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Yao Xu <oc...@gmail.com>
Gerrit-Reviewer: yangz <zh...@gmail.com>

[kudu-CR] KUDU-2670: Part 1: Build ScanToken by KeyRange

Posted by "Yao Xu (Code Review)" <ge...@cloudera.org>.
Yao Xu has uploaded a new patch set (#15) to the change originally created by yangz. ( http://gerrit.cloudera.org:8080/12323 )

Change subject: KUDU-2670: Part 1: Build ScanToken by KeyRange
......................................................................

KUDU-2670: Part 1: Build ScanToken by KeyRange

When reading data in a kudu table using spark,
if there is a large amount of data in the tablet,
reading the data takes a long time. The reason
is that KuduRDD uses a tablet to generate the
scanToken, so a spark task needs to process all
the data in a tablet.

We send SplitKeyRange RPC to TServer, split tablet's
primary key range into multiple primary key ranges
by size, and generate the scanToken by primary key ranges.

Change-Id: I0502f5d64569e8b1d45e88de3cb36aa2e01234d0
---
M java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java
A java/kudu-client/src/main/java/org/apache/kudu/client/KeyRange.java
M java/kudu-client/src/main/java/org/apache/kudu/client/KuduScanToken.java
M java/kudu-client/src/main/java/org/apache/kudu/client/KuduTable.java
A java/kudu-client/src/main/java/org/apache/kudu/client/SplitKeyRangeRequest.java
A java/kudu-client/src/main/java/org/apache/kudu/client/SplitKeyRangeResponse.java
A java/kudu-client/src/test/java/org/apache/kudu/client/TestSplitKeyRange.java
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduRDD.scala
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduReadOptions.scala
M java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/DefaultSourceTest.scala
M java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/KuduTestSuite.scala
M java/kudu-test-utils/src/main/java/org/apache/kudu/test/ClientTestUtil.java
13 files changed, 787 insertions(+), 15 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/23/12323/15
-- 
To view, visit http://gerrit.cloudera.org:8080/12323
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I0502f5d64569e8b1d45e88de3cb36aa2e01234d0
Gerrit-Change-Number: 12323
Gerrit-PatchSet: 15
Gerrit-Owner: yangz <zh...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Yao Xu <oc...@gmail.com>
Gerrit-Reviewer: yangz <zh...@gmail.com>

[kudu-CR] KUDU-2670: Part 1: Build ScanToken by splitSizeBytes

Posted by "Yao Xu (Code Review)" <ge...@cloudera.org>.
Yao Xu has uploaded a new patch set (#4) to the change originally created by yangz. ( http://gerrit.cloudera.org:8080/12323 )

Change subject: KUDU-2670: Part 1: Build ScanToken by splitSizeBytes
......................................................................

KUDU-2670: Part 1: Build ScanToken by splitSizeBytes

When reading data in a kudu table using spark,
if there is a large amount of data in the tablet,
reading the data takes a long time. The reason
is that KuduRDD uses a tablet to generate the
scanToken, so a spark task needs to process all
the data in a tablet.

We send SplitKeyRange RPC to TServer, split tablet
into multiple primary key ranges by size. And generate
the scanToken by tablet's key ranges.

Change-Id: I0502f5d64569e8b1d45e88de3cb36aa2e01234d0
---
M java/kudu-client/src/main/java/org/apache/kudu/client/AbstractKuduScannerBuilder.java
M java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java
A java/kudu-client/src/main/java/org/apache/kudu/client/KeyRange.java
M java/kudu-client/src/main/java/org/apache/kudu/client/KuduScanToken.java
A java/kudu-client/src/main/java/org/apache/kudu/client/SplitKeyRangeRequest.java
A java/kudu-client/src/main/java/org/apache/kudu/client/SplitKeyRangeResponse.java
A java/kudu-client/src/test/java/org/apache/kudu/client/TestSplitKeyRangeRequest.java
A java/kudu-client/src/test/java/org/apache/kudu/util/SplitTestUtil.java
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduRDD.scala
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduReadOptions.scala
11 files changed, 567 insertions(+), 8 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/23/12323/4
-- 
To view, visit http://gerrit.cloudera.org:8080/12323
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I0502f5d64569e8b1d45e88de3cb36aa2e01234d0
Gerrit-Change-Number: 12323
Gerrit-PatchSet: 4
Gerrit-Owner: yangz <zh...@gmail.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Yao Xu <oc...@gmail.com>
Gerrit-Reviewer: yangz <zh...@gmail.com>

[kudu-CR] KUDU-2670: Part 1: Build ScanToken by KeyRange

Posted by "Yao Xu (Code Review)" <ge...@cloudera.org>.
Yao Xu has uploaded a new patch set (#33) to the change originally created by yangz. ( http://gerrit.cloudera.org:8080/12323 )

Change subject: KUDU-2670: Part 1: Build ScanToken by KeyRange
......................................................................

KUDU-2670: Part 1: Build ScanToken by KeyRange

When reading data in a kudu table using spark,
if there is a large amount of data in the tablet,
reading the data takes a long time. The reason
is that KuduRDD uses a tablet to generate the
scanToken, so a spark task needs to process all
the data in a tablet.

We send SplitKeyRange RPC to TServer, split tablet's
primary key range into multiple primary key ranges
by size, and generate the scanToken by primary key ranges.

Change-Id: I0502f5d64569e8b1d45e88de3cb36aa2e01234d0
---
M java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java
M java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduScanner.java
A java/kudu-client/src/main/java/org/apache/kudu/client/KeyRange.java
M java/kudu-client/src/main/java/org/apache/kudu/client/KuduScanToken.java
A java/kudu-client/src/main/java/org/apache/kudu/client/SplitKeyRangeRequest.java
A java/kudu-client/src/main/java/org/apache/kudu/client/SplitKeyRangeResponse.java
A java/kudu-client/src/test/java/org/apache/kudu/client/TestSplitKeyRange.java
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduRDD.scala
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduReadOptions.scala
M java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/DefaultSourceTest.scala
M java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/KuduTestSuite.scala
M java/kudu-test-utils/src/main/java/org/apache/kudu/test/ClientTestUtil.java
13 files changed, 743 insertions(+), 16 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/23/12323/33
-- 
To view, visit http://gerrit.cloudera.org:8080/12323
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I0502f5d64569e8b1d45e88de3cb36aa2e01234d0
Gerrit-Change-Number: 12323
Gerrit-PatchSet: 33
Gerrit-Owner: yangz <zh...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Yao Xu <oc...@gmail.com>
Gerrit-Reviewer: yangz <zh...@gmail.com>

[kudu-CR] KUDU-2670: Part 1: Build ScanToken by KeyRange

Posted by "Yao Xu (Code Review)" <ge...@cloudera.org>.
Yao Xu has uploaded a new patch set (#27) to the change originally created by yangz. ( http://gerrit.cloudera.org:8080/12323 )

Change subject: KUDU-2670: Part 1: Build ScanToken by KeyRange
......................................................................

KUDU-2670: Part 1: Build ScanToken by KeyRange

When reading data in a kudu table using spark,
if there is a large amount of data in the tablet,
reading the data takes a long time. The reason
is that KuduRDD uses a tablet to generate the
scanToken, so a spark task needs to process all
the data in a tablet.

We send SplitKeyRange RPC to TServer, split tablet's
primary key range into multiple primary key ranges
by size, and generate the scanToken by primary key ranges.

Change-Id: I0502f5d64569e8b1d45e88de3cb36aa2e01234d0
---
M java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java
A java/kudu-client/src/main/java/org/apache/kudu/client/KeyRange.java
M java/kudu-client/src/main/java/org/apache/kudu/client/KuduScanToken.java
A java/kudu-client/src/main/java/org/apache/kudu/client/SplitKeyRangeRequest.java
A java/kudu-client/src/main/java/org/apache/kudu/client/SplitKeyRangeResponse.java
A java/kudu-client/src/test/java/org/apache/kudu/client/TestSplitKeyRange.java
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduRDD.scala
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduReadOptions.scala
M java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/DefaultSourceTest.scala
M java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/KuduTestSuite.scala
M java/kudu-test-utils/src/main/java/org/apache/kudu/test/ClientTestUtil.java
12 files changed, 727 insertions(+), 16 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/23/12323/27
-- 
To view, visit http://gerrit.cloudera.org:8080/12323
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I0502f5d64569e8b1d45e88de3cb36aa2e01234d0
Gerrit-Change-Number: 12323
Gerrit-PatchSet: 27
Gerrit-Owner: yangz <zh...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Yao Xu <oc...@gmail.com>
Gerrit-Reviewer: yangz <zh...@gmail.com>

[kudu-CR] KUDU-2670: split more scanner and add concurrent

Posted by "Grant Henke (Code Review)" <ge...@cloudera.org>.
Grant Henke has posted comments on this change. ( http://gerrit.cloudera.org:8080/12323 )

Change subject: KUDU-2670: split more scanner and add concurrent
......................................................................


Patch Set 2:

(15 comments)

Thank you for this contribution. I did a quick pass and added some high level feedback.

I think it would be useful to break this into 2 patches. One for the tablet splitting and another for the concurrent scanner.

http://gerrit.cloudera.org:8080/#/c/12323/2/java/kudu-client/src/main/java/org/apache/kudu/client/AbstractKuduScannerBuilder.java
File java/kudu-client/src/main/java/org/apache/kudu/client/AbstractKuduScannerBuilder.java:

http://gerrit.cloudera.org:8080/#/c/12323/2/java/kudu-client/src/main/java/org/apache/kudu/client/AbstractKuduScannerBuilder.java@341
PS2, Line 341:   public S lowerBoundPartitionKeyRaw(byte[] partitionKey) {
I don't think these need to be public.


http://gerrit.cloudera.org:8080/#/c/12323/2/java/kudu-client/src/main/java/org/apache/kudu/client/ConcurrentKuduScanner.java
File java/kudu-client/src/main/java/org/apache/kudu/client/ConcurrentKuduScanner.java:

http://gerrit.cloudera.org:8080/#/c/12323/2/java/kudu-client/src/main/java/org/apache/kudu/client/ConcurrentKuduScanner.java@32
PS2, Line 32: public class ConcurrentKuduScanner extends KuduScanner {
Can you add javadoc? An example of when and how this should be used would be helpful.

Does this need to extend KuduScanner? Given it wrap/contains many KuduScanners that doesn't seem right.


http://gerrit.cloudera.org:8080/#/c/12323/2/java/kudu-client/src/main/java/org/apache/kudu/client/ConcurrentKuduScanner.java@46
PS2, Line 46:   ConcurrentKuduScanner(List<KuduScanner> scanners,
Naming it Concurrent makes it sound like this class is thread safe. Instead it looks like this implementation wraps multiple scanners. Maybe we could name it MultiKuduScanner?


http://gerrit.cloudera.org:8080/#/c/12323/2/java/kudu-client/src/main/java/org/apache/kudu/client/ConcurrentKuduScanner.java@52
PS2, Line 52:     this.queue = new LinkedBlockingQueue<>(threadNumer * 4);
Why multiply by 4?


http://gerrit.cloudera.org:8080/#/c/12323/2/java/kudu-client/src/main/java/org/apache/kudu/client/ConcurrentKuduScanner.java@78
PS2, Line 78:               log.error("scanner error", ex);
Maybe info or debug log level?


http://gerrit.cloudera.org:8080/#/c/12323/2/java/kudu-client/src/main/java/org/apache/kudu/client/KeyEncoder.java
File java/kudu-client/src/main/java/org/apache/kudu/client/KeyEncoder.java:

http://gerrit.cloudera.org:8080/#/c/12323/2/java/kudu-client/src/main/java/org/apache/kudu/client/KeyEncoder.java@46
PS2, Line 46: public class KeyEncoder {
I don't think this needs to be public.


http://gerrit.cloudera.org:8080/#/c/12323/2/java/kudu-client/src/main/java/org/apache/kudu/client/KeyRange.java
File java/kudu-client/src/main/java/org/apache/kudu/client/KeyRange.java:

http://gerrit.cloudera.org:8080/#/c/12323/2/java/kudu-client/src/main/java/org/apache/kudu/client/KeyRange.java@20
PS2, Line 20: public class KeyRange {
Could you add java doc?

Does this need to be public?


http://gerrit.cloudera.org:8080/#/c/12323/2/java/kudu-client/src/main/java/org/apache/kudu/client/KuduScanToken.java
File java/kudu-client/src/main/java/org/apache/kudu/client/KuduScanToken.java:

http://gerrit.cloudera.org:8080/#/c/12323/2/java/kudu-client/src/main/java/org/apache/kudu/client/KuduScanToken.java@286
PS2, Line 286:     private long targetChunkSize = -1;
Could this name indicate that it will cause a tablet to be split? It would be good to also include the unit in the name. 
ex: splitSizeBytes


http://gerrit.cloudera.org:8080/#/c/12323/2/java/kudu-client/src/main/java/org/apache/kudu/client/KuduScanToken.java@420
PS2, Line 420:     static class DefaultTokenBuilder implements TokenBuilder {
Since this is private API we can use methods to separate this logic instead of a builder.


http://gerrit.cloudera.org:8080/#/c/12323/2/java/kudu-client/src/main/java/org/apache/kudu/client/KuduScanToken.java@460
PS2, Line 460:           SplitKeyRangeRequest request = new SplitKeyRangeRequest(table, startPrimaryKey,
I think the overall timeout needs to be considered here.


http://gerrit.cloudera.org:8080/#/c/12323/2/java/kudu-client/src/main/java/org/apache/kudu/client/PartitionPruner.java
File java/kudu-client/src/main/java/org/apache/kudu/client/PartitionPruner.java:

http://gerrit.cloudera.org:8080/#/c/12323/2/java/kudu-client/src/main/java/org/apache/kudu/client/PartitionPruner.java@a247
PS2, Line 247: 
Why was this removed?


http://gerrit.cloudera.org:8080/#/c/12323/2/java/kudu-client/src/main/java/org/apache/kudu/client/SplitKeyRangeRequest.java
File java/kudu-client/src/main/java/org/apache/kudu/client/SplitKeyRangeRequest.java:

http://gerrit.cloudera.org:8080/#/c/12323/2/java/kudu-client/src/main/java/org/apache/kudu/client/SplitKeyRangeRequest.java@30
PS2, Line 30: final class SplitKeyRangeRequest extends KuduRpc<SplitKeyRangeResponse> {
Can you add javadoc?


http://gerrit.cloudera.org:8080/#/c/12323/2/java/kudu-client/src/main/java/org/apache/kudu/client/SplitKeyRangeResponse.java
File java/kudu-client/src/main/java/org/apache/kudu/client/SplitKeyRangeResponse.java:

http://gerrit.cloudera.org:8080/#/c/12323/2/java/kudu-client/src/main/java/org/apache/kudu/client/SplitKeyRangeResponse.java@26
PS2, Line 26:    * Constructor with information common to all RPCs.
Think this is a copy paste from KuduRpcResponse. Can you update the javadoc.


http://gerrit.cloudera.org:8080/#/c/12323/2/java/kudu-client/src/test/java/org/apache/kudu/client/TestConcurrentScanner.java
File java/kudu-client/src/test/java/org/apache/kudu/client/TestConcurrentScanner.java:

http://gerrit.cloudera.org:8080/#/c/12323/2/java/kudu-client/src/test/java/org/apache/kudu/client/TestConcurrentScanner.java@20
PS2, Line 20:   
nit: All these red blocks show extra spaces.


http://gerrit.cloudera.org:8080/#/c/12323/2/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduRDD.scala
File java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduRDD.scala:

http://gerrit.cloudera.org:8080/#/c/12323/2/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduRDD.scala@51
PS2, Line 51:   val targetChunkSize = sc.getConf.getLong("spark.kudu.scanner.targetChunkSize", 1024 * 1024 * 512L)
This configuration can be added to KuduReadOptions and set in DefaultSource like "kudu.scanLocality".



-- 
To view, visit http://gerrit.cloudera.org:8080/12323
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0502f5d64569e8b1d45e88de3cb36aa2e01234d0
Gerrit-Change-Number: 12323
Gerrit-PatchSet: 2
Gerrit-Owner: yangz <zh...@gmail.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Comment-Date: Fri, 08 Feb 2019 22:11:25 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2670: Part 1: Build ScanToken by KeyRange

Posted by "Yao Xu (Code Review)" <ge...@cloudera.org>.
Yao Xu has uploaded a new patch set (#9) to the change originally created by yangz. ( http://gerrit.cloudera.org:8080/12323 )

Change subject: KUDU-2670: Part 1: Build ScanToken by KeyRange
......................................................................

KUDU-2670: Part 1: Build ScanToken by KeyRange

When reading data in a kudu table using spark,
if there is a large amount of data in the tablet,
reading the data takes a long time. The reason
is that KuduRDD uses a tablet to generate the
scanToken, so a spark task needs to process all
the data in a tablet.

We send SplitKeyRange RPC to TServer, split tablet's
primary key range into multiple primary key ranges
by size. And generate the scanToken by primary key ranges.

Change-Id: I0502f5d64569e8b1d45e88de3cb36aa2e01234d0
---
M java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java
A java/kudu-client/src/main/java/org/apache/kudu/client/KeyRange.java
M java/kudu-client/src/main/java/org/apache/kudu/client/KuduScanToken.java
M java/kudu-client/src/main/java/org/apache/kudu/client/KuduTable.java
A java/kudu-client/src/main/java/org/apache/kudu/client/SplitKeyRangeRequest.java
A java/kudu-client/src/main/java/org/apache/kudu/client/SplitKeyRangeResponse.java
A java/kudu-client/src/test/java/org/apache/kudu/client/TestSplitKeyRange.java
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduRDD.scala
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduReadOptions.scala
M java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/DefaultSourceTest.scala
M java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/KuduTestSuite.scala
M java/kudu-test-utils/src/main/java/org/apache/kudu/test/ClientTestUtil.java
13 files changed, 877 insertions(+), 16 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/23/12323/9
-- 
To view, visit http://gerrit.cloudera.org:8080/12323
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I0502f5d64569e8b1d45e88de3cb36aa2e01234d0
Gerrit-Change-Number: 12323
Gerrit-PatchSet: 9
Gerrit-Owner: yangz <zh...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Yao Xu <oc...@gmail.com>
Gerrit-Reviewer: yangz <zh...@gmail.com>

[kudu-CR] KUDU-2670: Part 1: Build ScanToken by KeyRange

Posted by "Yao Xu (Code Review)" <ge...@cloudera.org>.
Yao Xu has uploaded a new patch set (#17) to the change originally created by yangz. ( http://gerrit.cloudera.org:8080/12323 )

Change subject: KUDU-2670: Part 1: Build ScanToken by KeyRange
......................................................................

KUDU-2670: Part 1: Build ScanToken by KeyRange

When reading data in a kudu table using spark,
if there is a large amount of data in the tablet,
reading the data takes a long time. The reason
is that KuduRDD uses a tablet to generate the
scanToken, so a spark task needs to process all
the data in a tablet.

We send SplitKeyRange RPC to TServer, split tablet's
primary key range into multiple primary key ranges
by size, and generate the scanToken by primary key ranges.

Change-Id: I0502f5d64569e8b1d45e88de3cb36aa2e01234d0
---
M java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java
A java/kudu-client/src/main/java/org/apache/kudu/client/KeyRange.java
M java/kudu-client/src/main/java/org/apache/kudu/client/KuduScanToken.java
M java/kudu-client/src/main/java/org/apache/kudu/client/KuduTable.java
A java/kudu-client/src/main/java/org/apache/kudu/client/SplitKeyRangeRequest.java
A java/kudu-client/src/main/java/org/apache/kudu/client/SplitKeyRangeResponse.java
A java/kudu-client/src/test/java/org/apache/kudu/client/TestSplitKeyRange.java
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduRDD.scala
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduReadOptions.scala
M java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/DefaultSourceTest.scala
M java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/KuduTestSuite.scala
M java/kudu-test-utils/src/main/java/org/apache/kudu/test/ClientTestUtil.java
13 files changed, 787 insertions(+), 15 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/23/12323/17
-- 
To view, visit http://gerrit.cloudera.org:8080/12323
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I0502f5d64569e8b1d45e88de3cb36aa2e01234d0
Gerrit-Change-Number: 12323
Gerrit-PatchSet: 17
Gerrit-Owner: yangz <zh...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Yao Xu <oc...@gmail.com>
Gerrit-Reviewer: yangz <zh...@gmail.com>

[kudu-CR] KUDU-2670: Part 1: Build ScanToken by KeyRange

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has removed a vote on this change.

Change subject: KUDU-2670: Part 1: Build ScanToken by KeyRange
......................................................................


Removed Verified-1 by Kudu Jenkins (120)
-- 
To view, visit http://gerrit.cloudera.org:8080/12323
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: deleteVote
Gerrit-Change-Id: I0502f5d64569e8b1d45e88de3cb36aa2e01234d0
Gerrit-Change-Number: 12323
Gerrit-PatchSet: 33
Gerrit-Owner: yangz <zh...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Yao Xu <oc...@gmail.com>
Gerrit-Reviewer: yangz <zh...@gmail.com>

[kudu-CR] KUDU-2670: Part 1: Build ScanToken by KeyRange

Posted by "Yao Xu (Code Review)" <ge...@cloudera.org>.
Yao Xu has uploaded a new patch set (#20) to the change originally created by yangz. ( http://gerrit.cloudera.org:8080/12323 )

Change subject: KUDU-2670: Part 1: Build ScanToken by KeyRange
......................................................................

KUDU-2670: Part 1: Build ScanToken by KeyRange

When reading data in a kudu table using spark,
if there is a large amount of data in the tablet,
reading the data takes a long time. The reason
is that KuduRDD uses a tablet to generate the
scanToken, so a spark task needs to process all
the data in a tablet.

We send SplitKeyRange RPC to TServer, split tablet's
primary key range into multiple primary key ranges
by size, and generate the scanToken by primary key ranges.

Change-Id: I0502f5d64569e8b1d45e88de3cb36aa2e01234d0
---
M java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java
A java/kudu-client/src/main/java/org/apache/kudu/client/KeyRange.java
M java/kudu-client/src/main/java/org/apache/kudu/client/KuduScanToken.java
M java/kudu-client/src/main/java/org/apache/kudu/client/KuduTable.java
A java/kudu-client/src/main/java/org/apache/kudu/client/SplitKeyRangeRequest.java
A java/kudu-client/src/main/java/org/apache/kudu/client/SplitKeyRangeResponse.java
A java/kudu-client/src/test/java/org/apache/kudu/client/TestSplitKeyRange.java
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduRDD.scala
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduReadOptions.scala
M java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/DefaultSourceTest.scala
M java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/KuduTestSuite.scala
M java/kudu-test-utils/src/main/java/org/apache/kudu/test/ClientTestUtil.java
13 files changed, 787 insertions(+), 15 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/23/12323/20
-- 
To view, visit http://gerrit.cloudera.org:8080/12323
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I0502f5d64569e8b1d45e88de3cb36aa2e01234d0
Gerrit-Change-Number: 12323
Gerrit-PatchSet: 20
Gerrit-Owner: yangz <zh...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Yao Xu <oc...@gmail.com>
Gerrit-Reviewer: yangz <zh...@gmail.com>

[kudu-CR] KUDU-2670: Part 1: Build ScanToken by KeyRange

Posted by "Yao Xu (Code Review)" <ge...@cloudera.org>.
Yao Xu has uploaded a new patch set (#19) to the change originally created by yangz. ( http://gerrit.cloudera.org:8080/12323 )

Change subject: KUDU-2670: Part 1: Build ScanToken by KeyRange
......................................................................

KUDU-2670: Part 1: Build ScanToken by KeyRange

When reading data in a kudu table using spark,
if there is a large amount of data in the tablet,
reading the data takes a long time. The reason
is that KuduRDD uses a tablet to generate the
scanToken, so a spark task needs to process all
the data in a tablet.

We send SplitKeyRange RPC to TServer, split tablet's
primary key range into multiple primary key ranges
by size, and generate the scanToken by primary key ranges.

Change-Id: I0502f5d64569e8b1d45e88de3cb36aa2e01234d0
---
M java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java
A java/kudu-client/src/main/java/org/apache/kudu/client/KeyRange.java
M java/kudu-client/src/main/java/org/apache/kudu/client/KuduScanToken.java
M java/kudu-client/src/main/java/org/apache/kudu/client/KuduTable.java
A java/kudu-client/src/main/java/org/apache/kudu/client/SplitKeyRangeRequest.java
A java/kudu-client/src/main/java/org/apache/kudu/client/SplitKeyRangeResponse.java
A java/kudu-client/src/test/java/org/apache/kudu/client/TestSplitKeyRange.java
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduRDD.scala
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduReadOptions.scala
M java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/DefaultSourceTest.scala
M java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/KuduTestSuite.scala
M java/kudu-test-utils/src/main/java/org/apache/kudu/test/ClientTestUtil.java
13 files changed, 787 insertions(+), 15 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/23/12323/19
-- 
To view, visit http://gerrit.cloudera.org:8080/12323
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I0502f5d64569e8b1d45e88de3cb36aa2e01234d0
Gerrit-Change-Number: 12323
Gerrit-PatchSet: 19
Gerrit-Owner: yangz <zh...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Yao Xu <oc...@gmail.com>
Gerrit-Reviewer: yangz <zh...@gmail.com>

[kudu-CR] KUDU-2670: Part 1: Build ScanToken by splitSizeBytes

Posted by "Yao Xu (Code Review)" <ge...@cloudera.org>.
Yao Xu has uploaded a new patch set (#3) to the change originally created by yangz. ( http://gerrit.cloudera.org:8080/12323 )

Change subject: KUDU-2670: Part 1: Build ScanToken by splitSizeBytes
......................................................................

KUDU-2670: Part 1: Build ScanToken by splitSizeBytes

When reading data in a kudu table using spark,
if there is a large amount of data in the tablet,
reading the data takes a long time. The reason
is that KuduRDD uses a tablet to generate the
scanToken, so a spark task needs to process all
the data in a tablet.

We send SplitKeyRange RPC to TServer, split tablet
into multiple primary key ranges by size. And generate
the scanToken by tablet's key ranges.

Change-Id: I0502f5d64569e8b1d45e88de3cb36aa2e01234d0
---
M java/kudu-client/src/main/java/org/apache/kudu/client/AbstractKuduScannerBuilder.java
M java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java
A java/kudu-client/src/main/java/org/apache/kudu/client/KeyRange.java
M java/kudu-client/src/main/java/org/apache/kudu/client/KuduScanToken.java
A java/kudu-client/src/main/java/org/apache/kudu/client/SplitKeyRangeRequest.java
A java/kudu-client/src/main/java/org/apache/kudu/client/SplitKeyRangeResponse.java
A java/kudu-client/src/test/java/org/apache/kudu/client/TestSplitKeyRangeRequest.java
A java/kudu-client/src/test/java/org/apache/kudu/util/SplitTestUtil.java
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduRDD.scala
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduReadOptions.scala
11 files changed, 567 insertions(+), 8 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/23/12323/3
-- 
To view, visit http://gerrit.cloudera.org:8080/12323
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I0502f5d64569e8b1d45e88de3cb36aa2e01234d0
Gerrit-Change-Number: 12323
Gerrit-PatchSet: 3
Gerrit-Owner: yangz <zh...@gmail.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Yao Xu <oc...@gmail.com>
Gerrit-Reviewer: yangz <zh...@gmail.com>

[kudu-CR] KUDU-2670: Part 1: Build ScanToken by KeyRange

Posted by "Yao Xu (Code Review)" <ge...@cloudera.org>.
Yao Xu has uploaded a new patch set (#13) to the change originally created by yangz. ( http://gerrit.cloudera.org:8080/12323 )

Change subject: KUDU-2670: Part 1: Build ScanToken by KeyRange
......................................................................

KUDU-2670: Part 1: Build ScanToken by KeyRange

When reading data in a kudu table using spark,
if there is a large amount of data in the tablet,
reading the data takes a long time. The reason
is that KuduRDD uses a tablet to generate the
scanToken, so a spark task needs to process all
the data in a tablet.

We send SplitKeyRange RPC to TServer, split tablet's
primary key range into multiple primary key ranges
by size, and generate the scanToken by primary key ranges.

Change-Id: I0502f5d64569e8b1d45e88de3cb36aa2e01234d0
---
M java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java
A java/kudu-client/src/main/java/org/apache/kudu/client/KeyRange.java
M java/kudu-client/src/main/java/org/apache/kudu/client/KuduScanToken.java
M java/kudu-client/src/main/java/org/apache/kudu/client/KuduTable.java
A java/kudu-client/src/main/java/org/apache/kudu/client/SplitKeyRangeRequest.java
A java/kudu-client/src/main/java/org/apache/kudu/client/SplitKeyRangeResponse.java
A java/kudu-client/src/test/java/org/apache/kudu/client/TestSplitKeyRange.java
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduRDD.scala
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduReadOptions.scala
M java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/DefaultSourceTest.scala
M java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/KuduTestSuite.scala
M java/kudu-test-utils/src/main/java/org/apache/kudu/test/ClientTestUtil.java
13 files changed, 793 insertions(+), 15 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/23/12323/13
-- 
To view, visit http://gerrit.cloudera.org:8080/12323
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I0502f5d64569e8b1d45e88de3cb36aa2e01234d0
Gerrit-Change-Number: 12323
Gerrit-PatchSet: 13
Gerrit-Owner: yangz <zh...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Yao Xu <oc...@gmail.com>
Gerrit-Reviewer: yangz <zh...@gmail.com>

[kudu-CR] KUDU-2670: Part 1: Build ScanToken by KeyRange

Posted by "Yao Xu (Code Review)" <ge...@cloudera.org>.
Yao Xu has uploaded a new patch set (#14) to the change originally created by yangz. ( http://gerrit.cloudera.org:8080/12323 )

Change subject: KUDU-2670: Part 1: Build ScanToken by KeyRange
......................................................................

KUDU-2670: Part 1: Build ScanToken by KeyRange

When reading data in a kudu table using spark,
if there is a large amount of data in the tablet,
reading the data takes a long time. The reason
is that KuduRDD uses a tablet to generate the
scanToken, so a spark task needs to process all
the data in a tablet.

We send SplitKeyRange RPC to TServer, split tablet's
primary key range into multiple primary key ranges
by size, and generate the scanToken by primary key ranges.

Change-Id: I0502f5d64569e8b1d45e88de3cb36aa2e01234d0
---
M java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java
A java/kudu-client/src/main/java/org/apache/kudu/client/KeyRange.java
M java/kudu-client/src/main/java/org/apache/kudu/client/KuduScanToken.java
M java/kudu-client/src/main/java/org/apache/kudu/client/KuduTable.java
A java/kudu-client/src/main/java/org/apache/kudu/client/SplitKeyRangeRequest.java
A java/kudu-client/src/main/java/org/apache/kudu/client/SplitKeyRangeResponse.java
A java/kudu-client/src/test/java/org/apache/kudu/client/TestSplitKeyRange.java
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduRDD.scala
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduReadOptions.scala
M java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/DefaultSourceTest.scala
M java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/KuduTestSuite.scala
M java/kudu-test-utils/src/main/java/org/apache/kudu/test/ClientTestUtil.java
13 files changed, 787 insertions(+), 15 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/23/12323/14
-- 
To view, visit http://gerrit.cloudera.org:8080/12323
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I0502f5d64569e8b1d45e88de3cb36aa2e01234d0
Gerrit-Change-Number: 12323
Gerrit-PatchSet: 14
Gerrit-Owner: yangz <zh...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Yao Xu <oc...@gmail.com>
Gerrit-Reviewer: yangz <zh...@gmail.com>

[kudu-CR] KUDU-2670: Part 1: Build ScanToken by KeyRange

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change. ( http://gerrit.cloudera.org:8080/12323 )

Change subject: KUDU-2670: Part 1: Build ScanToken by KeyRange
......................................................................


Patch Set 20:

(18 comments)

http://gerrit.cloudera.org:8080/#/c/12323/20/java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java
File java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java:

http://gerrit.cloudera.org:8080/#/c/12323/20/java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java@1942
PS20, Line 1942:    * @param startPrimaryKey start splitting at the beginning of the tablet
               :    * @param endPrimaryKey stop splitting at the end of the tablet
Same.


http://gerrit.cloudera.org:8080/#/c/12323/20/java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java@1978
PS20, Line 1978:    * @param startPrimaryKey start splitting at the beginning of the tablet
               :    * @param endPrimaryKey stop splitting at the end of the tablet
See my other comments about this in KuduTable.


http://gerrit.cloudera.org:8080/#/c/12323/20/java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java@1990
PS20, Line 1990:    * @throws Exception MasterErrorException if the table doesn't exist
Not your fault, but MasterErrorException hasn't existed since commit 0a792366e. Could you update this (and wherever you copied this from) to reflect that?


http://gerrit.cloudera.org:8080/#/c/12323/20/java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java@2015
PS20, Line 2015:    *
Nit: unnecessary empty line here.


http://gerrit.cloudera.org:8080/#/c/12323/20/java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java@2016
PS20, Line 2016:    * @param startPrimaryKey start splitting at the beginning of the tablet
               :    * @param endPrimaryKey stop splitting at the end of the tablet
Same.


http://gerrit.cloudera.org:8080/#/c/12323/20/java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java@2080
PS20, Line 2080:                 }).join());
The join here means that this isn't an async call, and will likely muck with the internal machinery of the Kudu client. It should be possible to avoid it. Logically you're kicking off several independent asynchronous operations (getTableKeyRanges), converting each response into a list of KeyRanges, then combining the lists into one and returning that to the client. Deferred.group might be able to help here.

You may also find it helpful to declare a new Deferred locally, capture it in one of these lambdas, and signal it only when the entire operation is done. See ConnectToCluster for an example of that pattern.


http://gerrit.cloudera.org:8080/#/c/12323/20/java/kudu-client/src/main/java/org/apache/kudu/client/KeyRange.java
File java/kudu-client/src/main/java/org/apache/kudu/client/KeyRange.java:

http://gerrit.cloudera.org:8080/#/c/12323/20/java/kudu-client/src/main/java/org/apache/kudu/client/KeyRange.java@35
PS20, Line 35:    * @param primaryKeyEnd the encoded priamry key wheere to stop in the key range (exclusive)
primary


http://gerrit.cloudera.org:8080/#/c/12323/20/java/kudu-client/src/main/java/org/apache/kudu/client/KeyRange.java@38
PS20, Line 38:   public KeyRange(LocatedTablet tablet,
Doc 'tablet'. Based on toString() it looks like it could be null; when does that happen and what does it mean?


http://gerrit.cloudera.org:8080/#/c/12323/20/java/kudu-client/src/main/java/org/apache/kudu/client/KuduScanToken.java
File java/kudu-client/src/main/java/org/apache/kudu/client/KuduScanToken.java:

http://gerrit.cloudera.org:8080/#/c/12323/20/java/kudu-client/src/main/java/org/apache/kudu/client/KuduScanToken.java@312
PS20, Line 312: It was used split tablet's 
It is used to split the tablet's


http://gerrit.cloudera.org:8080/#/c/12323/20/java/kudu-client/src/main/java/org/apache/kudu/client/KuduScanToken.java@416
PS20, Line 416: keyRanges
Shouldn't this be newKeyRanges?

Seems like a pretty serious bug (L419 could throw an IndexOutOfRangeException); why didn't any unit tests catch this?


http://gerrit.cloudera.org:8080/#/c/12323/20/java/kudu-client/src/main/java/org/apache/kudu/client/KuduScanToken.java@435
PS20, Line 435: keyRange.getPrimaryKeyStart()
Could use primaryKeyStart here (local declared on L432).

Same for primaryKeyEnd below.


http://gerrit.cloudera.org:8080/#/c/12323/20/java/kudu-client/src/main/java/org/apache/kudu/client/KuduTable.java
File java/kudu-client/src/main/java/org/apache/kudu/client/KuduTable.java:

http://gerrit.cloudera.org:8080/#/c/12323/20/java/kudu-client/src/main/java/org/apache/kudu/client/KuduTable.java@226
PS20, Line 226: range
ranges


http://gerrit.cloudera.org:8080/#/c/12323/20/java/kudu-client/src/main/java/org/apache/kudu/client/KuduTable.java@230
PS20, Line 230:    * @param primaryKeyStart start splitting at the beginning of the tablet
              :    * @param primaryKeyEnd stop splitting at the end of the tablet
This documentation is confusing: is this saying that if these are null, we will start at the beginning and/or stop at the end? If so, could you also add the meaning of the parameters when not null?


http://gerrit.cloudera.org:8080/#/c/12323/20/java/kudu-client/src/main/java/org/apache/kudu/client/SplitKeyRangeRequest.java
File java/kudu-client/src/main/java/org/apache/kudu/client/SplitKeyRangeRequest.java:

http://gerrit.cloudera.org:8080/#/c/12323/20/java/kudu-client/src/main/java/org/apache/kudu/client/SplitKeyRangeRequest.java@55
PS20, Line 55:    * @param timeoutMillis
Doc this one?


http://gerrit.cloudera.org:8080/#/c/12323/20/java/kudu-client/src/test/java/org/apache/kudu/client/TestSplitKeyRange.java
File java/kudu-client/src/test/java/org/apache/kudu/client/TestSplitKeyRange.java:

http://gerrit.cloudera.org:8080/#/c/12323/20/java/kudu-client/src/test/java/org/apache/kudu/client/TestSplitKeyRange.java@20
PS20, Line 20: import static org.apache.kudu.test.ClientTestUtil.*;
Please expand this.


http://gerrit.cloudera.org:8080/#/c/12323/20/java/kudu-client/src/test/java/org/apache/kudu/client/TestSplitKeyRange.java@121
PS20, Line 121:     // 3.2 Coverage, but no data. RPC return a key range for tablet's MRS.
If we flushed the MRS by sleeping on L59, how is this querying the MRS?


http://gerrit.cloudera.org:8080/#/c/12323/20/java/kudu-client/src/test/java/org/apache/kudu/client/TestSplitKeyRange.java@123
PS20, Line 123:      primaryKeyEnd = KeyEncoder.encodePrimaryKey(partialRowEnd);
Nit: indentation.


http://gerrit.cloudera.org:8080/#/c/12323/20/java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/DefaultSourceTest.scala
File java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/DefaultSourceTest.scala:

http://gerrit.cloudera.org:8080/#/c/12323/20/java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/DefaultSourceTest.scala@1043
PS20, Line 1043:     // Add a SparkListener to count the number of tasks that end.
               :     var actualNumTasks = 0
               :     val listener = new SparkListener {
               :       override def onTaskEnd(taskEnd: SparkListenerTaskEnd): Unit = {
               :         actualNumTasks += 1
               :       }
               :     }
               :     ss.sparkContext.addSparkListener(listener)
This has turned out to be a flaky test pattern. See https://gerrit.cloudera.org/c/13263/ for details and a workaround.



-- 
To view, visit http://gerrit.cloudera.org:8080/12323
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0502f5d64569e8b1d45e88de3cb36aa2e01234d0
Gerrit-Change-Number: 12323
Gerrit-PatchSet: 20
Gerrit-Owner: yangz <zh...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Yao Xu <oc...@gmail.com>
Gerrit-Reviewer: yangz <zh...@gmail.com>
Gerrit-Comment-Date: Mon, 13 May 2019 18:55:43 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2670: Part 1: Build ScanToken by KeyRange

Posted by "Yao Xu (Code Review)" <ge...@cloudera.org>.
Yao Xu has posted comments on this change. ( http://gerrit.cloudera.org:8080/12323 )

Change subject: KUDU-2670: Part 1: Build ScanToken by KeyRange
......................................................................


Patch Set 23:

(18 comments)

http://gerrit.cloudera.org:8080/#/c/12323/20/java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java
File java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java:

http://gerrit.cloudera.org:8080/#/c/12323/20/java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java@1942
PS20, Line 1942:    * @param startPrimaryKey the primary key to begin splitting at (inclusive), pass null to
               :    *                        start splitting at the beginning of t
> Same.
Done


http://gerrit.cloudera.org:8080/#/c/12323/20/java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java@1978
PS20, Line 1978:    * This method blocks until it gets all the key ranges.
               :    * @param table the table to get key ranges from
> See my other comments about this in KuduTable.
Done


http://gerrit.cloudera.org:8080/#/c/12323/20/java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java@1990
PS20, Line 1990:    *                       larger or smaller than this value. If unset or <= 0, the key range
> Not your fault, but MasterErrorException hasn't existed since commit 0a7923
Done


http://gerrit.cloudera.org:8080/#/c/12323/20/java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java@2015
PS20, Line 2015:    * Get all or some key range for a given table. This may query the master multiple times if there
> Nit: unnecessary empty line here.
Done


http://gerrit.cloudera.org:8080/#/c/12323/20/java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java@2016
PS20, Line 2016:    * are a lot of tablets, and query each tablet to split the tablet's primary key range into
               :    * smaller ranges. This doesn't change the layout of the tablet
> Same.
Done


http://gerrit.cloudera.org:8080/#/c/12323/20/java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java@2080
PS20, Line 2080:                   }
> The join here means that this isn't an async call, and will likely muck wit
This is a good idea, I have not used Deferred.group before. :)


http://gerrit.cloudera.org:8080/#/c/12323/20/java/kudu-client/src/main/java/org/apache/kudu/client/KeyRange.java
File java/kudu-client/src/main/java/org/apache/kudu/client/KeyRange.java:

http://gerrit.cloudera.org:8080/#/c/12323/20/java/kudu-client/src/main/java/org/apache/kudu/client/KeyRange.java@35
PS20, Line 35:    * @param primaryKeyStart the encoded primary key where to start in the key range (inclusive)
> primary
Done


http://gerrit.cloudera.org:8080/#/c/12323/20/java/kudu-client/src/main/java/org/apache/kudu/client/KeyRange.java@38
PS20, Line 38:    */
> Doc 'tablet'. Based on toString() it looks like it could be null; when does
It cannot be null, we add doc for this


http://gerrit.cloudera.org:8080/#/c/12323/20/java/kudu-client/src/main/java/org/apache/kudu/client/KuduScanToken.java
File java/kudu-client/src/main/java/org/apache/kudu/client/KuduScanToken.java:

http://gerrit.cloudera.org:8080/#/c/12323/20/java/kudu-client/src/main/java/org/apache/kudu/client/KuduScanToken.java@312
PS20, Line 312: It is used to split tablet'
> It is used to split the tablet's
Done


http://gerrit.cloudera.org:8080/#/c/12323/20/java/kudu-client/src/main/java/org/apache/kudu/client/KuduScanToken.java@416
PS20, Line 416: newKeyRan
> Shouldn't this be newKeyRanges?
 > 
 > Seems like a pretty serious bug (L419 could throw an
 > IndexOutOfRangeException); why didn't any unit tests catch this?
Yes, it should be use newKeyRanges. But the current implementation, newKeyRange will not be empty (getTabletsKeyRanges returns at least one).


http://gerrit.cloudera.org:8080/#/c/12323/20/java/kudu-client/src/main/java/org/apache/kudu/client/KuduScanToken.java@435
PS20, Line 435: 
> Could use primaryKeyStart here (local declared on L432).
Done


http://gerrit.cloudera.org:8080/#/c/12323/20/java/kudu-client/src/main/java/org/apache/kudu/client/KuduTable.java
File java/kudu-client/src/main/java/org/apache/kudu/client/KuduTable.java:

http://gerrit.cloudera.org:8080/#/c/12323/20/java/kudu-client/src/main/java/org/apache/kudu/client/KuduTable.java@226
PS20, Line 226: range
> ranges
Done


http://gerrit.cloudera.org:8080/#/c/12323/20/java/kudu-client/src/main/java/org/apache/kudu/client/KuduTable.java@230
PS20, Line 230:    * @param primaryKeyStart the primary key to begin splitting at (inclusive), pass null to
              :    *                        start splitting at the beginning of t
> This documentation is confusing: is this saying that if these are null, we 
I'm so sorry, I misunderstood your previous comments.


http://gerrit.cloudera.org:8080/#/c/12323/20/java/kudu-client/src/main/java/org/apache/kudu/client/SplitKeyRangeRequest.java
File java/kudu-client/src/main/java/org/apache/kudu/client/SplitKeyRangeRequest.java:

http://gerrit.cloudera.org:8080/#/c/12323/20/java/kudu-client/src/main/java/org/apache/kudu/client/SplitKeyRangeRequest.java@55
PS20, Line 55:    * @param timeoutMillis the timeout of the request in milliseconds
> Doc this one?
Done


http://gerrit.cloudera.org:8080/#/c/12323/20/java/kudu-client/src/test/java/org/apache/kudu/client/TestSplitKeyRange.java
File java/kudu-client/src/test/java/org/apache/kudu/client/TestSplitKeyRange.java:

http://gerrit.cloudera.org:8080/#/c/12323/20/java/kudu-client/src/test/java/org/apache/kudu/client/TestSplitKeyRange.java@20
PS20, Line 20: import static org.apache.kudu.test.ClientTestUtil.createTableWithOneThousandRows;
> Please expand this.
Done


http://gerrit.cloudera.org:8080/#/c/12323/20/java/kudu-client/src/test/java/org/apache/kudu/client/TestSplitKeyRange.java@121
PS20, Line 121:     // TODO: Response should be return empty. But this does not affect scan result.
> If we flushed the MRS by sleeping on L59, how is this querying the
 > MRS?

The test case is to check if the DRS's data is not in [primaryKeyStart, primaryKeyEnd), the rpc must return a key range. Because the MRS's data may be in the  [primaryKeyStart, primaryKeyEnd). 

I think it would be better to add some doc.


http://gerrit.cloudera.org:8080/#/c/12323/20/java/kudu-client/src/test/java/org/apache/kudu/client/TestSplitKeyRange.java@123
PS20, Line 123:     assertEquals(tablet1.toString(), keyRanges.get(0).getTablet().toString());
> Nit: indentation.
Done


http://gerrit.cloudera.org:8080/#/c/12323/20/java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/DefaultSourceTest.scala
File java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/DefaultSourceTest.scala:

http://gerrit.cloudera.org:8080/#/c/12323/20/java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/DefaultSourceTest.scala@1043
PS20, Line 1043:       val t = "scanWithKeyRangeTest"
               :       sqlContext.read.options(kuduOptions).format("kudu").load.createOrReplaceTempView(t)
               :       val results = sqlContext.sql(s"SELECT * FROM $t").collectAsList()
               :       assert(results.size() == rowCount * 100)
               :     }
               :     assert(actualNumTasks >= 2)
               :   }
               : }
> This has turned out to be a flaky test pattern. See https://gerrit.cloudera
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/12323
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0502f5d64569e8b1d45e88de3cb36aa2e01234d0
Gerrit-Change-Number: 12323
Gerrit-PatchSet: 23
Gerrit-Owner: yangz <zh...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Yao Xu <oc...@gmail.com>
Gerrit-Reviewer: yangz <zh...@gmail.com>
Gerrit-Comment-Date: Tue, 14 May 2019 15:21:58 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2670: Part 1: Build ScanToken by splitSizeBytes

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change. ( http://gerrit.cloudera.org:8080/12323 )

Change subject: KUDU-2670: Part 1: Build ScanToken by splitSizeBytes
......................................................................


Patch Set 6:

(15 comments)

http://gerrit.cloudera.org:8080/#/c/12323/6/java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java
File java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java:

http://gerrit.cloudera.org:8080/#/c/12323/6/java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java@1208
PS6, Line 1208:    * @param partitionKey the partition key of the tablet to find
Not clear why this is necessary. As far as I can tell it's for proper retrying support of the RPC; could you clarify in the comment?


http://gerrit.cloudera.org:8080/#/c/12323/6/java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java@1209
PS6, Line 1209:    * @param splitSizeBytes the number of bytes to try to return in each key range.
The wording here suggests that this value is advisory (i.e. the server might return smaller or larger ranges). If that's the case, could you explicitly state that?


http://gerrit.cloudera.org:8080/#/c/12323/6/java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java@1211
PS6, Line 1211:    * @return a deferred object that indicates the completion of the request.
Nit: "a {@code Deferred} object which yields the list of key ranges."


http://gerrit.cloudera.org:8080/#/c/12323/6/java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java@1213
PS6, Line 1213:   Deferred<SplitKeyRangeResponse> getTabletKeyRanges(final KuduTable table,
Is there any use case for applications to call this API directly? If so, should there be a synchronous equivalent in KuduClient?


http://gerrit.cloudera.org:8080/#/c/12323/6/java/kudu-client/src/main/java/org/apache/kudu/client/KeyRange.java
File java/kudu-client/src/main/java/org/apache/kudu/client/KeyRange.java:

http://gerrit.cloudera.org:8080/#/c/12323/6/java/kudu-client/src/main/java/org/apache/kudu/client/KeyRange.java@23
PS6, Line 23:  * A KeyRange describes the range in Tablet.
Should document that startPrimaryKey and endPrimaryKey represent encoded primary keys.


http://gerrit.cloudera.org:8080/#/c/12323/6/java/kudu-client/src/main/java/org/apache/kudu/client/KeyRange.java@31
PS6, Line 31:   public KeyRange(byte[] startPrimaryKey, byte[] endPrimaryKey, Long bytes) {
Not clear what 'bytes' means here.


http://gerrit.cloudera.org:8080/#/c/12323/6/java/kudu-client/src/main/java/org/apache/kudu/client/KuduScanToken.java
File java/kudu-client/src/main/java/org/apache/kudu/client/KuduScanToken.java:

http://gerrit.cloudera.org:8080/#/c/12323/6/java/kudu-client/src/main/java/org/apache/kudu/client/KuduScanToken.java@304
PS6, Line 304:      * Sets a size of split tablet to use when building the list of scan tokens.
Again, careful when referencing "split tablet"; it makes it sound as if the tablet is being physically split.


http://gerrit.cloudera.org:8080/#/c/12323/6/java/kudu-client/src/main/java/org/apache/kudu/client/KuduScanToken.java@408
PS6, Line 408:         return splitSizeBytes > 0 ? buildTokensBySplit(tablets, proto) : buildTokensByDefault(tablets, proto);
Nit: wrap


http://gerrit.cloudera.org:8080/#/c/12323/6/java/kudu-client/src/main/java/org/apache/kudu/client/KuduScanToken.java@428
PS6, Line 428:     public List<KuduScanToken> buildTokensBySplit(List<LocatedTablet> tablets,
private


http://gerrit.cloudera.org:8080/#/c/12323/6/java/kudu-client/src/main/java/org/apache/kudu/client/SplitKeyRangeRequest.java
File java/kudu-client/src/main/java/org/apache/kudu/client/SplitKeyRangeRequest.java:

http://gerrit.cloudera.org:8080/#/c/12323/6/java/kudu-client/src/main/java/org/apache/kudu/client/SplitKeyRangeRequest.java@33
PS6, Line 33:  * RPC to split tablet into multiple primary key ranges by size.
One of the confusing aspects of this feature is that tablets aren't actually being "split" in the sense that after the RPC finishes, there are two tablets instead of one. We need to be careful when documenting that, to avoid sending the wrong message.

In this case, you could talk about how the tserver responds to the RPC by splitting a tablet's primary key range into smaller ranges suitable for concurrent scanning.


http://gerrit.cloudera.org:8080/#/c/12323/6/java/kudu-client/src/main/java/org/apache/kudu/client/SplitKeyRangeRequest.java@38
PS6, Line 38:   private final byte[] startPrimaryKey;
            :   private final byte[] endPrimaryKey;
            :   private final byte[] partitionKey;
            :   private final Long splitSizeBytes;
Document the meaning of these fields.


http://gerrit.cloudera.org:8080/#/c/12323/6/java/kudu-client/src/main/java/org/apache/kudu/client/SplitKeyRangeRequest.java@43
PS6, Line 43: 
Nit: unnecessary empty line here.


http://gerrit.cloudera.org:8080/#/c/12323/6/java/kudu-client/src/test/java/org/apache/kudu/util/SplitTestUtil.java
File java/kudu-client/src/test/java/org/apache/kudu/util/SplitTestUtil.java:

PS6: 
License header.


http://gerrit.cloudera.org:8080/#/c/12323/6/java/kudu-client/src/test/java/org/apache/kudu/util/SplitTestUtil.java@104
PS6, Line 104:     for (int i=1; i<numbers.size() - 1;i+=2){
for (int i = 1; i < numbers.size() - 1; i += 2) {


http://gerrit.cloudera.org:8080/#/c/12323/6/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduReadOptions.scala
File java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduReadOptions.scala:

http://gerrit.cloudera.org:8080/#/c/12323/6/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduReadOptions.scala@37
PS6, Line 37:  * @param splitSizeBytes Set the number of bytes used to split tablet.
> Perhaps make these docs a bit more spark centric. Something along the lines
Agreed with Grant, though we need to be careful when we talk about "splitting tablets"; we want to emphasize that the tablets _themselves_ are not being split. Only the range of primary keys to be scanned is being split.



-- 
To view, visit http://gerrit.cloudera.org:8080/12323
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0502f5d64569e8b1d45e88de3cb36aa2e01234d0
Gerrit-Change-Number: 12323
Gerrit-PatchSet: 6
Gerrit-Owner: yangz <zh...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Yao Xu <oc...@gmail.com>
Gerrit-Reviewer: yangz <zh...@gmail.com>
Gerrit-Comment-Date: Tue, 05 Mar 2019 23:59:17 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2670: Part 1: Build ScanToken by KeyRange

Posted by "Yao Xu (Code Review)" <ge...@cloudera.org>.
Yao Xu has uploaded a new patch set (#11) to the change originally created by yangz. ( http://gerrit.cloudera.org:8080/12323 )

Change subject: KUDU-2670: Part 1: Build ScanToken by KeyRange
......................................................................

KUDU-2670: Part 1: Build ScanToken by KeyRange

When reading data in a kudu table using spark,
if there is a large amount of data in the tablet,
reading the data takes a long time. The reason
is that KuduRDD uses a tablet to generate the
scanToken, so a spark task needs to process all
the data in a tablet.

We send SplitKeyRange RPC to TServer, split tablet's
primary key range into multiple primary key ranges
by size, and generate the scanToken by primary key ranges.

Change-Id: I0502f5d64569e8b1d45e88de3cb36aa2e01234d0
---
M java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java
A java/kudu-client/src/main/java/org/apache/kudu/client/KeyRange.java
M java/kudu-client/src/main/java/org/apache/kudu/client/KuduScanToken.java
M java/kudu-client/src/main/java/org/apache/kudu/client/KuduTable.java
A java/kudu-client/src/main/java/org/apache/kudu/client/SplitKeyRangeRequest.java
A java/kudu-client/src/main/java/org/apache/kudu/client/SplitKeyRangeResponse.java
A java/kudu-client/src/test/java/org/apache/kudu/client/TestSplitKeyRange.java
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduRDD.scala
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduReadOptions.scala
M java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/DefaultSourceTest.scala
M java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/KuduTestSuite.scala
M java/kudu-test-utils/src/main/java/org/apache/kudu/test/ClientTestUtil.java
13 files changed, 789 insertions(+), 14 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/23/12323/11
-- 
To view, visit http://gerrit.cloudera.org:8080/12323
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I0502f5d64569e8b1d45e88de3cb36aa2e01234d0
Gerrit-Change-Number: 12323
Gerrit-PatchSet: 11
Gerrit-Owner: yangz <zh...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Yao Xu <oc...@gmail.com>
Gerrit-Reviewer: yangz <zh...@gmail.com>

[kudu-CR] KUDU-2670: Part 1: Build ScanToken by KeyRange

Posted by "Yao Xu (Code Review)" <ge...@cloudera.org>.
Yao Xu has posted comments on this change. ( http://gerrit.cloudera.org:8080/12323 )

Change subject: KUDU-2670: Part 1: Build ScanToken by KeyRange
......................................................................


Patch Set 33:

> > The reason for the problem should be that KeyRange is duplicate.
 > > In the following log, the tablet(79e027fcb6bd4f67b85028ab7d08352c)
 > primary key range is duplicate.
 > >
 > > 13:07:31.784 [DEBUG - New I/O worker #158] (AsyncKuduClient.java:2039)
 > Add key range [<start>, <end>), 40603, "79e027fcb6bd4f67b85028ab7d08352c"
 > [0x80000032, <end>)
 > > 13:07:31.784 [DEBUG - New I/O worker #157] (AsyncKuduClient.java:2039)
 > Add key range [<start>, <end>), 35286, "453294f2eb9341cfb4e59964404fc4e0"
 > [<start>, 0x80000032)
 > > 13:07:31.785 [DEBUG - New I/O worker #158] (AsyncKuduClient.java:2039)
 > Add key range [<start>, <end>), 40603, "79e027fcb6bd4f67b85028ab7d08352c"
 > [0x80000032, <end>)
 > 
 > That sounds like it might be an interaction between split key range
 > requests and the DuplicatingRowSet, a special kind of rowset that
 > is used during a MRS flush. See Tablet::DoMergeCompactionOrFlush
 > for details.
 > 
 > In any case, the problem is likely to be server-side, so I suggest
 > you add some debugging to Tablet and RowSetInfo to figure out
 > what's going on.

I found the root cause of this problem. Because Deferred.group no guarantee on the order of the results in the deferred list returned, but we rely on the return order when we call PartitionPruner.removePartitionKeyRange, This will cause some KeyRange to be created repeatedly. It's my fault. :( 

Thank you again for your help. :)
Btw, dist-test seems to be a good test tools. Is there any doc for it?


-- 
To view, visit http://gerrit.cloudera.org:8080/12323
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0502f5d64569e8b1d45e88de3cb36aa2e01234d0
Gerrit-Change-Number: 12323
Gerrit-PatchSet: 33
Gerrit-Owner: yangz <zh...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Yao Xu <oc...@gmail.com>
Gerrit-Reviewer: yangz <zh...@gmail.com>
Gerrit-Comment-Date: Thu, 23 May 2019 12:01:25 +0000
Gerrit-HasComments: No

[kudu-CR] KUDU-2670: Part 1: Build ScanToken by KeyRange

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/12323 )

Change subject: KUDU-2670: Part 1: Build ScanToken by KeyRange
......................................................................

KUDU-2670: Part 1: Build ScanToken by KeyRange

When reading data in a kudu table using spark,
if there is a large amount of data in the tablet,
reading the data takes a long time. The reason
is that KuduRDD uses a tablet to generate the
scanToken, so a spark task needs to process all
the data in a tablet.

We send SplitKeyRange RPC to TServer, split tablet's
primary key range into multiple primary key ranges
by size, and generate the scanToken by primary key ranges.

Change-Id: I0502f5d64569e8b1d45e88de3cb36aa2e01234d0
Reviewed-on: http://gerrit.cloudera.org:8080/12323
Reviewed-by: Adar Dembo <ad...@cloudera.com>
Tested-by: Adar Dembo <ad...@cloudera.com>
---
M java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java
M java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduScanner.java
A java/kudu-client/src/main/java/org/apache/kudu/client/KeyRange.java
M java/kudu-client/src/main/java/org/apache/kudu/client/KuduScanToken.java
A java/kudu-client/src/main/java/org/apache/kudu/client/SplitKeyRangeRequest.java
A java/kudu-client/src/main/java/org/apache/kudu/client/SplitKeyRangeResponse.java
A java/kudu-client/src/test/java/org/apache/kudu/client/TestSplitKeyRange.java
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduRDD.scala
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduReadOptions.scala
M java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/DefaultSourceTest.scala
M java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/KuduTestSuite.scala
M java/kudu-test-utils/src/main/java/org/apache/kudu/test/ClientTestUtil.java
13 files changed, 743 insertions(+), 16 deletions(-)

Approvals:
  Adar Dembo: Looks good to me, approved; Verified

-- 
To view, visit http://gerrit.cloudera.org:8080/12323
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I0502f5d64569e8b1d45e88de3cb36aa2e01234d0
Gerrit-Change-Number: 12323
Gerrit-PatchSet: 34
Gerrit-Owner: yangz <zh...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Yao Xu <oc...@gmail.com>
Gerrit-Reviewer: yangz <zh...@gmail.com>

[kudu-CR] KUDU-2670: Part 1: Build ScanToken by KeyRange

Posted by "Yao Xu (Code Review)" <ge...@cloudera.org>.
Yao Xu has uploaded a new patch set (#22) to the change originally created by yangz. ( http://gerrit.cloudera.org:8080/12323 )

Change subject: KUDU-2670: Part 1: Build ScanToken by KeyRange
......................................................................

KUDU-2670: Part 1: Build ScanToken by KeyRange

When reading data in a kudu table using spark,
if there is a large amount of data in the tablet,
reading the data takes a long time. The reason
is that KuduRDD uses a tablet to generate the
scanToken, so a spark task needs to process all
the data in a tablet.

We send SplitKeyRange RPC to TServer, split tablet's
primary key range into multiple primary key ranges
by size, and generate the scanToken by primary key ranges.

Change-Id: I0502f5d64569e8b1d45e88de3cb36aa2e01234d0
---
M java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java
A java/kudu-client/src/main/java/org/apache/kudu/client/KeyRange.java
M java/kudu-client/src/main/java/org/apache/kudu/client/KuduScanToken.java
M java/kudu-client/src/main/java/org/apache/kudu/client/KuduTable.java
A java/kudu-client/src/main/java/org/apache/kudu/client/SplitKeyRangeRequest.java
A java/kudu-client/src/main/java/org/apache/kudu/client/SplitKeyRangeResponse.java
A java/kudu-client/src/test/java/org/apache/kudu/client/TestSplitKeyRange.java
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduRDD.scala
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduReadOptions.scala
M java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/DefaultSourceTest.scala
M java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/KuduTestSuite.scala
M java/kudu-test-utils/src/main/java/org/apache/kudu/test/ClientTestUtil.java
13 files changed, 797 insertions(+), 15 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/23/12323/22
-- 
To view, visit http://gerrit.cloudera.org:8080/12323
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I0502f5d64569e8b1d45e88de3cb36aa2e01234d0
Gerrit-Change-Number: 12323
Gerrit-PatchSet: 22
Gerrit-Owner: yangz <zh...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Yao Xu <oc...@gmail.com>
Gerrit-Reviewer: yangz <zh...@gmail.com>

[kudu-CR] KUDU-2670: Part 1: Build ScanToken by KeyRange

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change. ( http://gerrit.cloudera.org:8080/12323 )

Change subject: KUDU-2670: Part 1: Build ScanToken by KeyRange
......................................................................


Patch Set 23: Code-Review+1

(1 comment)

Grant, could you take another look?

http://gerrit.cloudera.org:8080/#/c/12323/20/java/kudu-client/src/main/java/org/apache/kudu/client/KeyRange.java
File java/kudu-client/src/main/java/org/apache/kudu/client/KeyRange.java:

http://gerrit.cloudera.org:8080/#/c/12323/20/java/kudu-client/src/main/java/org/apache/kudu/client/KeyRange.java@38
PS20, Line 38:    */
> It cannot be null, we add doc for this
Instead of the doc, could you add a Preconditions.checkNotNull() call to the constructor? Also, please update toString() to not check that the tablet is null, since it's enforced in the constructor.



-- 
To view, visit http://gerrit.cloudera.org:8080/12323
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0502f5d64569e8b1d45e88de3cb36aa2e01234d0
Gerrit-Change-Number: 12323
Gerrit-PatchSet: 23
Gerrit-Owner: yangz <zh...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Yao Xu <oc...@gmail.com>
Gerrit-Reviewer: yangz <zh...@gmail.com>
Gerrit-Comment-Date: Tue, 14 May 2019 22:54:28 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2670: Part 1: Build ScanToken by KeyRange

Posted by "Grant Henke (Code Review)" <ge...@cloudera.org>.
Grant Henke has posted comments on this change. ( http://gerrit.cloudera.org:8080/12323 )

Change subject: KUDU-2670: Part 1: Build ScanToken by KeyRange
......................................................................


Patch Set 24:

(7 comments)

Thanks for all the updates. It's looking really good. I have a few more small comments/questions.

http://gerrit.cloudera.org:8080/#/c/12323/24/java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java
File java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java:

http://gerrit.cloudera.org:8080/#/c/12323/24/java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java@1996
PS24, Line 1996:   List<KeyRange> syncGetTableKeyRanges(KuduTable table,
It's not common to have synchronous methods in the AsyncKuduClient. Given this is just a call to getTableKeyRanges with a following .join() call and it's only used once in KuduTable, maybe you could just call getTableKeyRanges(...).join() in KuduTable.


http://gerrit.cloudera.org:8080/#/c/12323/24/java/kudu-client/src/main/java/org/apache/kudu/client/KuduScanToken.java
File java/kudu-client/src/main/java/org/apache/kudu/client/KuduScanToken.java:

http://gerrit.cloudera.org:8080/#/c/12323/24/java/kudu-client/src/main/java/org/apache/kudu/client/KuduScanToken.java@294
PS24, Line 294:     private long splitSizeBytes = -1;
Nit: define a constant for the special -1 value.


http://gerrit.cloudera.org:8080/#/c/12323/24/java/kudu-client/src/main/java/org/apache/kudu/client/KuduScanToken.java@408
PS24, Line 408:           List<KeyRange> newKeyRanges = table.getTabletsKeyRanges(
Should we still use the old style when splitSizeBytes = -1?


http://gerrit.cloudera.org:8080/#/c/12323/24/java/kudu-client/src/main/java/org/apache/kudu/client/KuduTable.java
File java/kudu-client/src/main/java/org/apache/kudu/client/KuduTable.java:

http://gerrit.cloudera.org:8080/#/c/12323/24/java/kudu-client/src/main/java/org/apache/kudu/client/KuduTable.java@245
PS24, Line 245:   public List<KeyRange> getTabletsKeyRanges(byte[] primaryKeyStart,
I am not sure this should live here. Or at least is shouldn't be public given it returns a private KeyRange. You could call client.getTableKeyRanges(...).join() in the ScanToken code.


http://gerrit.cloudera.org:8080/#/c/12323/24/java/kudu-client/src/main/java/org/apache/kudu/client/SplitKeyRangeRequest.java
File java/kudu-client/src/main/java/org/apache/kudu/client/SplitKeyRangeRequest.java:

http://gerrit.cloudera.org:8080/#/c/12323/24/java/kudu-client/src/main/java/org/apache/kudu/client/SplitKeyRangeRequest.java@37
PS24, Line 37: 
Perhaps add the constant for the -1 value mentioned in KuduScanToken here.


http://gerrit.cloudera.org:8080/#/c/12323/24/java/kudu-client/src/main/java/org/apache/kudu/client/SplitKeyRangeRequest.java@109
PS24, Line 109:     SplitKeyRangeResponse response = new SplitKeyRangeResponse(
See my comment on SplitKeyRangeResponse.


http://gerrit.cloudera.org:8080/#/c/12323/24/java/kudu-client/src/main/java/org/apache/kudu/client/SplitKeyRangeResponse.java
File java/kudu-client/src/main/java/org/apache/kudu/client/SplitKeyRangeResponse.java:

http://gerrit.cloudera.org:8080/#/c/12323/24/java/kudu-client/src/main/java/org/apache/kudu/client/SplitKeyRangeResponse.java@33
PS24, Line 33:   SplitKeyRangeResponse(long elapsedMillis, String tsUUID, List<KeyRangePB> keyRanges) {
Can we translate KeyRangePB to KeyRange before passing into this constructor? This is usually done via methods added to ProtobufHelper and placed in a deserialize method on the request. See GetTableSchemaRequest.deserialize for an example.



-- 
To view, visit http://gerrit.cloudera.org:8080/12323
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0502f5d64569e8b1d45e88de3cb36aa2e01234d0
Gerrit-Change-Number: 12323
Gerrit-PatchSet: 24
Gerrit-Owner: yangz <zh...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Yao Xu <oc...@gmail.com>
Gerrit-Reviewer: yangz <zh...@gmail.com>
Gerrit-Comment-Date: Thu, 16 May 2019 14:06:51 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2670: Part 1: Build ScanToken by KeyRange

Posted by "Yao Xu (Code Review)" <ge...@cloudera.org>.
Yao Xu has posted comments on this change. ( http://gerrit.cloudera.org:8080/12323 )

Change subject: KUDU-2670: Part 1: Build ScanToken by KeyRange
......................................................................


Patch Set 30:

Sorry, the problem has not been reproduced in local.
The error message should be that some tablets have been read twice. 
I added some debug logs to output KeyRange information.


-- 
To view, visit http://gerrit.cloudera.org:8080/12323
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0502f5d64569e8b1d45e88de3cb36aa2e01234d0
Gerrit-Change-Number: 12323
Gerrit-PatchSet: 30
Gerrit-Owner: yangz <zh...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Yao Xu <oc...@gmail.com>
Gerrit-Reviewer: yangz <zh...@gmail.com>
Gerrit-Comment-Date: Fri, 17 May 2019 10:41:39 +0000
Gerrit-HasComments: No

[kudu-CR] KUDU-2670: Part 1: Build ScanToken by splitSizeBytes

Posted by "Yao Xu (Code Review)" <ge...@cloudera.org>.
Yao Xu has uploaded a new patch set (#5) to the change originally created by yangz. ( http://gerrit.cloudera.org:8080/12323 )

Change subject: KUDU-2670: Part 1: Build ScanToken by splitSizeBytes
......................................................................

KUDU-2670: Part 1: Build ScanToken by splitSizeBytes

When reading data in a kudu table using spark,
if there is a large amount of data in the tablet,
reading the data takes a long time. The reason
is that KuduRDD uses a tablet to generate the
scanToken, so a spark task needs to process all
the data in a tablet.

We send SplitKeyRange RPC to TServer, split tablet
into multiple primary key ranges by size. And generate
the scanToken by tablet's key ranges.

Change-Id: I0502f5d64569e8b1d45e88de3cb36aa2e01234d0
---
D java/kudu-client/src/main/java/org/apache/kudu/client/AbstractKuduScannerBuilder.java
M java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java
A java/kudu-client/src/main/java/org/apache/kudu/client/KeyRange.java
M java/kudu-client/src/main/java/org/apache/kudu/client/KuduScanToken.java
A java/kudu-client/src/main/java/org/apache/kudu/client/SplitKeyRangeRequest.java
A java/kudu-client/src/main/java/org/apache/kudu/client/SplitKeyRangeResponse.java
A java/kudu-client/src/test/java/org/apache/kudu/client/TestSplitKeyRangeRequest.java
A java/kudu-client/src/test/java/org/apache/kudu/util/SplitTestUtil.java
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduRDD.scala
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduReadOptions.scala
11 files changed, 565 insertions(+), 369 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/23/12323/5
-- 
To view, visit http://gerrit.cloudera.org:8080/12323
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I0502f5d64569e8b1d45e88de3cb36aa2e01234d0
Gerrit-Change-Number: 12323
Gerrit-PatchSet: 5
Gerrit-Owner: yangz <zh...@gmail.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Yao Xu <oc...@gmail.com>
Gerrit-Reviewer: yangz <zh...@gmail.com>

[kudu-CR] KUDU-2670: Part 1: Build ScanToken by KeyRange

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/12323 )

Change subject: KUDU-2670: Part 1: Build ScanToken by KeyRange
......................................................................


Patch Set 34:

BTW the source for dist-test is open if you want to run your own instantiation of it: https://github.com/cloudera/dist_test (and all the client code necessary to submit Kudu jobs to an instance is in the kudu repo). The only thing that's private are the credentials to the Cloudera-hosted farm of machines


-- 
To view, visit http://gerrit.cloudera.org:8080/12323
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0502f5d64569e8b1d45e88de3cb36aa2e01234d0
Gerrit-Change-Number: 12323
Gerrit-PatchSet: 34
Gerrit-Owner: yangz <zh...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Yao Xu <oc...@gmail.com>
Gerrit-Reviewer: yangz <zh...@gmail.com>
Gerrit-Comment-Date: Fri, 24 May 2019 06:38:47 +0000
Gerrit-HasComments: No

[kudu-CR] KUDU-2670: Part 1: Build ScanToken by KeyRange

Posted by "Yao Xu (Code Review)" <ge...@cloudera.org>.
Yao Xu has posted comments on this change. ( http://gerrit.cloudera.org:8080/12323 )

Change subject: KUDU-2670: Part 1: Build ScanToken by KeyRange
......................................................................


Patch Set 32:

(1 comment)

Thanks Adar for the test result. I added the information of the primary key range in AsyncKuduScanner.toString. The reason for the problem should be that KeyRange is duplicate. 
In the following log, the tablet(79e027fcb6bd4f67b85028ab7d08352c) primary key range is duplicate.

13:07:31.784 [DEBUG - New I/O worker #158] (AsyncKuduClient.java:2039) Add key range [<start>, <end>), 40603, "79e027fcb6bd4f67b85028ab7d08352c" [0x80000032, <end>)
13:07:31.784 [DEBUG - New I/O worker #157] (AsyncKuduClient.java:2039) Add key range [<start>, <end>), 35286, "453294f2eb9341cfb4e59964404fc4e0" [<start>, 0x80000032)
13:07:31.785 [DEBUG - New I/O worker #158] (AsyncKuduClient.java:2039) Add key range [<start>, <end>), 40603, "79e027fcb6bd4f67b85028ab7d08352c" [0x80000032, <end>)

http://gerrit.cloudera.org:8080/#/c/12323/30/java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/DefaultSourceTest.scala
File java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/DefaultSourceTest.scala:

http://gerrit.cloudera.org:8080/#/c/12323/30/java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/DefaultSourceTest.scala@1046
PS30, Line 1046:       assertEquals(rowCount * 100, results.size())
> Could you use assertEquals here?
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/12323
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0502f5d64569e8b1d45e88de3cb36aa2e01234d0
Gerrit-Change-Number: 12323
Gerrit-PatchSet: 32
Gerrit-Owner: yangz <zh...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Yao Xu <oc...@gmail.com>
Gerrit-Reviewer: yangz <zh...@gmail.com>
Gerrit-Comment-Date: Wed, 22 May 2019 13:35:48 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2670: Part 1: Build ScanToken by KeyRange

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change. ( http://gerrit.cloudera.org:8080/12323 )

Change subject: KUDU-2670: Part 1: Build ScanToken by KeyRange
......................................................................


Patch Set 10:

(15 comments)

http://gerrit.cloudera.org:8080/#/c/12323/10//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/12323/10//COMMIT_MSG@18
PS10, Line 18: by size. And generate 
Nit: "by size, and generate..."


http://gerrit.cloudera.org:8080/#/c/12323/10/java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java
File java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java:

http://gerrit.cloudera.org:8080/#/c/12323/10/java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java@1937
PS10, Line 1937: for
Nit: replace with 'to'


http://gerrit.cloudera.org:8080/#/c/12323/10/java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java@1942
PS10, Line 1942:    *                        start at the beginning
Nit: replace with "start splitting at the beginning of the tablet"


http://gerrit.cloudera.org:8080/#/c/12323/10/java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java@1944
PS10, Line 1944:    *                      get all the key ranges
Nit: replace with "stop splitting at the end of the tablet"


http://gerrit.cloudera.org:8080/#/c/12323/10/java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java@1947
PS10, Line 1947: the size of key range
Nit: "a key range"


http://gerrit.cloudera.org:8080/#/c/12323/10/java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java@1949
PS10, Line 1949: a master lookup
Odd to see this here. Are you sure you didn't mean "RPC that prompted the split key range request"?


http://gerrit.cloudera.org:8080/#/c/12323/10/java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java@1974
PS10, Line 1974:    * Get all or some key range for a given table. This may query the master multiple times if there
               :    * are a lot of tablets. And query the tablet server where each tablet is located, split the
               :    * [startPrimaryKey, endPrimaryKey) in the tablet into smaller ranges.
Is this comment correct for this method? Seems like it was copied from getTableKeyRanges where it is appropriate.


http://gerrit.cloudera.org:8080/#/c/12323/10/java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java@1991
PS10, Line 1991: If unset or <= 0, the key range
               :    *                       includes all the data of the Tablet.
What's the use case for this? It means we're not calling split key range at all; couldn't the caller trivially construct the list of key ranges without our help?


http://gerrit.cloudera.org:8080/#/c/12323/10/java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java@2015
PS10, Line 2015:   private Deferred<List<KeyRange>> loopGetTableKeyRanges(final KuduTable table,
Doc this too.


http://gerrit.cloudera.org:8080/#/c/12323/10/java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java@2033
PS10, Line 2033:     String tableId = table.getTableId();
Nit: just use table.getTableId() directly on L2041.


http://gerrit.cloudera.org:8080/#/c/12323/10/java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java@2049
PS10, Line 2049:           KeyRange keyRange = new KeyRange(startPrimaryKey, endPartitionKey, -1);
Shouldn't this use endPrimaryKey not endPartitionKey? Are we missing test coverage that would surfaced this error?


http://gerrit.cloudera.org:8080/#/c/12323/10/java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java@2159
PS10, Line 2159:   Deferred<List<KeyRange>> getTableKeyRanges(final KuduTable table,
These new methods duplicate a lot from locateTable and friends. Can we share code between them?


http://gerrit.cloudera.org:8080/#/c/12323/10/java/kudu-client/src/main/java/org/apache/kudu/client/KeyRange.java
File java/kudu-client/src/main/java/org/apache/kudu/client/KeyRange.java:

http://gerrit.cloudera.org:8080/#/c/12323/10/java/kudu-client/src/main/java/org/apache/kudu/client/KeyRange.java@47
PS10, Line 47:    * Gets the start primary key.
You can omit this sentence for simple getters and just use the @return.


http://gerrit.cloudera.org:8080/#/c/12323/10/java/kudu-client/src/main/java/org/apache/kudu/client/KeyRange.java@62
PS10, Line 62:   /**
             :    * Sets the tablet which the key range is in
             :    * @param tablet
             :    */
             :   public void setTablet(LocatedTablet tablet) {
             :     this.tablet = tablet;
             :   }
Can we omit this and provide the tablet in the constructor?

Seems like the reason this can't be in the constructor is the double duty that the class plays:
1. User-visible return value from the new public split key range methods.
2. Intermediate deserialized output stored in SplitKeyRangeResponse, where the tablet isn't yet known.

Perhaps we can find a different way to handle #2 and use KeyRange only for #1?


http://gerrit.cloudera.org:8080/#/c/12323/10/java/kudu-client/src/test/java/org/apache/kudu/client/TestSplitKeyRange.java
File java/kudu-client/src/test/java/org/apache/kudu/client/TestSplitKeyRange.java:

http://gerrit.cloudera.org:8080/#/c/12323/10/java/kudu-client/src/test/java/org/apache/kudu/client/TestSplitKeyRange.java@26
PS10, Line 26: import org.junit.Assert;
If you statically import the Assert functions you care about, you can make the test more terse below.



-- 
To view, visit http://gerrit.cloudera.org:8080/12323
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0502f5d64569e8b1d45e88de3cb36aa2e01234d0
Gerrit-Change-Number: 12323
Gerrit-PatchSet: 10
Gerrit-Owner: yangz <zh...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Yao Xu <oc...@gmail.com>
Gerrit-Reviewer: yangz <zh...@gmail.com>
Gerrit-Comment-Date: Sun, 24 Mar 2019 00:10:41 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2670: Part 1: Build ScanToken by KeyRange

Posted by "Yao Xu (Code Review)" <ge...@cloudera.org>.
Yao Xu has uploaded a new patch set (#12) to the change originally created by yangz. ( http://gerrit.cloudera.org:8080/12323 )

Change subject: KUDU-2670: Part 1: Build ScanToken by KeyRange
......................................................................

KUDU-2670: Part 1: Build ScanToken by KeyRange

When reading data in a kudu table using spark,
if there is a large amount of data in the tablet,
reading the data takes a long time. The reason
is that KuduRDD uses a tablet to generate the
scanToken, so a spark task needs to process all
the data in a tablet.

We send SplitKeyRange RPC to TServer, split tablet's
primary key range into multiple primary key ranges
by size, and generate the scanToken by primary key ranges.

Change-Id: I0502f5d64569e8b1d45e88de3cb36aa2e01234d0
---
M java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java
A java/kudu-client/src/main/java/org/apache/kudu/client/KeyRange.java
M java/kudu-client/src/main/java/org/apache/kudu/client/KuduScanToken.java
M java/kudu-client/src/main/java/org/apache/kudu/client/KuduTable.java
A java/kudu-client/src/main/java/org/apache/kudu/client/SplitKeyRangeRequest.java
A java/kudu-client/src/main/java/org/apache/kudu/client/SplitKeyRangeResponse.java
A java/kudu-client/src/test/java/org/apache/kudu/client/TestSplitKeyRange.java
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduRDD.scala
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduReadOptions.scala
M java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/DefaultSourceTest.scala
M java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/KuduTestSuite.scala
M java/kudu-test-utils/src/main/java/org/apache/kudu/test/ClientTestUtil.java
13 files changed, 783 insertions(+), 15 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/23/12323/12
-- 
To view, visit http://gerrit.cloudera.org:8080/12323
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I0502f5d64569e8b1d45e88de3cb36aa2e01234d0
Gerrit-Change-Number: 12323
Gerrit-PatchSet: 12
Gerrit-Owner: yangz <zh...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Yao Xu <oc...@gmail.com>
Gerrit-Reviewer: yangz <zh...@gmail.com>

[kudu-CR] KUDU-2670: Part 1: Build ScanToken by KeyRange

Posted by "Yao Xu (Code Review)" <ge...@cloudera.org>.
Yao Xu has uploaded a new patch set (#32) to the change originally created by yangz. ( http://gerrit.cloudera.org:8080/12323 )

Change subject: KUDU-2670: Part 1: Build ScanToken by KeyRange
......................................................................

KUDU-2670: Part 1: Build ScanToken by KeyRange

When reading data in a kudu table using spark,
if there is a large amount of data in the tablet,
reading the data takes a long time. The reason
is that KuduRDD uses a tablet to generate the
scanToken, so a spark task needs to process all
the data in a tablet.

We send SplitKeyRange RPC to TServer, split tablet's
primary key range into multiple primary key ranges
by size, and generate the scanToken by primary key ranges.

Change-Id: I0502f5d64569e8b1d45e88de3cb36aa2e01234d0
---
M java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java
M java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduScanner.java
A java/kudu-client/src/main/java/org/apache/kudu/client/KeyRange.java
M java/kudu-client/src/main/java/org/apache/kudu/client/KuduScanToken.java
A java/kudu-client/src/main/java/org/apache/kudu/client/SplitKeyRangeRequest.java
A java/kudu-client/src/main/java/org/apache/kudu/client/SplitKeyRangeResponse.java
A java/kudu-client/src/test/java/org/apache/kudu/client/TestSplitKeyRange.java
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduRDD.scala
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduReadOptions.scala
M java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/DefaultSourceTest.scala
M java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/KuduTestSuite.scala
M java/kudu-test-utils/src/main/java/org/apache/kudu/test/ClientTestUtil.java
13 files changed, 750 insertions(+), 16 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/23/12323/32
-- 
To view, visit http://gerrit.cloudera.org:8080/12323
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I0502f5d64569e8b1d45e88de3cb36aa2e01234d0
Gerrit-Change-Number: 12323
Gerrit-PatchSet: 32
Gerrit-Owner: yangz <zh...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Yao Xu <oc...@gmail.com>
Gerrit-Reviewer: yangz <zh...@gmail.com>

[kudu-CR] KUDU-2670: Part 1: Build ScanToken by KeyRange

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change. ( http://gerrit.cloudera.org:8080/12323 )

Change subject: KUDU-2670: Part 1: Build ScanToken by KeyRange
......................................................................


Patch Set 30:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/12323/30/java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java
File java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java:

http://gerrit.cloudera.org:8080/#/c/12323/30/java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java@2039
PS30, Line 2039:                       LOG.debug("Add key range {}", newRange);
> Is the test failure issue potentially a real issue, or just a test issue?
Yeah we need to get to the bottom of this. I didn't see this debug output anywhere in the test failure. The failure was:

1) testScanWithKeyRange(org.apache.kudu.spark.kudu.DefaultSourceTest)
org.scalatest.junit.JUnitTestFailedError: [[50,50,50,50.0,50,false,50,50.0,[B@6a4d57b7,2019-05-17 05:54:47.22
5,50,50,50,50], [51,51,null,51.0,51,true,51,51.0,[B@a36ff0b,2019-05-17 05:54:47.258,51,51,51,51], [52,52,52,5
2.0,52,false,52,52.0,[B@1ecf784f,2019-05-17 05:54:47.269,52,52,52,52], [53,53,null,53.0,53,true,53,53.0,[B@34
4769b9,2019-05-17 05:54:47.283,53,53,53,53], ...

That output isn't particularly useful either. Specifically, the [B@... refers to the address of a byte array and should be replaced with something more useful.

I looped DefaultSourceTest 1000 times. You can see the results here: http://dist-test.cloudera.org/job?job_id=adar.1558472309.107999

Looks like:
- 461 tests passed
- 142 tests failed
- 397 tests were canceled

I haven't looked at all of the failures, but the few I looked at all failed in the same way: the new test got 1950 rows instead of 1000.


http://gerrit.cloudera.org:8080/#/c/12323/30/java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/DefaultSourceTest.scala
File java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/DefaultSourceTest.scala:

http://gerrit.cloudera.org:8080/#/c/12323/30/java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/DefaultSourceTest.scala@1046
PS30, Line 1046:       assert(results.size() == rowCount * 100)
Could you use assertEquals here?



-- 
To view, visit http://gerrit.cloudera.org:8080/12323
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0502f5d64569e8b1d45e88de3cb36aa2e01234d0
Gerrit-Change-Number: 12323
Gerrit-PatchSet: 30
Gerrit-Owner: yangz <zh...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Yao Xu <oc...@gmail.com>
Gerrit-Reviewer: yangz <zh...@gmail.com>
Gerrit-Comment-Date: Tue, 21 May 2019 21:20:19 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2670: Part 1: Build ScanToken by KeyRange

Posted by "Yao Xu (Code Review)" <ge...@cloudera.org>.
Yao Xu has uploaded a new patch set (#18) to the change originally created by yangz. ( http://gerrit.cloudera.org:8080/12323 )

Change subject: KUDU-2670: Part 1: Build ScanToken by KeyRange
......................................................................

KUDU-2670: Part 1: Build ScanToken by KeyRange

When reading data in a kudu table using spark,
if there is a large amount of data in the tablet,
reading the data takes a long time. The reason
is that KuduRDD uses a tablet to generate the
scanToken, so a spark task needs to process all
the data in a tablet.

We send SplitKeyRange RPC to TServer, split tablet's
primary key range into multiple primary key ranges
by size, and generate the scanToken by primary key ranges.

Change-Id: I0502f5d64569e8b1d45e88de3cb36aa2e01234d0
---
M java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java
A java/kudu-client/src/main/java/org/apache/kudu/client/KeyRange.java
M java/kudu-client/src/main/java/org/apache/kudu/client/KuduScanToken.java
M java/kudu-client/src/main/java/org/apache/kudu/client/KuduTable.java
A java/kudu-client/src/main/java/org/apache/kudu/client/SplitKeyRangeRequest.java
A java/kudu-client/src/main/java/org/apache/kudu/client/SplitKeyRangeResponse.java
A java/kudu-client/src/test/java/org/apache/kudu/client/TestSplitKeyRange.java
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduRDD.scala
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduReadOptions.scala
M java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/DefaultSourceTest.scala
M java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/KuduTestSuite.scala
M java/kudu-test-utils/src/main/java/org/apache/kudu/test/ClientTestUtil.java
13 files changed, 787 insertions(+), 15 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/23/12323/18
-- 
To view, visit http://gerrit.cloudera.org:8080/12323
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I0502f5d64569e8b1d45e88de3cb36aa2e01234d0
Gerrit-Change-Number: 12323
Gerrit-PatchSet: 18
Gerrit-Owner: yangz <zh...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Yao Xu <oc...@gmail.com>
Gerrit-Reviewer: yangz <zh...@gmail.com>

[kudu-CR] KUDU-2670: Part 1: Build ScanToken by KeyRange

Posted by "Yao Xu (Code Review)" <ge...@cloudera.org>.
Yao Xu has uploaded a new patch set (#25) to the change originally created by yangz. ( http://gerrit.cloudera.org:8080/12323 )

Change subject: KUDU-2670: Part 1: Build ScanToken by KeyRange
......................................................................

KUDU-2670: Part 1: Build ScanToken by KeyRange

When reading data in a kudu table using spark,
if there is a large amount of data in the tablet,
reading the data takes a long time. The reason
is that KuduRDD uses a tablet to generate the
scanToken, so a spark task needs to process all
the data in a tablet.

We send SplitKeyRange RPC to TServer, split tablet's
primary key range into multiple primary key ranges
by size, and generate the scanToken by primary key ranges.

Change-Id: I0502f5d64569e8b1d45e88de3cb36aa2e01234d0
---
M java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java
A java/kudu-client/src/main/java/org/apache/kudu/client/KeyRange.java
M java/kudu-client/src/main/java/org/apache/kudu/client/KuduScanToken.java
M java/kudu-client/src/main/java/org/apache/kudu/client/KuduTable.java
A java/kudu-client/src/main/java/org/apache/kudu/client/SplitKeyRangeRequest.java
A java/kudu-client/src/main/java/org/apache/kudu/client/SplitKeyRangeResponse.java
A java/kudu-client/src/test/java/org/apache/kudu/client/TestSplitKeyRange.java
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduRDD.scala
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduReadOptions.scala
M java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/DefaultSourceTest.scala
M java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/KuduTestSuite.scala
M java/kudu-test-utils/src/main/java/org/apache/kudu/test/ClientTestUtil.java
13 files changed, 759 insertions(+), 16 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/23/12323/25
-- 
To view, visit http://gerrit.cloudera.org:8080/12323
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I0502f5d64569e8b1d45e88de3cb36aa2e01234d0
Gerrit-Change-Number: 12323
Gerrit-PatchSet: 25
Gerrit-Owner: yangz <zh...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Yao Xu <oc...@gmail.com>
Gerrit-Reviewer: yangz <zh...@gmail.com>

[kudu-CR] KUDU-2670: Part 1: Build ScanToken by KeyRange

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change. ( http://gerrit.cloudera.org:8080/12323 )

Change subject: KUDU-2670: Part 1: Build ScanToken by KeyRange
......................................................................


Patch Set 24: Code-Review+1


-- 
To view, visit http://gerrit.cloudera.org:8080/12323
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0502f5d64569e8b1d45e88de3cb36aa2e01234d0
Gerrit-Change-Number: 12323
Gerrit-PatchSet: 24
Gerrit-Owner: yangz <zh...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Yao Xu <oc...@gmail.com>
Gerrit-Reviewer: yangz <zh...@gmail.com>
Gerrit-Comment-Date: Wed, 15 May 2019 04:54:13 +0000
Gerrit-HasComments: No

[kudu-CR] KUDU-2670: Part 1: Build ScanToken by KeyRange

Posted by "Yao Xu (Code Review)" <ge...@cloudera.org>.
Yao Xu has uploaded a new patch set (#23) to the change originally created by yangz. ( http://gerrit.cloudera.org:8080/12323 )

Change subject: KUDU-2670: Part 1: Build ScanToken by KeyRange
......................................................................

KUDU-2670: Part 1: Build ScanToken by KeyRange

When reading data in a kudu table using spark,
if there is a large amount of data in the tablet,
reading the data takes a long time. The reason
is that KuduRDD uses a tablet to generate the
scanToken, so a spark task needs to process all
the data in a tablet.

We send SplitKeyRange RPC to TServer, split tablet's
primary key range into multiple primary key ranges
by size, and generate the scanToken by primary key ranges.

Change-Id: I0502f5d64569e8b1d45e88de3cb36aa2e01234d0
---
M java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java
A java/kudu-client/src/main/java/org/apache/kudu/client/KeyRange.java
M java/kudu-client/src/main/java/org/apache/kudu/client/KuduScanToken.java
M java/kudu-client/src/main/java/org/apache/kudu/client/KuduTable.java
A java/kudu-client/src/main/java/org/apache/kudu/client/SplitKeyRangeRequest.java
A java/kudu-client/src/main/java/org/apache/kudu/client/SplitKeyRangeResponse.java
A java/kudu-client/src/test/java/org/apache/kudu/client/TestSplitKeyRange.java
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduRDD.scala
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduReadOptions.scala
M java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/DefaultSourceTest.scala
M java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/KuduTestSuite.scala
M java/kudu-test-utils/src/main/java/org/apache/kudu/test/ClientTestUtil.java
13 files changed, 798 insertions(+), 15 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/23/12323/23
-- 
To view, visit http://gerrit.cloudera.org:8080/12323
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I0502f5d64569e8b1d45e88de3cb36aa2e01234d0
Gerrit-Change-Number: 12323
Gerrit-PatchSet: 23
Gerrit-Owner: yangz <zh...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Yao Xu <oc...@gmail.com>
Gerrit-Reviewer: yangz <zh...@gmail.com>

[kudu-CR] KUDU-2670: Part 1: Build ScanToken by KeyRange

Posted by "Yao Xu (Code Review)" <ge...@cloudera.org>.
Yao Xu has uploaded a new patch set (#10) to the change originally created by yangz. ( http://gerrit.cloudera.org:8080/12323 )

Change subject: KUDU-2670: Part 1: Build ScanToken by KeyRange
......................................................................

KUDU-2670: Part 1: Build ScanToken by KeyRange

When reading data in a kudu table using spark,
if there is a large amount of data in the tablet,
reading the data takes a long time. The reason
is that KuduRDD uses a tablet to generate the
scanToken, so a spark task needs to process all
the data in a tablet.

We send SplitKeyRange RPC to TServer, split tablet's
primary key range into multiple primary key ranges
by size. And generate the scanToken by primary key ranges.

Change-Id: I0502f5d64569e8b1d45e88de3cb36aa2e01234d0
---
M java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java
A java/kudu-client/src/main/java/org/apache/kudu/client/KeyRange.java
M java/kudu-client/src/main/java/org/apache/kudu/client/KuduScanToken.java
M java/kudu-client/src/main/java/org/apache/kudu/client/KuduTable.java
A java/kudu-client/src/main/java/org/apache/kudu/client/SplitKeyRangeRequest.java
A java/kudu-client/src/main/java/org/apache/kudu/client/SplitKeyRangeResponse.java
A java/kudu-client/src/test/java/org/apache/kudu/client/TestSplitKeyRange.java
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduRDD.scala
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduReadOptions.scala
M java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/DefaultSourceTest.scala
M java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/KuduTestSuite.scala
M java/kudu-test-utils/src/main/java/org/apache/kudu/test/ClientTestUtil.java
13 files changed, 877 insertions(+), 16 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/23/12323/10
-- 
To view, visit http://gerrit.cloudera.org:8080/12323
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I0502f5d64569e8b1d45e88de3cb36aa2e01234d0
Gerrit-Change-Number: 12323
Gerrit-PatchSet: 10
Gerrit-Owner: yangz <zh...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Yao Xu <oc...@gmail.com>
Gerrit-Reviewer: yangz <zh...@gmail.com>

[kudu-CR] KUDU-2670: Part 1: Build ScanToken by splitSizeBytes

Posted by "Yao Xu (Code Review)" <ge...@cloudera.org>.
Yao Xu has posted comments on this change. ( http://gerrit.cloudera.org:8080/12323 )

Change subject: KUDU-2670: Part 1: Build ScanToken by splitSizeBytes
......................................................................


Patch Set 5:

(9 comments)

Hi, grant. I update this patch. 
I broke the previous code into 2 patches.
PART 1:build scanToken by splitSizeBytes
PART 2:  MultiKuduScanner

http://gerrit.cloudera.org:8080/#/c/12323/2/java/kudu-client/src/main/java/org/apache/kudu/client/KeyEncoder.java
File java/kudu-client/src/main/java/org/apache/kudu/client/KeyEncoder.java:

http://gerrit.cloudera.org:8080/#/c/12323/2/java/kudu-client/src/main/java/org/apache/kudu/client/KeyEncoder.java@46
PS2, Line 46: class KeyEncoder {
> I don't think this needs to be public.
Done


http://gerrit.cloudera.org:8080/#/c/12323/2/java/kudu-client/src/main/java/org/apache/kudu/client/KeyRange.java
File java/kudu-client/src/main/java/org/apache/kudu/client/KeyRange.java:

http://gerrit.cloudera.org:8080/#/c/12323/2/java/kudu-client/src/main/java/org/apache/kudu/client/KeyRange.java@20
PS2, Line 20: import org.apache.yetus.audience.InterfaceAudience;
> Could you add java doc?
Done


http://gerrit.cloudera.org:8080/#/c/12323/2/java/kudu-client/src/main/java/org/apache/kudu/client/KuduScanToken.java
File java/kudu-client/src/main/java/org/apache/kudu/client/KuduScanToken.java:

http://gerrit.cloudera.org:8080/#/c/12323/2/java/kudu-client/src/main/java/org/apache/kudu/client/KuduScanToken.java@286
PS2, Line 286:     private long splitSizeBytes = -1;
> Could this name indicate that it will cause a tablet to be split? It would 
Done


http://gerrit.cloudera.org:8080/#/c/12323/2/java/kudu-client/src/main/java/org/apache/kudu/client/KuduScanToken.java@420
PS2, Line 420:             UnsafeByteOperations.unsafeWrap(tablet.getPartition().getPartitionKeyStart()));
> Since this is private API we can use methods to separate this logic instead
Done


http://gerrit.cloudera.org:8080/#/c/12323/2/java/kudu-client/src/main/java/org/apache/kudu/client/KuduScanToken.java@460
PS2, Line 460: 
> I think the overall timeout needs to be considered here.
Done


http://gerrit.cloudera.org:8080/#/c/12323/2/java/kudu-client/src/main/java/org/apache/kudu/client/PartitionPruner.java
File java/kudu-client/src/main/java/org/apache/kudu/client/PartitionPruner.java:

http://gerrit.cloudera.org:8080/#/c/12323/2/java/kudu-client/src/main/java/org/apache/kudu/client/PartitionPruner.java@a247
PS2, Line 247: 
> Why was this removed?
Done


http://gerrit.cloudera.org:8080/#/c/12323/2/java/kudu-client/src/main/java/org/apache/kudu/client/SplitKeyRangeRequest.java
File java/kudu-client/src/main/java/org/apache/kudu/client/SplitKeyRangeRequest.java:

http://gerrit.cloudera.org:8080/#/c/12323/2/java/kudu-client/src/main/java/org/apache/kudu/client/SplitKeyRangeRequest.java@30
PS2, Line 30: import org.apache.kudu.util.Pair;
> Can you add javadoc?
Done


http://gerrit.cloudera.org:8080/#/c/12323/2/java/kudu-client/src/main/java/org/apache/kudu/client/SplitKeyRangeResponse.java
File java/kudu-client/src/main/java/org/apache/kudu/client/SplitKeyRangeResponse.java:

http://gerrit.cloudera.org:8080/#/c/12323/2/java/kudu-client/src/main/java/org/apache/kudu/client/SplitKeyRangeResponse.java@26
PS2, Line 26:  */
> Think this is a copy paste from KuduRpcResponse. Can you update the javadoc
Done


http://gerrit.cloudera.org:8080/#/c/12323/2/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduRDD.scala
File java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduRDD.scala:

http://gerrit.cloudera.org:8080/#/c/12323/2/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduRDD.scala@51
PS2, Line 51:   private val keepAlivePeriodMs = options.keepAlivePeriodMs
> This configuration can be added to KuduReadOptions and set in DefaultSource
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/12323
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0502f5d64569e8b1d45e88de3cb36aa2e01234d0
Gerrit-Change-Number: 12323
Gerrit-PatchSet: 5
Gerrit-Owner: yangz <zh...@gmail.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Yao Xu <oc...@gmail.com>
Gerrit-Reviewer: yangz <zh...@gmail.com>
Gerrit-Comment-Date: Fri, 01 Mar 2019 10:35:44 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2670: Part 1: Build ScanToken by splitSizeBytes

Posted by "Grant Henke (Code Review)" <ge...@cloudera.org>.
Grant Henke has posted comments on this change. ( http://gerrit.cloudera.org:8080/12323 )

Change subject: KUDU-2670: Part 1: Build ScanToken by splitSizeBytes
......................................................................


Patch Set 6:

(9 comments)

Thank you for the update Yao, it's looking good. I posted a few small nits and a couple high level suggestions.

http://gerrit.cloudera.org:8080/#/c/12323/6/java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java
File java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java:

http://gerrit.cloudera.org:8080/#/c/12323/6/java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java@1204
PS6, Line 1204:    * Returns a deferred containing key ranges of the tablet which covers the partition key in the table.
Nit: Line too long.


http://gerrit.cloudera.org:8080/#/c/12323/6/java/kudu-client/src/main/java/org/apache/kudu/client/KuduScanToken.java
File java/kudu-client/src/main/java/org/apache/kudu/client/KuduScanToken.java:

http://gerrit.cloudera.org:8080/#/c/12323/6/java/kudu-client/src/main/java/org/apache/kudu/client/KuduScanToken.java@305
PS6, Line 305:      * If unset, don't split the tablet.
If unset or <= 0 is more accurate I suppose.


http://gerrit.cloudera.org:8080/#/c/12323/6/java/kudu-client/src/main/java/org/apache/kudu/client/KuduScanToken.java@437
PS6, Line 437:                 tablet.getPartition().partitionKeyStart, splitSizeBytes, timeout));
I think we need to use a DedlineTracker to be sure the timeout represents the entire process of getting the scan tokens. As is I think the SplitKeyRangeRequest can take the full timeout along with the  table.getTabletsLocations call. There was some refactoring recently related to this so you may want to rebase on master.


http://gerrit.cloudera.org:8080/#/c/12323/6/java/kudu-client/src/main/java/org/apache/kudu/client/SplitKeyRangeRequest.java
File java/kudu-client/src/main/java/org/apache/kudu/client/SplitKeyRangeRequest.java:

http://gerrit.cloudera.org:8080/#/c/12323/6/java/kudu-client/src/main/java/org/apache/kudu/client/SplitKeyRangeRequest.java@36
PS6, Line 36: final class SplitKeyRangeRequest extends KuduRpc<SplitKeyRangeResponse> {
I don't think this needs to be final since it's internal preventing extension isn't a big deal.


http://gerrit.cloudera.org:8080/#/c/12323/6/java/kudu-client/src/test/java/org/apache/kudu/client/TestSplitKeyRangeRequest.java
File java/kudu-client/src/test/java/org/apache/kudu/client/TestSplitKeyRangeRequest.java:

http://gerrit.cloudera.org:8080/#/c/12323/6/java/kudu-client/src/test/java/org/apache/kudu/client/TestSplitKeyRangeRequest.java@19
PS6, Line 19: public class TestSplitKeyRangeRequest {
I don't mind a test of the request/response. More coverage is a great thing. But I think a test of the public API usage is most important. 

In this case I think something that uses the ScanToken API with splits and then reading the data to verify the data was split.


http://gerrit.cloudera.org:8080/#/c/12323/6/java/kudu-client/src/test/java/org/apache/kudu/client/TestSplitKeyRangeRequest.java@34
PS6, Line 34:       "--tablet_transaction_memory_limit_mb=1",
Why do these flags need to be set?


http://gerrit.cloudera.org:8080/#/c/12323/6/java/kudu-client/src/test/java/org/apache/kudu/util/SplitTestUtil.java
File java/kudu-client/src/test/java/org/apache/kudu/util/SplitTestUtil.java:

http://gerrit.cloudera.org:8080/#/c/12323/6/java/kudu-client/src/test/java/org/apache/kudu/util/SplitTestUtil.java@16
PS6, Line 16: public class SplitTestUtil {
Could this be moved into the ClientTestUtil class for things that are generally useful? There is also an existing DataGenerator class that can be used to generate the rows.

Things that are specific to the TestSplitKeyRangeRequest only can be moved to the test class.


http://gerrit.cloudera.org:8080/#/c/12323/6/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala
File java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala:

PS6: 
Could you write a spark test that uses this new feature?


http://gerrit.cloudera.org:8080/#/c/12323/6/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduReadOptions.scala
File java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduReadOptions.scala:

http://gerrit.cloudera.org:8080/#/c/12323/6/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduReadOptions.scala@37
PS6, Line 37:  * @param splitSizeBytes Set the number of bytes used to split tablet.
Perhaps make these docs a bit more spark centric. Something along the lines of "Sets the target number of bytes per spark task. If set, tablets will be split to generate uniform task sizes instead of the default of 1 task per tablet."



-- 
To view, visit http://gerrit.cloudera.org:8080/12323
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0502f5d64569e8b1d45e88de3cb36aa2e01234d0
Gerrit-Change-Number: 12323
Gerrit-PatchSet: 6
Gerrit-Owner: yangz <zh...@gmail.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Yao Xu <oc...@gmail.com>
Gerrit-Reviewer: yangz <zh...@gmail.com>
Gerrit-Comment-Date: Mon, 04 Mar 2019 16:50:11 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2670: Part 1: Build ScanToken by KeyRange

Posted by "Yao Xu (Code Review)" <ge...@cloudera.org>.
Yao Xu has uploaded a new patch set (#21) to the change originally created by yangz. ( http://gerrit.cloudera.org:8080/12323 )

Change subject: KUDU-2670: Part 1: Build ScanToken by KeyRange
......................................................................

KUDU-2670: Part 1: Build ScanToken by KeyRange

When reading data in a kudu table using spark,
if there is a large amount of data in the tablet,
reading the data takes a long time. The reason
is that KuduRDD uses a tablet to generate the
scanToken, so a spark task needs to process all
the data in a tablet.

We send SplitKeyRange RPC to TServer, split tablet's
primary key range into multiple primary key ranges
by size, and generate the scanToken by primary key ranges.

Change-Id: I0502f5d64569e8b1d45e88de3cb36aa2e01234d0
---
M java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java
A java/kudu-client/src/main/java/org/apache/kudu/client/KeyRange.java
M java/kudu-client/src/main/java/org/apache/kudu/client/KuduScanToken.java
M java/kudu-client/src/main/java/org/apache/kudu/client/KuduTable.java
A java/kudu-client/src/main/java/org/apache/kudu/client/SplitKeyRangeRequest.java
A java/kudu-client/src/main/java/org/apache/kudu/client/SplitKeyRangeResponse.java
A java/kudu-client/src/test/java/org/apache/kudu/client/TestSplitKeyRange.java
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduRDD.scala
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduReadOptions.scala
M java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/DefaultSourceTest.scala
M java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/KuduTestSuite.scala
M java/kudu-test-utils/src/main/java/org/apache/kudu/test/ClientTestUtil.java
13 files changed, 791 insertions(+), 15 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/23/12323/21
-- 
To view, visit http://gerrit.cloudera.org:8080/12323
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I0502f5d64569e8b1d45e88de3cb36aa2e01234d0
Gerrit-Change-Number: 12323
Gerrit-PatchSet: 21
Gerrit-Owner: yangz <zh...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Yao Xu <oc...@gmail.com>
Gerrit-Reviewer: yangz <zh...@gmail.com>

[kudu-CR] KUDU-2670: Part 1: Build ScanToken by splitSizeBytes

Posted by "Yao Xu (Code Review)" <ge...@cloudera.org>.
Yao Xu has uploaded a new patch set (#6) to the change originally created by yangz. ( http://gerrit.cloudera.org:8080/12323 )

Change subject: KUDU-2670: Part 1: Build ScanToken by splitSizeBytes
......................................................................

KUDU-2670: Part 1: Build ScanToken by splitSizeBytes

When reading data in a kudu table using spark,
if there is a large amount of data in the tablet,
reading the data takes a long time. The reason
is that KuduRDD uses a tablet to generate the
scanToken, so a spark task needs to process all
the data in a tablet.

We send SplitKeyRange RPC to TServer, split tablet
into multiple primary key ranges by size. And generate
the scanToken by tablet's key ranges.

Change-Id: I0502f5d64569e8b1d45e88de3cb36aa2e01234d0
---
M java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java
A java/kudu-client/src/main/java/org/apache/kudu/client/KeyRange.java
M java/kudu-client/src/main/java/org/apache/kudu/client/KuduScanToken.java
A java/kudu-client/src/main/java/org/apache/kudu/client/SplitKeyRangeRequest.java
A java/kudu-client/src/main/java/org/apache/kudu/client/SplitKeyRangeResponse.java
A java/kudu-client/src/test/java/org/apache/kudu/client/TestSplitKeyRangeRequest.java
A java/kudu-client/src/test/java/org/apache/kudu/util/SplitTestUtil.java
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduRDD.scala
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduReadOptions.scala
10 files changed, 565 insertions(+), 6 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/23/12323/6
-- 
To view, visit http://gerrit.cloudera.org:8080/12323
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I0502f5d64569e8b1d45e88de3cb36aa2e01234d0
Gerrit-Change-Number: 12323
Gerrit-PatchSet: 6
Gerrit-Owner: yangz <zh...@gmail.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Yao Xu <oc...@gmail.com>
Gerrit-Reviewer: yangz <zh...@gmail.com>

[kudu-CR] KUDU-2670: split more scanner and add concurrent

Posted by "yangz (Code Review)" <ge...@cloudera.org>.
yangz has posted comments on this change. ( http://gerrit.cloudera.org:8080/12323 )

Change subject: KUDU-2670: split more scanner and add concurrent
......................................................................


Patch Set 2:

Thanks for reply. I am out of the job now. But Xu Yao at our team will continue it. He will reply for this patch soon.


-- 
To view, visit http://gerrit.cloudera.org:8080/12323
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0502f5d64569e8b1d45e88de3cb36aa2e01234d0
Gerrit-Change-Number: 12323
Gerrit-PatchSet: 2
Gerrit-Owner: yangz <zh...@gmail.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: yangz <zh...@gmail.com>
Gerrit-Comment-Date: Fri, 01 Mar 2019 04:10:26 +0000
Gerrit-HasComments: No

[kudu-CR] KUDU-2670: Part 1: Build ScanToken by KeyRange

Posted by "Yao Xu (Code Review)" <ge...@cloudera.org>.
Yao Xu has posted comments on this change. ( http://gerrit.cloudera.org:8080/12323 )

Change subject: KUDU-2670: Part 1: Build ScanToken by KeyRange
......................................................................


Patch Set 26:

(6 comments)

Thanks for review.

http://gerrit.cloudera.org:8080/#/c/12323/24/java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java
File java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java:

http://gerrit.cloudera.org:8080/#/c/12323/24/java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java@1996
PS24, Line 1996:                                              final byte[] endPrimaryKey,
> It's not common to have synchronous methods in the AsyncKuduClient.
 > Given this is just a call to getTableKeyRanges with a following
 > .join() call and it's only used once in KuduTable, maybe you could
 > just call getTableKeyRanges(...).join() in KuduTable.

Ok, I deleted it. Btw, maybe syncLocateTable should also needs to be modified. :)


http://gerrit.cloudera.org:8080/#/c/12323/24/java/kudu-client/src/main/java/org/apache/kudu/client/KuduScanToken.java
File java/kudu-client/src/main/java/org/apache/kudu/client/KuduScanToken.java:

http://gerrit.cloudera.org:8080/#/c/12323/24/java/kudu-client/src/main/java/org/apache/kudu/client/KuduScanToken.java@294
PS24, Line 294:     private long timeout;
> Nit: define a constant for the special -1 value.
Done


http://gerrit.cloudera.org:8080/#/c/12323/24/java/kudu-client/src/main/java/org/apache/kudu/client/KuduScanToken.java@408
PS24, Line 408:         List<KeyRange> keyRanges = new ArrayList<>();
> Should we still use the old style when splitSizeBytes = -1?

Yep, a scan token is created for each tablet to be scanned by default.


http://gerrit.cloudera.org:8080/#/c/12323/24/java/kudu-client/src/main/java/org/apache/kudu/client/KuduTable.java
File java/kudu-client/src/main/java/org/apache/kudu/client/KuduTable.java:

http://gerrit.cloudera.org:8080/#/c/12323/24/java/kudu-client/src/main/java/org/apache/kudu/client/KuduTable.java@245
PS24, Line 245:   public List<KeyRange> getTabletsKeyRanges(byte[] primaryKeyStart,
> I am not sure this should live here. Or at least is shouldn't be
 > public given it returns a private KeyRange. You could call
 > client.getTableKeyRanges(...).join() in the ScanToken code.

Sorry, I didn't consider the KuduTable API. I think it is better not to let the user directly KeyRange, so I deleted this interface.


http://gerrit.cloudera.org:8080/#/c/12323/24/java/kudu-client/src/main/java/org/apache/kudu/client/SplitKeyRangeRequest.java
File java/kudu-client/src/main/java/org/apache/kudu/client/SplitKeyRangeRequest.java:

http://gerrit.cloudera.org:8080/#/c/12323/24/java/kudu-client/src/main/java/org/apache/kudu/client/SplitKeyRangeRequest.java@37
PS24, Line 37: 
> Perhaps add the constant for the -1 value mentioned in
 > KuduScanToken here.

I added this constant in KuduScanToken. Maybe add some doc for splitSizeBytes <= 0 in KeyRange and SplitKeyRangeRequest.


http://gerrit.cloudera.org:8080/#/c/12323/24/java/kudu-client/src/main/java/org/apache/kudu/client/SplitKeyRangeResponse.java
File java/kudu-client/src/main/java/org/apache/kudu/client/SplitKeyRangeResponse.java:

http://gerrit.cloudera.org:8080/#/c/12323/24/java/kudu-client/src/main/java/org/apache/kudu/client/SplitKeyRangeResponse.java@33
PS24, Line 33:   SplitKeyRangeResponse(long elapsedMillis, String tsUUID, List<KeyRangePB> keyRanges) {
> Can we translate KeyRangePB to KeyRange before passing into this
 > constructor? This is usually done via methods added to
 > ProtobufHelper and placed in a deserialize method on the request.
 > See GetTableSchemaRequest.deserialize for an example.

There is a problem here, KeyRange's constructor needs a LocatedTablet. But getting this value here is more difficult, I didn't think of a good way to do it.



-- 
To view, visit http://gerrit.cloudera.org:8080/12323
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0502f5d64569e8b1d45e88de3cb36aa2e01234d0
Gerrit-Change-Number: 12323
Gerrit-PatchSet: 26
Gerrit-Owner: yangz <zh...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Yao Xu <oc...@gmail.com>
Gerrit-Reviewer: yangz <zh...@gmail.com>
Gerrit-Comment-Date: Fri, 17 May 2019 04:34:37 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2670: Part 1: Build ScanToken by KeyRange

Posted by "Yao Xu (Code Review)" <ge...@cloudera.org>.
Yao Xu has posted comments on this change. ( http://gerrit.cloudera.org:8080/12323 )

Change subject: KUDU-2670: Part 1: Build ScanToken by KeyRange
......................................................................


Patch Set 24:

(1 comment)

Add tablet not null check in KeyRanage's constructor, and fix some compiled warnings.

http://gerrit.cloudera.org:8080/#/c/12323/20/java/kudu-client/src/main/java/org/apache/kudu/client/KeyRange.java
File java/kudu-client/src/main/java/org/apache/kudu/client/KeyRange.java:

http://gerrit.cloudera.org:8080/#/c/12323/20/java/kudu-client/src/main/java/org/apache/kudu/client/KeyRange.java@38
PS20, Line 38:    * @param dataSizeBytes the estimated data size of the key range.
> Instead of the doc, could you add a Preconditions.checkNotNull() call to th
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/12323
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0502f5d64569e8b1d45e88de3cb36aa2e01234d0
Gerrit-Change-Number: 12323
Gerrit-PatchSet: 24
Gerrit-Owner: yangz <zh...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Yao Xu <oc...@gmail.com>
Gerrit-Reviewer: yangz <zh...@gmail.com>
Gerrit-Comment-Date: Wed, 15 May 2019 03:35:40 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2670: Part 1: Build ScanToken by KeyRange

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change. ( http://gerrit.cloudera.org:8080/12323 )

Change subject: KUDU-2670: Part 1: Build ScanToken by KeyRange
......................................................................


Patch Set 33: Code-Review+2

> I found the root cause of this problem. Because Deferred.group no guarantee on the order of the results in the deferred list returned, but we rely on the return order when we call PartitionPruner.removePartitionKeyRange, This will cause some KeyRange to be created repeatedly. It's my fault. :( 

Nice, glad you were able to get to the bottom of it.

> Btw, dist-test seems to be a good test tools. Is there any doc for it?

Unfortunately there aren't because it has some significant limitations:
1. Submitted tests must be built on el6 or Ubuntu 14 hosts.
2. There's no sandboxing or resource controls in place, so a malicious user can use dist-test to e.g. mine bitcoin. For that reason, we haven't shared the credentials outside Cloudera.

We should work towards making it available to the broader community, but for now if you want something looped with dist-test, don't hesitate to ask a Cloudera engineer who works on Kudu.

BTW, I looped DefaultSourceTest with the fix 1000 times. It passed 999/1000, with the one failure being unrelated.


-- 
To view, visit http://gerrit.cloudera.org:8080/12323
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0502f5d64569e8b1d45e88de3cb36aa2e01234d0
Gerrit-Change-Number: 12323
Gerrit-PatchSet: 33
Gerrit-Owner: yangz <zh...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Yao Xu <oc...@gmail.com>
Gerrit-Reviewer: yangz <zh...@gmail.com>
Gerrit-Comment-Date: Fri, 24 May 2019 05:15:25 +0000
Gerrit-HasComments: No

[kudu-CR] KUDU-2670: Part 1: Build ScanToken by KeyRange

Posted by "Yao Xu (Code Review)" <ge...@cloudera.org>.
Yao Xu has uploaded a new patch set (#31) to the change originally created by yangz. ( http://gerrit.cloudera.org:8080/12323 )

Change subject: KUDU-2670: Part 1: Build ScanToken by KeyRange
......................................................................

KUDU-2670: Part 1: Build ScanToken by KeyRange

When reading data in a kudu table using spark,
if there is a large amount of data in the tablet,
reading the data takes a long time. The reason
is that KuduRDD uses a tablet to generate the
scanToken, so a spark task needs to process all
the data in a tablet.

We send SplitKeyRange RPC to TServer, split tablet's
primary key range into multiple primary key ranges
by size, and generate the scanToken by primary key ranges.

Change-Id: I0502f5d64569e8b1d45e88de3cb36aa2e01234d0
---
M java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java
M java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduScanner.java
A java/kudu-client/src/main/java/org/apache/kudu/client/KeyRange.java
M java/kudu-client/src/main/java/org/apache/kudu/client/KuduScanToken.java
A java/kudu-client/src/main/java/org/apache/kudu/client/SplitKeyRangeRequest.java
A java/kudu-client/src/main/java/org/apache/kudu/client/SplitKeyRangeResponse.java
A java/kudu-client/src/test/java/org/apache/kudu/client/TestSplitKeyRange.java
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduRDD.scala
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduReadOptions.scala
M java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/DefaultSourceTest.scala
M java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/KuduTestSuite.scala
M java/kudu-test-utils/src/main/java/org/apache/kudu/test/ClientTestUtil.java
13 files changed, 740 insertions(+), 16 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/23/12323/31
-- 
To view, visit http://gerrit.cloudera.org:8080/12323
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I0502f5d64569e8b1d45e88de3cb36aa2e01234d0
Gerrit-Change-Number: 12323
Gerrit-PatchSet: 31
Gerrit-Owner: yangz <zh...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Yao Xu <oc...@gmail.com>
Gerrit-Reviewer: yangz <zh...@gmail.com>

[kudu-CR] KUDU-2670: Part 1: Build ScanToken by KeyRange

Posted by "Grant Henke (Code Review)" <ge...@cloudera.org>.
Grant Henke has posted comments on this change. ( http://gerrit.cloudera.org:8080/12323 )

Change subject: KUDU-2670: Part 1: Build ScanToken by KeyRange
......................................................................


Patch Set 29:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/12323/30/java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java
File java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java:

http://gerrit.cloudera.org:8080/#/c/12323/30/java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java@2039
PS30, Line 2039:                       ranges.add(newRange);
Is the test failure issue potentially a real issue, or just a test issue?

Do you think we should submit this patch with this log and the suspected testing issue?



-- 
To view, visit http://gerrit.cloudera.org:8080/12323
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0502f5d64569e8b1d45e88de3cb36aa2e01234d0
Gerrit-Change-Number: 12323
Gerrit-PatchSet: 29
Gerrit-Owner: yangz <zh...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Yao Xu <oc...@gmail.com>
Gerrit-Reviewer: yangz <zh...@gmail.com>
Gerrit-Comment-Date: Tue, 21 May 2019 14:56:15 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2670: split more scanner and add concurrent

Posted by "Grant Henke (Code Review)" <ge...@cloudera.org>.
Grant Henke has posted comments on this change. ( http://gerrit.cloudera.org:8080/12323 )

Change subject: KUDU-2670: split more scanner and add concurrent
......................................................................


Patch Set 2:

This contribution is very much appreciated. 

Are you interested in addressing the review feedback on this patch? If not we can let others from the community pick up where it leaves off as well. I just wanted to be sure you weren't working on following up.


-- 
To view, visit http://gerrit.cloudera.org:8080/12323
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0502f5d64569e8b1d45e88de3cb36aa2e01234d0
Gerrit-Change-Number: 12323
Gerrit-PatchSet: 2
Gerrit-Owner: yangz <zh...@gmail.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Comment-Date: Wed, 27 Feb 2019 02:46:02 +0000
Gerrit-HasComments: No

[kudu-CR] KUDU-2670: Part 1: Build ScanToken by KeyRange

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change. ( http://gerrit.cloudera.org:8080/12323 )

Change subject: KUDU-2670: Part 1: Build ScanToken by KeyRange
......................................................................


Patch Set 26:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/12323/24/java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java
File java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java:

http://gerrit.cloudera.org:8080/#/c/12323/24/java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java@1996
PS24, Line 1996:                                              final byte[] endPrimaryKey,
> It's not common to have synchronous methods in the AsyncKuduClient. Given t
I think this was modeled after syncLocateTable but I agree that that's also a poor fit for AsyncKuduClient.



-- 
To view, visit http://gerrit.cloudera.org:8080/12323
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0502f5d64569e8b1d45e88de3cb36aa2e01234d0
Gerrit-Change-Number: 12323
Gerrit-PatchSet: 26
Gerrit-Owner: yangz <zh...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Yao Xu <oc...@gmail.com>
Gerrit-Reviewer: yangz <zh...@gmail.com>
Gerrit-Comment-Date: Fri, 17 May 2019 04:09:48 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2670: Part 1: Build ScanToken by KeyRange

Posted by "Yao Xu (Code Review)" <ge...@cloudera.org>.
Yao Xu has uploaded a new patch set (#16) to the change originally created by yangz. ( http://gerrit.cloudera.org:8080/12323 )

Change subject: KUDU-2670: Part 1: Build ScanToken by KeyRange
......................................................................

KUDU-2670: Part 1: Build ScanToken by KeyRange

When reading data in a kudu table using spark,
if there is a large amount of data in the tablet,
reading the data takes a long time. The reason
is that KuduRDD uses a tablet to generate the
scanToken, so a spark task needs to process all
the data in a tablet.

We send SplitKeyRange RPC to TServer, split tablet's
primary key range into multiple primary key ranges
by size, and generate the scanToken by primary key ranges.

Change-Id: I0502f5d64569e8b1d45e88de3cb36aa2e01234d0
---
M java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java
A java/kudu-client/src/main/java/org/apache/kudu/client/KeyRange.java
M java/kudu-client/src/main/java/org/apache/kudu/client/KuduScanToken.java
M java/kudu-client/src/main/java/org/apache/kudu/client/KuduTable.java
A java/kudu-client/src/main/java/org/apache/kudu/client/SplitKeyRangeRequest.java
A java/kudu-client/src/main/java/org/apache/kudu/client/SplitKeyRangeResponse.java
A java/kudu-client/src/test/java/org/apache/kudu/client/TestSplitKeyRange.java
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduRDD.scala
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduReadOptions.scala
M java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/DefaultSourceTest.scala
M java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/KuduTestSuite.scala
M java/kudu-test-utils/src/main/java/org/apache/kudu/test/ClientTestUtil.java
13 files changed, 787 insertions(+), 15 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/23/12323/16
-- 
To view, visit http://gerrit.cloudera.org:8080/12323
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I0502f5d64569e8b1d45e88de3cb36aa2e01234d0
Gerrit-Change-Number: 12323
Gerrit-PatchSet: 16
Gerrit-Owner: yangz <zh...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Yao Xu <oc...@gmail.com>
Gerrit-Reviewer: yangz <zh...@gmail.com>

[kudu-CR] KUDU-2670: Part 1: Build ScanToken by KeyRange

Posted by "Yao Xu (Code Review)" <ge...@cloudera.org>.
Yao Xu has uploaded a new patch set (#30) to the change originally created by yangz. ( http://gerrit.cloudera.org:8080/12323 )

Change subject: KUDU-2670: Part 1: Build ScanToken by KeyRange
......................................................................

KUDU-2670: Part 1: Build ScanToken by KeyRange

When reading data in a kudu table using spark,
if there is a large amount of data in the tablet,
reading the data takes a long time. The reason
is that KuduRDD uses a tablet to generate the
scanToken, so a spark task needs to process all
the data in a tablet.

We send SplitKeyRange RPC to TServer, split tablet's
primary key range into multiple primary key ranges
by size, and generate the scanToken by primary key ranges.

Change-Id: I0502f5d64569e8b1d45e88de3cb36aa2e01234d0
---
M java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java
A java/kudu-client/src/main/java/org/apache/kudu/client/KeyRange.java
M java/kudu-client/src/main/java/org/apache/kudu/client/KuduScanToken.java
A java/kudu-client/src/main/java/org/apache/kudu/client/SplitKeyRangeRequest.java
A java/kudu-client/src/main/java/org/apache/kudu/client/SplitKeyRangeResponse.java
A java/kudu-client/src/test/java/org/apache/kudu/client/TestSplitKeyRange.java
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduRDD.scala
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduReadOptions.scala
M java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/DefaultSourceTest.scala
M java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/KuduTestSuite.scala
M java/kudu-test-utils/src/main/java/org/apache/kudu/test/ClientTestUtil.java
12 files changed, 738 insertions(+), 16 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/23/12323/30
-- 
To view, visit http://gerrit.cloudera.org:8080/12323
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I0502f5d64569e8b1d45e88de3cb36aa2e01234d0
Gerrit-Change-Number: 12323
Gerrit-PatchSet: 30
Gerrit-Owner: yangz <zh...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Yao Xu <oc...@gmail.com>
Gerrit-Reviewer: yangz <zh...@gmail.com>

[kudu-CR] KUDU-2670: Part 1: Build ScanToken by KeyRange

Posted by "Yao Xu (Code Review)" <ge...@cloudera.org>.
Yao Xu has posted comments on this change. ( http://gerrit.cloudera.org:8080/12323 )

Change subject: KUDU-2670: Part 1: Build ScanToken by KeyRange
......................................................................


Patch Set 20:

(12 comments)

Apologize for updating the patch late. I have checked the error case in the unit test, it should not be caused by this patch.

http://gerrit.cloudera.org:8080/#/c/12323/10//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/12323/10//COMMIT_MSG@18
PS10, Line 18: by size, and generate 
> Nit: "by size, and generate..."
Done


http://gerrit.cloudera.org:8080/#/c/12323/10/java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java
File java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java:

http://gerrit.cloudera.org:8080/#/c/12323/10/java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java@1937
PS10, Line 1937: 
> Nit: replace with 'to'
Done


http://gerrit.cloudera.org:8080/#/c/12323/10/java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java@1942
PS10, Line 1942:    * @param startPrimaryKey start splitting at the beginning of the tablet
> Nit: replace with "start splitting at the beginning of the tablet"
Done


http://gerrit.cloudera.org:8080/#/c/12323/10/java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java@1944
PS10, Line 1944:    * @param partitionKey the partition key of the tablet to find
> Nit: replace with "stop splitting at the end of the tablet"
Done


http://gerrit.cloudera.org:8080/#/c/12323/10/java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java@1947
PS10, Line 1947: 
> Nit: "a key range"
Done


http://gerrit.cloudera.org:8080/#/c/12323/10/java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java@1949
PS10, Line 1949: ess
> Odd to see this here. Are you sure you didn't mean "RPC that prompted the s
Done


http://gerrit.cloudera.org:8080/#/c/12323/10/java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java@1974
PS10, Line 1974:    * are a lot of tablets, and query each tablet to split the tablet's primary key range into
               :    * smaller ranges. This doesn't change the layout of the tablet.
               :    * This method blocks until it gets all the key ranges.
> Is this comment correct for this method? Seems like it was copied
 > from getTableKeyRanges where it is appropriate.

Yes, this method is the synchronized method for getTableKeyRanges, so the comment look the same.


http://gerrit.cloudera.org:8080/#/c/12323/10/java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java@2049
PS10, Line 2049:           }
> Shouldn't this use endPrimaryKey not endPartitionKey? Are we
 > missing test coverage that would surfaced this error?

I added some unit tests for this case.


http://gerrit.cloudera.org:8080/#/c/12323/10/java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java@2159
PS10, Line 2159:   }
> These new methods duplicate a lot from locateTable and friends. Can
 > we share code between them?

Sorry, I didn't find a way to share this code. But I've made improvements in code readability.


http://gerrit.cloudera.org:8080/#/c/12323/10/java/kudu-client/src/main/java/org/apache/kudu/client/KeyRange.java
File java/kudu-client/src/main/java/org/apache/kudu/client/KeyRange.java:

http://gerrit.cloudera.org:8080/#/c/12323/10/java/kudu-client/src/main/java/org/apache/kudu/client/KeyRange.java@47
PS10, Line 47: 
> You can omit this sentence for simple getters and just use the @return.
Done


http://gerrit.cloudera.org:8080/#/c/12323/10/java/kudu-client/src/main/java/org/apache/kudu/client/KeyRange.java@62
PS10, Line 62:   /**
             :    * @return the located tablet
             :    */
             :   public LocatedTablet getTablet() {
             :     return tablet;
             :   }
             : 
> Can we omit this and provide the tablet in the constructor?
Done


http://gerrit.cloudera.org:8080/#/c/12323/10/java/kudu-client/src/test/java/org/apache/kudu/client/TestSplitKeyRange.java
File java/kudu-client/src/test/java/org/apache/kudu/client/TestSplitKeyRange.java:

http://gerrit.cloudera.org:8080/#/c/12323/10/java/kudu-client/src/test/java/org/apache/kudu/client/TestSplitKeyRange.java@26
PS10, Line 26: import org.apache.kudu.test.KuduTestHarness;
> If you statically import the Assert functions you care about, you can make 
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/12323
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0502f5d64569e8b1d45e88de3cb36aa2e01234d0
Gerrit-Change-Number: 12323
Gerrit-PatchSet: 20
Gerrit-Owner: yangz <zh...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Yao Xu <oc...@gmail.com>
Gerrit-Reviewer: yangz <zh...@gmail.com>
Gerrit-Comment-Date: Mon, 13 May 2019 05:52:16 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2670: Part 1: Build ScanToken by splitSizeBytes

Posted by "Yao Xu (Code Review)" <ge...@cloudera.org>.
Yao Xu has posted comments on this change. ( http://gerrit.cloudera.org:8080/12323 )

Change subject: KUDU-2670: Part 1: Build ScanToken by splitSizeBytes
......................................................................


Patch Set 6:

Hi, grant and adar. Thanks for review.

I will update this patch according to your comments.


-- 
To view, visit http://gerrit.cloudera.org:8080/12323
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0502f5d64569e8b1d45e88de3cb36aa2e01234d0
Gerrit-Change-Number: 12323
Gerrit-PatchSet: 6
Gerrit-Owner: yangz <zh...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Yao Xu <oc...@gmail.com>
Gerrit-Reviewer: yangz <zh...@gmail.com>
Gerrit-Comment-Date: Wed, 06 Mar 2019 03:38:55 +0000
Gerrit-HasComments: No

[kudu-CR] KUDU-2670: Part 1: Build ScanToken by KeyRange

Posted by "Yao Xu (Code Review)" <ge...@cloudera.org>.
Yao Xu has uploaded a new patch set (#28) to the change originally created by yangz. ( http://gerrit.cloudera.org:8080/12323 )

Change subject: KUDU-2670: Part 1: Build ScanToken by KeyRange
......................................................................

KUDU-2670: Part 1: Build ScanToken by KeyRange

When reading data in a kudu table using spark,
if there is a large amount of data in the tablet,
reading the data takes a long time. The reason
is that KuduRDD uses a tablet to generate the
scanToken, so a spark task needs to process all
the data in a tablet.

We send SplitKeyRange RPC to TServer, split tablet's
primary key range into multiple primary key ranges
by size, and generate the scanToken by primary key ranges.

Change-Id: I0502f5d64569e8b1d45e88de3cb36aa2e01234d0
---
M java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java
A java/kudu-client/src/main/java/org/apache/kudu/client/KeyRange.java
M java/kudu-client/src/main/java/org/apache/kudu/client/KuduScanToken.java
A java/kudu-client/src/main/java/org/apache/kudu/client/SplitKeyRangeRequest.java
A java/kudu-client/src/main/java/org/apache/kudu/client/SplitKeyRangeResponse.java
A java/kudu-client/src/test/java/org/apache/kudu/client/TestSplitKeyRange.java
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduRDD.scala
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduReadOptions.scala
M java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/DefaultSourceTest.scala
M java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/KuduTestSuite.scala
M java/kudu-test-utils/src/main/java/org/apache/kudu/test/ClientTestUtil.java
12 files changed, 736 insertions(+), 16 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/23/12323/28
-- 
To view, visit http://gerrit.cloudera.org:8080/12323
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I0502f5d64569e8b1d45e88de3cb36aa2e01234d0
Gerrit-Change-Number: 12323
Gerrit-PatchSet: 28
Gerrit-Owner: yangz <zh...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Yao Xu <oc...@gmail.com>
Gerrit-Reviewer: yangz <zh...@gmail.com>

[kudu-CR] KUDU-2670: Part 1: Build ScanToken by KeyRange

Posted by "Yao Xu (Code Review)" <ge...@cloudera.org>.
Yao Xu has uploaded a new patch set (#8) to the change originally created by yangz. ( http://gerrit.cloudera.org:8080/12323 )

Change subject: KUDU-2670: Part 1: Build ScanToken by KeyRange
......................................................................

KUDU-2670: Part 1: Build ScanToken by KeyRange

When reading data in a kudu table using spark,
if there is a large amount of data in the tablet,
reading the data takes a long time. The reason
is that KuduRDD uses a tablet to generate the
scanToken, so a spark task needs to process all
the data in a tablet.

We send SplitKeyRange RPC to TServer, split tablet's
primary key range into multiple primary key ranges
by size. And generate the scanToken by primary key ranges.

Change-Id: I0502f5d64569e8b1d45e88de3cb36aa2e01234d0
---
M java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java
A java/kudu-client/src/main/java/org/apache/kudu/client/KeyRange.java
M java/kudu-client/src/main/java/org/apache/kudu/client/KuduScanToken.java
M java/kudu-client/src/main/java/org/apache/kudu/client/KuduTable.java
A java/kudu-client/src/main/java/org/apache/kudu/client/SplitKeyRangeRequest.java
A java/kudu-client/src/main/java/org/apache/kudu/client/SplitKeyRangeResponse.java
A java/kudu-client/src/test/java/org/apache/kudu/client/TestSplitKeyRange.java
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduRDD.scala
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduReadOptions.scala
M java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/DefaultSourceTest.scala
M java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/KuduTestSuite.scala
M java/kudu-test-utils/src/main/java/org/apache/kudu/test/ClientTestUtil.java
13 files changed, 873 insertions(+), 16 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/23/12323/8
-- 
To view, visit http://gerrit.cloudera.org:8080/12323
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I0502f5d64569e8b1d45e88de3cb36aa2e01234d0
Gerrit-Change-Number: 12323
Gerrit-PatchSet: 8
Gerrit-Owner: yangz <zh...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Yao Xu <oc...@gmail.com>
Gerrit-Reviewer: yangz <zh...@gmail.com>

[kudu-CR] KUDU-2670: Part 1: Build ScanToken by KeyRange

Posted by "Yao Xu (Code Review)" <ge...@cloudera.org>.
Yao Xu has uploaded a new patch set (#29) to the change originally created by yangz. ( http://gerrit.cloudera.org:8080/12323 )

Change subject: KUDU-2670: Part 1: Build ScanToken by KeyRange
......................................................................

KUDU-2670: Part 1: Build ScanToken by KeyRange

When reading data in a kudu table using spark,
if there is a large amount of data in the tablet,
reading the data takes a long time. The reason
is that KuduRDD uses a tablet to generate the
scanToken, so a spark task needs to process all
the data in a tablet.

We send SplitKeyRange RPC to TServer, split tablet's
primary key range into multiple primary key ranges
by size, and generate the scanToken by primary key ranges.

Change-Id: I0502f5d64569e8b1d45e88de3cb36aa2e01234d0
---
M java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java
A java/kudu-client/src/main/java/org/apache/kudu/client/KeyRange.java
M java/kudu-client/src/main/java/org/apache/kudu/client/KuduScanToken.java
A java/kudu-client/src/main/java/org/apache/kudu/client/SplitKeyRangeRequest.java
A java/kudu-client/src/main/java/org/apache/kudu/client/SplitKeyRangeResponse.java
A java/kudu-client/src/test/java/org/apache/kudu/client/TestSplitKeyRange.java
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduRDD.scala
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduReadOptions.scala
M java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/DefaultSourceTest.scala
M java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/KuduTestSuite.scala
M java/kudu-test-utils/src/main/java/org/apache/kudu/test/ClientTestUtil.java
12 files changed, 739 insertions(+), 16 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/23/12323/29
-- 
To view, visit http://gerrit.cloudera.org:8080/12323
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I0502f5d64569e8b1d45e88de3cb36aa2e01234d0
Gerrit-Change-Number: 12323
Gerrit-PatchSet: 29
Gerrit-Owner: yangz <zh...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Yao Xu <oc...@gmail.com>
Gerrit-Reviewer: yangz <zh...@gmail.com>