You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@kudu.apache.org by "ZhangYao (Code Review)" <ge...@cloudera.org> on 2019/08/20 15:15:12 UTC

[kudu-CR] KUDU-2921: Exposing the table statistics to spark relation.

ZhangYao has uploaded this change for review. ( http://gerrit.cloudera.org:8080/14107


Change subject: KUDU-2921: Exposing the table statistics to spark relation.
......................................................................

KUDU-2921: Exposing the table statistics to spark relation.

Exposing current table statistics to spark via rpc from client to master.

Change-Id: I7742a76708f989b0ccc8ba417f3390013e260175
---
M java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java
A java/kudu-client/src/main/java/org/apache/kudu/client/GetTableStatisticsRequest.java
A java/kudu-client/src/main/java/org/apache/kudu/client/GetTableStatisticsResponse.java
M java/kudu-client/src/main/java/org/apache/kudu/client/KuduClient.java
M java/kudu-client/src/main/java/org/apache/kudu/client/KuduTable.java
M java/kudu-client/src/test/java/org/apache/kudu/client/TestKuduTable.java
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala
M src/kudu/master/catalog_manager.cc
M src/kudu/master/catalog_manager.h
M src/kudu/master/master.proto
M src/kudu/master/master_service.cc
M src/kudu/master/master_service.h
12 files changed, 244 insertions(+), 0 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/07/14107/1
-- 
To view, visit http://gerrit.cloudera.org:8080/14107
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I7742a76708f989b0ccc8ba417f3390013e260175
Gerrit-Change-Number: 14107
Gerrit-PatchSet: 1
Gerrit-Owner: ZhangYao <tr...@gmail.com>

[kudu-CR] KUDU-2921: Exposing the table statistics to spark relation.

Posted by "ZhangYao (Code Review)" <ge...@cloudera.org>.
ZhangYao has posted comments on this change. ( http://gerrit.cloudera.org:8080/14107 )

Change subject: KUDU-2921: Exposing the table statistics to spark relation.
......................................................................


Patch Set 7:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/14107/6/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala
File java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala:

http://gerrit.cloudera.org:8080/#/c/14107/6/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala@33
PS6, Line 33: import org.slf4j.Logger
> Nit: Single import per line
Done


http://gerrit.cloudera.org:8080/#/c/14107/6/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala@288
PS6, Line 288:   override def sizeInBytes: Long = estimatedSize
> Is there a jira tracking the improvement for factoring predicates and proje
Ok, I will create one. And TODO added here.


http://gerrit.cloudera.org:8080/#/c/14107/6/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduReadOptions.scala
File java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduReadOptions.scala:

http://gerrit.cloudera.org:8080/#/c/14107/6/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduReadOptions.scala@35
PS6, Line 35:  *                          server to ensure that scanners do not time out
> Please move the documentation for sizeFactor here.
Done


http://gerrit.cloudera.org:8080/#/c/14107/6/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduReadOptions.scala@49
PS6, Line 49:     scanRequestTimeoutMs: Option[Long] = None,
> I am not 100% sure how this configuration is really useful/helpful. Given t
Yeah, the usage of factor needs user to know the detail about Relation and the spark plan maybe it's not suitable and hard to guide.


http://gerrit.cloudera.org:8080/#/c/14107/6/java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/DefaultSourceTest.scala
File java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/DefaultSourceTest.scala:

http://gerrit.cloudera.org:8080/#/c/14107/6/java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/DefaultSourceTest.scala@32
PS6, Line 32: import org.apache.kudu.client.CreateTableOptions
> Nit: One import per line is preferred. Here and below.
Done


http://gerrit.cloudera.org:8080/#/c/14107/6/src/kudu/master/catalog_manager.cc
File src/kudu/master/catalog_manager.cc:

http://gerrit.cloudera.org:8080/#/c/14107/6/src/kudu/master/catalog_manager.cc@271
PS6, Line 271: DEFINE_bool(mock_table_metrics_for_testing, false,
> Do these flags need to be marked as hidden?
Yeah, Adar said the same thing. According to the comment about Flag tags,  tagged as unsafe will exclude the flag automatically even if it wasn't marked as 'hidden'.  I think using hidden maybe more clear. Done.



-- 
To view, visit http://gerrit.cloudera.org:8080/14107
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I7742a76708f989b0ccc8ba417f3390013e260175
Gerrit-Change-Number: 14107
Gerrit-PatchSet: 7
Gerrit-Owner: ZhangYao <tr...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Hao Hao <ha...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: ZhangYao <tr...@gmail.com>
Gerrit-Comment-Date: Fri, 06 Sep 2019 05:38:48 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2921: Exposing the table statistics to spark relation.

Posted by "ZhangYao (Code Review)" <ge...@cloudera.org>.
ZhangYao has posted comments on this change. ( http://gerrit.cloudera.org:8080/14107 )

Change subject: KUDU-2921: Exposing the table statistics to spark relation.
......................................................................


Patch Set 5:

> > 1. For java client when using new RPC:
 > > new java client -> old server -> leads to execption
 > 
 > OK, because the exception is only seen when invoking the new APIs.
 > It's not visible in any existing APIs, right?
 Right.

 > > 2. For spark client:
 > > new spark client depends on new java client.
 > > new spark client -> old server -> leads to execption
 > 
 > This is what concerns me. Under what circumstances is there an
 > exception? When invoking some new API? Or in existing data paths?
 > 
 > If the latter, we need to address it. We take great care to ensure
 > Kudu is both forward and backward compatible, whether it be across
 > the wire (i.e. between server and client), or across a library
 > boundary (i.e. between client and application). It wouldn't be OK
 > for someone running with the new Spark integration against an old
 > server to have a degraded experience in _existing_ functionality.
 > If it's brand new functionality (i.e. a new API in the client),
 > that's fine.

I got it. In previous patch because when creating relation it will call getStatistic automatically so it will cause exception with old server . I used  a config to decide whether to use getStatistics or not. That's for new spark client with old server user if he doesn't open that conf, it can work well.


-- 
To view, visit http://gerrit.cloudera.org:8080/14107
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I7742a76708f989b0ccc8ba417f3390013e260175
Gerrit-Change-Number: 14107
Gerrit-PatchSet: 5
Gerrit-Owner: ZhangYao <tr...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Hao Hao <ha...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: ZhangYao <tr...@gmail.com>
Gerrit-Comment-Date: Fri, 30 Aug 2019 06:26:15 +0000
Gerrit-HasComments: No

[kudu-CR] KUDU-2921: Exposing the table statistics to spark relation.

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change. ( http://gerrit.cloudera.org:8080/14107 )

Change subject: KUDU-2921: Exposing the table statistics to spark relation.
......................................................................


Patch Set 10: Verified+1

Yeah I don't think those failures are related to your change. Let's override them.


-- 
To view, visit http://gerrit.cloudera.org:8080/14107
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I7742a76708f989b0ccc8ba417f3390013e260175
Gerrit-Change-Number: 14107
Gerrit-PatchSet: 10
Gerrit-Owner: ZhangYao <tr...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Hao Hao <ha...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: ZhangYao <tr...@gmail.com>
Gerrit-Comment-Date: Tue, 10 Sep 2019 03:54:14 +0000
Gerrit-HasComments: No

[kudu-CR] KUDU-2921: Exposing the table statistics to spark relation.

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change. ( http://gerrit.cloudera.org:8080/14107 )

Change subject: KUDU-2921: Exposing the table statistics to spark relation.
......................................................................


Patch Set 6: Code-Review+1

Thanks for addressing my concerns.

It'd be nice if Hao/Andrew reviewed the authz piece, and I'd also like Grant to review the Spark changes.


-- 
To view, visit http://gerrit.cloudera.org:8080/14107
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I7742a76708f989b0ccc8ba417f3390013e260175
Gerrit-Change-Number: 14107
Gerrit-PatchSet: 6
Gerrit-Owner: ZhangYao <tr...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Hao Hao <ha...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: ZhangYao <tr...@gmail.com>
Gerrit-Comment-Date: Tue, 03 Sep 2019 16:53:58 +0000
Gerrit-HasComments: No

[kudu-CR] KUDU-2921: Exposing the table statistics to spark relation.

Posted by "ZhangYao (Code Review)" <ge...@cloudera.org>.
ZhangYao has posted comments on this change. ( http://gerrit.cloudera.org:8080/14107 )

Change subject: KUDU-2921: Exposing the table statistics to spark relation.
......................................................................


Patch Set 4:

(3 comments)

> (4 comments)
 > 
 > How do the new Java APIs react when running against an old server
 > that doesn't support the new RPC? What about the Spark client?

1. For java client when using new RPC:
new java client -> old server -> leads to execption
new java client -> new server -> works normally
old java client -> old server -> works normally
old java client -> new server -> works normally

2. For spark client:
new spark client depends on new java client.
new spark client -> old server -> leads to execption
new spark client -> new server -> works normally

old spark client can works with both old and new spark client
old spark client -> old server -> works normally
old spark client -> new server -> works normally

So new spark client needs new java clients needs new server.

http://gerrit.cloudera.org:8080/#/c/14107/3/java/kudu-client/src/main/java/org/apache/kudu/client/KuduTableStatistics.java
File java/kudu-client/src/main/java/org/apache/kudu/client/KuduTableStatistics.java:

http://gerrit.cloudera.org:8080/#/c/14107/3/java/kudu-client/src/main/java/org/apache/kudu/client/KuduTableStatistics.java@52
PS3, Line 52:    * Get the table's live row count, this statistic is pre-replication.
> Should note pre-replication here too.
Done


http://gerrit.cloudera.org:8080/#/c/14107/3/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala
File java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala:

http://gerrit.cloudera.org:8080/#/c/14107/3/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala@270
PS3, Line 270: It
             :    * is always better to overestimate this size than underestimate,
> Why? Could you expand the comment?
Done


http://gerrit.cloudera.org:8080/#/c/14107/3/src/kudu/master/sentry_authz_provider.cc
File src/kudu/master/sentry_authz_provider.cc:

http://gerrit.cloudera.org:8080/#/c/14107/3/src/kudu/master/sentry_authz_provider.cc@258
PS3, Line 258:   // Table statistics contain data and thus requires the 'SELECT ON TABLE' privilege.
             :   return Authorize(SentryAu
> "Table statistics contain data and thus requires the 'SELECT ON TABLE' priv
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/14107
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I7742a76708f989b0ccc8ba417f3390013e260175
Gerrit-Change-Number: 14107
Gerrit-PatchSet: 4
Gerrit-Owner: ZhangYao <tr...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Hao Hao <ha...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: ZhangYao <tr...@gmail.com>
Gerrit-Comment-Date: Thu, 29 Aug 2019 07:00:34 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2921: Exposing the table statistics to spark relation.

Posted by "ZhangYao (Code Review)" <ge...@cloudera.org>.
Hello Tidy Bot, Kudu Jenkins, Andrew Wong, Adar Dembo, Grant Henke, Hao Hao, Todd Lipcon, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/14107

to look at the new patch set (#3).

Change subject: KUDU-2921: Exposing the table statistics to spark relation.
......................................................................

KUDU-2921: Exposing the table statistics to spark relation.

Exposing current table statistics to spark via rpc from client to master.

Change-Id: I7742a76708f989b0ccc8ba417f3390013e260175
---
M java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java
A java/kudu-client/src/main/java/org/apache/kudu/client/GetTableStatisticsRequest.java
A java/kudu-client/src/main/java/org/apache/kudu/client/GetTableStatisticsResponse.java
M java/kudu-client/src/main/java/org/apache/kudu/client/KuduClient.java
M java/kudu-client/src/main/java/org/apache/kudu/client/KuduTable.java
A java/kudu-client/src/main/java/org/apache/kudu/client/KuduTableStatistics.java
M java/kudu-client/src/test/java/org/apache/kudu/client/TestKuduTable.java
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduReadOptions.scala
M java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/DefaultSourceTest.scala
M src/kudu/master/authz_provider.h
M src/kudu/master/catalog_manager.cc
M src/kudu/master/catalog_manager.h
M src/kudu/master/default_authz_provider.h
M src/kudu/master/master-test.cc
M src/kudu/master/master.proto
M src/kudu/master/master_service.cc
M src/kudu/master/master_service.h
M src/kudu/master/sentry_authz_provider-test.cc
M src/kudu/master/sentry_authz_provider.cc
M src/kudu/master/sentry_authz_provider.h
21 files changed, 537 insertions(+), 2 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/07/14107/3
-- 
To view, visit http://gerrit.cloudera.org:8080/14107
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I7742a76708f989b0ccc8ba417f3390013e260175
Gerrit-Change-Number: 14107
Gerrit-PatchSet: 3
Gerrit-Owner: ZhangYao <tr...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Hao Hao <ha...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: ZhangYao <tr...@gmail.com>

[kudu-CR] KUDU-2921: Exposing the table statistics to spark relation.

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/14107 )

Change subject: KUDU-2921: Exposing the table statistics to spark relation.
......................................................................


Patch Set 9: Code-Review+1

(2 comments)

http://gerrit.cloudera.org:8080/#/c/14107/8/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala
File java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala:

http://gerrit.cloudera.org:8080/#/c/14107/8/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala@281
PS8, Line 281: overestimate this size than underestimate
> The estimation of relation size is the more accurate the better, if we don'
I see, thanks for the explanation! I agree, having some kind of measurement for row size seems desirable in that case.

Seems like the missing piece might be per-column statistics, i.e. per column min/max, histograms, etc, which would be able to provide accurate estimates or upper bounds on the row sizes, and may allow us to give estimates with projections and predicates. Definitely room for improvements, but I think this patch is still an improvement over nothing :)


http://gerrit.cloudera.org:8080/#/c/14107/9/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala
File java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala:

http://gerrit.cloudera.org:8080/#/c/14107/9/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala@71
PS9, Line 71:   val SIZE_FACTOR = "kudu.sizeFactor"
Not used?



-- 
To view, visit http://gerrit.cloudera.org:8080/14107
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I7742a76708f989b0ccc8ba417f3390013e260175
Gerrit-Change-Number: 14107
Gerrit-PatchSet: 9
Gerrit-Owner: ZhangYao <tr...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Hao Hao <ha...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: ZhangYao <tr...@gmail.com>
Gerrit-Comment-Date: Mon, 09 Sep 2019 18:38:12 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2921: Exposing the table statistics to spark relation.

Posted by "ZhangYao (Code Review)" <ge...@cloudera.org>.
Hello Tidy Bot, Kudu Jenkins, Andrew Wong, Adar Dembo, Grant Henke, Hao Hao, Todd Lipcon, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/14107

to look at the new patch set (#4).

Change subject: KUDU-2921: Exposing the table statistics to spark relation.
......................................................................

KUDU-2921: Exposing the table statistics to spark relation.

Exposing current table statistics to spark via rpc from client to master.

Change-Id: I7742a76708f989b0ccc8ba417f3390013e260175
---
M java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java
A java/kudu-client/src/main/java/org/apache/kudu/client/GetTableStatisticsRequest.java
A java/kudu-client/src/main/java/org/apache/kudu/client/GetTableStatisticsResponse.java
M java/kudu-client/src/main/java/org/apache/kudu/client/KuduClient.java
M java/kudu-client/src/main/java/org/apache/kudu/client/KuduTable.java
A java/kudu-client/src/main/java/org/apache/kudu/client/KuduTableStatistics.java
M java/kudu-client/src/test/java/org/apache/kudu/client/TestKuduTable.java
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduReadOptions.scala
M java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/DefaultSourceTest.scala
M src/kudu/master/authz_provider.h
M src/kudu/master/catalog_manager.cc
M src/kudu/master/catalog_manager.h
M src/kudu/master/default_authz_provider.h
M src/kudu/master/master-test.cc
M src/kudu/master/master.proto
M src/kudu/master/master_service.cc
M src/kudu/master/master_service.h
M src/kudu/master/sentry_authz_provider-test.cc
M src/kudu/master/sentry_authz_provider.cc
M src/kudu/master/sentry_authz_provider.h
21 files changed, 538 insertions(+), 2 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/07/14107/4
-- 
To view, visit http://gerrit.cloudera.org:8080/14107
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I7742a76708f989b0ccc8ba417f3390013e260175
Gerrit-Change-Number: 14107
Gerrit-PatchSet: 4
Gerrit-Owner: ZhangYao <tr...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Hao Hao <ha...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: ZhangYao <tr...@gmail.com>

[kudu-CR] KUDU-2921: Exposing the table statistics to spark relation.

Posted by "ZhangYao (Code Review)" <ge...@cloudera.org>.
Hello Tidy Bot, Kudu Jenkins, Andrew Wong, Adar Dembo, Grant Henke, Hao Hao, Todd Lipcon, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/14107

to look at the new patch set (#10).

Change subject: KUDU-2921: Exposing the table statistics to spark relation.
......................................................................

KUDU-2921: Exposing the table statistics to spark relation.

Exposing current table statistics to spark via rpc from client to master.

Change-Id: I7742a76708f989b0ccc8ba417f3390013e260175
---
M java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java
A java/kudu-client/src/main/java/org/apache/kudu/client/GetTableStatisticsRequest.java
A java/kudu-client/src/main/java/org/apache/kudu/client/GetTableStatisticsResponse.java
M java/kudu-client/src/main/java/org/apache/kudu/client/KuduClient.java
M java/kudu-client/src/main/java/org/apache/kudu/client/KuduTable.java
A java/kudu-client/src/main/java/org/apache/kudu/client/KuduTableStatistics.java
M java/kudu-client/src/test/java/org/apache/kudu/client/TestKuduTable.java
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala
M java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/DefaultSourceTest.scala
M src/kudu/master/authz_provider.h
M src/kudu/master/catalog_manager.cc
M src/kudu/master/catalog_manager.h
M src/kudu/master/default_authz_provider.h
M src/kudu/master/master-test.cc
M src/kudu/master/master.proto
M src/kudu/master/master_service.cc
M src/kudu/master/master_service.h
M src/kudu/master/sentry_authz_provider-test.cc
M src/kudu/master/sentry_authz_provider.cc
M src/kudu/master/sentry_authz_provider.h
20 files changed, 516 insertions(+), 0 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/07/14107/10
-- 
To view, visit http://gerrit.cloudera.org:8080/14107
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I7742a76708f989b0ccc8ba417f3390013e260175
Gerrit-Change-Number: 14107
Gerrit-PatchSet: 10
Gerrit-Owner: ZhangYao <tr...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Hao Hao <ha...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: ZhangYao <tr...@gmail.com>

[kudu-CR] KUDU-2921: Exposing the table statistics to spark relation.

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/14107 )

Change subject: KUDU-2921: Exposing the table statistics to spark relation.
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/14107/1/src/kudu/master/catalog_manager.cc
File src/kudu/master/catalog_manager.cc:

http://gerrit.cloudera.org:8080/#/c/14107/1/src/kudu/master/catalog_manager.cc@2906
PS1, Line 2906:       return SetupError(authz_provider_->AuthorizeGetTableMetadata(table_name, username),
I'm not sure these required permissions are correct for protecting this endpoint. Other systems like Impala require "ALL ON TABLE", whereas this is more like "ANY ON TABLE".

https://www.cloudera.com/documentation/enterprise/latest/topics/impala_authorization.html

We may need to extend the AuthzProvider interface to support this. It would also be good to get Hao Hao's input, since she is very knowledgable in this area.



-- 
To view, visit http://gerrit.cloudera.org:8080/14107
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I7742a76708f989b0ccc8ba417f3390013e260175
Gerrit-Change-Number: 14107
Gerrit-PatchSet: 1
Gerrit-Owner: ZhangYao <tr...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Tue, 20 Aug 2019 18:01:38 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2921: Exposing the table statistics to spark relation.

Posted by "ZhangYao (Code Review)" <ge...@cloudera.org>.
Hello Tidy Bot, Kudu Jenkins, Andrew Wong, Adar Dembo, Grant Henke, Hao Hao, Todd Lipcon, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/14107

to look at the new patch set (#6).

Change subject: KUDU-2921: Exposing the table statistics to spark relation.
......................................................................

KUDU-2921: Exposing the table statistics to spark relation.

Exposing current table statistics to spark via rpc from client to master.

Change-Id: I7742a76708f989b0ccc8ba417f3390013e260175
---
M java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java
A java/kudu-client/src/main/java/org/apache/kudu/client/GetTableStatisticsRequest.java
A java/kudu-client/src/main/java/org/apache/kudu/client/GetTableStatisticsResponse.java
M java/kudu-client/src/main/java/org/apache/kudu/client/KuduClient.java
M java/kudu-client/src/main/java/org/apache/kudu/client/KuduTable.java
A java/kudu-client/src/main/java/org/apache/kudu/client/KuduTableStatistics.java
M java/kudu-client/src/test/java/org/apache/kudu/client/TestKuduTable.java
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduReadOptions.scala
M java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/DefaultSourceTest.scala
M src/kudu/master/authz_provider.h
M src/kudu/master/catalog_manager.cc
M src/kudu/master/catalog_manager.h
M src/kudu/master/default_authz_provider.h
M src/kudu/master/master-test.cc
M src/kudu/master/master.proto
M src/kudu/master/master_service.cc
M src/kudu/master/master_service.h
M src/kudu/master/sentry_authz_provider-test.cc
M src/kudu/master/sentry_authz_provider.cc
M src/kudu/master/sentry_authz_provider.h
21 files changed, 543 insertions(+), 2 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/07/14107/6
-- 
To view, visit http://gerrit.cloudera.org:8080/14107
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I7742a76708f989b0ccc8ba417f3390013e260175
Gerrit-Change-Number: 14107
Gerrit-PatchSet: 6
Gerrit-Owner: ZhangYao <tr...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Hao Hao <ha...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: ZhangYao <tr...@gmail.com>

[kudu-CR] KUDU-2921: Exposing the table statistics to spark relation.

Posted by "ZhangYao (Code Review)" <ge...@cloudera.org>.
Hello Tidy Bot, Kudu Jenkins, Andrew Wong, Adar Dembo, Grant Henke, Todd Lipcon, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/14107

to look at the new patch set (#2).

Change subject: KUDU-2921: Exposing the table statistics to spark relation.
......................................................................

KUDU-2921: Exposing the table statistics to spark relation.

Exposing current table statistics to spark via rpc from client to master.

Change-Id: I7742a76708f989b0ccc8ba417f3390013e260175
---
M java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java
A java/kudu-client/src/main/java/org/apache/kudu/client/GetTableStatisticsRequest.java
A java/kudu-client/src/main/java/org/apache/kudu/client/GetTableStatisticsResponse.java
M java/kudu-client/src/main/java/org/apache/kudu/client/KuduClient.java
M java/kudu-client/src/main/java/org/apache/kudu/client/KuduTable.java
A java/kudu-client/src/main/java/org/apache/kudu/client/KuduTableStatistics.java
M java/kudu-client/src/test/java/org/apache/kudu/client/TestKuduTable.java
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala
M java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/DefaultSourceTest.scala
M src/kudu/master/catalog_manager.cc
M src/kudu/master/catalog_manager.h
M src/kudu/master/master-test.cc
M src/kudu/master/master.proto
M src/kudu/master/master_service.cc
M src/kudu/master/master_service.h
15 files changed, 414 insertions(+), 2 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/07/14107/2
-- 
To view, visit http://gerrit.cloudera.org:8080/14107
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I7742a76708f989b0ccc8ba417f3390013e260175
Gerrit-Change-Number: 14107
Gerrit-PatchSet: 2
Gerrit-Owner: ZhangYao <tr...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] KUDU-2921: Exposing the table statistics to spark relation.

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change. ( http://gerrit.cloudera.org:8080/14107 )

Change subject: KUDU-2921: Exposing the table statistics to spark relation.
......................................................................


Patch Set 4:

> 1. For java client when using new RPC:
> new java client -> old server -> leads to execption

OK, because the exception is only seen when invoking the new APIs. It's not visible in any existing APIs, right?

> 2. For spark client:
> new spark client depends on new java client.
> new spark client -> old server -> leads to execption

This is what concerns me. Under what circumstances is there an exception? When invoking some new API? Or in existing data paths?

If the latter, we need to address it. We take great care to ensure Kudu is both forward and backward compatible, whether it be across the wire (i.e. between server and client), or across a library boundary (i.e. between client and application). It wouldn't be OK for someone running with the new Spark integration against an old server to have a degraded experience in _existing_ functionality. If it's brand new functionality (i.e. a new API in the client), that's fine.


-- 
To view, visit http://gerrit.cloudera.org:8080/14107
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I7742a76708f989b0ccc8ba417f3390013e260175
Gerrit-Change-Number: 14107
Gerrit-PatchSet: 4
Gerrit-Owner: ZhangYao <tr...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Hao Hao <ha...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: ZhangYao <tr...@gmail.com>
Gerrit-Comment-Date: Fri, 30 Aug 2019 00:30:18 +0000
Gerrit-HasComments: No

[kudu-CR] KUDU-2921: Exposing the table statistics to spark relation.

Posted by "ZhangYao (Code Review)" <ge...@cloudera.org>.
Hello Tidy Bot, Kudu Jenkins, Andrew Wong, Adar Dembo, Grant Henke, Hao Hao, Todd Lipcon, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/14107

to look at the new patch set (#7).

Change subject: KUDU-2921: Exposing the table statistics to spark relation.
......................................................................

KUDU-2921: Exposing the table statistics to spark relation.

Exposing current table statistics to spark via rpc from client to master.

Change-Id: I7742a76708f989b0ccc8ba417f3390013e260175
---
M java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java
A java/kudu-client/src/main/java/org/apache/kudu/client/GetTableStatisticsRequest.java
A java/kudu-client/src/main/java/org/apache/kudu/client/GetTableStatisticsResponse.java
M java/kudu-client/src/main/java/org/apache/kudu/client/KuduClient.java
M java/kudu-client/src/main/java/org/apache/kudu/client/KuduTable.java
A java/kudu-client/src/main/java/org/apache/kudu/client/KuduTableStatistics.java
M java/kudu-client/src/test/java/org/apache/kudu/client/TestKuduTable.java
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala
M java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/DefaultSourceTest.scala
M src/kudu/master/authz_provider.h
M src/kudu/master/catalog_manager.cc
M src/kudu/master/catalog_manager.h
M src/kudu/master/default_authz_provider.h
M src/kudu/master/master-test.cc
M src/kudu/master/master.proto
M src/kudu/master/master_service.cc
M src/kudu/master/master_service.h
M src/kudu/master/sentry_authz_provider-test.cc
M src/kudu/master/sentry_authz_provider.cc
M src/kudu/master/sentry_authz_provider.h
20 files changed, 520 insertions(+), 0 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/07/14107/7
-- 
To view, visit http://gerrit.cloudera.org:8080/14107
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I7742a76708f989b0ccc8ba417f3390013e260175
Gerrit-Change-Number: 14107
Gerrit-PatchSet: 7
Gerrit-Owner: ZhangYao <tr...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Hao Hao <ha...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: ZhangYao <tr...@gmail.com>

[kudu-CR] KUDU-2921: Exposing the table statistics to spark relation.

Posted by "Grant Henke (Code Review)" <ge...@cloudera.org>.
Grant Henke has posted comments on this change. ( http://gerrit.cloudera.org:8080/14107 )

Change subject: KUDU-2921: Exposing the table statistics to spark relation.
......................................................................


Patch Set 9: Code-Review+1

LGTM. I will leave open to let others do a final review.


-- 
To view, visit http://gerrit.cloudera.org:8080/14107
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I7742a76708f989b0ccc8ba417f3390013e260175
Gerrit-Change-Number: 14107
Gerrit-PatchSet: 9
Gerrit-Owner: ZhangYao <tr...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Hao Hao <ha...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: ZhangYao <tr...@gmail.com>
Gerrit-Comment-Date: Mon, 09 Sep 2019 15:08:59 +0000
Gerrit-HasComments: No

[kudu-CR] KUDU-2921: Exposing the table statistics to spark relation.

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change. ( http://gerrit.cloudera.org:8080/14107 )

Change subject: KUDU-2921: Exposing the table statistics to spark relation.
......................................................................


Patch Set 3:

(4 comments)

How do the new Java APIs react when running against an old server that doesn't support the new RPC? What about the Spark client?

http://gerrit.cloudera.org:8080/#/c/14107/3/java/kudu-client/src/main/java/org/apache/kudu/client/KuduTableStatistics.java
File java/kudu-client/src/main/java/org/apache/kudu/client/KuduTableStatistics.java:

http://gerrit.cloudera.org:8080/#/c/14107/3/java/kudu-client/src/main/java/org/apache/kudu/client/KuduTableStatistics.java@52
PS3, Line 52:    * Get the table's live row count.
Should note pre-replication here too.


http://gerrit.cloudera.org:8080/#/c/14107/3/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala
File java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala:

http://gerrit.cloudera.org:8080/#/c/14107/3/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala@270
PS3, Line 270: It
             :    * is always better to overestimate this size than underestimate.
Why? Could you expand the comment?


http://gerrit.cloudera.org:8080/#/c/14107/2/src/kudu/master/catalog_manager.cc
File src/kudu/master/catalog_manager.cc:

http://gerrit.cloudera.org:8080/#/c/14107/2/src/kudu/master/catalog_manager.cc@271
PS2, Line 271: DEFINE_bool(mock_table_metrics_for_testing, false,
             :             "Whether to enable mock table metrics for testing.");
             : TAG_FLAG(mock_table_metrics_for_testing, runtime);
             : TAG_FLAG(mock_table_metrics_for_testing, unsafe);
             : 
             : DEFINE_int64(on_disk_size_for_testing, 0,
             :              "Mock the on disk size of metrics for testing.");
             : TAG_FLAG(on_disk_size_for_testing, runtime);
             : TAG_FLAG(on_disk_size_for_testing, unsafe);
             : 
             : DEFINE_int64(live_row_count_for_testing, 0,
             :              "Mock the live row count of metrics for testing.");
             : TAG_FLAG(live_row_count_for_testing, runtime);
             : TAG_FLAG(live_row_count_for_testing, unsafe);
> According to the comment about Flag tags, it seems tagged as unsafe will ex
I see. OK, I guess it's fine then.


http://gerrit.cloudera.org:8080/#/c/14107/3/src/kudu/master/sentry_authz_provider.cc
File src/kudu/master/sentry_authz_provider.cc:

http://gerrit.cloudera.org:8080/#/c/14107/3/src/kudu/master/sentry_authz_provider.cc@258
PS3, Line 258:   // Table statistics contains data requires scan privilege so it require
             :   // SELECT privilege here.
"Table statistics contain data and thus requires the 'SELECT ON TABLE' privilege."



-- 
To view, visit http://gerrit.cloudera.org:8080/14107
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I7742a76708f989b0ccc8ba417f3390013e260175
Gerrit-Change-Number: 14107
Gerrit-PatchSet: 3
Gerrit-Owner: ZhangYao <tr...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Hao Hao <ha...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: ZhangYao <tr...@gmail.com>
Gerrit-Comment-Date: Wed, 28 Aug 2019 18:59:52 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2921: Exposing the table statistics to spark relation.

Posted by "ZhangYao (Code Review)" <ge...@cloudera.org>.
ZhangYao has posted comments on this change. ( http://gerrit.cloudera.org:8080/14107 )

Change subject: KUDU-2921: Exposing the table statistics to spark relation.
......................................................................


Patch Set 6:

> >  > > 2. For spark client:
 > >  > > new spark client depends on new java client.
 > >  > > new spark client -> old server -> leads to execption
 > >  >
 > >  > This is what concerns me. Under what circumstances is there an
 > >  > exception? When invoking some new API? Or in existing data
 > paths?
 > >  >
 > >  > If the latter, we need to address it. We take great care to
 > ensure
 > >  > Kudu is both forward and backward compatible, whether it be
 > across
 > >  > the wire (i.e. between server and client), or across a library
 > >  > boundary (i.e. between client and application). It wouldn't be
 > OK
 > >  > for someone running with the new Spark integration against an
 > old
 > >  > server to have a degraded experience in _existing_
 > functionality.
 > >  > If it's brand new functionality (i.e. a new API in the
 > client),
 > >  > that's fine.
 > >
 > > I got it. In previous patch because when creating relation it
 > will call getStatistic automatically so it will cause exception
 > with old server . I used  a config to decide whether to use
 > getStatistics or not. That's for new spark client with old server
 > user if he doesn't open that conf, it can work well.
 > 
 > Instead of introducing a new config, could we just try the new API,
 > and if it throws an exception, chain to the superclass's
 > implementation? Maybe log something too?

Done


-- 
To view, visit http://gerrit.cloudera.org:8080/14107
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I7742a76708f989b0ccc8ba417f3390013e260175
Gerrit-Change-Number: 14107
Gerrit-PatchSet: 6
Gerrit-Owner: ZhangYao <tr...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Hao Hao <ha...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: ZhangYao <tr...@gmail.com>
Gerrit-Comment-Date: Sun, 01 Sep 2019 08:16:32 +0000
Gerrit-HasComments: No

[kudu-CR] KUDU-2921: Exposing the table statistics to spark relation.

Posted by "ZhangYao (Code Review)" <ge...@cloudera.org>.
ZhangYao has posted comments on this change. ( http://gerrit.cloudera.org:8080/14107 )

Change subject: KUDU-2921: Exposing the table statistics to spark relation.
......................................................................


Patch Set 3:

(11 comments)

http://gerrit.cloudera.org:8080/#/c/14107/2/java/kudu-client/src/main/java/org/apache/kudu/client/GetTableStatisticsRequest.java
File java/kudu-client/src/main/java/org/apache/kudu/client/GetTableStatisticsRequest.java:

PS2: 
> Copyright header.
Done


http://gerrit.cloudera.org:8080/#/c/14107/2/java/kudu-client/src/main/java/org/apache/kudu/client/GetTableStatisticsRequest.java@47
PS2, Line 47:             Master.TableIdentifierPB.newBuilder().setTableName(name).build();
> Nit: indent
Done


http://gerrit.cloudera.org:8080/#/c/14107/2/java/kudu-client/src/main/java/org/apache/kudu/client/GetTableStatisticsResponse.java
File java/kudu-client/src/main/java/org/apache/kudu/client/GetTableStatisticsResponse.java:

PS2: 
> Copyright header.
Done


http://gerrit.cloudera.org:8080/#/c/14107/2/java/kudu-client/src/main/java/org/apache/kudu/client/GetTableStatisticsResponse.java@28
PS2, Line 28: 
> Should mention that this is pre-replication (i.e. based on the cumulative o
Done


http://gerrit.cloudera.org:8080/#/c/14107/2/java/kudu-client/src/main/java/org/apache/kudu/client/KuduTableStatistics.java
File java/kudu-client/src/main/java/org/apache/kudu/client/KuduTableStatistics.java:

PS2: 
> Copyright header.
Done


http://gerrit.cloudera.org:8080/#/c/14107/2/java/kudu-client/src/main/java/org/apache/kudu/client/KuduTableStatistics.java@38
PS2, Line 38:                       long liveRowCount) {
> Should be getLiveRowCount
Done


http://gerrit.cloudera.org:8080/#/c/14107/2/java/kudu-client/src/test/java/org/apache/kudu/client/TestKuduTable.java
File java/kudu-client/src/test/java/org/apache/kudu/client/TestKuduTable.java:

http://gerrit.cloudera.org:8080/#/c/14107/2/java/kudu-client/src/test/java/org/apache/kudu/client/TestKuduTable.java@802
PS2, Line 802:       assertTrue(currentStatistics.getLiveRowCount() >= prevStatistics.getLiveRowCount());
             :       assertTrue(currentStatistics.getLiveRowCount() <= i + 1);
> Break this up into two separate asserts so that if one condition fails, it'
Done


http://gerrit.cloudera.org:8080/#/c/14107/2/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala
File java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala:

http://gerrit.cloudera.org:8080/#/c/14107/2/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala@263
PS2, Line 263: 
> What's the purpose of this scalar if its value is 1?
This sizeFactor is use to counteract the compression of disk size. Here the value is experimental and it do nothing with value 1. I try to use this to make sure the sizeInBytes below is overestimate.

Because the sizeInBytes is encoded size so it is hard to guarantee overestimate maybe it is better to add it as a config.


http://gerrit.cloudera.org:8080/#/c/14107/2/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala@267
PS2, Line 267: 
> Why is this needed?
Done


http://gerrit.cloudera.org:8080/#/c/14107/2/src/kudu/master/catalog_manager.h
File src/kudu/master/catalog_manager.h:

http://gerrit.cloudera.org:8080/#/c/14107/2/src/kudu/master/catalog_manager.h@636
PS2, Line 636: If 'user' is provided, checks if the user is
             :   // authorized to get such statistics.
> This was copy-pasted from ListTables but isn't correct for GetTableStatisti
Done, and according to HaoHao's comment I add new authorization check.


http://gerrit.cloudera.org:8080/#/c/14107/2/src/kudu/master/catalog_manager.cc
File src/kudu/master/catalog_manager.cc:

http://gerrit.cloudera.org:8080/#/c/14107/2/src/kudu/master/catalog_manager.cc@271
PS2, Line 271: DEFINE_bool(mock_table_metrics_for_testing, false,
             :             "Whether to enable mock table metrics for testing.");
             : TAG_FLAG(mock_table_metrics_for_testing, runtime);
             : TAG_FLAG(mock_table_metrics_for_testing, unsafe);
             : 
             : DEFINE_int64(on_disk_size_for_testing, 0,
             :              "Mock the on disk size of metrics for testing.");
             : TAG_FLAG(on_disk_size_for_testing, runtime);
             : TAG_FLAG(on_disk_size_for_testing, unsafe);
             : 
             : DEFINE_int64(live_row_count_for_testing, 0,
             :              "Mock the live row count of metrics for testing.");
             : TAG_FLAG(live_row_count_for_testing, runtime);
             : TAG_FLAG(live_row_count_for_testing, unsafe);
> If these are for tests, tag them as hidden so they don't show up in docs.
According to the comment about Flag tags, it seems tagged as unsafe will exclude the flag automatically even if it wasn't marked as 'hidden'. Is that unsafe enough?



-- 
To view, visit http://gerrit.cloudera.org:8080/14107
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I7742a76708f989b0ccc8ba417f3390013e260175
Gerrit-Change-Number: 14107
Gerrit-PatchSet: 3
Gerrit-Owner: ZhangYao <tr...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Hao Hao <ha...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: ZhangYao <tr...@gmail.com>
Gerrit-Comment-Date: Wed, 28 Aug 2019 03:32:46 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2921: Exposing the table statistics to spark relation.

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change. ( http://gerrit.cloudera.org:8080/14107 )

Change subject: KUDU-2921: Exposing the table statistics to spark relation.
......................................................................


Patch Set 10: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/14107
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I7742a76708f989b0ccc8ba417f3390013e260175
Gerrit-Change-Number: 14107
Gerrit-PatchSet: 10
Gerrit-Owner: ZhangYao <tr...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Hao Hao <ha...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: ZhangYao <tr...@gmail.com>
Gerrit-Comment-Date: Tue, 10 Sep 2019 02:41:59 +0000
Gerrit-HasComments: No

[kudu-CR] KUDU-2921: Exposing the table statistics to spark relation.

Posted by "ZhangYao (Code Review)" <ge...@cloudera.org>.
ZhangYao has posted comments on this change. ( http://gerrit.cloudera.org:8080/14107 )

Change subject: KUDU-2921: Exposing the table statistics to spark relation.
......................................................................


Patch Set 10:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/14107/8/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala
File java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala:

http://gerrit.cloudera.org:8080/#/c/14107/8/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala@281
PS8, Line 281: e execution plan such as broadcasting a v
> I see, thanks for the explanation! I agree, having some kind of measurement
Yeah :)


http://gerrit.cloudera.org:8080/#/c/14107/9/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala
File java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala:

http://gerrit.cloudera.org:8080/#/c/14107/9/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala@71
PS9, Line 71: 
> Not used?
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/14107
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I7742a76708f989b0ccc8ba417f3390013e260175
Gerrit-Change-Number: 14107
Gerrit-PatchSet: 10
Gerrit-Owner: ZhangYao <tr...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Hao Hao <ha...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: ZhangYao <tr...@gmail.com>
Gerrit-Comment-Date: Tue, 10 Sep 2019 02:30:52 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2921: Exposing the table statistics to spark relation.

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/14107 )

Change subject: KUDU-2921: Exposing the table statistics to spark relation.
......................................................................

KUDU-2921: Exposing the table statistics to spark relation.

Exposing current table statistics to spark via rpc from client to master.

Change-Id: I7742a76708f989b0ccc8ba417f3390013e260175
Reviewed-on: http://gerrit.cloudera.org:8080/14107
Reviewed-by: Adar Dembo <ad...@cloudera.com>
Tested-by: Adar Dembo <ad...@cloudera.com>
---
M java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java
A java/kudu-client/src/main/java/org/apache/kudu/client/GetTableStatisticsRequest.java
A java/kudu-client/src/main/java/org/apache/kudu/client/GetTableStatisticsResponse.java
M java/kudu-client/src/main/java/org/apache/kudu/client/KuduClient.java
M java/kudu-client/src/main/java/org/apache/kudu/client/KuduTable.java
A java/kudu-client/src/main/java/org/apache/kudu/client/KuduTableStatistics.java
M java/kudu-client/src/test/java/org/apache/kudu/client/TestKuduTable.java
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala
M java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/DefaultSourceTest.scala
M src/kudu/master/authz_provider.h
M src/kudu/master/catalog_manager.cc
M src/kudu/master/catalog_manager.h
M src/kudu/master/default_authz_provider.h
M src/kudu/master/master-test.cc
M src/kudu/master/master.proto
M src/kudu/master/master_service.cc
M src/kudu/master/master_service.h
M src/kudu/master/sentry_authz_provider-test.cc
M src/kudu/master/sentry_authz_provider.cc
M src/kudu/master/sentry_authz_provider.h
20 files changed, 516 insertions(+), 0 deletions(-)

Approvals:
  Adar Dembo: Looks good to me, approved; Verified

-- 
To view, visit http://gerrit.cloudera.org:8080/14107
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I7742a76708f989b0ccc8ba417f3390013e260175
Gerrit-Change-Number: 14107
Gerrit-PatchSet: 11
Gerrit-Owner: ZhangYao <tr...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Hao Hao <ha...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: ZhangYao <tr...@gmail.com>

[kudu-CR] KUDU-2921: Exposing the table statistics to spark relation.

Posted by "Hao Hao (Code Review)" <ge...@cloudera.org>.
Hao Hao has posted comments on this change. ( http://gerrit.cloudera.org:8080/14107 )

Change subject: KUDU-2921: Exposing the table statistics to spark relation.
......................................................................


Patch Set 2:

(1 comment)

> Patch Set 2:
> 
> > Patch Set 2:
> > 
> > > Actually, I have considered to put those statistics into existing message such as getShema which will be used when opening table. But as for the table statistics may contain more and more entries it may be better to be a independent rpc. What's more, because this statistics may change when client trying to write, so I think it will be useful to support a rpc for client to get the statistics change. Although for query plan it will only use once and save one round trip if we put it into existing rpc but it will be heavy if we want to get statistic change in other use cases. 
> > 
> > How are the stats useful for a writing client (i.e. an Impala or Spark executor) vs. the query planner?
> > 
> > What are the other use cases you're thinking about? If they're hypothetical, we could add the stats to GetTableLocations now, and also expose them via new RPC later, when those other use cases materialize.
> > 
> > >  > If doing this, let's make sure to make it optional, since the
> > >  > statistics could become larger later (eg with things like
> > >  > histograms, NDVs per column, etc)
> > > 
> > > Will make it optional if do that :)
> > 
> > Agreed.
> 
> Here are some stats that systems use:
> - size in bytes
> - number of rows
> - per-column stats
>   - distinct count
>   - min/max/histogram
> 
> The per-column stats seem like they wouldn't be a good fit for the GetSchema endpoint, since they may end up returning user data.
> 
> So I'm hesitant about adding stats in with GetSchema; from an authorization point of view, getting the number of rows in a table, for instance, currently requires some scan privileges. If we tie it into the GetSchema RPC, this is no longer the case.
> 
> Some info about Spark:
> https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-CatalogStatistics.html
> https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-ColumnStat.html

Agree.

http://gerrit.cloudera.org:8080/#/c/14107/1/src/kudu/master/catalog_manager.cc
File src/kudu/master/catalog_manager.cc:

http://gerrit.cloudera.org:8080/#/c/14107/1/src/kudu/master/catalog_manager.cc@2906
PS1, Line 2906:       table->set_name(table_name);
> Actually, this metrics doesn't contained in tablemetadata and  I think it m
AFAIU, Impala requires "ALTER and SELECT on Table", because compute stats actually trigger SELECT queries and alteration on the table. Here we are only using pre-computed stats associated with the table, so arguably  "METADATA on Table" may be sufficient. However, as the stats contain data (number of rows) that requires scan privileges, to be more restrictive, it is better to at least requires "SELECT on Table" here.



-- 
To view, visit http://gerrit.cloudera.org:8080/14107
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I7742a76708f989b0ccc8ba417f3390013e260175
Gerrit-Change-Number: 14107
Gerrit-PatchSet: 2
Gerrit-Owner: ZhangYao <tr...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Hao Hao <ha...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: ZhangYao <tr...@gmail.com>
Gerrit-Comment-Date: Fri, 23 Aug 2019 19:27:06 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2921: Exposing the table statistics to spark relation.

Posted by "ZhangYao (Code Review)" <ge...@cloudera.org>.
ZhangYao has posted comments on this change. ( http://gerrit.cloudera.org:8080/14107 )

Change subject: KUDU-2921: Exposing the table statistics to spark relation.
......................................................................


Patch Set 10:

I ran those failure test offline multi times, it works well. I think those failures maybe not caused by my function :)


-- 
To view, visit http://gerrit.cloudera.org:8080/14107
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I7742a76708f989b0ccc8ba417f3390013e260175
Gerrit-Change-Number: 14107
Gerrit-PatchSet: 10
Gerrit-Owner: ZhangYao <tr...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Hao Hao <ha...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: ZhangYao <tr...@gmail.com>
Gerrit-Comment-Date: Tue, 10 Sep 2019 03:41:59 +0000
Gerrit-HasComments: No

[kudu-CR] KUDU-2921: Exposing the table statistics to spark relation.

Posted by "ZhangYao (Code Review)" <ge...@cloudera.org>.
ZhangYao has posted comments on this change. ( http://gerrit.cloudera.org:8080/14107 )

Change subject: KUDU-2921: Exposing the table statistics to spark relation.
......................................................................


Patch Set 8:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/14107/7/src/kudu/master/sentry_authz_provider-test.cc
File src/kudu/master/sentry_authz_provider-test.cc:

http://gerrit.cloudera.org:8080/#/c/14107/7/src/kudu/master/sentry_authz_provider-test.cc@822
PS7, Line 822: // Authorize get table statistics on a user with proper privileges.
             :   TSentryPrivilege privilege = GetDatabasePrivilege("db", "SELECT");
             :   ASSERT_OK(AlterRoleGrantPrivilege(privilege));
             :   ASSERT_OK(sentry_authz_provider_->AuthorizeGetTableStatistics("db.table", kTestUser));
             : }
             : 
             : TEST_F(SentryAuthzProviderTest, TestAuthorizeGetTableMetadata) {
             :   // Don't authorize getting metadata on a table for a user without required
> The second test case is a bit redundant as once granting SELECT on DB it sh
Done


http://gerrit.cloudera.org:8080/#/c/14107/7/src/kudu/master/sentry_authz_provider.cc
File src/kudu/master/sentry_authz_provider.cc:

http://gerrit.cloudera.org:8080/#/c/14107/7/src/kudu/master/sentry_authz_provider.cc@258
PS7, Line 258: Statistics contain data (e.g.
> nit: contain data (e.g. number of rows) that requires the 'SELECT ON TABLE'
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/14107
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I7742a76708f989b0ccc8ba417f3390013e260175
Gerrit-Change-Number: 14107
Gerrit-PatchSet: 8
Gerrit-Owner: ZhangYao <tr...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Hao Hao <ha...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: ZhangYao <tr...@gmail.com>
Gerrit-Comment-Date: Sat, 07 Sep 2019 14:59:23 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2921: Exposing the table statistics to spark relation.

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change. ( http://gerrit.cloudera.org:8080/14107 )

Change subject: KUDU-2921: Exposing the table statistics to spark relation.
......................................................................


Patch Set 1:

Did you consider adding the statistics to the existing GetTableLocations RPC response instead of creating a brand new RPC? Query planners and job drivers already use it to open tables, so it'd mean one less round trip to get the information.


-- 
To view, visit http://gerrit.cloudera.org:8080/14107
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I7742a76708f989b0ccc8ba417f3390013e260175
Gerrit-Change-Number: 14107
Gerrit-PatchSet: 1
Gerrit-Owner: ZhangYao <tr...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Comment-Date: Tue, 20 Aug 2019 16:16:09 +0000
Gerrit-HasComments: No

[kudu-CR] KUDU-2921: Exposing the table statistics to spark relation.

Posted by "ZhangYao (Code Review)" <ge...@cloudera.org>.
Hello Tidy Bot, Kudu Jenkins, Andrew Wong, Adar Dembo, Grant Henke, Hao Hao, Todd Lipcon, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/14107

to look at the new patch set (#9).

Change subject: KUDU-2921: Exposing the table statistics to spark relation.
......................................................................

KUDU-2921: Exposing the table statistics to spark relation.

Exposing current table statistics to spark via rpc from client to master.

Change-Id: I7742a76708f989b0ccc8ba417f3390013e260175
---
M java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java
A java/kudu-client/src/main/java/org/apache/kudu/client/GetTableStatisticsRequest.java
A java/kudu-client/src/main/java/org/apache/kudu/client/GetTableStatisticsResponse.java
M java/kudu-client/src/main/java/org/apache/kudu/client/KuduClient.java
M java/kudu-client/src/main/java/org/apache/kudu/client/KuduTable.java
A java/kudu-client/src/main/java/org/apache/kudu/client/KuduTableStatistics.java
M java/kudu-client/src/test/java/org/apache/kudu/client/TestKuduTable.java
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala
M java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/DefaultSourceTest.scala
M src/kudu/master/authz_provider.h
M src/kudu/master/catalog_manager.cc
M src/kudu/master/catalog_manager.h
M src/kudu/master/default_authz_provider.h
M src/kudu/master/master-test.cc
M src/kudu/master/master.proto
M src/kudu/master/master_service.cc
M src/kudu/master/master_service.h
M src/kudu/master/sentry_authz_provider-test.cc
M src/kudu/master/sentry_authz_provider.cc
M src/kudu/master/sentry_authz_provider.h
20 files changed, 517 insertions(+), 0 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/07/14107/9
-- 
To view, visit http://gerrit.cloudera.org:8080/14107
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I7742a76708f989b0ccc8ba417f3390013e260175
Gerrit-Change-Number: 14107
Gerrit-PatchSet: 9
Gerrit-Owner: ZhangYao <tr...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Hao Hao <ha...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: ZhangYao <tr...@gmail.com>

[kudu-CR] KUDU-2921: Exposing the table statistics to spark relation.

Posted by "ZhangYao (Code Review)" <ge...@cloudera.org>.
ZhangYao has posted comments on this change. ( http://gerrit.cloudera.org:8080/14107 )

Change subject: KUDU-2921: Exposing the table statistics to spark relation.
......................................................................


Patch Set 2:

(11 comments)

> Did you consider adding the statistics to the existing
 > GetTableLocations RPC response instead of creating a brand new RPC?
 > Query planners and job drivers already use it to open tables, so
 > it'd mean one less round trip to get the information.
Actually, I have considered to put those statistics into existing message such as getShema which will be used when opening table. But as for the table statistics may contain more and more entries it may be better to be a independent rpc. What's more, because this statistics may change when client trying to write, so I think it will be useful to support a rpc for client to get the statistics change. Although for query plan it will only use once and save one round trip if we put it into existing rpc but it will be heavy if we want to get statistic change in other use cases. 

 > > Patch Set 1:
 > >
 > > Did you consider adding the statistics to the existing
 > GetTableLocations RPC response instead of creating a brand new RPC?
 > Query planners and job drivers already use it to open tables, so
 > it'd mean one less round trip to get the information.
 > 
 > If doing this, let's make sure to make it optional, since the
 > statistics could become larger later (eg with things like
 > histograms, NDVs per column, etc)

Will make it optional if do that :)

http://gerrit.cloudera.org:8080/#/c/14107/1/java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java
File java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java:

http://gerrit.cloudera.org:8080/#/c/14107/1/java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java@816
PS1, Line 816:     GetTableStatisticsRequest rpc = new GetTableStatisticsRequest(this.masterTable,
> Are you planning to expose this in the c++ client too?
Yes, will do this :) maybe next part.


http://gerrit.cloudera.org:8080/#/c/14107/1/java/kudu-client/src/main/java/org/apache/kudu/client/GetTableStatisticsRequest.java
File java/kudu-client/src/main/java/org/apache/kudu/client/GetTableStatisticsRequest.java:

http://gerrit.cloudera.org:8080/#/c/14107/1/java/kudu-client/src/main/java/org/apache/kudu/client/GetTableStatisticsRequest.java@11
PS1, Line 11: class GetTableStatisticsRequest extends KuduRpc<GetTableStatisticsResponse> {
> This can be package private.
Done


http://gerrit.cloudera.org:8080/#/c/14107/1/java/kudu-client/src/main/java/org/apache/kudu/client/GetTableStatisticsResponse.java
File java/kudu-client/src/main/java/org/apache/kudu/client/GetTableStatisticsResponse.java:

http://gerrit.cloudera.org:8080/#/c/14107/1/java/kudu-client/src/main/java/org/apache/kudu/client/GetTableStatisticsResponse.java@8
PS1, Line 8:   private final long onDiskSize;
> If you see my other comment on defining a KuduTableStatistics object, then 
Done


http://gerrit.cloudera.org:8080/#/c/14107/1/java/kudu-client/src/main/java/org/apache/kudu/client/KuduClient.java
File java/kudu-client/src/main/java/org/apache/kudu/client/KuduClient.java:

http://gerrit.cloudera.org:8080/#/c/14107/1/java/kudu-client/src/main/java/org/apache/kudu/client/KuduClient.java@227
PS1, Line 227:   public KuduTableStatistics getTableStatistics(String name) throws KuduException {
> Does it make sense for this method to be on the KuduTable object? The api w
I have implemented this method in KuduTable. Done.


http://gerrit.cloudera.org:8080/#/c/14107/1/java/kudu-client/src/test/java/org/apache/kudu/client/TestKuduTable.java
File java/kudu-client/src/test/java/org/apache/kudu/client/TestKuduTable.java:

http://gerrit.cloudera.org:8080/#/c/14107/1/java/kudu-client/src/test/java/org/apache/kudu/client/TestKuduTable.java@811
PS1, Line 811: 
> Is there a more deterministic way to ensure this happens? Using sleep often
Currently I don't find a way to trigger the heartbeat between tserver and master from java client side(This may need new rpc to implement the control). Here I use update_tablet_stats_interval_ms and heartbeat_interval_ms to control the heartbeat frequency and wait  enought time for the statistics to be collected on master. If the heartbeat and statistics work correctly it supposed to be right. 
Whatever it still flaky somehow. Anyone have other advices?


http://gerrit.cloudera.org:8080/#/c/14107/1/java/kudu-client/src/test/java/org/apache/kudu/client/TestKuduTable.java@814
PS1, Line 814:     Thread.sleep(200 * 6);
> Can you test some edge cases? Maybe call this with 0 rows, and insert rows 
0 rows has been test in this test. As for test inserting row increasingly, it can do but the statistics are suppose to lagging so maybe test the monotonically increasing of statistics and the final accuracy :)


http://gerrit.cloudera.org:8080/#/c/14107/1/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala
File java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala:

http://gerrit.cloudera.org:8080/#/c/14107/1/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala@263
PS1, Line 263:   private val sizeFactor: Float = 1.0F
> What's the purpose of the size factor here?
Actually, because our on_disk_size is the size after compression so the estimation of size will be too small.  As for spark plan, sizeInBytes is always used to help selecting the suitable join method and according to its comment it is better to overestimate size than underestimate. This sizeFactor is the param to adjust the  estimation but its current value(1.0) is experimental.


http://gerrit.cloudera.org:8080/#/c/14107/1/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala@267
PS1, Line 267:   override def sizeInBytes: Long = (math.max(onDiskSize, 0) * sizeFactor).toLong
> Shouldn't this size take the projection and predicates into consideration? 
1. It supposed to do that but currently we don't have column statistics. We only have table statistics so it's hard to take the userSchema into consideration. Maybe we can roughly cut the size with how many columns we selected but it may underestimate the relation size. Here I use the total on_disk_size is conservative but safe.

If we use live row size to estimate the sizeInByte we can have more accurate estimation because we can cut those columns we doesn't select. However, we can not know the size of binary/string column(It has max_cell_size_bytes as 64kb in default but it's too large to do the estimation). 

2. Will do it on test, Done :)


http://gerrit.cloudera.org:8080/#/c/14107/1/src/kudu/master/catalog_manager.h
File src/kudu/master/catalog_manager.h:

http://gerrit.cloudera.org:8080/#/c/14107/1/src/kudu/master/catalog_manager.h@638
PS1, Line 638:   Status GetTableStatistics(const GetTableStatisticsRequestPB* req,
> Can you add a test for this?
This may be tested in master_service.cc because the test in master_service.cc will call this functions.


http://gerrit.cloudera.org:8080/#/c/14107/1/src/kudu/master/catalog_manager.cc
File src/kudu/master/catalog_manager.cc:

http://gerrit.cloudera.org:8080/#/c/14107/1/src/kudu/master/catalog_manager.cc@2906
PS1, Line 2906:       table->set_name(table_name);
> I'm not sure these required permissions are correct for protecting this end
Actually, this metrics doesn't contained in tablemetadata and  I think it may just like metadata somehow so I use this  authorization. I invited hao hao to review this :)


http://gerrit.cloudera.org:8080/#/c/14107/1/src/kudu/master/master_service.cc
File src/kudu/master/master_service.cc:

http://gerrit.cloudera.org:8080/#/c/14107/1/src/kudu/master/master_service.cc@415
PS1, Line 415: void MasterServiceImpl::GetTableStatistics(const GetTableStatisticsRequestPB* req,
> Can you add a test for this? Maybe in master-test.cc
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/14107
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I7742a76708f989b0ccc8ba417f3390013e260175
Gerrit-Change-Number: 14107
Gerrit-PatchSet: 2
Gerrit-Owner: ZhangYao <tr...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: ZhangYao <tr...@gmail.com>
Gerrit-Comment-Date: Thu, 22 Aug 2019 08:01:11 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2921: Exposing the table statistics to spark relation.

Posted by "ZhangYao (Code Review)" <ge...@cloudera.org>.
Hello Tidy Bot, Kudu Jenkins, Andrew Wong, Adar Dembo, Grant Henke, Hao Hao, Todd Lipcon, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/14107

to look at the new patch set (#5).

Change subject: KUDU-2921: Exposing the table statistics to spark relation.
......................................................................

KUDU-2921: Exposing the table statistics to spark relation.

Exposing current table statistics to spark via rpc from client to master.

Change-Id: I7742a76708f989b0ccc8ba417f3390013e260175
---
M java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java
A java/kudu-client/src/main/java/org/apache/kudu/client/GetTableStatisticsRequest.java
A java/kudu-client/src/main/java/org/apache/kudu/client/GetTableStatisticsResponse.java
M java/kudu-client/src/main/java/org/apache/kudu/client/KuduClient.java
M java/kudu-client/src/main/java/org/apache/kudu/client/KuduTable.java
A java/kudu-client/src/main/java/org/apache/kudu/client/KuduTableStatistics.java
M java/kudu-client/src/test/java/org/apache/kudu/client/TestKuduTable.java
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduReadOptions.scala
M java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/DefaultSourceTest.scala
M src/kudu/master/authz_provider.h
M src/kudu/master/catalog_manager.cc
M src/kudu/master/catalog_manager.h
M src/kudu/master/default_authz_provider.h
M src/kudu/master/master-test.cc
M src/kudu/master/master.proto
M src/kudu/master/master_service.cc
M src/kudu/master/master_service.h
M src/kudu/master/sentry_authz_provider-test.cc
M src/kudu/master/sentry_authz_provider.cc
M src/kudu/master/sentry_authz_provider.h
21 files changed, 558 insertions(+), 2 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/07/14107/5
-- 
To view, visit http://gerrit.cloudera.org:8080/14107
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I7742a76708f989b0ccc8ba417f3390013e260175
Gerrit-Change-Number: 14107
Gerrit-PatchSet: 5
Gerrit-Owner: ZhangYao <tr...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Hao Hao <ha...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: ZhangYao <tr...@gmail.com>

[kudu-CR] KUDU-2921: Exposing the table statistics to spark relation.

Posted by "ZhangYao (Code Review)" <ge...@cloudera.org>.
ZhangYao has posted comments on this change. ( http://gerrit.cloudera.org:8080/14107 )

Change subject: KUDU-2921: Exposing the table statistics to spark relation.
......................................................................


Patch Set 2:

> > Actually, I have considered to put those statistics into existing
 > message such as getShema which will be used when opening table. But
 > as for the table statistics may contain more and more entries it
 > may be better to be a independent rpc. What's more, because this
 > statistics may change when client trying to write, so I think it
 > will be useful to support a rpc for client to get the statistics
 > change. Although for query plan it will only use once and save one
 > round trip if we put it into existing rpc but it will be heavy if
 > we want to get statistic change in other use cases.
 > 
 > How are the stats useful for a writing client (i.e. an Impala or
 > Spark executor) vs. the query planner?
 > 
 > What are the other use cases you're thinking about? If they're
 > hypothetical, we could add the stats to GetTableLocations now, and
 > also expose them via new RPC later, when those other use cases
 > materialize.
 > 
 > >  > If doing this, let's make sure to make it optional, since the
 > >  > statistics could become larger later (eg with things like
 > >  > histograms, NDVs per column, etc)
 > >
 > > Will make it optional if do that :)
 > 
 > Agreed.

Yeah, they're hypothetical. If we want it put into 

 > > Actually, I have considered to put those statistics into existing
 > message such as getShema which will be used when opening table. But
 > as for the table statistics may contain more and more entries it
 > may be better to be a independent rpc. What's more, because this
 > statistics may change when client trying to write, so I think it
 > will be useful to support a rpc for client to get the statistics
 > change. Although for query plan it will only use once and save one
 > round trip if we put it into existing rpc but it will be heavy if
 > we want to get statistic change in other use cases.
 > 
 > How are the stats useful for a writing client (i.e. an Impala or
 > Spark executor) vs. the query planner?
 > 
 > What are the other use cases you're thinking about? If they're
 > hypothetical, we could add the stats to GetTableLocations now, and
 > also expose them via new RPC later, when those other use cases
 > materialize.
 > 
 > >  > If doing this, let's make sure to make it optional, since the
 > >  > statistics could become larger later (eg with things like
 > >  > histograms, NDVs per column, etc)
 > >
 > > Will make it optional if do that :)
 > 
 > Agreed.

Yeah, agree with you :). They are hypothetical. If put it into GetTableLocations I want to make sure that it will be called before spark plan use sizeInBytes. Currently in KuduRelation, I find GetTableLocations is called in buildScan when really scan data in kudu, but sizeInBytes is used before real scan. So maybe put it into GetTableLocations we will get empty sizeInBytes. And for the same client it has LocationCache, so if we use the same client to read between write, it may not update the statistics. But I still need to make sure about that :)

 > > Patch Set 2:
 > >
 > > > Actually, I have considered to put those statistics into
 > existing message such as getShema which will be used when opening
 > table. But as for the table statistics may contain more and more
 > entries it may be better to be a independent rpc. What's more,
 > because this statistics may change when client trying to write, so
 > I think it will be useful to support a rpc for client to get the
 > statistics change. Although for query plan it will only use once
 > and save one round trip if we put it into existing rpc but it will
 > be heavy if we want to get statistic change in other use cases.
 > >
 > > How are the stats useful for a writing client (i.e. an Impala or
 > Spark executor) vs. the query planner?
 > >
 > > What are the other use cases you're thinking about? If they're
 > hypothetical, we could add the stats to GetTableLocations now, and
 > also expose them via new RPC later, when those other use cases
 > materialize.
 > >
 > > >  > If doing this, let's make sure to make it optional, since
 > the
 > > >  > statistics could become larger later (eg with things like
 > > >  > histograms, NDVs per column, etc)
 > > >
 > > > Will make it optional if do that :)
 > >
 > > Agreed.
 > 
 > Here are some stats that systems use:
 > - size in bytes
 > - number of rows
 > - per-column stats
 > - distinct count
 > - min/max/histogram
 > 
 > The per-column stats seem like they wouldn't be a good fit for the
 > GetSchema endpoint, since they may end up returning user data.
 > 
 > So I'm hesitant about adding stats in with GetSchema; from an
 > authorization point of view, getting the number of rows in a table,
 > for instance, currently requires some scan privileges. If we tie it
 > into the GetSchema RPC, this is no longer the case.
 > 
 > Some info about Spark:
 > https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-CatalogStatistics.html
 > https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-ColumnStat.html

Agree with you and get that, I do think is not suitable to put it into GetSchema. :)


-- 
To view, visit http://gerrit.cloudera.org:8080/14107
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I7742a76708f989b0ccc8ba417f3390013e260175
Gerrit-Change-Number: 14107
Gerrit-PatchSet: 2
Gerrit-Owner: ZhangYao <tr...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Hao Hao <ha...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: ZhangYao <tr...@gmail.com>
Gerrit-Comment-Date: Sat, 24 Aug 2019 17:55:52 +0000
Gerrit-HasComments: No

[kudu-CR] KUDU-2921: Exposing the table statistics to spark relation.

Posted by "Hao Hao (Code Review)" <ge...@cloudera.org>.
Hao Hao has posted comments on this change. ( http://gerrit.cloudera.org:8080/14107 )

Change subject: KUDU-2921: Exposing the table statistics to spark relation.
......................................................................


Patch Set 7: Code-Review+1

(2 comments)

http://gerrit.cloudera.org:8080/#/c/14107/7/src/kudu/master/sentry_authz_provider-test.cc
File src/kudu/master/sentry_authz_provider-test.cc:

http://gerrit.cloudera.org:8080/#/c/14107/7/src/kudu/master/sentry_authz_provider-test.cc@822
PS7, Line 822: // Authorize the SELECT privilege of the database should imply the table.
             :   TSentryPrivilege privilege = GetDatabasePrivilege("db", "SELECT");
             :   ASSERT_OK(AlterRoleGrantPrivilege(privilege));
             :   ASSERT_OK(sentry_authz_provider_->AuthorizeGetTableStatistics("db.table", kTestUser));
             : 
             :   // Authorize the SELECT privilege of the table.
             :   ASSERT_OK(AlterRoleGrantPrivilege(GetTablePrivilege("db", "table", "SELECT")));
             :   ASSERT_OK(sentry_authz_provider_->AuthorizeGetTableStatistics("db.table", kTestUser));
The second test case is a bit redundant as once granting SELECT on DB it should imply SELECT on Table. You can just keep either one of these two test cases.


http://gerrit.cloudera.org:8080/#/c/14107/7/src/kudu/master/sentry_authz_provider.cc
File src/kudu/master/sentry_authz_provider.cc:

http://gerrit.cloudera.org:8080/#/c/14107/7/src/kudu/master/sentry_authz_provider.cc@258
PS7, Line 258: Table statistics contain data
nit: contain data (e.g. number of rows) that requires the 'SELECT ON TABLE' privilege.



-- 
To view, visit http://gerrit.cloudera.org:8080/14107
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I7742a76708f989b0ccc8ba417f3390013e260175
Gerrit-Change-Number: 14107
Gerrit-PatchSet: 7
Gerrit-Owner: ZhangYao <tr...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Hao Hao <ha...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: ZhangYao <tr...@gmail.com>
Gerrit-Comment-Date: Fri, 06 Sep 2019 21:28:31 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2921: Exposing the table statistics to spark relation.

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change. ( http://gerrit.cloudera.org:8080/14107 )

Change subject: KUDU-2921: Exposing the table statistics to spark relation.
......................................................................


Patch Set 2:

> Actually, I have considered to put those statistics into existing message such as getShema which will be used when opening table. But as for the table statistics may contain more and more entries it may be better to be a independent rpc. What's more, because this statistics may change when client trying to write, so I think it will be useful to support a rpc for client to get the statistics change. Although for query plan it will only use once and save one round trip if we put it into existing rpc but it will be heavy if we want to get statistic change in other use cases. 

How are the stats useful for a writing client (i.e. an Impala or Spark executor) vs. the query planner?

What are the other use cases you're thinking about? If they're hypothetical, we could add the stats to GetTableLocations now, and also expose them via new RPC later, when those other use cases materialize.

>  > If doing this, let's make sure to make it optional, since the
>  > statistics could become larger later (eg with things like
>  > histograms, NDVs per column, etc)
> 
> Will make it optional if do that :)

Agreed.


-- 
To view, visit http://gerrit.cloudera.org:8080/14107
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I7742a76708f989b0ccc8ba417f3390013e260175
Gerrit-Change-Number: 14107
Gerrit-PatchSet: 2
Gerrit-Owner: ZhangYao <tr...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Hao Hao <ha...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: ZhangYao <tr...@gmail.com>
Gerrit-Comment-Date: Thu, 22 Aug 2019 18:58:36 +0000
Gerrit-HasComments: No

[kudu-CR] KUDU-2921: Exposing the table statistics to spark relation.

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/14107 )

Change subject: KUDU-2921: Exposing the table statistics to spark relation.
......................................................................


Patch Set 2:

> Patch Set 2:
> 
> > Actually, I have considered to put those statistics into existing message such as getShema which will be used when opening table. But as for the table statistics may contain more and more entries it may be better to be a independent rpc. What's more, because this statistics may change when client trying to write, so I think it will be useful to support a rpc for client to get the statistics change. Although for query plan it will only use once and save one round trip if we put it into existing rpc but it will be heavy if we want to get statistic change in other use cases. 
> 
> How are the stats useful for a writing client (i.e. an Impala or Spark executor) vs. the query planner?
> 
> What are the other use cases you're thinking about? If they're hypothetical, we could add the stats to GetTableLocations now, and also expose them via new RPC later, when those other use cases materialize.
> 
> >  > If doing this, let's make sure to make it optional, since the
> >  > statistics could become larger later (eg with things like
> >  > histograms, NDVs per column, etc)
> > 
> > Will make it optional if do that :)
> 
> Agreed.

Here are some stats that systems use:
- size in bytes
- number of rows
- per-column stats
  - distinct count
  - min/max/histogram

The per-column stats seem like they wouldn't be a good fit for the GetSchema endpoint, since they may end up returning user data.

So I'm hesitant about adding stats in with GetSchema; from an authorization point of view, getting the number of rows in a table, for instance, currently requires some scan privileges. If we tie it into the GetSchema RPC, this is no longer the case.

Some info about Spark:
https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-CatalogStatistics.html
https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-ColumnStat.html


-- 
To view, visit http://gerrit.cloudera.org:8080/14107
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I7742a76708f989b0ccc8ba417f3390013e260175
Gerrit-Change-Number: 14107
Gerrit-PatchSet: 2
Gerrit-Owner: ZhangYao <tr...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Hao Hao <ha...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: ZhangYao <tr...@gmail.com>
Gerrit-Comment-Date: Fri, 23 Aug 2019 06:52:43 +0000
Gerrit-HasComments: No

[kudu-CR] KUDU-2921: Exposing the table statistics to spark relation.

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change. ( http://gerrit.cloudera.org:8080/14107 )

Change subject: KUDU-2921: Exposing the table statistics to spark relation.
......................................................................


Patch Set 2:

(11 comments)

http://gerrit.cloudera.org:8080/#/c/14107/2/java/kudu-client/src/main/java/org/apache/kudu/client/GetTableStatisticsRequest.java
File java/kudu-client/src/main/java/org/apache/kudu/client/GetTableStatisticsRequest.java:

PS2: 
Copyright header.


http://gerrit.cloudera.org:8080/#/c/14107/2/java/kudu-client/src/main/java/org/apache/kudu/client/GetTableStatisticsRequest.java@47
PS2, Line 47:                                                String tsUUID) throws KuduException {
Nit: indent


http://gerrit.cloudera.org:8080/#/c/14107/2/java/kudu-client/src/main/java/org/apache/kudu/client/GetTableStatisticsResponse.java
File java/kudu-client/src/main/java/org/apache/kudu/client/GetTableStatisticsResponse.java:

PS2: 
Copyright header.


http://gerrit.cloudera.org:8080/#/c/14107/2/java/kudu-client/src/main/java/org/apache/kudu/client/GetTableStatisticsResponse.java@28
PS2, Line 28:    * Get the table's on disk size.
Should mention that this is pre-replication (i.e. based on the cumulative on disk size of all LEADER replicas) for both.


http://gerrit.cloudera.org:8080/#/c/14107/2/java/kudu-client/src/main/java/org/apache/kudu/client/KuduTableStatistics.java
File java/kudu-client/src/main/java/org/apache/kudu/client/KuduTableStatistics.java:

PS2: 
Copyright header.


http://gerrit.cloudera.org:8080/#/c/14107/2/java/kudu-client/src/main/java/org/apache/kudu/client/KuduTableStatistics.java@38
PS2, Line 38:   public long getliveRowCount() {
Should be getLiveRowCount


http://gerrit.cloudera.org:8080/#/c/14107/2/java/kudu-client/src/test/java/org/apache/kudu/client/TestKuduTable.java
File java/kudu-client/src/test/java/org/apache/kudu/client/TestKuduTable.java:

http://gerrit.cloudera.org:8080/#/c/14107/2/java/kudu-client/src/test/java/org/apache/kudu/client/TestKuduTable.java@802
PS2, Line 802:       assertTrue(currentStatistics.getliveRowCount() >= prevStatistics.getliveRowCount() &&
             :                  currentStatistics.getliveRowCount() <= i + 1);
Break this up into two separate asserts so that if one condition fails, it's clear which it is.

Same for L817-818.


http://gerrit.cloudera.org:8080/#/c/14107/2/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala
File java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala:

http://gerrit.cloudera.org:8080/#/c/14107/2/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala@263
PS2, Line 263:   private val sizeFactor: Float = 1.0F
What's the purpose of this scalar if its value is 1?


http://gerrit.cloudera.org:8080/#/c/14107/2/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala@267
PS2, Line 267: math.max(onDiskSize, 0)
Why is this needed?

Also, please add a comment explaining the effect of this override.


http://gerrit.cloudera.org:8080/#/c/14107/2/src/kudu/master/catalog_manager.h
File src/kudu/master/catalog_manager.h:

http://gerrit.cloudera.org:8080/#/c/14107/2/src/kudu/master/catalog_manager.h@636
PS2, Line 636: If 'user' is provided, only lists those that
             :   // the given user is authorized to see.
This was copy-pasted from ListTables but isn't correct for GetTableStatistics.


http://gerrit.cloudera.org:8080/#/c/14107/2/src/kudu/master/catalog_manager.cc
File src/kudu/master/catalog_manager.cc:

http://gerrit.cloudera.org:8080/#/c/14107/2/src/kudu/master/catalog_manager.cc@271
PS2, Line 271: DEFINE_bool(mock_table_metrics_for_testing, false,
             :             "Whether to enable mock table metrics for testing.");
             : TAG_FLAG(mock_table_metrics_for_testing, runtime);
             : TAG_FLAG(mock_table_metrics_for_testing, unsafe);
             : 
             : DEFINE_int64(on_disk_size_for_testing, 0,
             :              "Mock the on disk size of metrics for testing.");
             : TAG_FLAG(on_disk_size_for_testing, runtime);
             : TAG_FLAG(on_disk_size_for_testing, unsafe);
             : 
             : DEFINE_int64(live_row_count_for_testing, 0,
             :              "Mock the live row count of metrics for testing.");
             : TAG_FLAG(live_row_count_for_testing, runtime);
             : TAG_FLAG(live_row_count_for_testing, unsafe);
If these are for tests, tag them as hidden so they don't show up in docs.



-- 
To view, visit http://gerrit.cloudera.org:8080/14107
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I7742a76708f989b0ccc8ba417f3390013e260175
Gerrit-Change-Number: 14107
Gerrit-PatchSet: 2
Gerrit-Owner: ZhangYao <tr...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Hao Hao <ha...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: ZhangYao <tr...@gmail.com>
Gerrit-Comment-Date: Mon, 26 Aug 2019 05:10:29 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2921: Exposing the table statistics to spark relation.

Posted by "Grant Henke (Code Review)" <ge...@cloudera.org>.
Grant Henke has posted comments on this change. ( http://gerrit.cloudera.org:8080/14107 )

Change subject: KUDU-2921: Exposing the table statistics to spark relation.
......................................................................


Patch Set 1:

(10 comments)

http://gerrit.cloudera.org:8080/#/c/14107/1/java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java
File java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java:

http://gerrit.cloudera.org:8080/#/c/14107/1/java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java@816
PS1, Line 816:   public Deferred<GetTableStatisticsResponse> getTableStatistics(String name) {
Are you planning to expose this in the c++ client too?


http://gerrit.cloudera.org:8080/#/c/14107/1/java/kudu-client/src/main/java/org/apache/kudu/client/GetTableStatisticsRequest.java
File java/kudu-client/src/main/java/org/apache/kudu/client/GetTableStatisticsRequest.java:

http://gerrit.cloudera.org:8080/#/c/14107/1/java/kudu-client/src/main/java/org/apache/kudu/client/GetTableStatisticsRequest.java@11
PS1, Line 11: public class GetTableStatisticsRequest extends KuduRpc<GetTableStatisticsResponse> {
This can be package private.


http://gerrit.cloudera.org:8080/#/c/14107/1/java/kudu-client/src/main/java/org/apache/kudu/client/GetTableStatisticsResponse.java
File java/kudu-client/src/main/java/org/apache/kudu/client/GetTableStatisticsResponse.java:

http://gerrit.cloudera.org:8080/#/c/14107/1/java/kudu-client/src/main/java/org/apache/kudu/client/GetTableStatisticsResponse.java@8
PS1, Line 8: public class GetTableStatisticsResponse extends KuduRpcResponse {
If you see my other comment on defining a KuduTableStatistics object, then this could be package private.


http://gerrit.cloudera.org:8080/#/c/14107/1/java/kudu-client/src/main/java/org/apache/kudu/client/KuduClient.java
File java/kudu-client/src/main/java/org/apache/kudu/client/KuduClient.java:

http://gerrit.cloudera.org:8080/#/c/14107/1/java/kudu-client/src/main/java/org/apache/kudu/client/KuduClient.java@227
PS1, Line 227:   public GetTableStatisticsResponse getTableStatistics(String name) throws KuduException {
Does it make sense for this method to be on the KuduTable object? The api would then be something like `table.getStatistics`.

I think the return object here should be a `KuduTableStatistics` object. It will look similar to the GetTableStatisticsResponse object, but decouples the RPC concerns from the domain objects. This pattern can be seen with `KuduClient.openTable` which returns KuduTable instead of GetKuduSchemaResponse.


http://gerrit.cloudera.org:8080/#/c/14107/1/java/kudu-client/src/test/java/org/apache/kudu/client/TestKuduTable.java
File java/kudu-client/src/test/java/org/apache/kudu/client/TestKuduTable.java:

http://gerrit.cloudera.org:8080/#/c/14107/1/java/kudu-client/src/test/java/org/apache/kudu/client/TestKuduTable.java@811
PS1, Line 811:     Thread.sleep(500 * 9);
Is there a more deterministic way to ensure this happens? Using sleep often leads to flaky tests.


http://gerrit.cloudera.org:8080/#/c/14107/1/java/kudu-client/src/test/java/org/apache/kudu/client/TestKuduTable.java@814
PS1, Line 814:     statistics = table.getTableStatistics();
Can you test some edge cases? Maybe call this with 0 rows, and insert rows again and check that the value has increased?


http://gerrit.cloudera.org:8080/#/c/14107/1/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala
File java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala:

http://gerrit.cloudera.org:8080/#/c/14107/1/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala@263
PS1, Line 263:   private val sizeFactor: Float = 1.0F
What's the purpose of the size factor here?


http://gerrit.cloudera.org:8080/#/c/14107/1/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala@267
PS1, Line 267:   override def sizeInBytes: Long = (math.max(onDiskSize, 0) * sizeFactor).toLong
Shouldn't this size take the projection and predicates into consideration? I suppose this would be the worst case size of the relation as is.

Is there a way we can test that spark is receiving and using this value correctly here?


http://gerrit.cloudera.org:8080/#/c/14107/1/src/kudu/master/catalog_manager.h
File src/kudu/master/catalog_manager.h:

http://gerrit.cloudera.org:8080/#/c/14107/1/src/kudu/master/catalog_manager.h@638
PS1, Line 638:   Status GetTableStatistics(const GetTableStatisticsRequestPB* req,
Can you add a test for this?


http://gerrit.cloudera.org:8080/#/c/14107/1/src/kudu/master/master_service.cc
File src/kudu/master/master_service.cc:

http://gerrit.cloudera.org:8080/#/c/14107/1/src/kudu/master/master_service.cc@415
PS1, Line 415: void MasterServiceImpl::GetTableStatistics(const GetTableStatisticsRequestPB* req,
Can you add a test for this? Maybe in master-test.cc



-- 
To view, visit http://gerrit.cloudera.org:8080/14107
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I7742a76708f989b0ccc8ba417f3390013e260175
Gerrit-Change-Number: 14107
Gerrit-PatchSet: 1
Gerrit-Owner: ZhangYao <tr...@gmail.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Comment-Date: Tue, 20 Aug 2019 15:55:33 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2921: Exposing the table statistics to spark relation.

Posted by "ZhangYao (Code Review)" <ge...@cloudera.org>.
ZhangYao has posted comments on this change. ( http://gerrit.cloudera.org:8080/14107 )

Change subject: KUDU-2921: Exposing the table statistics to spark relation.
......................................................................


Patch Set 8:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/14107/8/java/kudu-client/src/main/java/org/apache/kudu/client/GetTableStatisticsRequest.java
File java/kudu-client/src/main/java/org/apache/kudu/client/GetTableStatisticsRequest.java:

http://gerrit.cloudera.org:8080/#/c/14107/8/java/kudu-client/src/main/java/org/apache/kudu/client/GetTableStatisticsRequest.java@45
PS8, Line 45:             Master.GetTableStatisticsRequestPB.newBuilder();
> nit: to be consistent with the rest of the line-wraps, multi-line code shou
Done


http://gerrit.cloudera.org:8080/#/c/14107/8/java/kudu-client/src/test/java/org/apache/kudu/client/TestKuduTable.java
File java/kudu-client/src/test/java/org/apache/kudu/client/TestKuduTable.java:

http://gerrit.cloudera.org:8080/#/c/14107/8/java/kudu-client/src/test/java/org/apache/kudu/client/TestKuduTable.java@785
PS8, Line 785:           "--update_tablet_stats_interval_ms=" + 200,
> nit: Why not just "--update_tablet_states_interval_ms=200"? Same below?
Done


http://gerrit.cloudera.org:8080/#/c/14107/8/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala
File java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala:

http://gerrit.cloudera.org:8080/#/c/14107/8/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala@281
PS8, Line 281: overestimate this size than underestimate
> It's worth noting that the on-disk size is encoded and compressed. Is that 
The estimation of relation size is the more accurate the better, if we don't have accurate value, overestimate is better that underestimate.
Because we don't take projection and predicates into consideration of size estimation and the size we records contains meta and log so I suppose it is overestimate.
I thinks use (row count * row size) is more accurate, but we don't know the average row size and if the schema contains String/Binary we don't know it's size.


http://gerrit.cloudera.org:8080/#/c/14107/8/src/kudu/master/master-test.cc
File src/kudu/master/master-test.cc:

http://gerrit.cloudera.org:8080/#/c/14107/8/src/kudu/master/master-test.cc@1790
PS8, Line 1790: resp.on_disk_size(), 0);
              :   ASSERT_EQ(resp.live_row_count(), 0);
> nit: the expected value should be on the left-hand-side of the comma, other
Done


http://gerrit.cloudera.org:8080/#/c/14107/8/src/kudu/master/master-test.cc@1799
PS8, Line 1799: resp.on_disk_size(), FLAGS_on_disk_size_for_testing);
              :   ASSERT_EQ(resp.live_row_count(), FLAGS_live_row_count_for_testing
> Same here.
Done


http://gerrit.cloudera.org:8080/#/c/14107/8/src/kudu/master/master_service.cc
File src/kudu/master/master_service.cc:

http://gerrit.cloudera.org:8080/#/c/14107/8/src/kudu/master/master_service.cc@422
PS8, Line 422:         
> nit: got too many spaces here.
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/14107
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I7742a76708f989b0ccc8ba417f3390013e260175
Gerrit-Change-Number: 14107
Gerrit-PatchSet: 8
Gerrit-Owner: ZhangYao <tr...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Hao Hao <ha...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: ZhangYao <tr...@gmail.com>
Gerrit-Comment-Date: Sun, 08 Sep 2019 15:56:51 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2921: Exposing the table statistics to spark relation.

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/14107 )

Change subject: KUDU-2921: Exposing the table statistics to spark relation.
......................................................................


Patch Set 8:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/14107/8/java/kudu-client/src/main/java/org/apache/kudu/client/GetTableStatisticsRequest.java
File java/kudu-client/src/main/java/org/apache/kudu/client/GetTableStatisticsRequest.java:

http://gerrit.cloudera.org:8080/#/c/14107/8/java/kudu-client/src/main/java/org/apache/kudu/client/GetTableStatisticsRequest.java@45
PS8, Line 45:             Master.GetTableStatisticsRequestPB.newBuilder();
nit: to be consistent with the rest of the line-wraps, multi-line code should be four spaces here, and below, and at L66


http://gerrit.cloudera.org:8080/#/c/14107/8/java/kudu-client/src/test/java/org/apache/kudu/client/TestKuduTable.java
File java/kudu-client/src/test/java/org/apache/kudu/client/TestKuduTable.java:

http://gerrit.cloudera.org:8080/#/c/14107/8/java/kudu-client/src/test/java/org/apache/kudu/client/TestKuduTable.java@785
PS8, Line 785:           "--update_tablet_stats_interval_ms=" + 200,
nit: Why not just "--update_tablet_states_interval_ms=200"? Same below?

nit: Also this seems like a lot of spaces. Could you make this consistent with the other `flags` usage?


http://gerrit.cloudera.org:8080/#/c/14107/8/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala
File java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala:

http://gerrit.cloudera.org:8080/#/c/14107/8/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala@281
PS8, Line 281: overestimate this size than underestimate
It's worth noting that the on-disk size is encoded and compressed. Is that problematic? Or is the correct order-of-magnitude more important than the exact value?


http://gerrit.cloudera.org:8080/#/c/14107/8/src/kudu/master/master-test.cc
File src/kudu/master/master-test.cc:

http://gerrit.cloudera.org:8080/#/c/14107/8/src/kudu/master/master-test.cc@1790
PS8, Line 1790: resp.on_disk_size(), 0);
              :   ASSERT_EQ(resp.live_row_count(), 0);
nit: the expected value should be on the left-hand-side of the comma, otherwise it will print a confusing message if this assertion ever fails.


http://gerrit.cloudera.org:8080/#/c/14107/8/src/kudu/master/master-test.cc@1799
PS8, Line 1799: resp.on_disk_size(), FLAGS_on_disk_size_for_testing);
              :   ASSERT_EQ(resp.live_row_count(), FLAGS_live_row_count_for_testing
Same here.


http://gerrit.cloudera.org:8080/#/c/14107/8/src/kudu/master/master_service.cc
File src/kudu/master/master_service.cc:

http://gerrit.cloudera.org:8080/#/c/14107/8/src/kudu/master/master_service.cc@422
PS8, Line 422:         
nit: got too many spaces here.



-- 
To view, visit http://gerrit.cloudera.org:8080/14107
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I7742a76708f989b0ccc8ba417f3390013e260175
Gerrit-Change-Number: 14107
Gerrit-PatchSet: 8
Gerrit-Owner: ZhangYao <tr...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Hao Hao <ha...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: ZhangYao <tr...@gmail.com>
Gerrit-Comment-Date: Sun, 08 Sep 2019 05:07:26 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2921: Exposing the table statistics to spark relation.

Posted by "Grant Henke (Code Review)" <ge...@cloudera.org>.
Grant Henke has posted comments on this change. ( http://gerrit.cloudera.org:8080/14107 )

Change subject: KUDU-2921: Exposing the table statistics to spark relation.
......................................................................


Patch Set 6:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/14107/6/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala
File java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala:

http://gerrit.cloudera.org:8080/#/c/14107/6/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala@33
PS6, Line 33: import org.slf4j.{Logger, LoggerFactory}
Nit: Single import per line


http://gerrit.cloudera.org:8080/#/c/14107/6/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala@288
PS6, Line 288:   override def sizeInBytes: Long = estimatedSize
Is there a jira tracking the improvement for factoring predicates and projections into the estimate? That would be useful to have marked here as a TODO.


http://gerrit.cloudera.org:8080/#/c/14107/6/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduReadOptions.scala
File java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduReadOptions.scala:

http://gerrit.cloudera.org:8080/#/c/14107/6/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduReadOptions.scala@35
PS6, Line 35:  *                          server to ensure that scanners do not time out
Please move the documentation for sizeFactor here.


http://gerrit.cloudera.org:8080/#/c/14107/6/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduReadOptions.scala@49
PS6, Line 49:     sizeFactor: Float = defaultSizeFactor,
I am not 100% sure how this configuration is really useful/helpful. Given this is public API and it's not totally clear how a user would calculate and set this value I think we could just use bytes for now and open a follow up jira to improve the statistics to include the real size, or even estimate based on row size and row count.

If there is good guidance on how to configure this value, we  should include it in the documentation above.


http://gerrit.cloudera.org:8080/#/c/14107/6/java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/DefaultSourceTest.scala
File java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/DefaultSourceTest.scala:

http://gerrit.cloudera.org:8080/#/c/14107/6/java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/DefaultSourceTest.scala@32
PS6, Line 32: import org.apache.kudu.client.{CreateTableOptions, KuduTableStatistics}
Nit: One import per line is preferred. Here and below.


http://gerrit.cloudera.org:8080/#/c/14107/6/src/kudu/master/catalog_manager.cc
File src/kudu/master/catalog_manager.cc:

http://gerrit.cloudera.org:8080/#/c/14107/6/src/kudu/master/catalog_manager.cc@271
PS6, Line 271: DEFINE_bool(mock_table_metrics_for_testing, false,
Do these flags need to be marked as hidden?



-- 
To view, visit http://gerrit.cloudera.org:8080/14107
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I7742a76708f989b0ccc8ba417f3390013e260175
Gerrit-Change-Number: 14107
Gerrit-PatchSet: 6
Gerrit-Owner: ZhangYao <tr...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Hao Hao <ha...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: ZhangYao <tr...@gmail.com>
Gerrit-Comment-Date: Thu, 05 Sep 2019 18:21:42 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2921: Exposing the table statistics to spark relation.

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has removed a vote on this change.

Change subject: KUDU-2921: Exposing the table statistics to spark relation.
......................................................................


Removed Verified-1 by Kudu Jenkins (120)
-- 
To view, visit http://gerrit.cloudera.org:8080/14107
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: deleteVote
Gerrit-Change-Id: I7742a76708f989b0ccc8ba417f3390013e260175
Gerrit-Change-Number: 14107
Gerrit-PatchSet: 10
Gerrit-Owner: ZhangYao <tr...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Hao Hao <ha...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: ZhangYao <tr...@gmail.com>

[kudu-CR] KUDU-2921: Exposing the table statistics to spark relation.

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/14107 )

Change subject: KUDU-2921: Exposing the table statistics to spark relation.
......................................................................


Patch Set 1:

> Patch Set 1:
> 
> Did you consider adding the statistics to the existing GetTableLocations RPC response instead of creating a brand new RPC? Query planners and job drivers already use it to open tables, so it'd mean one less round trip to get the information.

If doing this, let's make sure to make it optional, since the statistics could become larger later (eg with things like histograms, NDVs per column, etc)


-- 
To view, visit http://gerrit.cloudera.org:8080/14107
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I7742a76708f989b0ccc8ba417f3390013e260175
Gerrit-Change-Number: 14107
Gerrit-PatchSet: 1
Gerrit-Owner: ZhangYao <tr...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Tue, 20 Aug 2019 17:28:53 +0000
Gerrit-HasComments: No

[kudu-CR] KUDU-2921: Exposing the table statistics to spark relation.

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change. ( http://gerrit.cloudera.org:8080/14107 )

Change subject: KUDU-2921: Exposing the table statistics to spark relation.
......................................................................


Patch Set 5:

>  > > 2. For spark client:
>  > > new spark client depends on new java client.
>  > > new spark client -> old server -> leads to execption
>  > 
>  > This is what concerns me. Under what circumstances is there an
>  > exception? When invoking some new API? Or in existing data paths?
>  > 
>  > If the latter, we need to address it. We take great care to ensure
>  > Kudu is both forward and backward compatible, whether it be across
>  > the wire (i.e. between server and client), or across a library
>  > boundary (i.e. between client and application). It wouldn't be OK
>  > for someone running with the new Spark integration against an old
>  > server to have a degraded experience in _existing_ functionality.
>  > If it's brand new functionality (i.e. a new API in the client),
>  > that's fine.
> 
> I got it. In previous patch because when creating relation it will call getStatistic automatically so it will cause exception with old server . I used  a config to decide whether to use getStatistics or not. That's for new spark client with old server user if he doesn't open that conf, it can work well.

Instead of introducing a new config, could we just try the new API, and if it throws an exception, chain to the superclass's implementation? Maybe log something too?


-- 
To view, visit http://gerrit.cloudera.org:8080/14107
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I7742a76708f989b0ccc8ba417f3390013e260175
Gerrit-Change-Number: 14107
Gerrit-PatchSet: 5
Gerrit-Owner: ZhangYao <tr...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Hao Hao <ha...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: ZhangYao <tr...@gmail.com>
Gerrit-Comment-Date: Fri, 30 Aug 2019 22:00:47 +0000
Gerrit-HasComments: No

[kudu-CR] KUDU-2921: Exposing the table statistics to spark relation.

Posted by "ZhangYao (Code Review)" <ge...@cloudera.org>.
Hello Tidy Bot, Kudu Jenkins, Andrew Wong, Adar Dembo, Grant Henke, Hao Hao, Todd Lipcon, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/14107

to look at the new patch set (#8).

Change subject: KUDU-2921: Exposing the table statistics to spark relation.
......................................................................

KUDU-2921: Exposing the table statistics to spark relation.

Exposing current table statistics to spark via rpc from client to master.

Change-Id: I7742a76708f989b0ccc8ba417f3390013e260175
---
M java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java
A java/kudu-client/src/main/java/org/apache/kudu/client/GetTableStatisticsRequest.java
A java/kudu-client/src/main/java/org/apache/kudu/client/GetTableStatisticsResponse.java
M java/kudu-client/src/main/java/org/apache/kudu/client/KuduClient.java
M java/kudu-client/src/main/java/org/apache/kudu/client/KuduTable.java
A java/kudu-client/src/main/java/org/apache/kudu/client/KuduTableStatistics.java
M java/kudu-client/src/test/java/org/apache/kudu/client/TestKuduTable.java
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala
M java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/DefaultSourceTest.scala
M src/kudu/master/authz_provider.h
M src/kudu/master/catalog_manager.cc
M src/kudu/master/catalog_manager.h
M src/kudu/master/default_authz_provider.h
M src/kudu/master/master-test.cc
M src/kudu/master/master.proto
M src/kudu/master/master_service.cc
M src/kudu/master/master_service.h
M src/kudu/master/sentry_authz_provider-test.cc
M src/kudu/master/sentry_authz_provider.cc
M src/kudu/master/sentry_authz_provider.h
20 files changed, 517 insertions(+), 0 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/07/14107/8
-- 
To view, visit http://gerrit.cloudera.org:8080/14107
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I7742a76708f989b0ccc8ba417f3390013e260175
Gerrit-Change-Number: 14107
Gerrit-PatchSet: 8
Gerrit-Owner: ZhangYao <tr...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Hao Hao <ha...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: ZhangYao <tr...@gmail.com>