You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@kudu.apache.org by "Will Berkeley (Code Review)" <ge...@cloudera.org> on 2019/01/23 19:58:09 UTC

[kudu-CR] [spark] Add write duration histograms

Will Berkeley has uploaded this change for review. ( http://gerrit.cloudera.org:8080/12261


Change subject: [spark] Add write duration histograms
......................................................................

[spark] Add write duration histograms

This adds an additional accumulator metrics to KuduContext writes: write
duration histograms. These histograms will show up on the webui and in
the driver logs, so it's easier to track how much time is spent writing
to Kudu in Spark stages and tasks. Log messages on the driver look like:

19/01/23 11:13:34 INFO kudu.KuduContext: completed insert ops: duration histogram: 25.0%: 14ms, 25.0%: 14ms, 75.0%: 17ms, 75.0%: 17ms, 75.0%: 17ms, 100.0%: 66ms, 100.0%: 66ms

The funny repeated values are an artifact of having a cluster with only
3 executors executing 4 tasks. Log messages on executors look like

19/01/23 11:13:34 INFO kudu.KuduContext: applied 69 inserts to table 'impala::default.aaa' in 14ms

HdrHistograms need to be shipped between executors and the driver, so
their (serialized) size is relevant. Spark users differ in how they
serialize, so I didn't put much effort into estimating the serialized
size, but based on the conservative formula in [1] the in-memory size of
a histogram with 3 significant value digits and storing longs is 4MiB or
so. That only happens if the histogram is storing values from 1 to the
max trackable long value, which is Long.MAX / 2. More realistically,
the values in the duration histogram should be at most 86400 * 1000, the
number of milliseconds in a day, and usually much, much smaller. For
that range of values, the max footprint is 1MiB. That should be a safe
amount of data to ship about semi-frequently along with all the Kudu
data (and I'm not counting potential compression as part of
serialization).

[1]: https://github.com/HdrHistogram/HdrHistogram#footprint-estimation

Change-Id: I0fd4d380b08bd7d7d5c1e65b79cffb44a9b9d433
---
M java/gradle/dependencies.gradle
M java/kudu-spark/build.gradle
A java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/HdrHistogramAccumulator.scala
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduContext.scala
4 files changed, 114 insertions(+), 1 deletion(-)



  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/61/12261/1
-- 
To view, visit http://gerrit.cloudera.org:8080/12261
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I0fd4d380b08bd7d7d5c1e65b79cffb44a9b9d433
Gerrit-Change-Number: 12261
Gerrit-PatchSet: 1
Gerrit-Owner: Will Berkeley <wd...@gmail.com>

[kudu-CR] [spark] Add write duration histograms

Posted by "Will Berkeley (Code Review)" <ge...@cloudera.org>.
Will Berkeley has posted comments on this change. ( http://gerrit.cloudera.org:8080/12261 )

Change subject: [spark] Add write duration histograms
......................................................................


Patch Set 2: Verified+1

But by some old python stuff.


-- 
To view, visit http://gerrit.cloudera.org:8080/12261
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0fd4d380b08bd7d7d5c1e65b79cffb44a9b9d433
Gerrit-Change-Number: 12261
Gerrit-PatchSet: 2
Gerrit-Owner: Will Berkeley <wd...@gmail.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>
Gerrit-Comment-Date: Tue, 05 Feb 2019 21:18:45 +0000
Gerrit-HasComments: No

[kudu-CR] [spark] Add write duration histograms

Posted by "Will Berkeley (Code Review)" <ge...@cloudera.org>.
Will Berkeley has posted comments on this change. ( http://gerrit.cloudera.org:8080/12261 )

Change subject: [spark] Add write duration histograms
......................................................................


Patch Set 1:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/12261/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/12261/1//COMMIT_MSG@14
PS1, Line 14: 25.0%: 14ms, 25.0%: 14ms
> Why does it have information on every bin duplicated in the output?  Is it 
Because there are so few values. It's because of the HdrHistogram's implementation and I don't think it's worth doing anything about because it's an edge case and the numbers are still correct.


http://gerrit.cloudera.org:8080/#/c/12261/1//COMMIT_MSG@21
PS1, Line 21: need to be shipped between executors and the driver, so
            : their (serialized) size is relevant
> How often does that happen?  Does it depend on the granularity of the histo
When tasks end and on executor heartbeat. Executor heartbeat is every 10s (from https://spark.apache.org/docs/latest/configuration.html). So it's not so often.



-- 
To view, visit http://gerrit.cloudera.org:8080/12261
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0fd4d380b08bd7d7d5c1e65b79cffb44a9b9d433
Gerrit-Change-Number: 12261
Gerrit-PatchSet: 1
Gerrit-Owner: Will Berkeley <wd...@gmail.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>
Gerrit-Comment-Date: Thu, 31 Jan 2019 18:30:07 +0000
Gerrit-HasComments: Yes

[kudu-CR] [spark] Add write duration histograms

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/12261 )

Change subject: [spark] Add write duration histograms
......................................................................


Patch Set 1: Code-Review+1

(1 comment)

LGTM, but I'll defer to Grant since I don't think I fluent in Scala.

http://gerrit.cloudera.org:8080/#/c/12261/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/12261/1//COMMIT_MSG@21
PS1, Line 21: need to be shipped between executors and the driver, so
            : their (serialized) size is relevant
> When tasks end and on executor heartbeat. Executor heartbeat is every 10s (
Thanks for the clarification.



-- 
To view, visit http://gerrit.cloudera.org:8080/12261
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0fd4d380b08bd7d7d5c1e65b79cffb44a9b9d433
Gerrit-Change-Number: 12261
Gerrit-PatchSet: 1
Gerrit-Owner: Will Berkeley <wd...@gmail.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>
Gerrit-Comment-Date: Mon, 04 Feb 2019 22:29:17 +0000
Gerrit-HasComments: Yes

[kudu-CR] [spark] Add write duration histograms

Posted by "Will Berkeley (Code Review)" <ge...@cloudera.org>.
Will Berkeley has posted comments on this change. ( http://gerrit.cloudera.org:8080/12261 )

Change subject: [spark] Add write duration histograms
......................................................................


Patch Set 1:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/12261/1/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/HdrHistogramAccumulator.scala
File java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/HdrHistogramAccumulator.scala:

http://gerrit.cloudera.org:8080/#/c/12261/1/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/HdrHistogramAccumulator.scala@63
PS1, Line 63:       other.asInstanceOf[HdrHistogramAccumulator].histogram.inner_histogram)
> Can you use `other.value` here in place of `.asInstanceOf[HdrHistogramAccum
Done


http://gerrit.cloudera.org:8080/#/c/12261/1/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/HdrHistogramAccumulator.scala@80
PS1, Line 80:     extends Serializable {
> Is SynchronizedHistogram serializable?
SynchornizedHistogram extends Histogram extends AbstractHistogram extends Serializable.



-- 
To view, visit http://gerrit.cloudera.org:8080/12261
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0fd4d380b08bd7d7d5c1e65b79cffb44a9b9d433
Gerrit-Change-Number: 12261
Gerrit-PatchSet: 1
Gerrit-Owner: Will Berkeley <wd...@gmail.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>
Gerrit-Comment-Date: Tue, 05 Feb 2019 19:24:16 +0000
Gerrit-HasComments: Yes

[kudu-CR] [spark] Add write duration histograms

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/12261 )

Change subject: [spark] Add write duration histograms
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/12261/1/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/HdrHistogramAccumulator.scala
File java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/HdrHistogramAccumulator.scala:

http://gerrit.cloudera.org:8080/#/c/12261/1/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/HdrHistogramAccumulator.scala@66
PS1, Line 66:   override def value: HistogramWrapper = histogram
Would it be possible just to sub-class the HistogramWrapper and override/add the necessary methods?



-- 
To view, visit http://gerrit.cloudera.org:8080/12261
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0fd4d380b08bd7d7d5c1e65b79cffb44a9b9d433
Gerrit-Change-Number: 12261
Gerrit-PatchSet: 1
Gerrit-Owner: Will Berkeley <wd...@gmail.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>
Gerrit-Comment-Date: Mon, 28 Jan 2019 22:54:31 +0000
Gerrit-HasComments: Yes

[kudu-CR] [spark] Add write duration histograms

Posted by "Will Berkeley (Code Review)" <ge...@cloudera.org>.
Will Berkeley has removed a vote on this change.

Change subject: [spark] Add write duration histograms
......................................................................


Removed Verified-1 by Kudu Jenkins (120)
-- 
To view, visit http://gerrit.cloudera.org:8080/12261
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: deleteVote
Gerrit-Change-Id: I0fd4d380b08bd7d7d5c1e65b79cffb44a9b9d433
Gerrit-Change-Number: 12261
Gerrit-PatchSet: 1
Gerrit-Owner: Will Berkeley <wd...@gmail.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)

[kudu-CR] [spark] Add write duration histograms

Posted by "Will Berkeley (Code Review)" <ge...@cloudera.org>.
Will Berkeley has posted comments on this change. ( http://gerrit.cloudera.org:8080/12261 )

Change subject: [spark] Add write duration histograms
......................................................................


Patch Set 1: Verified+1

Python pandas issue fix not rebased on, and an unrelated C++ flake.


-- 
To view, visit http://gerrit.cloudera.org:8080/12261
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0fd4d380b08bd7d7d5c1e65b79cffb44a9b9d433
Gerrit-Change-Number: 12261
Gerrit-PatchSet: 1
Gerrit-Owner: Will Berkeley <wd...@gmail.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>
Gerrit-Comment-Date: Wed, 23 Jan 2019 21:40:37 +0000
Gerrit-HasComments: No

[kudu-CR] [spark] Add write duration histograms

Posted by "Will Berkeley (Code Review)" <ge...@cloudera.org>.
Will Berkeley has posted comments on this change. ( http://gerrit.cloudera.org:8080/12261 )

Change subject: [spark] Add write duration histograms
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/12261/1/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/HdrHistogramAccumulator.scala
File java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/HdrHistogramAccumulator.scala:

http://gerrit.cloudera.org:8080/#/c/12261/1/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/HdrHistogramAccumulator.scala@66
PS1, Line 66:   override def value: HistogramWrapper = histogram
> Would it be possible just to sub-class the HistogramWrapper and override/ad
I looked into this. Yes, it's possible to subclass SynchronizedHistogram and try to use that. But I found that to do so I need to reimplement copy, which meant I need to reimplement a second constructor, and so on, to the point where it's not worth it vs. the wrapper, especially given it's tricky because it's a Scala class overriding a Java one.



-- 
To view, visit http://gerrit.cloudera.org:8080/12261
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0fd4d380b08bd7d7d5c1e65b79cffb44a9b9d433
Gerrit-Change-Number: 12261
Gerrit-PatchSet: 1
Gerrit-Owner: Will Berkeley <wd...@gmail.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>
Gerrit-Comment-Date: Tue, 29 Jan 2019 17:37:20 +0000
Gerrit-HasComments: Yes

[kudu-CR] [spark] Add write duration histograms

Posted by "Will Berkeley (Code Review)" <ge...@cloudera.org>.
Will Berkeley has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/12261 )

Change subject: [spark] Add write duration histograms
......................................................................

[spark] Add write duration histograms

This adds an additional accumulator metrics to KuduContext writes: write
duration histograms. These histograms will show up on the webui and in
the driver logs, so it's easier to track how much time is spent writing
to Kudu in Spark stages and tasks. Log messages on the driver look like:

19/01/23 11:13:34 INFO kudu.KuduContext: completed insert ops: duration histogram: 25.0%: 14ms, 25.0%: 14ms, 75.0%: 17ms, 75.0%: 17ms, 75.0%: 17ms, 100.0%: 66ms, 100.0%: 66ms

The funny repeated values are an artifact of having a cluster with only
3 executors executing 4 tasks. Log messages on executors look like

19/01/23 11:13:34 INFO kudu.KuduContext: applied 69 inserts to table 'impala::default.aaa' in 14ms

HdrHistograms need to be shipped between executors and the driver, so
their (serialized) size is relevant. Spark users differ in how they
serialize, so I didn't put much effort into estimating the serialized
size, but based on the conservative formula in [1] the in-memory size of
a histogram with 3 significant value digits and storing longs is 4MiB or
so. That only happens if the histogram is storing values from 1 to the
max trackable long value, which is Long.MAX / 2. More realistically,
the values in the duration histogram should be at most 86400 * 1000, the
number of milliseconds in a day, and usually much, much smaller. For
that range of values, the max footprint is 1MiB. That should be a safe
amount of data to ship about semi-frequently along with all the Kudu
data (and I'm not counting potential compression as part of
serialization).

[1]: https://github.com/HdrHistogram/HdrHistogram#footprint-estimation

Change-Id: I0fd4d380b08bd7d7d5c1e65b79cffb44a9b9d433
Reviewed-on: http://gerrit.cloudera.org:8080/12261
Reviewed-by: Grant Henke <gr...@apache.org>
Tested-by: Will Berkeley <wd...@gmail.com>
---
M java/gradle/dependencies.gradle
M java/kudu-spark/build.gradle
A java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/HdrHistogramAccumulator.scala
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduContext.scala
4 files changed, 113 insertions(+), 1 deletion(-)

Approvals:
  Grant Henke: Looks good to me, approved
  Will Berkeley: Verified

-- 
To view, visit http://gerrit.cloudera.org:8080/12261
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I0fd4d380b08bd7d7d5c1e65b79cffb44a9b9d433
Gerrit-Change-Number: 12261
Gerrit-PatchSet: 3
Gerrit-Owner: Will Berkeley <wd...@gmail.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>

[kudu-CR] [spark] Add write duration histograms

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/12261 )

Change subject: [spark] Add write duration histograms
......................................................................


Patch Set 1:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/12261/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/12261/1//COMMIT_MSG@14
PS1, Line 14: 25.0%: 14ms, 25.0%: 14ms
Why does it have information on every bin duplicated in the output?  Is it intended?


http://gerrit.cloudera.org:8080/#/c/12261/1//COMMIT_MSG@21
PS1, Line 21: need to be shipped between executors and the driver, so
            : their (serialized) size is relevant
How often does that happen?  Does it depend on the granularity of the histogram or something else?


http://gerrit.cloudera.org:8080/#/c/12261/1/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/HdrHistogramAccumulator.scala
File java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/HdrHistogramAccumulator.scala:

http://gerrit.cloudera.org:8080/#/c/12261/1/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/HdrHistogramAccumulator.scala@66
PS1, Line 66:   override def value: HistogramWrapper = histogram
> I looked into this. Yes, it's possible to subclass SynchronizedHistogram an
Thank you for clarifying on this!

In that case, the original approach looks good enough to me.



-- 
To view, visit http://gerrit.cloudera.org:8080/12261
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0fd4d380b08bd7d7d5c1e65b79cffb44a9b9d433
Gerrit-Change-Number: 12261
Gerrit-PatchSet: 1
Gerrit-Owner: Will Berkeley <wd...@gmail.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>
Gerrit-Comment-Date: Tue, 29 Jan 2019 19:05:48 +0000
Gerrit-HasComments: Yes

[kudu-CR] [spark] Add write duration histograms

Posted by "Grant Henke (Code Review)" <ge...@cloudera.org>.
Grant Henke has posted comments on this change. ( http://gerrit.cloudera.org:8080/12261 )

Change subject: [spark] Add write duration histograms
......................................................................


Patch Set 2: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/12261
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0fd4d380b08bd7d7d5c1e65b79cffb44a9b9d433
Gerrit-Change-Number: 12261
Gerrit-PatchSet: 2
Gerrit-Owner: Will Berkeley <wd...@gmail.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>
Gerrit-Comment-Date: Tue, 05 Feb 2019 19:29:00 +0000
Gerrit-HasComments: No

[kudu-CR] [spark] Add write duration histograms

Posted by "Will Berkeley (Code Review)" <ge...@cloudera.org>.
Will Berkeley has removed a vote on this change.

Change subject: [spark] Add write duration histograms
......................................................................


Removed Verified-1 by Kudu Jenkins (120)
-- 
To view, visit http://gerrit.cloudera.org:8080/12261
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: deleteVote
Gerrit-Change-Id: I0fd4d380b08bd7d7d5c1e65b79cffb44a9b9d433
Gerrit-Change-Number: 12261
Gerrit-PatchSet: 2
Gerrit-Owner: Will Berkeley <wd...@gmail.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>

[kudu-CR] [spark] Add write duration histograms

Posted by "Grant Henke (Code Review)" <ge...@cloudera.org>.
Grant Henke has posted comments on this change. ( http://gerrit.cloudera.org:8080/12261 )

Change subject: [spark] Add write duration histograms
......................................................................


Patch Set 1:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/12261/1/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/HdrHistogramAccumulator.scala
File java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/HdrHistogramAccumulator.scala:

http://gerrit.cloudera.org:8080/#/c/12261/1/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/HdrHistogramAccumulator.scala@63
PS1, Line 63:       other.asInstanceOf[HdrHistogramAccumulator].histogram.inner_histogram)
Can you use `other.value` here in place of `.asInstanceOf[HdrHistogramAccumulator].histogram`?


http://gerrit.cloudera.org:8080/#/c/12261/1/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/HdrHistogramAccumulator.scala@80
PS1, Line 80:     extends Serializable {
Is SynchronizedHistogram serializable?



-- 
To view, visit http://gerrit.cloudera.org:8080/12261
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0fd4d380b08bd7d7d5c1e65b79cffb44a9b9d433
Gerrit-Change-Number: 12261
Gerrit-PatchSet: 1
Gerrit-Owner: Will Berkeley <wd...@gmail.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>
Gerrit-Comment-Date: Tue, 05 Feb 2019 19:06:39 +0000
Gerrit-HasComments: Yes

[kudu-CR] [spark] Add write duration histograms

Posted by "Will Berkeley (Code Review)" <ge...@cloudera.org>.
Hello Alexey Serbin, Kudu Jenkins, Grant Henke, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/12261

to look at the new patch set (#2).

Change subject: [spark] Add write duration histograms
......................................................................

[spark] Add write duration histograms

This adds an additional accumulator metrics to KuduContext writes: write
duration histograms. These histograms will show up on the webui and in
the driver logs, so it's easier to track how much time is spent writing
to Kudu in Spark stages and tasks. Log messages on the driver look like:

19/01/23 11:13:34 INFO kudu.KuduContext: completed insert ops: duration histogram: 25.0%: 14ms, 25.0%: 14ms, 75.0%: 17ms, 75.0%: 17ms, 75.0%: 17ms, 100.0%: 66ms, 100.0%: 66ms

The funny repeated values are an artifact of having a cluster with only
3 executors executing 4 tasks. Log messages on executors look like

19/01/23 11:13:34 INFO kudu.KuduContext: applied 69 inserts to table 'impala::default.aaa' in 14ms

HdrHistograms need to be shipped between executors and the driver, so
their (serialized) size is relevant. Spark users differ in how they
serialize, so I didn't put much effort into estimating the serialized
size, but based on the conservative formula in [1] the in-memory size of
a histogram with 3 significant value digits and storing longs is 4MiB or
so. That only happens if the histogram is storing values from 1 to the
max trackable long value, which is Long.MAX / 2. More realistically,
the values in the duration histogram should be at most 86400 * 1000, the
number of milliseconds in a day, and usually much, much smaller. For
that range of values, the max footprint is 1MiB. That should be a safe
amount of data to ship about semi-frequently along with all the Kudu
data (and I'm not counting potential compression as part of
serialization).

[1]: https://github.com/HdrHistogram/HdrHistogram#footprint-estimation

Change-Id: I0fd4d380b08bd7d7d5c1e65b79cffb44a9b9d433
---
M java/gradle/dependencies.gradle
M java/kudu-spark/build.gradle
A java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/HdrHistogramAccumulator.scala
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduContext.scala
4 files changed, 113 insertions(+), 1 deletion(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/61/12261/2
-- 
To view, visit http://gerrit.cloudera.org:8080/12261
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I0fd4d380b08bd7d7d5c1e65b79cffb44a9b9d433
Gerrit-Change-Number: 12261
Gerrit-PatchSet: 2
Gerrit-Owner: Will Berkeley <wd...@gmail.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>