You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@impala.apache.org by "Sailesh Mukil (Code Review)" <ge...@cloudera.org> on 2016/09/11 03:09:31 UTC

[Impala-ASF-CR] IMPALA-3823: Add timer to measure Parquet footer reads

Sailesh Mukil has uploaded a new change for review.

  http://gerrit.cloudera.org:8080/4371

Change subject: IMPALA-3823: Add timer to measure Parquet footer reads
......................................................................

IMPALA-3823: Add timer to measure Parquet footer reads

It's been observed that Parquet footer reads perform poorly especially
when reading from S3. This patch adds a timer "FooterProcessingTimer"
which keeps a track of the average time each split of each scan node
spends in reading and processing the parquet footer.

Added a new utility counter called MinMaxAvgValueCounter which keeps
a track of the min, max and average values seen so far from a set of
values. This counter is used to calculate the min, max and average
time taken to scan and process Parquet footers per query per node.
This is also displayed in the RuntimeProfile.

The RuntimeProfile has also been updated to keep a track of, display
and move this new MinMaxAvgValueCounter between nodes through Thrift.

A test has been added to test that this counter works fine when there
are multiple blocks to scan per node.

Change-Id: Icf87bad90037dd0cea63b10c537382ec0f980cbf
---
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/hdfs-parquet-scanner.h
M be/src/util/runtime-profile-counters.h
M be/src/util/runtime-profile.cc
M be/src/util/runtime-profile.h
M common/thrift/RuntimeProfile.thrift
M tests/query_test/test_scanners.py
7 files changed, 216 insertions(+), 4 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/71/4371/1
-- 
To view, visit http://gerrit.cloudera.org:8080/4371
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: Icf87bad90037dd0cea63b10c537382ec0f980cbf
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Sailesh Mukil <sa...@cloudera.com>

[Impala-ASF-CR] IMPALA-3823: Add timer to measure Parquet footer reads

Posted by "Matthew Jacobs (Code Review)" <ge...@cloudera.org>.
Matthew Jacobs has posted comments on this change.

Change subject: IMPALA-3823: Add timer to measure Parquet footer reads
......................................................................


Patch Set 9:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/4371/9/be/src/util/runtime-profile-counters.h
File be/src/util/runtime-profile-counters.h:

PS9, Line 256: SpinLock lock_;
comment on what this protects


http://gerrit.cloudera.org:8080/#/c/4371/9/be/src/util/runtime-profile.cc
File be/src/util/runtime-profile.cc:

PS9, Line 1019:   counter->name = name;
Was there a reason not to take lock_ here?


http://gerrit.cloudera.org:8080/#/c/4371/9/common/thrift/RuntimeProfile.thrift
File common/thrift/RuntimeProfile.thrift:

PS9, Line 94: min max average value
update


-- 
To view, visit http://gerrit.cloudera.org:8080/4371
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Icf87bad90037dd0cea63b10c537382ec0f980cbf
Gerrit-PatchSet: 9
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-3823: Add timer to measure Parquet footer reads

Posted by "Sailesh Mukil (Code Review)" <ge...@cloudera.org>.
Sailesh Mukil has posted comments on this change.

Change subject: IMPALA-3823: Add timer to measure Parquet footer reads
......................................................................


Patch Set 11: Code-Review+2

(1 comment)

Carry +2

http://gerrit.cloudera.org:8080/#/c/4371/10/be/src/util/runtime-profile-counters.h
File be/src/util/runtime-profile-counters.h:

Line 209:   SummaryStatsCounter(TUnit::type unit, int32_t total_num_values,
> I'm not sure what this comment means -- this is a constructor, so how can t
Oops, that was meant for SetStats(). I put that here by mistake. I've removed this line here now.


-- 
To view, visit http://gerrit.cloudera.org:8080/4371
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Icf87bad90037dd0cea63b10c537382ec0f980cbf
Gerrit-PatchSet: 11
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-3823: Add timer to measure Parquet footer reads

Posted by "Henry Robinson (Code Review)" <ge...@cloudera.org>.
Henry Robinson has posted comments on this change.

Change subject: IMPALA-3823: Add timer to measure Parquet footer reads
......................................................................


Patch Set 8: Code-Review+1

(11 comments)

http://gerrit.cloudera.org:8080/#/c/4371/8/be/src/util/runtime-profile-counters.h
File be/src/util/runtime-profile-counters.h:

PS8, Line 204: Contrary to 
"Unlike"


PS8, Line 204: a 
remove


PS8, Line 206: The average is stored in the 'value_' member of the base class.
Referring to member variables in a class description is usually a sign the text is leaking implementation details. Do you mean to say that value() returns the average?


Line 209:   SummaryStatsCounter(TUnit::type unit, int32_t total_num_values,
Comment what this constructor is used for (is it when merging two SSCounters together?)


PS8, Line 216: total_num_values
can total_num_values be 0?


PS8, Line 247: This keeps track of t
Remove - just "The total..." is sufficient.


PS8, Line 255: SpinLock counter_lock_;
any reason not to call this lock_ ?


http://gerrit.cloudera.org:8080/#/c/4371/8/be/src/util/runtime-profile.cc
File be/src/util/runtime-profile.cc:

PS8, Line 664: /* Print all SummaryStatsCounters as following:
             :     <Name>: (Avg: <value> ; Min: <min_value> ; Max: <max_value> ;
             :     Number of samples: <total>) */
Comment style?


PS8, Line 669: are
is


PS8, Line 1019: { min_ = new_value; }
remove parentheses for single-line if statements.


http://gerrit.cloudera.org:8080/#/c/4371/8/tests/query_test/test_scanners.py
File tests/query_test/test_scanners.py:

PS8, Line 317: \
remove


-- 
To view, visit http://gerrit.cloudera.org:8080/4371
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Icf87bad90037dd0cea63b10c537382ec0f980cbf
Gerrit-PatchSet: 8
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-3823: Add timer to measure Parquet footer reads

Posted by "Sailesh Mukil (Code Review)" <ge...@cloudera.org>.
Sailesh Mukil has posted comments on this change.

Change subject: IMPALA-3823: Add timer to measure Parquet footer reads
......................................................................


Patch Set 6:

(8 comments)

http://gerrit.cloudera.org:8080/#/c/4371/6/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

Line 192:   Status footer_status;
> why was this moved to a separate line?
Sorry, I previously introduced a code block with "{ }" and moved this so that it wouldn't go out of scope. But I've since removed it, so it's no longer needed. Done.


http://gerrit.cloudera.org:8080/#/c/4371/6/be/src/util/runtime-profile-counters.h
File be/src/util/runtime-profile-counters.h:

PS6, Line 49: TIMER
> is there anything time-specific about this counter?  why not ADD_SUMMARY_ST
It's the same reason as why we have ADD_COUNTER() and ADD_TIMER(); they both are the same, but ADD_TIMER() uses the default unit as TUnit::TIME_NS.


PS6, Line 205: AveragedCounter
> do we plan to rename that? would be nice.
It technically is an AveragedCounter, so I'm out of ideas for a better name. Do you have any suggestions?


PS6, Line 206: untimeProfile::Counter
> what does the base class' value_ hold?
It holds the average. I've added that in a comment now.


PS6, Line 221: INT64_MAX
> we usually use numeric_limits
Done


Line 228:   int32_t total_num_values() const { return total_num_values_; }
> what's the thread safety guarantees of our counters?  why can these be unlo
Since it's only a read, I figured even if the CPU reorders the read instruction for that variable's address to happen a few cycles earlier than intended, it doesn't matter, as how is it any different than having an atomic 'min_.Load()' where the value of 'min_' changes just after the Load() returns with a value.

What do you think?


PS6, Line 240: This function 
> delete
Done


http://gerrit.cloudera.org:8080/#/c/4371/6/be/src/util/runtime-profile.h
File be/src/util/runtime-profile.h:

PS6, Line 159: min/max average
> this sounds like a "min-average" "max-average" or something. how about just
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/4371
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Icf87bad90037dd0cea63b10c537382ec0f980cbf
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-3823: Add timer to measure Parquet footer reads

Posted by "Sailesh Mukil (Code Review)" <ge...@cloudera.org>.
Sailesh Mukil has posted comments on this change.

Change subject: IMPALA-3823: Add timer to measure Parquet footer reads
......................................................................


Patch Set 3:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/4371/3/be/src/exec/hdfs-parquet-scanner.h
File be/src/exec/hdfs-parquet-scanner.h:

PS3, Line 430: avg_timer_
> let's update the variable names too. e.g. something like process_footer_tim
Done


http://gerrit.cloudera.org:8080/#/c/4371/3/be/src/util/runtime-profile-counters.h
File be/src/util/runtime-profile-counters.h:

PS3, Line 240: Overwrite
> "SetStats()" to be more consistent with some of the other fns, e.g. SetSamp
Done


http://gerrit.cloudera.org:8080/#/c/4371/3/be/src/util/runtime-profile.cc
File be/src/util/runtime-profile.cc:

PS3, Line 668: - 
> I don't see this dash in the example above? Is it intentional? I see the ev
The dash comes before this, I added that now. I've made the changes according to how you suggested. It makes more sense that way.


-- 
To view, visit http://gerrit.cloudera.org:8080/4371
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Icf87bad90037dd0cea63b10c537382ec0f980cbf
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-3823: Add timer to measure Parquet footer reads

Posted by "Dan Hecht (Code Review)" <ge...@cloudera.org>.
Dan Hecht has posted comments on this change.

Change subject: IMPALA-3823: Add timer to measure Parquet footer reads
......................................................................


Patch Set 2:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/4371/2/be/src/util/runtime-profile.cc
File be/src/util/runtime-profile.cc:

Line 1008:     current_max_ = new_counter->value();
why is this safe? couldn't current_sum_ become 0 at an arbtrary time (if a counter's value is negative)?  we should just initialize current_min_ to be the maximum value and current_max_ to be the minimum value, for the type.


-- 
To view, visit http://gerrit.cloudera.org:8080/4371
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Icf87bad90037dd0cea63b10c537382ec0f980cbf
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-3823: Add timer to measure Parquet footer reads

Posted by "Sailesh Mukil (Code Review)" <ge...@cloudera.org>.
Sailesh Mukil has uploaded a new patch set (#4).

Change subject: IMPALA-3823: Add timer to measure Parquet footer reads
......................................................................

IMPALA-3823: Add timer to measure Parquet footer reads

It's been observed that Parquet footer reads perform poorly especially
when reading from S3. This patch adds a timer "FooterProcessingTimer"
which keeps a track of the average time each split of each scan node
spends in reading and processing the parquet footer.

Added a new utility counter called SummaryStatsCounter which keeps
track of the min, max and average values seen so far from a set of
values. This counter is used to calculate the min, max and average
time taken to scan and process Parquet footers per query per node.

The RuntimeProfile has also been updated to keep a track of, display
and serialize this new counter to thrift.

A test has been added to test that this counter works fine when there
are multiple blocks to scan per node.

Change-Id: Icf87bad90037dd0cea63b10c537382ec0f980cbf
---
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/hdfs-parquet-scanner.h
M be/src/util/runtime-profile-counters.h
M be/src/util/runtime-profile.cc
M be/src/util/runtime-profile.h
M common/thrift/RuntimeProfile.thrift
M tests/query_test/test_scanners.py
7 files changed, 230 insertions(+), 17 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/71/4371/4
-- 
To view, visit http://gerrit.cloudera.org:8080/4371
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Icf87bad90037dd0cea63b10c537382ec0f980cbf
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>

[Impala-ASF-CR] IMPALA-3823: Add timer to measure Parquet footer reads

Posted by "Dan Hecht (Code Review)" <ge...@cloudera.org>.
Dan Hecht has posted comments on this change.

Change subject: IMPALA-3823: Add timer to measure Parquet footer reads
......................................................................


Patch Set 8:

I can look at this once you address Henry's comments.

-- 
To view, visit http://gerrit.cloudera.org:8080/4371
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Icf87bad90037dd0cea63b10c537382ec0f980cbf
Gerrit-PatchSet: 8
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-3823: Add timer to measure Parquet footer reads

Posted by "Matthew Jacobs (Code Review)" <ge...@cloudera.org>.
Matthew Jacobs has posted comments on this change.

Change subject: IMPALA-3823: Add timer to measure Parquet footer reads
......................................................................


Patch Set 4:

(15 comments)

I was hoping to get to +1 this pass but there's a bit more still...

Most importantly there's a bug in computing max and I think we need unit tests.

http://gerrit.cloudera.org:8080/#/c/4371/4/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

PS4, Line 196: RuntimeProfile::Counter single_footer_process_timer(TUnit::TIME_NS);
this is unnecessary, SCOPED_TIMER below just creates a ScopedTimer which creates a MonotonicStopWatch and adds the result of it to this counter, so you might as well just use MonotonicStopWatch directly.


http://gerrit.cloudera.org:8080/#/c/4371/4/be/src/util/runtime-profile-counters.h
File be/src/util/runtime-profile-counters.h:

PS4, Line 49: ADD_MIN_MAX_AVG_VALUE_TIMER
please rename


PS4, Line 70: ADD_MIN_MAX_AVG_VALUE_TIMER
same


PS4, Line 147: .
can you add:
This is used to aggregate child counters.


PS4, Line 204: /// Contrary to the AveragedCounter, this only keeps a track of values and not 'Counter'
             : /// objects themselves.
this is a bit misleading because it doesn't really keep values, it keeps the statistics about the values.

How about:
Contrary to the AveragedCoutner, this keeps statistics about raw values whereas AveragedCounter is used for averaging child counters.


PS4, Line 249:  /// Current sums of values seen so far and min/max values seen so far.
             :   int64_t current_min_;
             :   int64_t current_max_;
             :   int64_t current_sum_;
current is unnecessary


http://gerrit.cloudera.org:8080/#/c/4371/4/be/src/util/runtime-profile.cc
File be/src/util/runtime-profile.cc:

PS4, Line 301:       const TSummaryStatsCounter& c = node.summary_stats_counters[i];
             :       SummaryStatsCounterMap::iterator it = summary_stats_map_.find(c.name);
             :       if (it == summary_stats_map_.end()) {
             :         summary_stats_map_[c.name] =
             :             pool_->Add(new SummaryStatsCounter(
             :                 c.unit, c.total_num_values, c.min_value, c.max_value, c.sum));
             :       } else {
             :         it->second->SetStats(c);
             :       }
it's hard to know if this gets exercised in your test case.

Can you add a test in runtime-profile-test.cc that sets up a stats counter and goes through a merge?


PS4, Line 667:  
super nit: extra space


Line 668:     for (const SummaryStatsCounterMap::value_type& v: summary_stats_map_) {
I think we need a separate case to handle when total_num_values is 0 because avg/min/max don't make sense on 0 samples, plus, min and max will print out INT_MAX/INT_MIN which is an implementation detail.

I think this would benefit from a test case in runtime-profile-test.cc as well


PS4, Line 669: (
this brace isn't closed at the end


PS4, Line 676: v.second->unit()
this isn't in the same Counter's units (which is time for the use case you added), this is just a raw count. I think you should just print this total_num_values without pretty printer.


PS4, Line 1017:  else if
this isn't an else if, you need to execute both branches

this will be wrong when there is just 1 sample.


http://gerrit.cloudera.org:8080/#/c/4371/4/tests/query_test/test_scanners.py
File tests/query_test/test_scanners.py:

PS4, Line 336: MinMaxAvgValueCounter
update


PS4, Line 337: averages the time taken to
             :     # scan and process the Parquet footer across scan ranges and also records the min and
             :     # the max time.
keeps statistics for the time taken to scan and Process the Parquet footer across scan ranges


PS4, Line 340:     if ranges_per_node > 1:
             :       num_equal = 0
             :       for min_max_time in footer_processing_time_list:
             :         num_equal += (min_max_time[0] == min_max_time[1])
             :       # The chance that the times taken by different scan ranges to process the footer,
             :       # are equal, is very unlikely. However, it could happen which could make the min and
             :       # max time to process a footer equal. Assert that at a time, no more than 1 node has
             :       # equal min and max time.
             :       assert num_equal < 2
following up with my previous comment, I still think this is weird and was hiding at least 1 bug (the max wasn't set when there was only 1 sample). I'm not really sure what this verification gets us.

It's fine if you can't compare it to the total time, but please at least check that if ranges_per_node is 1 or you see that the number of samples is 1, then min = max = avg != 0. That's easy since you don't have to do any string parsing.

That said, I do think we should have some tests that do some better validation, e.g. unit tests in runtime-profile-test.cc. Then we don't need to worry about parsing times here and we get some more clear coverage.


-- 
To view, visit http://gerrit.cloudera.org:8080/4371
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Icf87bad90037dd0cea63b10c537382ec0f980cbf
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-3823: Add timer to measure Parquet footer reads

Posted by "Sailesh Mukil (Code Review)" <ge...@cloudera.org>.
Sailesh Mukil has uploaded a new patch set (#2).

Change subject: IMPALA-3823: Add timer to measure Parquet footer reads
......................................................................

IMPALA-3823: Add timer to measure Parquet footer reads

It's been observed that Parquet footer reads perform poorly especially
when reading from S3. This patch adds a timer "FooterProcessingTimer"
which keeps a track of the average time each split of each scan node
spends in reading and processing the parquet footer.

Added a new utility counter called MinMaxAvgValueCounter which keeps
a track of the min, max and average values seen so far from a set of
values. This counter is used to calculate the min, max and average
time taken to scan and process Parquet footers per query per node.
This is also displayed in the RuntimeProfile.

The RuntimeProfile has also been updated to keep a track of, display
and move this new MinMaxAvgValueCounter between nodes through Thrift.

A test has been added to test that this counter works fine when there
are multiple blocks to scan per node.

Change-Id: Icf87bad90037dd0cea63b10c537382ec0f980cbf
---
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/hdfs-parquet-scanner.h
M be/src/util/runtime-profile-counters.h
M be/src/util/runtime-profile.cc
M be/src/util/runtime-profile.h
M common/thrift/RuntimeProfile.thrift
M tests/query_test/test_scanners.py
7 files changed, 216 insertions(+), 4 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/71/4371/2
-- 
To view, visit http://gerrit.cloudera.org:8080/4371
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Icf87bad90037dd0cea63b10c537382ec0f980cbf
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Sailesh Mukil <sa...@cloudera.com>

[Impala-ASF-CR] IMPALA-3823: Add timer to measure Parquet footer reads

Posted by "Henry Robinson (Code Review)" <ge...@cloudera.org>.
Henry Robinson has posted comments on this change.

Change subject: IMPALA-3823: Add timer to measure Parquet footer reads
......................................................................


Patch Set 6:

> Do we have any counters where we track running sum and number of samples and compute the average by hand (or by the user)

Good idea to check - but we should bear in mind that average is usually very uninformative for latency distributions. Where we try to measure average we usually want to measure percentiles (and we have a histogram class for that purpose).

Sailesh - can you at least leave a TODO to consider replacing average measurements with histograms? Might be possible in this patch; I don't know how hard ToThrift() would be.

-- 
To view, visit http://gerrit.cloudera.org:8080/4371
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Icf87bad90037dd0cea63b10c537382ec0f980cbf
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-3823: Add timer to measure Parquet footer reads

Posted by "Sailesh Mukil (Code Review)" <ge...@cloudera.org>.
Sailesh Mukil has posted comments on this change.

Change subject: IMPALA-3823: Add timer to measure Parquet footer reads
......................................................................


Patch Set 8:

(11 comments)

http://gerrit.cloudera.org:8080/#/c/4371/8/be/src/util/runtime-profile-counters.h
File be/src/util/runtime-profile-counters.h:

PS8, Line 204: Contrary to 
> "Unlike"
Done


PS8, Line 204: a 
> remove
Done


PS8, Line 206: The average is stored in the 'value_' member of the base class.
> Referring to member variables in a class description is usually a sign the 
Yes, I've changed it to say value() instead.


Line 209:   SummaryStatsCounter(TUnit::type unit, int32_t total_num_values,
> Comment what this constructor is used for (is it when merging two SSCounter
It's for replacing the existing counter with new values.


PS8, Line 216: total_num_values
> can total_num_values be 0?
I don't think it happens in practice, but I've made a change to handle that case too.


PS8, Line 247: This keeps track of t
> Remove - just "The total..." is sufficient.
Done


PS8, Line 255: SpinLock counter_lock_;
> any reason not to call this lock_ ?
Nope, changed it.


http://gerrit.cloudera.org:8080/#/c/4371/8/be/src/util/runtime-profile.cc
File be/src/util/runtime-profile.cc:

PS8, Line 664: /* Print all SummaryStatsCounters as following:
             :     <Name>: (Avg: <value> ; Min: <min_value> ; Max: <max_value> ;
             :     Number of samples: <total>) */
> Comment style?
Done


PS8, Line 669: are
> is
Done


PS8, Line 1019: { min_ = new_value; }
> remove parentheses for single-line if statements.
Done


http://gerrit.cloudera.org:8080/#/c/4371/8/tests/query_test/test_scanners.py
File tests/query_test/test_scanners.py:

PS8, Line 317: \
> remove
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/4371
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Icf87bad90037dd0cea63b10c537382ec0f980cbf
Gerrit-PatchSet: 8
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-3823: Add timer to measure Parquet footer reads

Posted by "Sailesh Mukil (Code Review)" <ge...@cloudera.org>.
Sailesh Mukil has posted comments on this change.

Change subject: IMPALA-3823: Add timer to measure Parquet footer reads
......................................................................


Patch Set 3:

(18 comments)

http://gerrit.cloudera.org:8080/#/c/4371/2//COMMIT_MSG
Commit Message:

PS2, Line 15: tr
> remove a
Done


PS2, Line 18: 
> remove, this is implied if it is a runtime profile counter
Done


PS2, Line 21: 
> serialize ... to thrift
Done


http://gerrit.cloudera.org:8080/#/c/4371/2/be/src/util/runtime-profile-counters.h
File be/src/util/runtime-profile-counters.h:

PS2, Line 214:  {
> I'd be nice to have a simpler name rather than enumerate the stats this kee
Done


PS2, Line 223:    current_sum_(0) {
             :   }
> space between these
Done


PS2, Line 226: 64_t min_value() const { return current_min_; }
             :   int64_t max_value(
> initialize min and max as well
Done


PS2, Line 230:   /// seen so far.
             :   void UpdateCounter(int64_t new_value);
             : 
> 1 line
Done


PS2, Line 234:   /// Set() and Add() should not be used.
             :   virtual void Set(double value) { DCHECK(false); }
             :   v
> 1 line
Done


PS2, Line 238: 
             :   /// This functio
> this can be one sentence
Done


PS2, Line 240: t TSummaryStatsCount
> This is a bit confusing since it starts to look like the AveragedCounter wh
Yes that's right. The only reason I kept it as a Counter instead of a value was to compare if the units of both the values were equal.

But if you foresee it being used even without a counter as a source, it makes sense to just take a value here.

Changed it.


PS2, Line 244:  private:
             :   /// This keeps track of the total number of values seen so far.
             :   int32_t total_num_values_;
             : 
             :   /// Current sums of values seen so far and min/max values seen so far.
             :   int64_t current_min_;
             :   int64_t current_max_;
             :   int64_t current_sum_;
             : 
             :   SpinLock counter_lock_;
             : };
> remove extra space, each fn can be 1 line. I know this is what happens abov
Done


http://gerrit.cloudera.org:8080/#/c/4371/2/be/src/util/runtime-profile.cc
File be/src/util/runtime-profile.cc:

PS2, Line 303: if (it == summary_stats_map_.end()) {
> if this counter already existed, shouldn't the values be updated?
Yes, I initially thought that counters will always have the same value when they arrive through reports, however, that needn't be true for long running queries.

I've added a method Overwrite() to take care of this.


Line 661:   }
> can you add a comment with how this looks printed out?
Done


PS2, Line 794: 
             :   {
> vector can be constructed with a size parameter to do this in 1 step.
Done


Line 1008:   ++total_num_values_;
> why is this safe? couldn't current_sum_ become 0 at an arbtrary time (if a 
Done


http://gerrit.cloudera.org:8080/#/c/4371/2/common/thrift/RuntimeProfile.thrift
File common/thrift/RuntimeProfile.thrift:

PS2, Line 56: TSummaryStatsCounter {
> TSummaryStatsCounter
Done


PS2, Line 59: sum
> sum?
Done


http://gerrit.cloudera.org:8080/#/c/4371/2/tests/query_test/test_scanners.py
File tests/query_test/test_scanners.py:

PS2, Line 343:       # The chance that the times taken by different scan ranges to process the footer,
             :       # are equal, is very unlikely. However, it could happen which could make the min and
             :       # max time to process a footer equal. Assert that at a time, no more than 1 node has
             :       # equal min and max time.
             :       assert num_equal < 2
> this seems brittle...
The reason I didn't do that is because the times from the runtime profile are listed as such: "330.23ms, 1s23ms, 22ns", etc.

There's no good way to compare them without passing them through a helper function to standardize their values before comparison. However, I thought that adding that helper function might clutter this file.

If you still think it's worth it, I can go ahead and do it.


-- 
To view, visit http://gerrit.cloudera.org:8080/4371
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Icf87bad90037dd0cea63b10c537382ec0f980cbf
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-3823: Add timer to measure Parquet footer reads

Posted by "Sailesh Mukil (Code Review)" <ge...@cloudera.org>.
Sailesh Mukil has posted comments on this change.

Change subject: IMPALA-3823: Add timer to measure Parquet footer reads
......................................................................


Patch Set 4:

(15 comments)

http://gerrit.cloudera.org:8080/#/c/4371/4/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

PS4, Line 196: RuntimeProfile::Counter single_footer_process_timer(TUnit::TIME_NS);
> this is unnecessary, SCOPED_TIMER below just creates a ScopedTimer which cr
Done


http://gerrit.cloudera.org:8080/#/c/4371/4/be/src/util/runtime-profile-counters.h
File be/src/util/runtime-profile-counters.h:

PS4, Line 49: ADD_MIN_MAX_AVG_VALUE_TIMER
> please rename
Sorry, forgot to get this one. Done.


PS4, Line 70: ADD_MIN_MAX_AVG_VALUE_TIMER
> same
Done


PS4, Line 147: .
> can you add:
When you say 'child counters', do you mean it in the terms of the child counters of counters in the RuntimeProfile? If so, I think it wouldn't be best to assume how it would be used (even though that's the only use for it now).

Or do you mean every counter that the AveragedCounter keeps a track of is what we are calling a 'child counter'?


PS4, Line 204: /// Contrary to the AveragedCounter, this only keeps a track of values and not 'Counter'
             : /// objects themselves.
> this is a bit misleading because it doesn't really keep values, it keeps th
Good call. I rephrased it, but I left out the bit about 'child counters'. I'll add it if we reach consensus on the above comment.


PS4, Line 249:  /// Current sums of values seen so far and min/max values seen so far.
             :   int64_t current_min_;
             :   int64_t current_max_;
             :   int64_t current_sum_;
> current is unnecessary
Done


http://gerrit.cloudera.org:8080/#/c/4371/4/be/src/util/runtime-profile.cc
File be/src/util/runtime-profile.cc:

PS4, Line 301:       const TSummaryStatsCounter& c = node.summary_stats_counters[i];
             :       SummaryStatsCounterMap::iterator it = summary_stats_map_.find(c.name);
             :       if (it == summary_stats_map_.end()) {
             :         summary_stats_map_[c.name] =
             :             pool_->Add(new SummaryStatsCounter(
             :                 c.unit, c.total_num_values, c.min_value, c.max_value, c.sum));
             :       } else {
             :         it->second->SetStats(c);
             :       }
> it's hard to know if this gets exercised in your test case.
Done


PS4, Line 667:  
> super nit: extra space
Done


Line 668:     for (const SummaryStatsCounterMap::value_type& v: summary_stats_map_) {
> I think we need a separate case to handle when total_num_values is 0 becaus
Done


PS4, Line 669: (
> this brace isn't closed at the end
Done


PS4, Line 676: v.second->unit()
> this isn't in the same Counter's units (which is time for the use case you 
Oops, yes. Done.


PS4, Line 1017:  else if
> this isn't an else if, you need to execute both branches
Done


http://gerrit.cloudera.org:8080/#/c/4371/4/tests/query_test/test_scanners.py
File tests/query_test/test_scanners.py:

PS4, Line 336: MinMaxAvgValueCounter
> update
Done


PS4, Line 337: averages the time taken to
             :     # scan and process the Parquet footer across scan ranges and also records the min and
             :     # the max time.
> keeps statistics for the time taken to scan and Process the Parquet footer 
Done


PS4, Line 340:     if ranges_per_node > 1:
             :       num_equal = 0
             :       for min_max_time in footer_processing_time_list:
             :         num_equal += (min_max_time[0] == min_max_time[1])
             :       # The chance that the times taken by different scan ranges to process the footer,
             :       # are equal, is very unlikely. However, it could happen which could make the min and
             :       # max time to process a footer equal. Assert that at a time, no more than 1 node has
             :       # equal min and max time.
             :       assert num_equal < 2
> following up with my previous comment, I still think this is weird and was 
I've removed the multi scan range test and just added a simple verify assert to see if all the values are the same if we have only one range per node.

I've added BE unit tests to verify that the counter works properly, so we don't need to do that again here.


-- 
To view, visit http://gerrit.cloudera.org:8080/4371
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Icf87bad90037dd0cea63b10c537382ec0f980cbf
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-3823: Add timer to measure Parquet footer reads

Posted by "Henry Robinson (Code Review)" <ge...@cloudera.org>.
Henry Robinson has posted comments on this change.

Change subject: IMPALA-3823: Add timer to measure Parquet footer reads
......................................................................


Patch Set 6:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/4371/6/be/src/util/runtime-profile-counters.h
File be/src/util/runtime-profile-counters.h:

Line 228:   int32_t total_num_values() const { return total_num_values_; }
> Since it's only a read, I figured even if the CPU reorders the read instruc
Without reader synchronization you can probably have the following sequence:

Previously, counter.min = counter.max = 0

Writer:
  counter.min = V
  counter.max = V

Reader:
  mn = counter->min_value() // == V
  mx = counter->max_value() // == 0
  DCHECK_LE(mn, mx) << "Fails"

So you would need to synchronize with the writer to ensure that an update appears atomic. 

TSO helps you avoid certain kinds of ordering bugs - but remember that others are trying to port this to PPC. You have, in RuntimeProfile::SummaryStatsCounter::UpdateCounter():

  ++total_num_values_;
  sum_ += new_value;

If you did an unsynchronized read of sum and total_num_values on a non-TSO architecture, you could easily get total_num_values == 0 but sum > 0.


-- 
To view, visit http://gerrit.cloudera.org:8080/4371
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Icf87bad90037dd0cea63b10c537382ec0f980cbf
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-3823: Add timer to measure Parquet footer reads

Posted by "Sailesh Mukil (Code Review)" <ge...@cloudera.org>.
Sailesh Mukil has uploaded a new patch set (#5).

Change subject: IMPALA-3823: Add timer to measure Parquet footer reads
......................................................................

IMPALA-3823: Add timer to measure Parquet footer reads

It's been observed that Parquet footer reads perform poorly especially
when reading from S3. This patch adds a timer "FooterProcessingTimer"
which keeps a track of the average time each split of each scan node
spends in reading and processing the parquet footer.

Added a new utility counter called SummaryStatsCounter which keeps
track of the min, max and average values seen so far from a set of
values. This counter is used to calculate the min, max and average
time taken to scan and process Parquet footers per query per node.

The RuntimeProfile has also been updated to keep a track of, display
and serialize this new counter to thrift.

BE tests have been added to verify that this counter works fine.

Change-Id: Icf87bad90037dd0cea63b10c537382ec0f980cbf
---
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/hdfs-parquet-scanner.h
M be/src/util/runtime-profile-counters.h
M be/src/util/runtime-profile-test.cc
M be/src/util/runtime-profile.cc
M be/src/util/runtime-profile.h
M common/thrift/RuntimeProfile.thrift
M tests/query_test/test_scanners.py
8 files changed, 288 insertions(+), 16 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/71/4371/5
-- 
To view, visit http://gerrit.cloudera.org:8080/4371
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Icf87bad90037dd0cea63b10c537382ec0f980cbf
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>

[Impala-ASF-CR] IMPALA-3823: Add timer to measure Parquet footer reads

Posted by "Dan Hecht (Code Review)" <ge...@cloudera.org>.
Dan Hecht has posted comments on this change.

Change subject: IMPALA-3823: Add timer to measure Parquet footer reads
......................................................................


Patch Set 9:

After you address Matt's comments I can take a final look.

-- 
To view, visit http://gerrit.cloudera.org:8080/4371
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Icf87bad90037dd0cea63b10c537382ec0f980cbf
Gerrit-PatchSet: 9
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-3823: Add timer to measure Parquet footer reads

Posted by "Matthew Jacobs (Code Review)" <ge...@cloudera.org>.
Matthew Jacobs has posted comments on this change.

Change subject: IMPALA-3823: Add timer to measure Parquet footer reads
......................................................................


Patch Set 4:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/4371/4/be/src/util/runtime-profile-counters.h
File be/src/util/runtime-profile-counters.h:

PS4, Line 147: .
> When you say 'child counters', do you mean it in the terms of the child cou
Yeah, ignore this. I was trying to further differentiate this from the new counter.

In the future we can think about cleaning up all the different kinds of Counters we have.


http://gerrit.cloudera.org:8080/#/c/4371/5/be/src/util/runtime-profile.cc
File be/src/util/runtime-profile.cc:

PS5, Line 1020: }
              : 
              : voi
1 line


PS5, Line 1023:   lock_guard<SpinLock> l(counter_lock_);
              :   unit_ = counter.unit;
              :   c
1 line


http://gerrit.cloudera.org:8080/#/c/4371/5/tests/query_test/test_scanners.py
File tests/query_test/test_scanners.py:

Line 338:     # scan and process the Parquet footer across scan ranges and also records the min and
can you add a TODO for handling the case when there is more than one range per node? You can add it requires parsing the times and there is unit test coverage.


-- 
To view, visit http://gerrit.cloudera.org:8080/4371
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Icf87bad90037dd0cea63b10c537382ec0f980cbf
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-3823: Add timer to measure Parquet footer reads

Posted by "Matthew Jacobs (Code Review)" <ge...@cloudera.org>.
Matthew Jacobs has posted comments on this change.

Change subject: IMPALA-3823: Add timer to measure Parquet footer reads
......................................................................


Patch Set 2:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/4371/2/tests/query_test/test_scanners.py
File tests/query_test/test_scanners.py:

PS2, Line 343:       # The chance that the times taken by different scan ranges to process the footer,
             :       # are equal, is very unlikely. However, it could happen which could make the min and
             :       # max time to process a footer equal. Assert that at a time, no more than 1 node has
             :       # equal min and max time.
             :       assert num_equal < 2
this seems brittle...

how about checking that:

0 <= min <= avg <= max < total time (e.g. scanner thread wall clock time?)


-- 
To view, visit http://gerrit.cloudera.org:8080/4371
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Icf87bad90037dd0cea63b10c537382ec0f980cbf
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-3823: Add timer to measure Parquet footer reads

Posted by "Sailesh Mukil (Code Review)" <ge...@cloudera.org>.
Sailesh Mukil has uploaded a new patch set (#3).

Change subject: IMPALA-3823: Add timer to measure Parquet footer reads
......................................................................

IMPALA-3823: Add timer to measure Parquet footer reads

It's been observed that Parquet footer reads perform poorly especially
when reading from S3. This patch adds a timer "FooterProcessingTimer"
which keeps a track of the average time each split of each scan node
spends in reading and processing the parquet footer.

Added a new utility counter called SummaryStatsCounter which keeps
track of the min, max and average values seen so far from a set of
values. This counter is used to calculate the min, max and average
time taken to scan and process Parquet footers per query per node.

The RuntimeProfile has also been updated to keep a track of, display
and serialize this new counter to thrift.

A test has been added to test that this counter works fine when there
are multiple blocks to scan per node.

Change-Id: Icf87bad90037dd0cea63b10c537382ec0f980cbf
---
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/hdfs-parquet-scanner.h
M be/src/util/runtime-profile-counters.h
M be/src/util/runtime-profile.cc
M be/src/util/runtime-profile.h
M common/thrift/RuntimeProfile.thrift
M tests/query_test/test_scanners.py
7 files changed, 225 insertions(+), 17 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/71/4371/3
-- 
To view, visit http://gerrit.cloudera.org:8080/4371
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Icf87bad90037dd0cea63b10c537382ec0f980cbf
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>

[Impala-ASF-CR] IMPALA-3823: Add timer to measure Parquet footer reads

Posted by "Matthew Jacobs (Code Review)" <ge...@cloudera.org>.
Matthew Jacobs has posted comments on this change.

Change subject: IMPALA-3823: Add timer to measure Parquet footer reads
......................................................................


Patch Set 2:

(16 comments)

Nice! I think this counter may be useful more generally, though I think we should rename it (and variable names based on it).

http://gerrit.cloudera.org:8080/#/c/4371/2//COMMIT_MSG
Commit Message:

PS2, Line 15: a 
remove a


PS2, Line 18: This is also displayed in the RuntimeProfile.
remove, this is implied if it is a runtime profile counter


PS2, Line 21: move this new MinMaxAvgValueCounter between nodes through Thrift
serialize ... to thrift


http://gerrit.cloudera.org:8080/#/c/4371/2/be/src/util/runtime-profile-counters.h
File be/src/util/runtime-profile-counters.h:

PS2, Line 214: MinMaxAvgValueCounter
I'd be nice to have a simpler name rather than enumerate the stats this keeps, e.g. SummaryStatsCounter


PS2, Line 223: }
             :   M
space between these


PS2, Line 226: total_num_values_(0),
             :      current_sum_(0)
initialize min and max as well


PS2, Line 230:   int64_t min_value() const {
             :       return current_min_;
             :   }
1 line


PS2, Line 234:   int64_t max_value() const {
             :       return current_max_;
             :   }
1 line


PS2, Line 238:  /// Update current_sum_ with the new value. And also update the min and the max number
             :   /// seen so far.
this can be one sentence


PS2, Line 240: Counter* new_counter
This is a bit confusing since it starts to look like the AveragedCounter which actually keeps Counter*s. I think this should just take the raw values since it doesn't really matter whether or not the new values come from another counter or not. It'd also make this more useful more generally if the source isn't another counter.


PS2, Line 244:   virtual void Set(double value) {
             :     DCHECK(false);
             :   }
             : 
             :   virtual void Set(int64_t value) {
             :     DCHECK(false);
             :   }
             : 
             :   virtual void Add(int64_t delta) {
             :     DCHECK(false);
             :   }
remove extra space, each fn can be 1 line. I know this is what happens above in AveragedCounter, and it'd be nice to remove the extra space there too.


http://gerrit.cloudera.org:8080/#/c/4371/2/be/src/util/runtime-profile.cc
File be/src/util/runtime-profile.cc:

PS2, Line 303: if (it == min_max_avg_value_counter_map_.end()) {
if this counter already existed, shouldn't the values be updated?


Line 661:   {
can you add a comment with how this looks printed out?


PS2, Line 794: vector<TMinMaxAvgValueCounter>());
             :       node.min_max_avg_value_counters.resize(min_max_avg_value_counter_map_.size());
vector can be constructed with a size parameter to do this in 1 step.


http://gerrit.cloudera.org:8080/#/c/4371/2/common/thrift/RuntimeProfile.thrift
File common/thrift/RuntimeProfile.thrift:

PS2, Line 56: TMinMaxAvgValueCounter
TSummaryStatsCounter


PS2, Line 59: value
sum?


-- 
To view, visit http://gerrit.cloudera.org:8080/4371
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Icf87bad90037dd0cea63b10c537382ec0f980cbf
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-3823: Add timer to measure Parquet footer reads

Posted by "Sailesh Mukil (Code Review)" <ge...@cloudera.org>.
Hello Matthew Jacobs,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/4371

to look at the new patch set (#8).

Change subject: IMPALA-3823: Add timer to measure Parquet footer reads
......................................................................

IMPALA-3823: Add timer to measure Parquet footer reads

It's been observed that Parquet footer reads perform poorly especially
when reading from S3. This patch adds a timer "FooterProcessingTimer"
which keeps a track of the average time each split of each scan node
spends in reading and processing the parquet footer.

Added a new utility counter called SummaryStatsCounter which keeps
track of the min, max and average values seen so far from a set of
values. This counter is used to calculate the min, max and average
time taken to scan and process Parquet footers per query per node.

The RuntimeProfile has also been updated to keep a track of, display
and serialize this new counter to thrift.

BE tests have been added to verify that this counter works fine.

Change-Id: Icf87bad90037dd0cea63b10c537382ec0f980cbf
---
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/hdfs-parquet-scanner.h
M be/src/util/runtime-profile-counters.h
M be/src/util/runtime-profile-test.cc
M be/src/util/runtime-profile.cc
M be/src/util/runtime-profile.h
M common/thrift/RuntimeProfile.thrift
M tests/query_test/test_scanners.py
8 files changed, 301 insertions(+), 16 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/71/4371/8
-- 
To view, visit http://gerrit.cloudera.org:8080/4371
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Icf87bad90037dd0cea63b10c537382ec0f980cbf
Gerrit-PatchSet: 8
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>

[Impala-ASF-CR] IMPALA-3823: Add timer to measure Parquet footer reads

Posted by "Matthew Jacobs (Code Review)" <ge...@cloudera.org>.
Matthew Jacobs has posted comments on this change.

Change subject: IMPALA-3823: Add timer to measure Parquet footer reads
......................................................................


Patch Set 10: Code-Review+1

-- 
To view, visit http://gerrit.cloudera.org:8080/4371
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Icf87bad90037dd0cea63b10c537382ec0f980cbf
Gerrit-PatchSet: 10
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-3823: Add timer to measure Parquet footer reads

Posted by "Dan Hecht (Code Review)" <ge...@cloudera.org>.
Dan Hecht has posted comments on this change.

Change subject: IMPALA-3823: Add timer to measure Parquet footer reads
......................................................................


Patch Set 10: Code-Review+2

(1 comment)

http://gerrit.cloudera.org:8080/#/c/4371/10/be/src/util/runtime-profile-counters.h
File be/src/util/runtime-profile-counters.h:

Line 209:   // This replaces the existing counter with the new values that are passed as arguments.
I'm not sure what this comment means -- this is a constructor, so how can the counter be existing already?


-- 
To view, visit http://gerrit.cloudera.org:8080/4371
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Icf87bad90037dd0cea63b10c537382ec0f980cbf
Gerrit-PatchSet: 10
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-3823: Add timer to measure Parquet footer reads

Posted by "Sailesh Mukil (Code Review)" <ge...@cloudera.org>.
Sailesh Mukil has posted comments on this change.

Change subject: IMPALA-3823: Add timer to measure Parquet footer reads
......................................................................


Patch Set 9: Code-Review+1

Carry +1.

-- 
To view, visit http://gerrit.cloudera.org:8080/4371
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Icf87bad90037dd0cea63b10c537382ec0f980cbf
Gerrit-PatchSet: 9
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-3823: Add timer to measure Parquet footer reads

Posted by "Sailesh Mukil (Code Review)" <ge...@cloudera.org>.
Hello Henry Robinson, Matthew Jacobs,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/4371

to look at the new patch set (#10).

Change subject: IMPALA-3823: Add timer to measure Parquet footer reads
......................................................................

IMPALA-3823: Add timer to measure Parquet footer reads

It's been observed that Parquet footer reads perform poorly especially
when reading from S3. This patch adds a timer "FooterProcessingTimer"
which keeps a track of the average time each split of each scan node
spends in reading and processing the parquet footer.

Added a new utility counter called SummaryStatsCounter which keeps
track of the min, max and average values seen so far from a set of
values. This counter is used to calculate the min, max and average
time taken to scan and process Parquet footers per query per node.

The RuntimeProfile has also been updated to keep a track of, display
and serialize this new counter to thrift.

BE tests have been added to verify that this counter works fine.

Change-Id: Icf87bad90037dd0cea63b10c537382ec0f980cbf
---
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/hdfs-parquet-scanner.h
M be/src/util/runtime-profile-counters.h
M be/src/util/runtime-profile-test.cc
M be/src/util/runtime-profile.cc
M be/src/util/runtime-profile.h
M common/thrift/RuntimeProfile.thrift
M tests/query_test/test_scanners.py
8 files changed, 304 insertions(+), 16 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/71/4371/10
-- 
To view, visit http://gerrit.cloudera.org:8080/4371
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Icf87bad90037dd0cea63b10c537382ec0f980cbf
Gerrit-PatchSet: 10
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>

[Impala-ASF-CR] IMPALA-3823: Add timer to measure Parquet footer reads

Posted by "Sailesh Mukil (Code Review)" <ge...@cloudera.org>.
Sailesh Mukil has posted comments on this change.

Change subject: IMPALA-3823: Add timer to measure Parquet footer reads
......................................................................


Patch Set 8:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/4371/6/be/src/util/runtime-profile-counters.h
File be/src/util/runtime-profile-counters.h:

Line 228:   int64_t MaxValue();
> Without reader synchronization you can probably have the following sequence
Yes, that makes sense. I've changed it to grab the spinlock before returning the value.


-- 
To view, visit http://gerrit.cloudera.org:8080/4371
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Icf87bad90037dd0cea63b10c537382ec0f980cbf
Gerrit-PatchSet: 8
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-3823: Add timer to measure Parquet footer reads

Posted by "Marcel Kornacker (Code Review)" <ge...@cloudera.org>.
Marcel Kornacker has posted comments on this change.

Change subject: IMPALA-3823: Add timer to measure Parquet footer reads
......................................................................


Patch Set 11: Verified+1

-- 
To view, visit http://gerrit.cloudera.org:8080/4371
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Icf87bad90037dd0cea63b10c537382ec0f980cbf
Gerrit-PatchSet: 11
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-3823: Add timer to measure Parquet footer reads

Posted by "Sailesh Mukil (Code Review)" <ge...@cloudera.org>.
Sailesh Mukil has posted comments on this change.

Change subject: IMPALA-3823: Add timer to measure Parquet footer reads
......................................................................


Patch Set 6: Code-Review+1

(3 comments)

Carry +1

http://gerrit.cloudera.org:8080/#/c/4371/5/be/src/util/runtime-profile.cc
File be/src/util/runtime-profile.cc:

PS5, Line 1020:   if (new_value < min_) { min_ = new_value; }
              :   if (new_value > max_) { max_ = new_value; }
              : }
> 1 line
Done


PS5, Line 1023: 
              : void RuntimeProfile::SummaryStatsCounter::SetStats(const TSummaryStatsCounter& counter) {
              :   l
> 1 line
Done


http://gerrit.cloudera.org:8080/#/c/4371/5/tests/query_test/test_scanners.py
File tests/query_test/test_scanners.py:

Line 338:     # and average values as the same since we have only one sample (i.e. only one range)
> can you add a TODO for handling the case when there is more than one range 
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/4371
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Icf87bad90037dd0cea63b10c537382ec0f980cbf
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-3823: Add timer to measure Parquet footer reads

Posted by "Sailesh Mukil (Code Review)" <ge...@cloudera.org>.
Hello Matthew Jacobs,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/4371

to look at the new patch set (#6).

Change subject: IMPALA-3823: Add timer to measure Parquet footer reads
......................................................................

IMPALA-3823: Add timer to measure Parquet footer reads

It's been observed that Parquet footer reads perform poorly especially
when reading from S3. This patch adds a timer "FooterProcessingTimer"
which keeps a track of the average time each split of each scan node
spends in reading and processing the parquet footer.

Added a new utility counter called SummaryStatsCounter which keeps
track of the min, max and average values seen so far from a set of
values. This counter is used to calculate the min, max and average
time taken to scan and process Parquet footers per query per node.

The RuntimeProfile has also been updated to keep a track of, display
and serialize this new counter to thrift.

BE tests have been added to verify that this counter works fine.

Change-Id: Icf87bad90037dd0cea63b10c537382ec0f980cbf
---
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/hdfs-parquet-scanner.h
M be/src/util/runtime-profile-counters.h
M be/src/util/runtime-profile-test.cc
M be/src/util/runtime-profile.cc
M be/src/util/runtime-profile.h
M common/thrift/RuntimeProfile.thrift
M tests/query_test/test_scanners.py
8 files changed, 286 insertions(+), 16 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/71/4371/6
-- 
To view, visit http://gerrit.cloudera.org:8080/4371
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Icf87bad90037dd0cea63b10c537382ec0f980cbf
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>

[Impala-ASF-CR] IMPALA-3823: Add timer to measure Parquet footer reads

Posted by "Matthew Jacobs (Code Review)" <ge...@cloudera.org>.
Matthew Jacobs has posted comments on this change.

Change subject: IMPALA-3823: Add timer to measure Parquet footer reads
......................................................................


Patch Set 3:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/4371/3/be/src/exec/hdfs-parquet-scanner.h
File be/src/exec/hdfs-parquet-scanner.h:

PS3, Line 430: avg_timer_
let's update the variable names too. e.g. something like process_footer_timer_stats_ or such


http://gerrit.cloudera.org:8080/#/c/4371/2/be/src/util/runtime-profile-counters.h
File be/src/util/runtime-profile-counters.h:

PS2, Line 240: t TSummaryStatsCount
> Yes that's right. The only reason I kept it as a Counter instead of a value
Thanks

I think it's more useful to remove the restriction. Even in hdfs-parquet-scanner where you're using this, there is a Counter constructed just for it's timer, it's not included in a profile. That code might as well have just used a timing function explicitly. (I'm not saying it's worth changing that code, just that there's no dependency on this taking Counters.)

Also, I don't think we get much from checking the Counter's unit. We don't really set counter units carefully in many cases (e.g. we use UNIT all over the place instead of anything useful), and ultimately it's up to the user of this interface to make sure what they're providing here makes sense.


http://gerrit.cloudera.org:8080/#/c/4371/3/be/src/util/runtime-profile-counters.h
File be/src/util/runtime-profile-counters.h:

PS3, Line 240: Overwrite
"SetStats()" to be more consistent with some of the other fns, e.g. SetSamples(). If I saw Overwrite() I'd wonder why there wasn't just a copy constructor or something like that.


http://gerrit.cloudera.org:8080/#/c/4371/2/be/src/util/runtime-profile.cc
File be/src/util/runtime-profile.cc:

Line 661:   }
> Done
thanks!


http://gerrit.cloudera.org:8080/#/c/4371/3/be/src/util/runtime-profile.cc
File be/src/util/runtime-profile.cc:

PS3, Line 668: - 
I don't see this dash in the example above? Is it intentional? I see the event markers under an event sequence make a list, though those are items of a specific timeline. I don't see why that'd be the case here.

Also value() is an average, right? I think it'd be good to have that specified.


It'd be nice to include the number of samples as well. Also, avg isn't really all that different from the other stats, so maybe just print them all the same way, e.g.:

" <name>:  avg: <avgval>, min: <minval>, max: <maxval>, num_samples: <numval>"


-- 
To view, visit http://gerrit.cloudera.org:8080/4371
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Icf87bad90037dd0cea63b10c537382ec0f980cbf
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-3823: Add timer to measure Parquet footer reads

Posted by "Sailesh Mukil (Code Review)" <ge...@cloudera.org>.
Hello Henry Robinson, Matthew Jacobs, Dan Hecht,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/4371

to look at the new patch set (#11).

Change subject: IMPALA-3823: Add timer to measure Parquet footer reads
......................................................................

IMPALA-3823: Add timer to measure Parquet footer reads

It's been observed that Parquet footer reads perform poorly especially
when reading from S3. This patch adds a timer "FooterProcessingTimer"
which keeps a track of the average time each split of each scan node
spends in reading and processing the parquet footer.

Added a new utility counter called SummaryStatsCounter which keeps
track of the min, max and average values seen so far from a set of
values. This counter is used to calculate the min, max and average
time taken to scan and process Parquet footers per query per node.

The RuntimeProfile has also been updated to keep a track of, display
and serialize this new counter to thrift.

BE tests have been added to verify that this counter works fine.

Change-Id: Icf87bad90037dd0cea63b10c537382ec0f980cbf
---
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/hdfs-parquet-scanner.h
M be/src/util/runtime-profile-counters.h
M be/src/util/runtime-profile-test.cc
M be/src/util/runtime-profile.cc
M be/src/util/runtime-profile.h
M common/thrift/RuntimeProfile.thrift
M tests/query_test/test_scanners.py
8 files changed, 303 insertions(+), 16 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/71/4371/11
-- 
To view, visit http://gerrit.cloudera.org:8080/4371
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Icf87bad90037dd0cea63b10c537382ec0f980cbf
Gerrit-PatchSet: 11
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>

[Impala-ASF-CR] IMPALA-3823: Add timer to measure Parquet footer reads

Posted by "Matthew Jacobs (Code Review)" <ge...@cloudera.org>.
Matthew Jacobs has posted comments on this change.

Change subject: IMPALA-3823: Add timer to measure Parquet footer reads
......................................................................


Patch Set 5: Code-Review+1

-- 
To view, visit http://gerrit.cloudera.org:8080/4371
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Icf87bad90037dd0cea63b10c537382ec0f980cbf
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-3823: Add timer to measure Parquet footer reads

Posted by "Sailesh Mukil (Code Review)" <ge...@cloudera.org>.
Hello Matthew Jacobs,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/4371

to look at the new patch set (#7).

Change subject: IMPALA-3823: Add timer to measure Parquet footer reads
......................................................................

IMPALA-3823: Add timer to measure Parquet footer reads

It's been observed that Parquet footer reads perform poorly especially
when reading from S3. This patch adds a timer "FooterProcessingTimer"
which keeps a track of the average time each split of each scan node
spends in reading and processing the parquet footer.

Added a new utility counter called SummaryStatsCounter which keeps
track of the min, max and average values seen so far from a set of
values. This counter is used to calculate the min, max and average
time taken to scan and process Parquet footers per query per node.

The RuntimeProfile has also been updated to keep a track of, display
and serialize this new counter to thrift.

BE tests have been added to verify that this counter works fine.

Change-Id: Icf87bad90037dd0cea63b10c537382ec0f980cbf
---
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/hdfs-parquet-scanner.h
M be/src/util/runtime-profile-counters.h
M be/src/util/runtime-profile-test.cc
M be/src/util/runtime-profile.cc
M be/src/util/runtime-profile.h
M common/thrift/RuntimeProfile.thrift
M tests/query_test/test_scanners.py
8 files changed, 286 insertions(+), 16 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/71/4371/7
-- 
To view, visit http://gerrit.cloudera.org:8080/4371
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Icf87bad90037dd0cea63b10c537382ec0f980cbf
Gerrit-PatchSet: 7
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>

[Impala-ASF-CR] IMPALA-3823: Add timer to measure Parquet footer reads

Posted by "Dan Hecht (Code Review)" <ge...@cloudera.org>.
Dan Hecht has posted comments on this change.

Change subject: IMPALA-3823: Add timer to measure Parquet footer reads
......................................................................


Patch Set 6:

(8 comments)

Do we have any counters where we track running sum and number of samples and compute the average by hand (or by the user)? i.e. any other existing places we should use this counter?

http://gerrit.cloudera.org:8080/#/c/4371/6/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

Line 192:   Status footer_status;
why was this moved to a separate line?


http://gerrit.cloudera.org:8080/#/c/4371/6/be/src/util/runtime-profile-counters.h
File be/src/util/runtime-profile-counters.h:

PS6, Line 49: TIMER
is there anything time-specific about this counter?  why not ADD_SUMMARY_STATS_COUNTER(profile, name, unit)?


PS6, Line 205: AveragedCounter
do we plan to rename that? would be nice.


PS6, Line 206: untimeProfile::Counter
what does the base class' value_ hold?


PS6, Line 221: INT64_MAX
we usually use numeric_limits


Line 228:   int32_t total_num_values() const { return total_num_values_; }
what's the thread safety guarantees of our counters?  why can these be unlocked/non-atomic?


PS6, Line 240: This function 
delete


http://gerrit.cloudera.org:8080/#/c/4371/6/be/src/util/runtime-profile.h
File be/src/util/runtime-profile.h:

PS6, Line 159: min/max average
this sounds like a "min-average" "max-average" or something. how about just: Adds a counter that tracks min, max and average ... 
or something like that.


-- 
To view, visit http://gerrit.cloudera.org:8080/4371
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Icf87bad90037dd0cea63b10c537382ec0f980cbf
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-3823: Add timer to measure Parquet footer reads

Posted by "Sailesh Mukil (Code Review)" <ge...@cloudera.org>.
Sailesh Mukil has posted comments on this change.

Change subject: IMPALA-3823: Add timer to measure Parquet footer reads
......................................................................


Patch Set 10: Code-Review+1

(3 comments)

Sorry for the delay. I've made the changes.
Carry +1

http://gerrit.cloudera.org:8080/#/c/4371/9/be/src/util/runtime-profile-counters.h
File be/src/util/runtime-profile-counters.h:

PS9, Line 256: // Protects min
> comment on what this protects
Done


http://gerrit.cloudera.org:8080/#/c/4371/9/be/src/util/runtime-profile.cc
File be/src/util/runtime-profile.cc:

PS9, Line 1019:   lock_guard<SpinLock> 
> Was there a reason not to take lock_ here?
No, that was a mistake. Added it now.


http://gerrit.cloudera.org:8080/#/c/4371/9/common/thrift/RuntimeProfile.thrift
File common/thrift/RuntimeProfile.thrift:

PS9, Line 94: summary stats counter
> update
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/4371
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Icf87bad90037dd0cea63b10c537382ec0f980cbf
Gerrit-PatchSet: 10
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-3823: Add timer to measure Parquet footer reads

Posted by "Marcel Kornacker (Code Review)" <ge...@cloudera.org>.
Marcel Kornacker has submitted this change and it was merged.

Change subject: IMPALA-3823: Add timer to measure Parquet footer reads
......................................................................


IMPALA-3823: Add timer to measure Parquet footer reads

It's been observed that Parquet footer reads perform poorly especially
when reading from S3. This patch adds a timer "FooterProcessingTimer"
which keeps a track of the average time each split of each scan node
spends in reading and processing the parquet footer.

Added a new utility counter called SummaryStatsCounter which keeps
track of the min, max and average values seen so far from a set of
values. This counter is used to calculate the min, max and average
time taken to scan and process Parquet footers per query per node.

The RuntimeProfile has also been updated to keep a track of, display
and serialize this new counter to thrift.

BE tests have been added to verify that this counter works fine.

Change-Id: Icf87bad90037dd0cea63b10c537382ec0f980cbf
Reviewed-on: http://gerrit.cloudera.org:8080/4371
Reviewed-by: Sailesh Mukil <sa...@cloudera.com>
Tested-by: Marcel Kornacker <ma...@cloudera.com>
---
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/hdfs-parquet-scanner.h
M be/src/util/runtime-profile-counters.h
M be/src/util/runtime-profile-test.cc
M be/src/util/runtime-profile.cc
M be/src/util/runtime-profile.h
M common/thrift/RuntimeProfile.thrift
M tests/query_test/test_scanners.py
8 files changed, 303 insertions(+), 16 deletions(-)

Approvals:
  Marcel Kornacker: Verified
  Sailesh Mukil: Looks good to me, approved



-- 
To view, visit http://gerrit.cloudera.org:8080/4371
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: Icf87bad90037dd0cea63b10c537382ec0f980cbf
Gerrit-PatchSet: 12
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>

[Impala-ASF-CR] IMPALA-3823: Add timer to measure Parquet footer reads

Posted by "Sailesh Mukil (Code Review)" <ge...@cloudera.org>.
Hello Henry Robinson, Matthew Jacobs,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/4371

to look at the new patch set (#9).

Change subject: IMPALA-3823: Add timer to measure Parquet footer reads
......................................................................

IMPALA-3823: Add timer to measure Parquet footer reads

It's been observed that Parquet footer reads perform poorly especially
when reading from S3. This patch adds a timer "FooterProcessingTimer"
which keeps a track of the average time each split of each scan node
spends in reading and processing the parquet footer.

Added a new utility counter called SummaryStatsCounter which keeps
track of the min, max and average values seen so far from a set of
values. This counter is used to calculate the min, max and average
time taken to scan and process Parquet footers per query per node.

The RuntimeProfile has also been updated to keep a track of, display
and serialize this new counter to thrift.

BE tests have been added to verify that this counter works fine.

Change-Id: Icf87bad90037dd0cea63b10c537382ec0f980cbf
---
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/hdfs-parquet-scanner.h
M be/src/util/runtime-profile-counters.h
M be/src/util/runtime-profile-test.cc
M be/src/util/runtime-profile.cc
M be/src/util/runtime-profile.h
M common/thrift/RuntimeProfile.thrift
M tests/query_test/test_scanners.py
8 files changed, 302 insertions(+), 16 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/71/4371/9
-- 
To view, visit http://gerrit.cloudera.org:8080/4371
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Icf87bad90037dd0cea63b10c537382ec0f980cbf
Gerrit-PatchSet: 9
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>