You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Philip Zeyliger (Code Review)" <ge...@cloudera.org> on 2019/01/10 17:05:18 UTC

[Impala-ASF-CR] IMPALA-7694: Add host resource usage metrics to profile

Philip Zeyliger has posted comments on this change. ( http://gerrit.cloudera.org:8080/12069 )

Change subject: IMPALA-7694: Add host resource usage metrics to profile
......................................................................


Patch Set 8:

(25 comments)

Thanks for the update! I want to do another pass, but I think I've got enough comments that it's useful to click send now. Overall, I think the structure looks great.

http://gerrit.cloudera.org:8080/#/c/12069/8/be/src/runtime/coordinator-backend-state.cc
File be/src/runtime/coordinator-backend-state.cc:

http://gerrit.cloudera.org:8080/#/c/12069/8/be/src/runtime/coordinator-backend-state.cc@57
PS8, Line 57:     const vector<FragmentStats*>& fragment_stats, RuntimeProfile* resource_profile,
The other places we use resource_profile, it's a variable name for TBackendResourceProfile. I think it's ok with this name, but I had to think about it. I don't have better suggestions, so feel free to ignore.


http://gerrit.cloudera.org:8080/#/c/12069/8/be/src/runtime/query-state.cc
File be/src/runtime/query-state.cc:

http://gerrit.cloudera.org:8080/#/c/12069/8/be/src/runtime/query-state.cc@142
PS8, Line 142:   ExecEnv* exec_env = ExecEnv::GetInstance();
Can you move this to line 146 since it's only used in there?


http://gerrit.cloudera.org:8080/#/c/12069/8/be/src/runtime/query-state.cc@148
PS8, Line 148:         return exec_env->system_state_info()->GetCpuUsageRatios().user;
Should you capture

exec_env->system_state_info()

rather than

exec_env?

I don't know that it matters.


http://gerrit.cloudera.org:8080/#/c/12069/8/be/src/runtime/runtime-state.cc
File be/src/runtime/runtime-state.cc:

http://gerrit.cloudera.org:8080/#/c/12069/8/be/src/runtime/runtime-state.cc@76
PS8, Line 76:     profile_(RuntimeProfile::Create(obj_pool(), "<fragment instance>")),
Is removing the id from the second argument here intentional?


http://gerrit.cloudera.org:8080/#/c/12069/8/be/src/util/pretty-printer.h
File be/src/util/pretty-printer.h:

http://gerrit.cloudera.org:8080/#/c/12069/8/be/src/util/pretty-printer.h@139
PS8, Line 139:       case TUnit::BASE_POINTS: {
             :         DCHECK_LE(value, 10000);
             :         ss << (value / 100);
             :         break;
             :       }
What's this end up looking like? Do we want to have some form of double precision here?


http://gerrit.cloudera.org:8080/#/c/12069/8/be/src/util/runtime-profile-counters.h
File be/src/util/runtime-profile-counters.h:

http://gerrit.cloudera.org:8080/#/c/12069/8/be/src/util/runtime-profile-counters.h@34
PS8, Line 34: namespace impala {
You wrote a really lovely summary of the class structures in runtime-profile.h. Perhaps add a note here to point to that file?


http://gerrit.cloudera.org:8080/#/c/12069/8/be/src/util/runtime-profile-counters.h@397
PS8, Line 397:   const int64_t* GetSamplesTest(int* num_samples, int* period) {
I don't mind, but we definitely have an occasional pattern of using friend classes to let tests reach into stuff they shouldn't, e.g.:

friend class LlvmCodeGenTest_HashTest_Test;


http://gerrit.cloudera.org:8080/#/c/12069/8/be/src/util/runtime-profile-test.cc
File be/src/util/runtime-profile-test.cc:

http://gerrit.cloudera.org:8080/#/c/12069/8/be/src/util/runtime-profile-test.cc@907
PS8, Line 907:   // Make sure the values are gone.
             :   counter->GetSamplesTest(&num_samples, &period);
             :   EXPECT_EQ(num_samples, 0);
This isn't racy because "StopPeriodicCounters()" has been called, right?


http://gerrit.cloudera.org:8080/#/c/12069/8/be/src/util/runtime-profile.cc
File be/src/util/runtime-profile.cc:

http://gerrit.cloudera.org:8080/#/c/12069/8/be/src/util/runtime-profile.cc@780
PS8, Line 780:       const int64_t* samples = counter->GetSamplesLocked(&num, &period);
If GetSamplesLocked() mutates counter, should we be doing that in a PrettyPrint()?


http://gerrit.cloudera.org:8080/#/c/12069/8/be/src/util/runtime-profile.cc@784
PS8, Line 784:         int step = std::max((num - 1) / 32, 1);
If num = 128, we should have step = 2. But, here, we have step = (128-1)/32 = 3.

If I'm right, perhaps:

int step = 1;
if (num > 64) {
  step = ceil(num/64.0)
  // same as?
  // step = 1 + (num - 1)/64
}


http://gerrit.cloudera.org:8080/#/c/12069/8/be/src/util/runtime-profile.cc@793
PS8, Line 793:           stream << " (thrift profile has " << num << " values, showing " <<
nit:

Perhaps simply "showing X of Y values" (elide "thrift profile")


http://gerrit.cloudera.org:8080/#/c/12069/8/be/src/util/runtime-profile.cc@1243
PS8, Line 1243:   // Mutable
I agree that it's unintuitive that a GetSamples() function mutates the structure. Is that what you're trying to say here? Maybe elaborate on this comment or maybe we can find a better method name. (I don't have any great ideas; feel free to ignore.)


http://gerrit.cloudera.org:8080/#/c/12069/8/be/src/util/runtime-profile.cc@1290
PS8, Line 1290:       ChunkedTimeSeriesCounter* c = dynamic_cast<ChunkedTimeSeriesCounter*>(it.second);
Are we cool with dynamic_cast in this code base? I found only one or two uses. https://google.github.io/styleguide/cppguide.html#Run-Time_Type_Information__RTTI_ dislikes it.

// When you upcast (that is, cast a pointer from type Foo to type
// SuperclassOfFoo), it's fine to use implicit_cast<>, since upcasts
// always succeed.  When you downcast (that is, cast a pointer from
// type Foo to type SubclassOfFoo), static_cast<> isn't safe, because
// how do you know the pointer is really of type SubclassOfFoo?  It
// could be a bare Foo, or of type DifferentSubclassOfFoo.  Thus,
// when you downcast, you should use this macro.  In debug mode, we
// use dynamic_cast<> to double-check the downcast is legal (we die
// if it's not).  In normal mode, we do the efficient static_cast<>
// instead.  Thus, it's important to test in debug mode to make sure
// the cast is legal!
//    This is the only place in the code we should use dynamic_cast<>.
// In particular, you SHOULDN'T be using dynamic_cast<> in order to
// do RTTI (eg code like this:
//    if (dynamic_cast<Subclass1>(foo)) HandleASubclass1Object(foo);
//    if (dynamic_cast<Subclass2>(foo)) HandleASubclass2Object(foo);
// You should design the code some other way not to need this.

template<typename To, typename From>     // use like this: down_cast<T*>(foo);
inline To down_cast(From* f) {                   // so we only accept pointers


http://gerrit.cloudera.org:8080/#/c/12069/8/be/src/util/system-state-info.h
File be/src/util/system-state-info.h:

http://gerrit.cloudera.org:8080/#/c/12069/8/be/src/util/system-state-info.h@70
PS8, Line 70:   enum PROC_STAT_CPU_FIELDS {
            :     CPU_USER,
            :     CPU_NICE,
            :     CPU_SYSTEM,
            :     CPU_IDLE,
            :     CPU_IOWAIT
            :   };
            : 
            :   typedef std::array<int64_t, NUM_CPU_FIELDS> CpuValues;
I assume we have a reason to not do struct CpuValues { int64_t cpu_user } but rather have this enum driven? If so, let's write it down. If not, maybe a struct is better?


http://gerrit.cloudera.org:8080/#/c/12069/8/be/src/util/system-state-info.cc
File be/src/util/system-state-info.cc:

http://gerrit.cloudera.org:8080/#/c/12069/8/be/src/util/system-state-info.cc@33
PS8, Line 33:   memset(&cpu_ratios_, 0, sizeof(cpu_ratios_));
Can this have been

cpu_ratios_ { 0 } in the constructor initializer list?

(Note that I'm far from an expert on this stuff.)


http://gerrit.cloudera.org:8080/#/c/12069/8/be/src/util/system-state-info.cc@88
PS8, Line 88:   int64_t cur_sum = accumulate(cur.begin(), cur.end(), 0);
I get what you're doing, but I think

CpuValues.total() is clearer if you go down the route of making it a struct.


http://gerrit.cloudera.org:8080/#/c/12069/8/be/src/util/system-state-info.cc@95
PS8, Line 95:     memset(&cpu_ratios_, 0, sizeof(cpu_ratios_));
I don't get this case (or the comment). Is this just:

"If no time has passed, the ratio is zero (to avoid dividing by zero)". Should we have simply not read the value to begin with?


http://gerrit.cloudera.org:8080/#/c/12069/8/be/src/util/system-state-info.cc@99
PS8, Line 99:   // Convert each ratio to base points.
I think these are called "basis points", or at least I couldn't find a base point reference.


http://gerrit.cloudera.org:8080/#/c/12069/8/bin/plot_profile_resource_usage.py
File bin/plot_profile_resource_usage.py:

http://gerrit.cloudera.org:8080/#/c/12069/8/bin/plot_profile_resource_usage.py@136
PS8, Line 136:   if len(space_separated) == 3:
             :     ts = int(space_separated[0])
             :     print datetime.datetime.fromtimestamp(ts / 1000.0).isoformat(), space_separated[1]
             :     base64_encoded = space_separated[2]
             :   elif len(space_separated) == 1:
             :     base64_encoded = space_separated[0]
             :   else:
             :     raise Exception("Unexpected line: " + line)
Not a blocker, but you could move this and bin/parse-thrift-profile.py both into lib/py and share code. Can be done separately.


http://gerrit.cloudera.org:8080/#/c/12069/8/common/thrift/Metrics.thrift
File common/thrift/Metrics.thrift:

http://gerrit.cloudera.org:8080/#/c/12069/8/common/thrift/Metrics.thrift@34
PS8, Line 34:   BASE_POINTS,
I think this is BASIS_POINTS.

https://www.merriam-webster.com/dictionary/basis%20point


http://gerrit.cloudera.org:8080/#/c/12069/8/tests/query_test/test_observability.py
File tests/query_test/test_observability.py:

http://gerrit.cloudera.org:8080/#/c/12069/8/tests/query_test/test_observability.py@365
PS8, Line 365:     for query_opts in [None, {'resource_trace_ratio': 1.0}]:
query_opts is not used? I think you cut and pasted this from the test below...

(Now, setting it to 0.0 and using it makes sense.)


http://gerrit.cloudera.org:8080/#/c/12069/8/tests/query_test/test_observability.py@368
PS8, Line 368:       expected_strs = ["CpuIoWaitPercentage (500.000ms):",
I would argue these are "unexpected", and, if these are not expected, maybe we can drop the duration or regexpify it?


http://gerrit.cloudera.org:8080/#/c/12069/8/tests/query_test/test_observability.py@417
PS8, Line 417:   @pytest.mark.execute_serially
Why serial? If you run it in parallel, the fact that it's long is not nearly as bothersome, because it's implicitly divided by 16 or whatever.


http://gerrit.cloudera.org:8080/#/c/12069/8/tests/query_test/test_observability.py@419
PS8, Line 419:     """Tests that the resolution of a CPU usage counter increases once they exceed 64
This is effectively a unit test for PrettyPrinting the profile, yes? If we wanted to test that function, which maybe I found a bug in, we could probably do it more explicitly.


http://gerrit.cloudera.org:8080/#/c/12069/8/tests/query_test/test_observability.py@442
PS8, Line 442:       assert len(string_values) <= 64
you can also assert len(string_values) >= 33, because if it's 32 or above, we shouldn't have sampled down.



-- 
To view, visit http://gerrit.cloudera.org:8080/12069
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I3aedc20c553ab8d7ed50f72a1a936eba151487d9
Gerrit-Change-Number: 12069
Gerrit-PatchSet: 8
Gerrit-Owner: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Philip Zeyliger <ph...@cloudera.com>
Gerrit-Comment-Date: Thu, 10 Jan 2019 17:05:18 +0000
Gerrit-HasComments: Yes