You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@kudu.apache.org by "Alexey Serbin (Code Review)" <ge...@cloudera.org> on 2016/09/01 00:17:05 UTC

[kudu-CR] KUDU-456 Implement AUTO FLUSH BACKGROUND flush mode

Alexey Serbin has posted comments on this change.

Change subject: KUDU-456 Implement AUTO_FLUSH_BACKGROUND flush mode
......................................................................


Patch Set 25:

(44 comments)

Thank you for review.  I'm planning to post updated patch soon.

http://gerrit.cloudera.org:8080/#/c/3952/25/src/kudu/client/client-test.cc
File src/kudu/client/client-test.cc:

PS25, Line 182:   // Must be public to use as a thread closure.
> They can be static functions, though, right? I don't see them accessing any
Good observation: yes, now they can be static methods of this class.  Previously they could not.

But I don't want make them just static functions -- they need access to private members of KuduSession, and I don't like idea of declaring many friend functions for the KuduSession class.


PS25, Line 187: Microseconds
> How did you choose 10 us? It seems low to me (i.e. maybe we can get away wi
The idea was to monitor those values as closely as possible.  Without resorting to invasive changes to KuduSession::Data or a separate class which would inherit from KuduSession::Data to monitor the maximum, I had nothing better than this.

Why do you think 10 us is too low?


PS25, Line 199:   
> Nit: extra space here.
Done


PS25, Line 553: oprations
> Nit: operations
Done


PS25, Line 556: WaitForAutoFlushBackground
> Hmm, when this returns, there's no actual guarantee that a batch flushed, r
Exactly, that's the idea.  When guaranteed flush is needed, it's necessary to use KuduSession::Flush() call.


PS25, Line 558: MAX_WAIT_MS
> kMaxWaitMs
Done


Line 1207:         << "unexpected status: " << result.ToString();
> There's a better technique for handling situations like these:
OK, thanks.  Will use.


PS25, Line 2153: BUFFER_SIZE
> kBufferSizeBytes (camel case, k prefix, and append the unit)
Done


PS25, Line 2186: // Applying a big operation which does not fit into the buffer should
               : // return an error with session running in any supported flush mode.
> Doesn't this test make the second half of TestApplyTooMuchWithoutFlushing r
Nope.  This is different: this test verifies that an attempt to apply a _single_ operation which does not fit the buffer produces an error right away with no buffered data prior to that.

This test exercises different code path (not the same path as the second half of the TestApplyTooMuchWithoutFlushing test).


PS25, Line 2197: for (auto mode: FLUSH_MODES)
> Nit: auto mode : FLUSH_MODES
Done


PS25, Line 2215:     STLDeleteElements(&errors);
> Nit: use an ElementDeleter and declare it right after declaring 'errors'. T
Done


PS25, Line 2258:   WaitForAutoFlushBackground(session, FLUSH_INTERVAL_MS);
               :   // There should be some pending ops: the automatic background flush
               :   // should not be active after switch to the MANUAL_FLUSH mode.
> I see. How expensive is this in terms of test runtime? You've dropped the f
About 500 ms.  Is that too much?


Line 2304:     *result = t_end - t_begin;
> Have you looked at the StopWatch class? May be easier for this kind of thin
Will do, thanks!


PS25, Line 2335: // Check that call to Flush()/FlushAsync() is no-op if there isn't current
               : // batcher for the session.
> Doesn't this test make TestEmptyBatch redundant?
good observation -- I'll remove that.


PS25, Line 2348:   ASSERT_OK(sync0.Wait());
               :   // The same with the synchronous flush.
               :   ASSERT_OK(session->Flush());
               :   ASSERT_FALSE(session->HasPendingOperations());
               :   ASSERT_EQ(0, session->CountPendingErrors());
> How does this verify that flush was a no-op? Seems to me you need to find s
This is to test that no error happens when trying to call flush and current batcher is not there.  The comment for the test is misleading -- will fix.


PS25, Line 2391: kudu::Thread::Create
> Now that we're building with C++11, we prefer to use std::thread for test t
OK, that's sounds good.  I thought I had to use KuduThreads, otherwise I would use std::thread from the beginning.

Just another instance when addressing my doubts by direct asking  would help.  It seems I need to communicate more on issues like this.


Line 2428:   // AUTO_FLUSH_BACKGROUND should be faster than AUTO_FLUSH_SYNC.
> Hmm, is this expected to be true 100% of all test runs? Even with debug ins
Due to the nature of the things and chosen parameters, the difference is in order of magnitude.  Therefore, no flakiness is expected.  Besides, if using debug/TSAN/ASAN builds, both versions become slower, so they are expected to keep the speed ratio.

Yes, I did that check locally: ran it using shell while true; do ... || break; done cycle.  It looked stable (more than 4096 iterations in the release mode and 1024 in debug one).  Will do with dist-test.py as well.


Line 2429:   EXPECT_GT(t_diff_afs.ToMilliseconds(), t_diff_afb.ToMilliseconds());
> Can't we just EXPECT_GT on the MonoDeltas directly now?
Good observation!  Yes, now we can compare them directly.


PS25, Line 2451:   // AUTO_FLUSH_BACKGROUND should be faster than AUTO_FLUSH_SYNC.
               :   EXPECT_GT(t_diff_afs.ToMilliseconds(), t_diff_afb.ToMilliseconds());
> Same issues and questions as in the above test.
Yep, this test hasn't shown flakiness so far.

Yes, it's possible to compare MonoDelta directly now -- will update.


PS25, Line 2456: // applying a bunch of small rows without a flush should not result in
               : // an error, even with low limit on the buffer space.  This is because
               : // Session::Apply() blocks and waits for buffer space to become available.
> For this test, you may want to scan once you're done writing to make sure t
It's a good idea!  Actually, it would be nice to do that for every test related to batching while performing inserts via Apply() :)  But in this context, this test looks as the best place for that since if we expect an error like that, then it most likely to trigger here.

OK, I'll add that check.


Line 2488: // applying a bunch of rows every of which is so big in size that
> Nit: every one of which
Done


PS25, Line 2489: does
> Nit: do
Done


PS25, Line 2510: 3
> Why does i start at 3 and not 0?
Starting from i == 3: this is the lowest i when
((i - 1) * BUFFER_SIZE / i) has a value greater
than BUFFER_SIZE / 2.  We want a pair of operations
that both don't fit into the buffer,
but every single one does.

I will add the comment.


PS25, Line 2574:   ASSERT_TRUE(session->HasPendingOperations());
> This is the first test where I've noticed this but others may be affected t
Yes, I also have this concern.  The principal solution would be delaying/stopping the Messenger's task queue for a while or throttling the target tablet server to 0 operations-per-second, checking for the pre-conditions,
and then restoring the original state of affairs.  However, I did not find the way to do that.  If you know how to do that, please let me know.  Probably, that would be a good to-do project that will help to make many testcases more robust.

I ran multiple iterations of this test on my laptop and at the test machine: no flakiness observed.   Let's address this in separate changelist.


PS25, Line 2612:     const MonoTime t_begin(MonoTime::Now());
               :     ASSERT_OK(session->Apply(insert.release()));
               :     const MonoTime t_end(MonoTime::Now());
> Consider StopWatch for this.
OK, will do.  Thanks!


PS25, Line 2627:     // It's supposed that sending an operation to the server at least
               :     // order of magnitude less than the periodic flush interval.
> Nit: "Sending an operation to the server is supposed to be at least an orde
Actually, after some consideration I understood we can get rid of this assumption.


PS25, Line 2630:     EXPECT_GT(FLUSH_INTERVAL_MS / 10, t_diff.ToMilliseconds());
> Again, worried about flakiness due to testing timing like this.
Yes, I share that concern.  I ran the tests in cycle many times (1000, that's for sure).  This has shown no flakiness so far.

I addressed the concern via increasing the flush timeout and using Flush() while finalizing the test.


PS25, Line 2634: It's supposed the flushed operations
               :   // reach the tablet server faster than the max-wait flush interval.
> Nit: "The flushed operations should reach the tablet server faster than the
After some consideration I found it's better to call Flush() here: the only thing we want to ensure in this test is that the second Apply() does not take too long.  The functionality of the background flusher is checked in the TestAutoFlushBackgroundFlushTimeout test.


PS25, Line 2636:   WaitForAutoFlushBackground(session, FLUSH_INTERVAL_MS);
> This will be at least a 10s pause, right? If so, can you condition the enti
Actually, it should have been session->Flush() call here.  Fixed.


PS25, Line 2645: accomodate
> Nit: accommodate
Done


PS25, Line 2653: gurantees
> Nit: guarantees
Done


PS25, Line 2663: TestAutoFlushBackgroundApplyBlocks
> Lots of opportunities for flakiness here due to timing measurements, please
The comparable measurements are set to be different in order of magnitude.  I ran this test many times with TSAN, no issues appeared.

Having a provision to block processing of Messenger's queue would allow to have bullet-proof guarantee for stability of this test.  Let's do that in a separate changelist.


PS25, Line 2714: It's supposed the flushed operations
               :   // reach the tablet server faster than wait_timeout_ms.
> Nit: reword as per the above.
Actually, it should have been session->Flush() instead of WaitForAutoFlushBackground(): fixed.


http://gerrit.cloudera.org:8080/#/c/3952/25/src/kudu/client/client.h
File src/kudu/client/client.h:

PS25, Line 1351: successfull
> successful
Done


PS25, Line 1352: left
> be left
Done


http://gerrit.cloudera.org:8080/#/c/3952/25/src/kudu/client/session-internal.cc
File src/kudu/client/session-internal.cc:

PS25, Line 106: PRCs
> RPCs
Done


PS25, Line 150: PRCs
> RPCs
Done


PS25, Line 174: PRCs
> RPCs
Done


PS25, Line 208: PRCs
> RPCs
Done


http://gerrit.cloudera.org:8080/#/c/3952/25/src/kudu/client/session-internal.h
File src/kudu/client/session-internal.h:

PS25, Line 102: NoLock
> Kudu convention is to use an "Unlocked" suffix for methods like these.
Done


PS25, Line 144: This method is used by tests only.
> Kudu convention is to add the suffix "ForTests" for test-only methods like 
Done


PS25, Line 151: is here to represent
> Nit: just "represents"
Done


PS25, Line 152: // the first argument in expressions like FlushCurrentBatcher(1, cbk):
              :   // this is the watermark corresponding to 1 byte of data.
> Yes, but what is the effect? What is special about choosing a value of 1?
Oops, it seems I missed explaining the essence.

The effect is that using this watermark allows to flush the batcher only if it's non-empty.  This watermark level is the minimum possible for a non-empty batcher, so any non-empty batcher will be flushed if calling FlushCurrentBatcher() using this watermark.

I'll will add the comment.


PS25, Line 181:   // Note that this mutex should not be acquired if the thread is already
              :   // holding a Batcher's lock. This must be acquired first.
> How is such sequencing possible? Isn't the batcher's lock internal to the b
That's came from the prior lock_ description -- I just moved it here.  Frankly, after reading this I was also puzzled.

As I understand, this is outdated.  This comment is already stale in the original file.  As I see from git history, this came here from client.h during the 'pimplesation' of the KuduSession class.  It might come from old times when there was a separate batcher lock or the Batcher did not have its own locks.

Will remove.


-- 
To view, visit http://gerrit.cloudera.org:8080/3952
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I34905c30b3aad96f53cf7a1822b1cde6d25f33a8
Gerrit-PatchSet: 25
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@cloudera.com>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes