You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Konstantin Ryakhovskiy (JIRA)" <ji...@apache.org> on 2016/06/28 07:28:57 UTC

[jira] [Comment Edited] (HBASE-14422) Fix TestFastFailWithoutTestUtil

    [ https://issues.apache.org/jira/browse/HBASE-14422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15352383#comment-15352383 ] 

Konstantin Ryakhovskiy edited comment on HBASE-14422 at 6/28/16 7:28 AM:
-------------------------------------------------------------------------

I checked out master, reverted commit e4bf77e2de54ab6ea17b95dc116af9abf24a332d, modified one line to allow the code to compile.

Thread 1 (T-1) is in the retry-mode
Thread 2 (T-2) is in the fast-fail mode.

when the mode is fast-fail the counter "done" gets incremented (by T-2), therefore, at some point T-1 shouldn't call latch.await().
if (done.get() <= 1) 
  latches2[priviRetryCounter.get()].await();

T-2 increments the counter in case when T-2 is in the fast-fail mode only:
boolean pffe = false;
if (!isPriviThreadLocal.get().get()) 
  pffe = !((FastFailInterceptorContext)context).isRetryDespiteFastFailMode();
...
if (!isPriviThreadLocal.get().get()) {
  if (pffe) done.incrementAndGet();
The problem is in the PreemptiveFastFailInterceptor#inFastFailMode():
return (fInfo != null && 
  EnvironmentEdgeManager.currentTime() >
  (fInfo.timeOfFirstFailureMilliSec + this.fastFailThresholdMilliSec));

with some "unliky" timing T2 is in the retry mode instead of fast-fail and the counter "done" is not incremented, 
context.isRetryDespiteFastFailMode() returns true for T-2 which should never happen.

Can I just remove the verification before incrementing the "done" counter
if (pffe) ... ?
Increasing PAUSE_TIME might not help, it will decrease the probability of the heisenbug, but will not remove it.


was (Author: ryakhovskiy.k):
I checked out master, reverted commit e4bf77e2de54ab6ea17b95dc116af9abf24a332d, modified one line to allow the code to compile.

Thread 1 (T-1) is in the retry-mode
Thread 2 (T-2) is in the fast-fail mode.

when the mode is fast-fail the counter "done" gets incremented (by T-2), therefore, at some point T-1 shouldn't call latch.await().
if (done.get() <= 1) 
  latches2[priviRetryCounter.get()].await();

T-2 increments the counter in case when T-2 is in the fast-fail mode only:
boolean pffe = false;
if (!isPriviThreadLocal.get().get()) 
  pffe = !((FastFailInterceptorContext)context).isRetryDespiteFastFailMode();
...
if (!isPriviThreadLocal.get().get()) {
  if (pffe) done.incrementAndGet();
The problem is in the PreemptiveFastFailInterceptor#inFastFailMode():
return (fInfo != null && 
  EnvironmentEdgeManager.currentTime() >
  (fInfo.timeOfFirstFailureMilliSec + this.fastFailThresholdMilliSec));

with some "unliky" timing T2 is in the retry mode instead of fast-fail and the counter "done" is not incremented, 
context.isRetryDespiteFastFailMode() returns true for T-2 which should never happen.

Can I just remove the verification before incrementing the "done" counter
if (pffe) ... ?
Decreasing fastFailThresholdMilliSec might not help, it will decrease the possibility of the heisenbug, but will not remove it.

> Fix TestFastFailWithoutTestUtil
> -------------------------------
>
>                 Key: HBASE-14422
>                 URL: https://issues.apache.org/jira/browse/HBASE-14422
>             Project: HBase
>          Issue Type: Task
>          Components: test
>            Reporter: stack
>            Priority: Minor
>              Labels: beginner
>
> TestFastFailWithoutTestUtil has a unit test that does testInterceptorIntercept50Times Usually it passes but on occasion, the latching between thread 1 and thread 2 goes awry and the test hangs and the test hangs out. Depends on the hardware but it seems to happen about one in four runs here on an internal rig.
> HBASE-14421 changed the wait-on-latch to timeout and do a thread dump and just let the test keep going.
> This issue is about digging in on figuring why the hang up on latches and then fixing it so the test doesn't have to have the latch timeout. Hopefully the threaddump helps.
> This one could be hard to fix since it not easy to reproduce. Marking it beginner anyways.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)