You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@qpid.apache.org by Kevin van den Bekerom <k....@sig.eu> on 2016/04/08 14:27:18 UTC

Masters Thesis on False Positives in Test Failures

Dear Developers of the Apache Qpid project,



My name is Kevin van den Bekerom and I am currently doing my Master's
research on the topic of false alarms in test code. I would like to ask the
input of the Qpid development team categorizing test code bugs.



My research is based on a recent paper by Arash et al. (
http://salt.ece.ubc.ca/publications/docs/icsme15.pdf). They conducted an
empirical study, categorizing "test code bugs" in Apache software projects,
e.g. semantic, flaky, environmental, etc. A "test code bug" is a failing
test, where the System Under Test is correct, but the test code is
incorrect. To identify test code bugs they looked at issues in JIRA, and
checked if the fixing commit was only in the test code. Only fixed issues
were counted and categorised.



My goal is to replicate their results using a different approach, i.e. ask
developers that were involved in the issue and/or fix how they
would categorise it.  For the Qpid project they counted 152 test code bugs.
Insight into false positives can therefore be very relevant for your
project. Note that the authors only sampled a number of identified test
code bugs for individual inspection.


I would like to ask the Qpid team’s participation in categorizing the
various test bugs. I will provide a list of JIRA IDs which are identified
as test code bugs and an initial list of categories (assertion fault,
obsolete assertion, test dependency, etc) to aid in
the categorisation process with short explanations. In my belief, the
developers that worked on the issue are the one's that are most capable of
categorizing the issue. Please let me know if this project looks
interesting to you and if you are willing to help me out.



As a next step I will look for common patterns in identified test code bugs
and my aim is to extent static source code analysis techniques to be also
suited to find test code bugs. I am of course very happy to share my
findings with the team.



Hope to hear from you!



With kind regards,

-- 
*Kevin van den Bekerom* | Intern

+31 6 21 33 93 85 | kvandenbekerom@sig.eu
Software Improvement Group | www.sig.eu

Re: Masters Thesis on False Positives in Test Failures

Posted by Alan Conway <ac...@redhat.com>.
On Sat, 2016-04-09 at 16:57 +0100, Rob Godfrey wrote:
> Hi Kevin,
> 
> On 8 April 2016 at 13:27, Kevin van den Bekerom <k.vandenbekerom@sig.
> eu>
> wrote:
> 
> > 
> > Dear Developers of the Apache Qpid project,
> > 
> > 
> > 
> > My name is Kevin van den Bekerom and I am currently doing my
> > Master's
> > research on the topic of false alarms in test code. I would like to
> > ask the
> > input of the Qpid development team categorizing test code bugs.
> > 
> > 
> > 
> > My research is based on a recent paper by Arash et al. (
> > http://salt.ece.ubc.ca/publications/docs/icsme15.pdf). They
> > conducted an
> > empirical study, categorizing "test code bugs" in Apache software
> > projects,
> > e.g. semantic, flaky, environmental, etc. A "test code bug" is a
> > failing
> > test, where the System Under Test is correct, but the test code is
> > incorrect. To identify test code bugs they looked at issues in
> > JIRA, and
> > checked if the fixing commit was only in the test code. Only fixed
> > issues
> > were counted and categorised.
> > 
> > 
> > 
> > My goal is to replicate their results using a different approach,
> > i.e. ask
> > developers that were involved in the issue and/or fix how they
> > would categorise it.  For the Qpid project they counted 152 test
> > code bugs.
> > Insight into false positives can therefore be very relevant for
> > your
> > project. Note that the authors only sampled a number of identified
> > test
> > code bugs for individual inspection.
> > 
> > 
> > I would like to ask the Qpid team’s participation in categorizing
> > the
> > various test bugs. I will provide a list of JIRA IDs which are
> > identified
> > as test code bugs and an initial list of categories (assertion
> > fault,
> > obsolete assertion, test dependency, etc) to aid in
> > the categorisation process with short explanations. In my belief,
> > the
> > developers that worked on the issue are the one's that are most
> > capable of
> > categorizing the issue. Please let me know if this project looks
> > interesting to you and if you are willing to help me out.
> > 
> > 
> > 
> The paper you reference seems only to be looking at Java code, are
> you
> similarly restricting your research, or looking across all
> languages.  I
> ask mainly because Qpid supports many different components written
> across a
> number of different languages (and the developers are somewhat
> disjoint
> sets). I'm certainly willing to see if I can find some time to look
> at
> JIRAs you list that affect the Java client/broker components, but I
> wouldn't be able to offer any opinion on the C++ code (for example).


I'm more in the C++/C/Python/Go bits of the code, but I have a bug
category and accompanying rant that may be of interest.

Bug category: Using sleep() in test code.

<rant>

THe folliwng line in test code IS ALWAYS A BUG:

   sleep(arbitrary_fixed_interval_that_works_for_me)

Don't Do It. Ever.

The #1 cause of flaky tests in distributed systems is timing bugs: a
test incorrectly assumes some component is ready and fails sporadically
when it isn't. 

The #1 WRONG way to work around timing bugs is to add a sleep() in the
hope the component will be ready when you wake up. That DOESN'T FIX THE
PROBLEM. You have to redesign your test to wait till the component is
*actually ready*. Adding a sleep does one of two very bad things
(usually both)

1. The sleep is sometimes too short, so the problem remains but is even
more elusive and harder to find and fix.

2. The sleep is usually too long and needlessly slows down the entire
test suite. This wastes every developer's time and eventually they stop
running the entire suite regularly during their work, or even on
commit. This is the road to Hell.

But sure what's a 1 second sleep() here and there? Well it is several
seconds per test x 100s-1000s of tests per project x dozens of
developers x many changes on private dev branch per hour.

I once worked on a project where the accumulated "second here and
there" amounted to a test suite that took over 12 hours to run. "I
think we can ship, haven't those all been failing since the last
release?"

Often there is an efficient way to be notified that the component is
ready. Customers will have the same problem so there really should be.
In the worst case you can spin with exponential backoff waiting for
something in a log file, or retrying some operation on the component.
sleep() is allowed here because the duration is *not* arbitrary: it
should start *very* small and can grow *very* big to avoid risk of
long/too short problems.

Just Say No to sleep() in tests.

</rant>
> 
> 
> > 
> > 
> > As a next step I will look for common patterns in identified test
> > code bugs
> > and my aim is to extent static source code analysis techniques to
> > be also
> > suited to find test code bugs. I am of course very happy to share
> > my
> > findings with the team.
> > 
> > 
> Historically the Java system test codebase has had a large number of
> "flaky" tests...  This is partly due to how failure is detected when
> you
> are testing an asynchronous messaging system.  If the test is to
> check that
> under the set of test conditions "a message is delivered" then
> failure
> (i.e. no message is delivered) can only be established by setting a
> timeout
> and saying "if no message has been delivered in X seconds then
> consider it
> a failure"... and on a slow (contended) CI machine, assumptions about
> a
> reasonable timeout value may be invalid.
> 
> Cheers,
> Rob
> 
> 
> 
> > 
> > 
> > Hope to hear from you!
> > 
> > 
> > 
> > With kind regards,
> > 
> > --
> > *Kevin van den Bekerom* | Intern
> > 
> > +31 6 21 33 93 85 | kvandenbekerom@sig.eu
> > Software Improvement Group | www.sig.eu
> > 

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@qpid.apache.org
For additional commands, e-mail: dev-help@qpid.apache.org


Re: Masters Thesis on False Positives in Test Failures

Posted by Rob Godfrey <ro...@gmail.com>.
Hi Kevin,

On 8 April 2016 at 13:27, Kevin van den Bekerom <k....@sig.eu>
wrote:

> Dear Developers of the Apache Qpid project,
>
>
>
> My name is Kevin van den Bekerom and I am currently doing my Master's
> research on the topic of false alarms in test code. I would like to ask the
> input of the Qpid development team categorizing test code bugs.
>
>
>
> My research is based on a recent paper by Arash et al. (
> http://salt.ece.ubc.ca/publications/docs/icsme15.pdf). They conducted an
> empirical study, categorizing "test code bugs" in Apache software projects,
> e.g. semantic, flaky, environmental, etc. A "test code bug" is a failing
> test, where the System Under Test is correct, but the test code is
> incorrect. To identify test code bugs they looked at issues in JIRA, and
> checked if the fixing commit was only in the test code. Only fixed issues
> were counted and categorised.
>
>
>
> My goal is to replicate their results using a different approach, i.e. ask
> developers that were involved in the issue and/or fix how they
> would categorise it.  For the Qpid project they counted 152 test code bugs.
> Insight into false positives can therefore be very relevant for your
> project. Note that the authors only sampled a number of identified test
> code bugs for individual inspection.
>
>
> I would like to ask the Qpid team’s participation in categorizing the
> various test bugs. I will provide a list of JIRA IDs which are identified
> as test code bugs and an initial list of categories (assertion fault,
> obsolete assertion, test dependency, etc) to aid in
> the categorisation process with short explanations. In my belief, the
> developers that worked on the issue are the one's that are most capable of
> categorizing the issue. Please let me know if this project looks
> interesting to you and if you are willing to help me out.
>
>
>
The paper you reference seems only to be looking at Java code, are you
similarly restricting your research, or looking across all languages.  I
ask mainly because Qpid supports many different components written across a
number of different languages (and the developers are somewhat disjoint
sets). I'm certainly willing to see if I can find some time to look at
JIRAs you list that affect the Java client/broker components, but I
wouldn't be able to offer any opinion on the C++ code (for example).


>
> As a next step I will look for common patterns in identified test code bugs
> and my aim is to extent static source code analysis techniques to be also
> suited to find test code bugs. I am of course very happy to share my
> findings with the team.
>
>
Historically the Java system test codebase has had a large number of
"flaky" tests...  This is partly due to how failure is detected when you
are testing an asynchronous messaging system.  If the test is to check that
under the set of test conditions "a message is delivered" then failure
(i.e. no message is delivered) can only be established by setting a timeout
and saying "if no message has been delivered in X seconds then consider it
a failure"... and on a slow (contended) CI machine, assumptions about a
reasonable timeout value may be invalid.

Cheers,
Rob



>
> Hope to hear from you!
>
>
>
> With kind regards,
>
> --
> *Kevin van den Bekerom* | Intern
>
> +31 6 21 33 93 85 | kvandenbekerom@sig.eu
> Software Improvement Group | www.sig.eu
>