You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@flink.apache.org by Dawid Wysakowicz <dw...@apache.org> on 2021/04/26 07:54:04 UTC

[DISCUSS] Using timeouts in JUnit tests

Hi devs!

I wanted to bring up something that was discussed in a few independent
groups of people in the past days. I'd like to revise using timeouts in
our JUnit tests. The suggestion would be not to use them anymore. The
problem with timeouts is that we have no thread dump and stack traces of
the system as it hangs. If we were not using a timeout, the CI runner
would have caught the timeout and created a thread dump which often is a
great starting point for debugging.

This problem has been spotted e.g. during debugging FLINK-22416[1]. In
the past thread dumps were not always taken for hanging tests, but it
was changed quite recently in FLINK-21346[2]. I am happy to hear your
opinions on it. If there are no objections I would like to add the
suggestion to the Coding Guidelines[3]

Best,

Dawid


[1] https://issues.apache.org/jira/browse/FLINK-22416

[2] https://issues.apache.org/jira/browse/FLINK-21346

[3]
https://flink.apache.org/contributing/code-style-and-quality-java.html#java-language-features-and-libraries

Re: [DISCUSS] Using timeouts in JUnit tests

Posted by Till Rohrmann <tr...@apache.org>.

Assuming that not many tests deadlock I think it should be fine to simply
let the build process deadlock. Even if multiple tests fail consistently,
then one would see them one after another. That way we wouldn't have to
build some extra tooling. Moreover, the behaviour would be consistent on
the local machine because the same test would also deadlock there.

Decreasing the build/test time is in my opinion more relevant for tests
which actually do pass but do things too slowly and, hence, more of an
orthogonal discussion.

Cheers,
Till

On Mon, Apr 26, 2021 at 8:25 PM Robert Metzger <rm...@apache.org> wrote:

> I was actually recently wondering if we shouldn't rather use timeouts more
> aggressively in JUnit.
> There was recently a case where a number of tests accidentally ran for 5
> minutes, because a timeout was increased to 5 minutes.
> If we had a global limit of 1 minute per test, we would have caught this
> case (and we would encourage people to be careful with CI time). If we are
> going to add some custom timeout infrastructure to JUnit (in Java, not in
> the CI bash scripts ;) ) it should be fairly straightforward to print the
> current stack traces in case of a timeout.
> Another benefit of solving this problem at a Junit level is that the
> behavior would be the same in all environments (for example when running
> the tests locally).
> The final benefit would be that we would get a list of all tests that are
> timing out (from that module), instead of having the test stall at a random
> test.
>
>
> On Mon, Apr 26, 2021 at 10:49 AM Till Rohrmann <tr...@apache.org>
> wrote:
>
> > +1. I think this rule makes a lot of sense.
> >
> > Cheers,
> > Till
> >
> > On Mon, Apr 26, 2021 at 10:08 AM Arvid Heise <ar...@apache.org> wrote:
> >
> > > +1 from my side.
> > >
> > > We should probably double-check if we really need 4h timeouts on test
> > tasks
> > > in AZP. It feels like 2h be enough.
> > >
> > > On Mon, Apr 26, 2021 at 9:54 AM Dawid Wysakowicz <
> dwysakowicz@apache.org
> > >
> > > wrote:
> > >
> > > > Hi devs!
> > > >
> > > > I wanted to bring up something that was discussed in a few
> independent
> > > > groups of people in the past days. I'd like to revise using timeouts
> in
> > > > our JUnit tests. The suggestion would be not to use them anymore. The
> > > > problem with timeouts is that we have no thread dump and stack traces
> > of
> > > > the system as it hangs. If we were not using a timeout, the CI runner
> > > > would have caught the timeout and created a thread dump which often
> is
> > a
> > > > great starting point for debugging.
> > > >
> > > > This problem has been spotted e.g. during debugging FLINK-22416[1].
> In
> > > > the past thread dumps were not always taken for hanging tests, but it
> > > > was changed quite recently in FLINK-21346[2]. I am happy to hear your
> > > > opinions on it. If there are no objections I would like to add the
> > > > suggestion to the Coding Guidelines[3]
> > > >
> > > > Best,
> > > >
> > > > Dawid
> > > >
> > > >
> > > > [1] https://issues.apache.org/jira/browse/FLINK-22416
> > > >
> > > > [2] https://issues.apache.org/jira/browse/FLINK-21346
> > > >
> > > > [3]
> > > >
> > > >
> > >
> >
> https://flink.apache.org/contributing/code-style-and-quality-java.html#java-language-features-and-libraries
> > > >
> > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Using timeouts in JUnit tests

Posted by Robert Metzger <rm...@apache.org>.

I was actually recently wondering if we shouldn't rather use timeouts more
aggressively in JUnit.
There was recently a case where a number of tests accidentally ran for 5
minutes, because a timeout was increased to 5 minutes.
If we had a global limit of 1 minute per test, we would have caught this
case (and we would encourage people to be careful with CI time). If we are
going to add some custom timeout infrastructure to JUnit (in Java, not in
the CI bash scripts ;) ) it should be fairly straightforward to print the
current stack traces in case of a timeout.
Another benefit of solving this problem at a Junit level is that the
behavior would be the same in all environments (for example when running
the tests locally).
The final benefit would be that we would get a list of all tests that are
timing out (from that module), instead of having the test stall at a random
test.

On Mon, Apr 26, 2021 at 10:49 AM Till Rohrmann <tr...@apache.org> wrote:

> +1. I think this rule makes a lot of sense.
>
> Cheers,
> Till
>
> On Mon, Apr 26, 2021 at 10:08 AM Arvid Heise <ar...@apache.org> wrote:
>
> > +1 from my side.
> >
> > We should probably double-check if we really need 4h timeouts on test
> tasks
> > in AZP. It feels like 2h be enough.
> >
> > On Mon, Apr 26, 2021 at 9:54 AM Dawid Wysakowicz <dwysakowicz@apache.org
> >
> > wrote:
> >
> > > Hi devs!
> > >
> > > I wanted to bring up something that was discussed in a few independent
> > > groups of people in the past days. I'd like to revise using timeouts in
> > > our JUnit tests. The suggestion would be not to use them anymore. The
> > > problem with timeouts is that we have no thread dump and stack traces
> of
> > > the system as it hangs. If we were not using a timeout, the CI runner
> > > would have caught the timeout and created a thread dump which often is
> a
> > > great starting point for debugging.
> > >
> > > This problem has been spotted e.g. during debugging FLINK-22416[1]. In
> > > the past thread dumps were not always taken for hanging tests, but it
> > > was changed quite recently in FLINK-21346[2]. I am happy to hear your
> > > opinions on it. If there are no objections I would like to add the
> > > suggestion to the Coding Guidelines[3]
> > >
> > > Best,
> > >
> > > Dawid
> > >
> > >
> > > [1] https://issues.apache.org/jira/browse/FLINK-22416
> > >
> > > [2] https://issues.apache.org/jira/browse/FLINK-21346
> > >
> > > [3]
> > >
> > >
> >
> https://flink.apache.org/contributing/code-style-and-quality-java.html#java-language-features-and-libraries
> > >
> > >
> > >
> >
>

Re: [DISCUSS] Using timeouts in JUnit tests

Posted by Till Rohrmann <tr...@apache.org>.

+1. I think this rule makes a lot of sense.

Cheers,
Till

On Mon, Apr 26, 2021 at 10:08 AM Arvid Heise <ar...@apache.org> wrote:

> +1 from my side.
>
> We should probably double-check if we really need 4h timeouts on test tasks
> in AZP. It feels like 2h be enough.
>
> On Mon, Apr 26, 2021 at 9:54 AM Dawid Wysakowicz <dw...@apache.org>
> wrote:
>
> > Hi devs!
> >
> > I wanted to bring up something that was discussed in a few independent
> > groups of people in the past days. I'd like to revise using timeouts in
> > our JUnit tests. The suggestion would be not to use them anymore. The
> > problem with timeouts is that we have no thread dump and stack traces of
> > the system as it hangs. If we were not using a timeout, the CI runner
> > would have caught the timeout and created a thread dump which often is a
> > great starting point for debugging.
> >
> > This problem has been spotted e.g. during debugging FLINK-22416[1]. In
> > the past thread dumps were not always taken for hanging tests, but it
> > was changed quite recently in FLINK-21346[2]. I am happy to hear your
> > opinions on it. If there are no objections I would like to add the
> > suggestion to the Coding Guidelines[3]
> >
> > Best,
> >
> > Dawid
> >
> >
> > [1] https://issues.apache.org/jira/browse/FLINK-22416
> >
> > [2] https://issues.apache.org/jira/browse/FLINK-21346
> >
> > [3]
> >
> >
> https://flink.apache.org/contributing/code-style-and-quality-java.html#java-language-features-and-libraries
> >
> >
> >
>

Re: [DISCUSS] Using timeouts in JUnit tests

Posted by Arvid Heise <ar...@apache.org>.

+1 from my side.

We should probably double-check if we really need 4h timeouts on test tasks
in AZP. It feels like 2h be enough.

On Mon, Apr 26, 2021 at 9:54 AM Dawid Wysakowicz <dw...@apache.org>
wrote:

> Hi devs!
>
> I wanted to bring up something that was discussed in a few independent
> groups of people in the past days. I'd like to revise using timeouts in
> our JUnit tests. The suggestion would be not to use them anymore. The
> problem with timeouts is that we have no thread dump and stack traces of
> the system as it hangs. If we were not using a timeout, the CI runner
> would have caught the timeout and created a thread dump which often is a
> great starting point for debugging.
>
> This problem has been spotted e.g. during debugging FLINK-22416[1]. In
> the past thread dumps were not always taken for hanging tests, but it
> was changed quite recently in FLINK-21346[2]. I am happy to hear your
> opinions on it. If there are no objections I would like to add the
> suggestion to the Coding Guidelines[3]
>
> Best,
>
> Dawid
>
>
> [1] https://issues.apache.org/jira/browse/FLINK-22416
>
> [2] https://issues.apache.org/jira/browse/FLINK-21346
>
> [3]
>
> https://flink.apache.org/contributing/code-style-and-quality-java.html#java-language-features-and-libraries
>
>
>

Re: [DISCUSS] Using timeouts in JUnit tests

Posted by Piotr Nowojski <pn...@apache.org>.

Thanks for the response Dawid. In this case I don't see a reason for
timeouts in JUnit. I agree with the previously mentioned points and I think
it doesn't make sense to use them in order to limit test duration

Piotrek

wt., 4 maj 2021 o 17:44 Dawid Wysakowicz <dw...@apache.org>
napisał(a):

> Hey all,
>
> Sorry I've not been active in the thread. I think the conclusion is
> rather positive and we do want to depend more on the Azure watchdog and
> discourage timeouts in JUnit tests from now on. I'll update our coding
> guidelines accordingly.
>
> @Piotr Yes, this was a problem in the past, but that was solved in the
> FLINK-21346, that I linked, quite recently. The thread dumps will be
> taken and logs uploaded if the job is reaching the global limit.
> Similarly as it happens for the e2e tests.
>
> Best,
>
> Dawid
>
>
> [1] https://issues.apache.org/jira/browse/FLINK-21346
>
> On 04/05/2021 17:24, Piotr Nowojski wrote:
> > Hi,
> >
> > I'm ok with removing most or even all timeouts. Just one thing.
> >
> > Reason behind using junit timeouts that I've heard (and I was adhering to
> > it) was that maven watchdog was doing the thread dump and killing the
> test
> > process using timeout based on logs inactivity. Some tests were by nature
> > prone to live lock (for example if a job had an infinite source), that
> was
> > preventing the watchdog from kicking in. And if I remember correctly we
> > didn't have a global timeout for all tests, just travis was killing our
> > jobs without even uploading any logs to S3. So junit level timeouts were
> > very valuable in those kinds of cases, to at least get some logs, even if
> > without a stack trace.
> >
> > I have no idea what's the state of our watch dogs right now, after all of
> > the changes in the past years, so I don't know how relevant is this
> > reasoning.
> >
> > Piotrek
> >
> > pt., 30 kwi 2021 o 11:52 Till Rohrmann <tr...@apache.org>
> napisał(a):
> >
> >> Yes, you can click on the test job where a test failed. Then you can
> click
> >> on 1 artifact produced (or on the overview page on the X published
> >> (artifacts)). This brings you to the published artifacts page where we
> >> upload for every job the logs.
> >>
> >> Cheers,
> >> Till
> >>
> >> On Fri, Apr 30, 2021 at 9:22 AM Dong Lin <li...@gmail.com> wrote:
> >>
> >>> Thanks Till. Yes you are right. The INFO logging is enabled. It is just
> >>> dumped to a file (the FileAppender) other than the console.
> >>>
> >>> There is probably a way to retrieve that log file from AZP. I will ask
> >>> other colleagues how to get this later.
> >>>
> >>> On Thu, Apr 29, 2021 at 4:51 PM Till Rohrmann <tr...@apache.org>
> >>> wrote:
> >>>
> >>>> I think for the maven tests we use this log4j.properties file [1].
> >>>>
> >>>> [1]
> >>> https://github.com/apache/flink/blob/master/tools/ci/log4j.properties
> >>>> Cheers,
> >>>> Till
> >>>>
> >>>> On Wed, Apr 28, 2021 at 4:47 AM Dong Lin <li...@gmail.com> wrote:
> >>>>
> >>>>> Thanks for the detailed explanations! Regarding the usage of timeout,
> >>>> now I
> >>>>> agree that it is better to remove per-test timeouts because it helps
> >>>>> make our testing results more reliable and consistent.
> >>>>>
> >>>>> My previous concern is that it might not be a good idea to
> >>> intentionally
> >>>>> let the test hang in AZP in order to get the thread dump. Now I get
> >>> that
> >>>>> there are a few practical concerns around the usage of timeout which
> >>>> makes
> >>>>> testing results unreliable (e.g. flakiness in the presence of VM
> >>>>> migration).
> >>>>>
> >>>>> Regarding the level logging on AZP, it appears that we actually set
> >>>>> "rootLogger.level = OFF" in most log4j2-test.properties, which means
> >>> that
> >>>>> no INFO log would be printed on AZP. For example, I tried to increase
> >>> the
> >>>>> log level in this <https://github.com/apache/flink/pull/15617> PR
> >> and
> >>>> was
> >>>>> suggested in this
> >>>>> <
> >>>>>
> >>
> https://issues.apache.org/jira/browse/FLINK-22085?focusedCommentId=17321055&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17321055
> >>>>> comment to avoid increasing the log level. Did I miss something here?
> >>>>>
> >>>>>
> >>>>> On Wed, Apr 28, 2021 at 2:22 AM Arvid Heise <ar...@apache.org>
> >> wrote:
> >>>>>> Just to add to Dong Lin's list of cons of allowing timeout:
> >>>>>> - Any timeout value that you manually set is arbitrary. If it's set
> >>> too
> >>>>>> low, you get test instabilities. What too low means depends on
> >>> numerous
> >>>>>> factors, such as hardware and current utilization (especially I/O).
> >>> If
> >>>>> you
> >>>>>> run in VMs and the VM is migrated while running a test, any
> >>> reasonable
> >>>>>> timeout will probably fail. While you could make a similar case for
> >>> the
> >>>>>> overall timeout of tests, any smaller hiccup in the range of
> >> minutes
> >>>> will
> >>>>>> not impact the overall runtime much. The probability of having a VM
> >>>>>> constantly migrating during the same stage is abysmally low.
> >>>>>> - A timeout is more maintenance-intensive. It's one more knob where
> >>> you
> >>>>> can
> >>>>>> tweak a build or not. If you change the test a bit, you also need
> >> to
> >>>>>> double-check the timeout. Hence, there have been quite a few
> >> commits
> >>>> that
> >>>>>> just increase timeouts.
> >>>>>> - Whether a test uses a timeout or not is arbitrary: Why do some
> >> ITs
> >>>>> have a
> >>>>>> timeout and others don't? All IT tests are prone to timeout if
> >> there
> >>>> are
> >>>>>> issues with resource allocation. Similarly, there are quite a few
> >>> unit
> >>>>>> tests with timeouts while others don't have them with no obvious
> >>>> pattern.
> >>>>>> - An ill-set timeout reduces build reproducibility. Imagine having
> >> a
> >>>>>> release with such a timeout and the users cannot build Flink
> >>> reliably.
> >>>>>> I'd like to also point out that we should not cater around unstable
> >>>> tests
> >>>>>> if our overall goal is to have as many green builds as possible. If
> >>> we
> >>>>>> assume that our builds fail more often than not, we should also
> >> look
> >>>> into
> >>>>>> the other direction and continue the builds on error. I'm not a big
> >>> fan
> >>>>> of
> >>>>>> that.
> >>>>>>
> >>>>>> One argument that I also heard is that it eases local debugging in
> >>> case
> >>>>> of
> >>>>>> refactorings as you can see multiple failures at the same time. But
> >>> no
> >>>>> one
> >>>>>> is keeping you from temporarily adding a timeout on your branch.
> >>> Then,
> >>>> we
> >>>>>> can be sure that the timeout is plausible for your hardware and
> >> avoid
> >>>> all
> >>>>>> above mentioned drawbacks.
> >>>>>>
> >>>>>> @Robert Metzger <rm...@apache.org>
> >>>>>>
> >>>>>>> If we had a global limit of 1 minute per test, we would have
> >> caught
> >>>>> this
> >>>>>>> case (and we would encourage people to be careful with CI time).
> >>>>>>>
> >>>>>> There are quite a few tests that run longer, especially on a well
> >>>>> utilized
> >>>>>> build machine. A global limit is even worse than individual limits
> >> as
> >>>>> there
> >>>>>> is no value that fits it all. If you screwed up and 200 tests hang,
> >>>> you'd
> >>>>>> also run into the global timeout anyway. I'm also not sure what
> >> these
> >>>>>> additional hangs bring you except a huge log.
> >>>>>>
> >>>>>> I'm also not sure if it's really better in terms of CI time. For
> >>>> example,
> >>>>>> for UnalignedCheckpointRescaleITCase, we test all known
> >> partitioners
> >>> in
> >>>>> one
> >>>>>> pipeline for correctness. For higher parallelism, that means the
> >> test
> >>>>> runs
> >>>>>> over 1 minute regularly. If there is a global limit, I'd need to
> >>> split
> >>>>> the
> >>>>>> test into smaller chunks, where I'm positive that the sum of the
> >>> chunks
> >>>>>> will be larger than before.
> >>>>>>
> >>>>>> PS: all tests on AZP will print INFO in the artifacts. There you
> >> can
> >>>> also
> >>>>>> retrieve the stacktraces.
> >>>>>> PPS: I also said that we should revalidate the current timeout on
> >>> AZP.
> >>>> So
> >>>>>> the argument that we have >2h of precious CI time wasted is kind of
> >>>>>> constructed and is just due to some random defaults.
> >>>>>>
> >>>>>> On Tue, Apr 27, 2021 at 6:42 PM Till Rohrmann <
> >> trohrmann@apache.org>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> I think we do capture the INFO logs of the test runs on AZP.
> >>>>>>>
> >>>>>>> I am also not sure whether we really caught slow tests with
> >> Junit's
> >>>>>> timeout
> >>>>>>> rule before. I think the default is usually to increase the
> >> timeout
> >>>> to
> >>>>>> make
> >>>>>>> the test pass. One way to find slow tests is to measure the time
> >>> and
> >>>>> look
> >>>>>>> at the outliers.
> >>>>>>>
> >>>>>>> Cheers,
> >>>>>>> Till
> >>>>>>>
> >>>>>>> On Tue, Apr 27, 2021 at 3:49 PM Dong Lin <li...@gmail.com>
> >>>> wrote:
> >>>>>>>> There is one more point that may be useful to consider here.
> >>>>>>>>
> >>>>>>>> In order to debug deadlock that is not easily reproducible, it
> >> is
> >>>>>> likely
> >>>>>>>> not sufficient to see only the thread dump to figure out the
> >> root
> >>>>>> cause.
> >>>>>>> We
> >>>>>>>> likely need to enable the INFO level logging. Since AZP does
> >> not
> >>>>>> provide
> >>>>>>>> INFO level logging by default, we either need to reproduce the
> >>> bug
> >>>>>>> locally
> >>>>>>>> or change the AZP log4j temporarily. This further reduces the
> >>>> benefit
> >>>>>> of
> >>>>>>>> logging the thread dump (which comes at the cost of letting the
> >>> AZP
> >>>>> job
> >>>>>>>> hang).
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Tue, Apr 27, 2021 at 9:34 PM Dong Lin <li...@gmail.com>
> >>>>> wrote:
> >>>>>>>>> Just to make sure I understand the proposal correctly: is the
> >>>>>> proposal
> >>>>>>> to
> >>>>>>>>> disallow the usage of @Test(timeout=...) for Flink Junit
> >> tests?
> >>>>>>>>> Here is my understanding of the pros/cons according to the
> >>>>> discussion
> >>>>>>> so
> >>>>>>>>> far.
> >>>>>>>>>
> >>>>>>>>> Pros of allowing timeout:
> >>>>>>>>> 1) When there are tests that are unreasonably slow, it helps
> >> us
> >>>>>>>>> catch those tests and thus increase the quality of unit
> >> tests.
> >>>>>>>>> 2) When there are tests that cause deadlock, it helps the AZP
> >>> job
> >>>>>> fail
> >>>>>>>>> fast instead of being blocked for 4 hours. This saves
> >> resources
> >>>> and
> >>>>>>> also
> >>>>>>>>> allows developers to get their PR tested again earlier
> >> (useful
> >>>> when
> >>>>>> the
> >>>>>>>>> test failure is not relevant to their PR).
> >>>>>>>>>
> >>>>>>>>> Cons of allowing timeout:
> >>>>>>>>> 1) When there are tests that cause deadlock, we could not see
> >>> the
> >>>>>>> thread
> >>>>>>>>> dump of all threads, which makes debugging the issue harder.
> >>>>>>>>>
> >>>>>>>>> I would suggest that we should still allow timeout because
> >> the
> >>>> pros
> >>>>>>>>> outweigh the cons.
> >>>>>>>>>
> >>>>>>>>> As far as I can tell, if we allow timeout and encounter a
> >>>> deadlock
> >>>>>> bug
> >>>>>>> in
> >>>>>>>>> AZP, we still know which test (or test suite) fails. There
> >> is a
> >>>>> good
> >>>>>>>> chance
> >>>>>>>>> we can reproduce the deadlock locally (by running it 100
> >> times)
> >>>> and
> >>>>>> get
> >>>>>>>> the
> >>>>>>>>> debug information we need. In the rare case where the
> >> deadlock
> >>>>>> happens
> >>>>>>>> only
> >>>>>>>>> on AZP, we can just disable the timeout for that particular
> >>> test.
> >>>>> So
> >>>>>>> the
> >>>>>>>>> lack of thread dump is not really a concern.
> >>>>>>>>>
> >>>>>>>>> On the other hand, if we disallow timeout, it will be very
> >> hard
> >>>> for
> >>>>>> us
> >>>>>>> to
> >>>>>>>>> catch low-quality tests. I don't know if there is a good
> >>>>> alternative
> >>>>>>> way
> >>>>>>>> to
> >>>>>>>>> catch those tests.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Mon, Apr 26, 2021 at 3:54 PM Dawid Wysakowicz <
> >>>>>>> dwysakowicz@apache.org
> >>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> Hi devs!
> >>>>>>>>>>
> >>>>>>>>>> I wanted to bring up something that was discussed in a few
> >>>>>> independent
> >>>>>>>>>> groups of people in the past days. I'd like to revise using
> >>>>> timeouts
> >>>>>>> in
> >>>>>>>>>> our JUnit tests. The suggestion would be not to use them
> >>>> anymore.
> >>>>>> The
> >>>>>>>>>> problem with timeouts is that we have no thread dump and
> >> stack
> >>>>>> traces
> >>>>>>> of
> >>>>>>>>>> the system as it hangs. If we were not using a timeout, the
> >> CI
> >>>>>> runner
> >>>>>>>>>> would have caught the timeout and created a thread dump
> >> which
> >>>>> often
> >>>>>>> is a
> >>>>>>>>>> great starting point for debugging.
> >>>>>>>>>>
> >>>>>>>>>> This problem has been spotted e.g. during debugging
> >>>>> FLINK-22416[1].
> >>>>>> In
> >>>>>>>>>> the past thread dumps were not always taken for hanging
> >> tests,
> >>>> but
> >>>>>> it
> >>>>>>>>>> was changed quite recently in FLINK-21346[2]. I am happy to
> >>> hear
> >>>>>> your
> >>>>>>>>>> opinions on it. If there are no objections I would like to
> >> add
> >>>> the
> >>>>>>>>>> suggestion to the Coding Guidelines[3]
> >>>>>>>>>>
> >>>>>>>>>> Best,
> >>>>>>>>>>
> >>>>>>>>>> Dawid
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> [1] https://issues.apache.org/jira/browse/FLINK-22416
> >>>>>>>>>>
> >>>>>>>>>> [2] https://issues.apache.org/jira/browse/FLINK-21346
> >>>>>>>>>>
> >>>>>>>>>> [3]
> >>>>>>>>>>
> >>>>>>>>>>
> >>
> https://flink.apache.org/contributing/code-style-and-quality-java.html#java-language-features-and-libraries
> >>>>>>>>>>
> >>>>>>>>>>
>
>

Re: [DISCUSS] Using timeouts in JUnit tests

Posted by Dawid Wysakowicz <dw...@apache.org>.

Hey all,

Sorry I've not been active in the thread. I think the conclusion is
rather positive and we do want to depend more on the Azure watchdog and
discourage timeouts in JUnit tests from now on. I'll update our coding
guidelines accordingly.

@Piotr Yes, this was a problem in the past, but that was solved in the
FLINK-21346, that I linked, quite recently. The thread dumps will be
taken and logs uploaded if the job is reaching the global limit.
Similarly as it happens for the e2e tests.

Best,

Dawid


[1] https://issues.apache.org/jira/browse/FLINK-21346

On 04/05/2021 17:24, Piotr Nowojski wrote:
> Hi,
>
> I'm ok with removing most or even all timeouts. Just one thing.
>
> Reason behind using junit timeouts that I've heard (and I was adhering to
> it) was that maven watchdog was doing the thread dump and killing the test
> process using timeout based on logs inactivity. Some tests were by nature
> prone to live lock (for example if a job had an infinite source), that was
> preventing the watchdog from kicking in. And if I remember correctly we
> didn't have a global timeout for all tests, just travis was killing our
> jobs without even uploading any logs to S3. So junit level timeouts were
> very valuable in those kinds of cases, to at least get some logs, even if
> without a stack trace.
>
> I have no idea what's the state of our watch dogs right now, after all of
> the changes in the past years, so I don't know how relevant is this
> reasoning.
>
> Piotrek
>
> pt., 30 kwi 2021 o 11:52 Till Rohrmann <tr...@apache.org> napisał(a):
>
>> Yes, you can click on the test job where a test failed. Then you can click
>> on 1 artifact produced (or on the overview page on the X published
>> (artifacts)). This brings you to the published artifacts page where we
>> upload for every job the logs.
>>
>> Cheers,
>> Till
>>
>> On Fri, Apr 30, 2021 at 9:22 AM Dong Lin <li...@gmail.com> wrote:
>>
>>> Thanks Till. Yes you are right. The INFO logging is enabled. It is just
>>> dumped to a file (the FileAppender) other than the console.
>>>
>>> There is probably a way to retrieve that log file from AZP. I will ask
>>> other colleagues how to get this later.
>>>
>>> On Thu, Apr 29, 2021 at 4:51 PM Till Rohrmann <tr...@apache.org>
>>> wrote:
>>>
>>>> I think for the maven tests we use this log4j.properties file [1].
>>>>
>>>> [1]
>>> https://github.com/apache/flink/blob/master/tools/ci/log4j.properties
>>>> Cheers,
>>>> Till
>>>>
>>>> On Wed, Apr 28, 2021 at 4:47 AM Dong Lin <li...@gmail.com> wrote:
>>>>
>>>>> Thanks for the detailed explanations! Regarding the usage of timeout,
>>>> now I
>>>>> agree that it is better to remove per-test timeouts because it helps
>>>>> make our testing results more reliable and consistent.
>>>>>
>>>>> My previous concern is that it might not be a good idea to
>>> intentionally
>>>>> let the test hang in AZP in order to get the thread dump. Now I get
>>> that
>>>>> there are a few practical concerns around the usage of timeout which
>>>> makes
>>>>> testing results unreliable (e.g. flakiness in the presence of VM
>>>>> migration).
>>>>>
>>>>> Regarding the level logging on AZP, it appears that we actually set
>>>>> "rootLogger.level = OFF" in most log4j2-test.properties, which means
>>> that
>>>>> no INFO log would be printed on AZP. For example, I tried to increase
>>> the
>>>>> log level in this <https://github.com/apache/flink/pull/15617> PR
>> and
>>>> was
>>>>> suggested in this
>>>>> <
>>>>>
>> https://issues.apache.org/jira/browse/FLINK-22085?focusedCommentId=17321055&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17321055
>>>>> comment to avoid increasing the log level. Did I miss something here?
>>>>>
>>>>>
>>>>> On Wed, Apr 28, 2021 at 2:22 AM Arvid Heise <ar...@apache.org>
>> wrote:
>>>>>> Just to add to Dong Lin's list of cons of allowing timeout:
>>>>>> - Any timeout value that you manually set is arbitrary. If it's set
>>> too
>>>>>> low, you get test instabilities. What too low means depends on
>>> numerous
>>>>>> factors, such as hardware and current utilization (especially I/O).
>>> If
>>>>> you
>>>>>> run in VMs and the VM is migrated while running a test, any
>>> reasonable
>>>>>> timeout will probably fail. While you could make a similar case for
>>> the
>>>>>> overall timeout of tests, any smaller hiccup in the range of
>> minutes
>>>> will
>>>>>> not impact the overall runtime much. The probability of having a VM
>>>>>> constantly migrating during the same stage is abysmally low.
>>>>>> - A timeout is more maintenance-intensive. It's one more knob where
>>> you
>>>>> can
>>>>>> tweak a build or not. If you change the test a bit, you also need
>> to
>>>>>> double-check the timeout. Hence, there have been quite a few
>> commits
>>>> that
>>>>>> just increase timeouts.
>>>>>> - Whether a test uses a timeout or not is arbitrary: Why do some
>> ITs
>>>>> have a
>>>>>> timeout and others don't? All IT tests are prone to timeout if
>> there
>>>> are
>>>>>> issues with resource allocation. Similarly, there are quite a few
>>> unit
>>>>>> tests with timeouts while others don't have them with no obvious
>>>> pattern.
>>>>>> - An ill-set timeout reduces build reproducibility. Imagine having
>> a
>>>>>> release with such a timeout and the users cannot build Flink
>>> reliably.
>>>>>> I'd like to also point out that we should not cater around unstable
>>>> tests
>>>>>> if our overall goal is to have as many green builds as possible. If
>>> we
>>>>>> assume that our builds fail more often than not, we should also
>> look
>>>> into
>>>>>> the other direction and continue the builds on error. I'm not a big
>>> fan
>>>>> of
>>>>>> that.
>>>>>>
>>>>>> One argument that I also heard is that it eases local debugging in
>>> case
>>>>> of
>>>>>> refactorings as you can see multiple failures at the same time. But
>>> no
>>>>> one
>>>>>> is keeping you from temporarily adding a timeout on your branch.
>>> Then,
>>>> we
>>>>>> can be sure that the timeout is plausible for your hardware and
>> avoid
>>>> all
>>>>>> above mentioned drawbacks.
>>>>>>
>>>>>> @Robert Metzger <rm...@apache.org>
>>>>>>
>>>>>>> If we had a global limit of 1 minute per test, we would have
>> caught
>>>>> this
>>>>>>> case (and we would encourage people to be careful with CI time).
>>>>>>>
>>>>>> There are quite a few tests that run longer, especially on a well
>>>>> utilized
>>>>>> build machine. A global limit is even worse than individual limits
>> as
>>>>> there
>>>>>> is no value that fits it all. If you screwed up and 200 tests hang,
>>>> you'd
>>>>>> also run into the global timeout anyway. I'm also not sure what
>> these
>>>>>> additional hangs bring you except a huge log.
>>>>>>
>>>>>> I'm also not sure if it's really better in terms of CI time. For
>>>> example,
>>>>>> for UnalignedCheckpointRescaleITCase, we test all known
>> partitioners
>>> in
>>>>> one
>>>>>> pipeline for correctness. For higher parallelism, that means the
>> test
>>>>> runs
>>>>>> over 1 minute regularly. If there is a global limit, I'd need to
>>> split
>>>>> the
>>>>>> test into smaller chunks, where I'm positive that the sum of the
>>> chunks
>>>>>> will be larger than before.
>>>>>>
>>>>>> PS: all tests on AZP will print INFO in the artifacts. There you
>> can
>>>> also
>>>>>> retrieve the stacktraces.
>>>>>> PPS: I also said that we should revalidate the current timeout on
>>> AZP.
>>>> So
>>>>>> the argument that we have >2h of precious CI time wasted is kind of
>>>>>> constructed and is just due to some random defaults.
>>>>>>
>>>>>> On Tue, Apr 27, 2021 at 6:42 PM Till Rohrmann <
>> trohrmann@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>> I think we do capture the INFO logs of the test runs on AZP.
>>>>>>>
>>>>>>> I am also not sure whether we really caught slow tests with
>> Junit's
>>>>>> timeout
>>>>>>> rule before. I think the default is usually to increase the
>> timeout
>>>> to
>>>>>> make
>>>>>>> the test pass. One way to find slow tests is to measure the time
>>> and
>>>>> look
>>>>>>> at the outliers.
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Till
>>>>>>>
>>>>>>> On Tue, Apr 27, 2021 at 3:49 PM Dong Lin <li...@gmail.com>
>>>> wrote:
>>>>>>>> There is one more point that may be useful to consider here.
>>>>>>>>
>>>>>>>> In order to debug deadlock that is not easily reproducible, it
>> is
>>>>>> likely
>>>>>>>> not sufficient to see only the thread dump to figure out the
>> root
>>>>>> cause.
>>>>>>> We
>>>>>>>> likely need to enable the INFO level logging. Since AZP does
>> not
>>>>>> provide
>>>>>>>> INFO level logging by default, we either need to reproduce the
>>> bug
>>>>>>> locally
>>>>>>>> or change the AZP log4j temporarily. This further reduces the
>>>> benefit
>>>>>> of
>>>>>>>> logging the thread dump (which comes at the cost of letting the
>>> AZP
>>>>> job
>>>>>>>> hang).
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Apr 27, 2021 at 9:34 PM Dong Lin <li...@gmail.com>
>>>>> wrote:
>>>>>>>>> Just to make sure I understand the proposal correctly: is the
>>>>>> proposal
>>>>>>> to
>>>>>>>>> disallow the usage of @Test(timeout=...) for Flink Junit
>> tests?
>>>>>>>>> Here is my understanding of the pros/cons according to the
>>>>> discussion
>>>>>>> so
>>>>>>>>> far.
>>>>>>>>>
>>>>>>>>> Pros of allowing timeout:
>>>>>>>>> 1) When there are tests that are unreasonably slow, it helps
>> us
>>>>>>>>> catch those tests and thus increase the quality of unit
>> tests.
>>>>>>>>> 2) When there are tests that cause deadlock, it helps the AZP
>>> job
>>>>>> fail
>>>>>>>>> fast instead of being blocked for 4 hours. This saves
>> resources
>>>> and
>>>>>>> also
>>>>>>>>> allows developers to get their PR tested again earlier
>> (useful
>>>> when
>>>>>> the
>>>>>>>>> test failure is not relevant to their PR).
>>>>>>>>>
>>>>>>>>> Cons of allowing timeout:
>>>>>>>>> 1) When there are tests that cause deadlock, we could not see
>>> the
>>>>>>> thread
>>>>>>>>> dump of all threads, which makes debugging the issue harder.
>>>>>>>>>
>>>>>>>>> I would suggest that we should still allow timeout because
>> the
>>>> pros
>>>>>>>>> outweigh the cons.
>>>>>>>>>
>>>>>>>>> As far as I can tell, if we allow timeout and encounter a
>>>> deadlock
>>>>>> bug
>>>>>>> in
>>>>>>>>> AZP, we still know which test (or test suite) fails. There
>> is a
>>>>> good
>>>>>>>> chance
>>>>>>>>> we can reproduce the deadlock locally (by running it 100
>> times)
>>>> and
>>>>>> get
>>>>>>>> the
>>>>>>>>> debug information we need. In the rare case where the
>> deadlock
>>>>>> happens
>>>>>>>> only
>>>>>>>>> on AZP, we can just disable the timeout for that particular
>>> test.
>>>>> So
>>>>>>> the
>>>>>>>>> lack of thread dump is not really a concern.
>>>>>>>>>
>>>>>>>>> On the other hand, if we disallow timeout, it will be very
>> hard
>>>> for
>>>>>> us
>>>>>>> to
>>>>>>>>> catch low-quality tests. I don't know if there is a good
>>>>> alternative
>>>>>>> way
>>>>>>>> to
>>>>>>>>> catch those tests.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Apr 26, 2021 at 3:54 PM Dawid Wysakowicz <
>>>>>>> dwysakowicz@apache.org
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi devs!
>>>>>>>>>>
>>>>>>>>>> I wanted to bring up something that was discussed in a few
>>>>>> independent
>>>>>>>>>> groups of people in the past days. I'd like to revise using
>>>>> timeouts
>>>>>>> in
>>>>>>>>>> our JUnit tests. The suggestion would be not to use them
>>>> anymore.
>>>>>> The
>>>>>>>>>> problem with timeouts is that we have no thread dump and
>> stack
>>>>>> traces
>>>>>>> of
>>>>>>>>>> the system as it hangs. If we were not using a timeout, the
>> CI
>>>>>> runner
>>>>>>>>>> would have caught the timeout and created a thread dump
>> which
>>>>> often
>>>>>>> is a
>>>>>>>>>> great starting point for debugging.
>>>>>>>>>>
>>>>>>>>>> This problem has been spotted e.g. during debugging
>>>>> FLINK-22416[1].
>>>>>> In
>>>>>>>>>> the past thread dumps were not always taken for hanging
>> tests,
>>>> but
>>>>>> it
>>>>>>>>>> was changed quite recently in FLINK-21346[2]. I am happy to
>>> hear
>>>>>> your
>>>>>>>>>> opinions on it. If there are no objections I would like to
>> add
>>>> the
>>>>>>>>>> suggestion to the Coding Guidelines[3]
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>>
>>>>>>>>>> Dawid
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> [1] https://issues.apache.org/jira/browse/FLINK-22416
>>>>>>>>>>
>>>>>>>>>> [2] https://issues.apache.org/jira/browse/FLINK-21346
>>>>>>>>>>
>>>>>>>>>> [3]
>>>>>>>>>>
>>>>>>>>>>
>> https://flink.apache.org/contributing/code-style-and-quality-java.html#java-language-features-and-libraries
>>>>>>>>>>
>>>>>>>>>>

Re: [DISCUSS] Using timeouts in JUnit tests

Posted by Piotr Nowojski <pn...@apache.org>.

Hi,

I'm ok with removing most or even all timeouts. Just one thing.

Reason behind using junit timeouts that I've heard (and I was adhering to
it) was that maven watchdog was doing the thread dump and killing the test
process using timeout based on logs inactivity. Some tests were by nature
prone to live lock (for example if a job had an infinite source), that was
preventing the watchdog from kicking in. And if I remember correctly we
didn't have a global timeout for all tests, just travis was killing our
jobs without even uploading any logs to S3. So junit level timeouts were
very valuable in those kinds of cases, to at least get some logs, even if
without a stack trace.

I have no idea what's the state of our watch dogs right now, after all of
the changes in the past years, so I don't know how relevant is this
reasoning.

Piotrek

pt., 30 kwi 2021 o 11:52 Till Rohrmann <tr...@apache.org> napisał(a):

> Yes, you can click on the test job where a test failed. Then you can click
> on 1 artifact produced (or on the overview page on the X published
> (artifacts)). This brings you to the published artifacts page where we
> upload for every job the logs.
>
> Cheers,
> Till
>
> On Fri, Apr 30, 2021 at 9:22 AM Dong Lin <li...@gmail.com> wrote:
>
> > Thanks Till. Yes you are right. The INFO logging is enabled. It is just
> > dumped to a file (the FileAppender) other than the console.
> >
> > There is probably a way to retrieve that log file from AZP. I will ask
> > other colleagues how to get this later.
> >
> > On Thu, Apr 29, 2021 at 4:51 PM Till Rohrmann <tr...@apache.org>
> > wrote:
> >
> > > I think for the maven tests we use this log4j.properties file [1].
> > >
> > > [1]
> > https://github.com/apache/flink/blob/master/tools/ci/log4j.properties
> > >
> > > Cheers,
> > > Till
> > >
> > > On Wed, Apr 28, 2021 at 4:47 AM Dong Lin <li...@gmail.com> wrote:
> > >
> > > > Thanks for the detailed explanations! Regarding the usage of timeout,
> > > now I
> > > > agree that it is better to remove per-test timeouts because it helps
> > > > make our testing results more reliable and consistent.
> > > >
> > > > My previous concern is that it might not be a good idea to
> > intentionally
> > > > let the test hang in AZP in order to get the thread dump. Now I get
> > that
> > > > there are a few practical concerns around the usage of timeout which
> > > makes
> > > > testing results unreliable (e.g. flakiness in the presence of VM
> > > > migration).
> > > >
> > > > Regarding the level logging on AZP, it appears that we actually set
> > > > "rootLogger.level = OFF" in most log4j2-test.properties, which means
> > that
> > > > no INFO log would be printed on AZP. For example, I tried to increase
> > the
> > > > log level in this <https://github.com/apache/flink/pull/15617> PR
> and
> > > was
> > > > suggested in this
> > > > <
> > > >
> > >
> >
> https://issues.apache.org/jira/browse/FLINK-22085?focusedCommentId=17321055&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17321055
> > > > >
> > > > comment to avoid increasing the log level. Did I miss something here?
> > > >
> > > >
> > > > On Wed, Apr 28, 2021 at 2:22 AM Arvid Heise <ar...@apache.org>
> wrote:
> > > >
> > > > > Just to add to Dong Lin's list of cons of allowing timeout:
> > > > > - Any timeout value that you manually set is arbitrary. If it's set
> > too
> > > > > low, you get test instabilities. What too low means depends on
> > numerous
> > > > > factors, such as hardware and current utilization (especially I/O).
> > If
> > > > you
> > > > > run in VMs and the VM is migrated while running a test, any
> > reasonable
> > > > > timeout will probably fail. While you could make a similar case for
> > the
> > > > > overall timeout of tests, any smaller hiccup in the range of
> minutes
> > > will
> > > > > not impact the overall runtime much. The probability of having a VM
> > > > > constantly migrating during the same stage is abysmally low.
> > > > > - A timeout is more maintenance-intensive. It's one more knob where
> > you
> > > > can
> > > > > tweak a build or not. If you change the test a bit, you also need
> to
> > > > > double-check the timeout. Hence, there have been quite a few
> commits
> > > that
> > > > > just increase timeouts.
> > > > > - Whether a test uses a timeout or not is arbitrary: Why do some
> ITs
> > > > have a
> > > > > timeout and others don't? All IT tests are prone to timeout if
> there
> > > are
> > > > > issues with resource allocation. Similarly, there are quite a few
> > unit
> > > > > tests with timeouts while others don't have them with no obvious
> > > pattern.
> > > > > - An ill-set timeout reduces build reproducibility. Imagine having
> a
> > > > > release with such a timeout and the users cannot build Flink
> > reliably.
> > > > >
> > > > > I'd like to also point out that we should not cater around unstable
> > > tests
> > > > > if our overall goal is to have as many green builds as possible. If
> > we
> > > > > assume that our builds fail more often than not, we should also
> look
> > > into
> > > > > the other direction and continue the builds on error. I'm not a big
> > fan
> > > > of
> > > > > that.
> > > > >
> > > > > One argument that I also heard is that it eases local debugging in
> > case
> > > > of
> > > > > refactorings as you can see multiple failures at the same time. But
> > no
> > > > one
> > > > > is keeping you from temporarily adding a timeout on your branch.
> > Then,
> > > we
> > > > > can be sure that the timeout is plausible for your hardware and
> avoid
> > > all
> > > > > above mentioned drawbacks.
> > > > >
> > > > > @Robert Metzger <rm...@apache.org>
> > > > >
> > > > > > If we had a global limit of 1 minute per test, we would have
> caught
> > > > this
> > > > > > case (and we would encourage people to be careful with CI time).
> > > > > >
> > > > > There are quite a few tests that run longer, especially on a well
> > > > utilized
> > > > > build machine. A global limit is even worse than individual limits
> as
> > > > there
> > > > > is no value that fits it all. If you screwed up and 200 tests hang,
> > > you'd
> > > > > also run into the global timeout anyway. I'm also not sure what
> these
> > > > > additional hangs bring you except a huge log.
> > > > >
> > > > > I'm also not sure if it's really better in terms of CI time. For
> > > example,
> > > > > for UnalignedCheckpointRescaleITCase, we test all known
> partitioners
> > in
> > > > one
> > > > > pipeline for correctness. For higher parallelism, that means the
> test
> > > > runs
> > > > > over 1 minute regularly. If there is a global limit, I'd need to
> > split
> > > > the
> > > > > test into smaller chunks, where I'm positive that the sum of the
> > chunks
> > > > > will be larger than before.
> > > > >
> > > > > PS: all tests on AZP will print INFO in the artifacts. There you
> can
> > > also
> > > > > retrieve the stacktraces.
> > > > > PPS: I also said that we should revalidate the current timeout on
> > AZP.
> > > So
> > > > > the argument that we have >2h of precious CI time wasted is kind of
> > > > > constructed and is just due to some random defaults.
> > > > >
> > > > > On Tue, Apr 27, 2021 at 6:42 PM Till Rohrmann <
> trohrmann@apache.org>
> > > > > wrote:
> > > > >
> > > > > > I think we do capture the INFO logs of the test runs on AZP.
> > > > > >
> > > > > > I am also not sure whether we really caught slow tests with
> Junit's
> > > > > timeout
> > > > > > rule before. I think the default is usually to increase the
> timeout
> > > to
> > > > > make
> > > > > > the test pass. One way to find slow tests is to measure the time
> > and
> > > > look
> > > > > > at the outliers.
> > > > > >
> > > > > > Cheers,
> > > > > > Till
> > > > > >
> > > > > > On Tue, Apr 27, 2021 at 3:49 PM Dong Lin <li...@gmail.com>
> > > wrote:
> > > > > >
> > > > > > > There is one more point that may be useful to consider here.
> > > > > > >
> > > > > > > In order to debug deadlock that is not easily reproducible, it
> is
> > > > > likely
> > > > > > > not sufficient to see only the thread dump to figure out the
> root
> > > > > cause.
> > > > > > We
> > > > > > > likely need to enable the INFO level logging. Since AZP does
> not
> > > > > provide
> > > > > > > INFO level logging by default, we either need to reproduce the
> > bug
> > > > > > locally
> > > > > > > or change the AZP log4j temporarily. This further reduces the
> > > benefit
> > > > > of
> > > > > > > logging the thread dump (which comes at the cost of letting the
> > AZP
> > > > job
> > > > > > > hang).
> > > > > > >
> > > > > > >
> > > > > > > On Tue, Apr 27, 2021 at 9:34 PM Dong Lin <li...@gmail.com>
> > > > wrote:
> > > > > > >
> > > > > > > > Just to make sure I understand the proposal correctly: is the
> > > > > proposal
> > > > > > to
> > > > > > > > disallow the usage of @Test(timeout=...) for Flink Junit
> tests?
> > > > > > > >
> > > > > > > > Here is my understanding of the pros/cons according to the
> > > > discussion
> > > > > > so
> > > > > > > > far.
> > > > > > > >
> > > > > > > > Pros of allowing timeout:
> > > > > > > > 1) When there are tests that are unreasonably slow, it helps
> us
> > > > > > > > catch those tests and thus increase the quality of unit
> tests.
> > > > > > > > 2) When there are tests that cause deadlock, it helps the AZP
> > job
> > > > > fail
> > > > > > > > fast instead of being blocked for 4 hours. This saves
> resources
> > > and
> > > > > > also
> > > > > > > > allows developers to get their PR tested again earlier
> (useful
> > > when
> > > > > the
> > > > > > > > test failure is not relevant to their PR).
> > > > > > > >
> > > > > > > > Cons of allowing timeout:
> > > > > > > > 1) When there are tests that cause deadlock, we could not see
> > the
> > > > > > thread
> > > > > > > > dump of all threads, which makes debugging the issue harder.
> > > > > > > >
> > > > > > > > I would suggest that we should still allow timeout because
> the
> > > pros
> > > > > > > > outweigh the cons.
> > > > > > > >
> > > > > > > > As far as I can tell, if we allow timeout and encounter a
> > > deadlock
> > > > > bug
> > > > > > in
> > > > > > > > AZP, we still know which test (or test suite) fails. There
> is a
> > > > good
> > > > > > > chance
> > > > > > > > we can reproduce the deadlock locally (by running it 100
> times)
> > > and
> > > > > get
> > > > > > > the
> > > > > > > > debug information we need. In the rare case where the
> deadlock
> > > > > happens
> > > > > > > only
> > > > > > > > on AZP, we can just disable the timeout for that particular
> > test.
> > > > So
> > > > > > the
> > > > > > > > lack of thread dump is not really a concern.
> > > > > > > >
> > > > > > > > On the other hand, if we disallow timeout, it will be very
> hard
> > > for
> > > > > us
> > > > > > to
> > > > > > > > catch low-quality tests. I don't know if there is a good
> > > > alternative
> > > > > > way
> > > > > > > to
> > > > > > > > catch those tests.
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Mon, Apr 26, 2021 at 3:54 PM Dawid Wysakowicz <
> > > > > > dwysakowicz@apache.org
> > > > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > >> Hi devs!
> > > > > > > >>
> > > > > > > >> I wanted to bring up something that was discussed in a few
> > > > > independent
> > > > > > > >> groups of people in the past days. I'd like to revise using
> > > > timeouts
> > > > > > in
> > > > > > > >> our JUnit tests. The suggestion would be not to use them
> > > anymore.
> > > > > The
> > > > > > > >> problem with timeouts is that we have no thread dump and
> stack
> > > > > traces
> > > > > > of
> > > > > > > >> the system as it hangs. If we were not using a timeout, the
> CI
> > > > > runner
> > > > > > > >> would have caught the timeout and created a thread dump
> which
> > > > often
> > > > > > is a
> > > > > > > >> great starting point for debugging.
> > > > > > > >>
> > > > > > > >> This problem has been spotted e.g. during debugging
> > > > FLINK-22416[1].
> > > > > In
> > > > > > > >> the past thread dumps were not always taken for hanging
> tests,
> > > but
> > > > > it
> > > > > > > >> was changed quite recently in FLINK-21346[2]. I am happy to
> > hear
> > > > > your
> > > > > > > >> opinions on it. If there are no objections I would like to
> add
> > > the
> > > > > > > >> suggestion to the Coding Guidelines[3]
> > > > > > > >>
> > > > > > > >> Best,
> > > > > > > >>
> > > > > > > >> Dawid
> > > > > > > >>
> > > > > > > >>
> > > > > > > >> [1] https://issues.apache.org/jira/browse/FLINK-22416
> > > > > > > >>
> > > > > > > >> [2] https://issues.apache.org/jira/browse/FLINK-21346
> > > > > > > >>
> > > > > > > >> [3]
> > > > > > > >>
> > > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://flink.apache.org/contributing/code-style-and-quality-java.html#java-language-features-and-libraries
> > > > > > > >>
> > > > > > > >>
> > > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Using timeouts in JUnit tests

Posted by Till Rohrmann <tr...@apache.org>.

Yes, you can click on the test job where a test failed. Then you can click
on 1 artifact produced (or on the overview page on the X published
(artifacts)). This brings you to the published artifacts page where we
upload for every job the logs.

Cheers,
Till

On Fri, Apr 30, 2021 at 9:22 AM Dong Lin <li...@gmail.com> wrote:

> Thanks Till. Yes you are right. The INFO logging is enabled. It is just
> dumped to a file (the FileAppender) other than the console.
>
> There is probably a way to retrieve that log file from AZP. I will ask
> other colleagues how to get this later.
>
> On Thu, Apr 29, 2021 at 4:51 PM Till Rohrmann <tr...@apache.org>
> wrote:
>
> > I think for the maven tests we use this log4j.properties file [1].
> >
> > [1]
> https://github.com/apache/flink/blob/master/tools/ci/log4j.properties
> >
> > Cheers,
> > Till
> >
> > On Wed, Apr 28, 2021 at 4:47 AM Dong Lin <li...@gmail.com> wrote:
> >
> > > Thanks for the detailed explanations! Regarding the usage of timeout,
> > now I
> > > agree that it is better to remove per-test timeouts because it helps
> > > make our testing results more reliable and consistent.
> > >
> > > My previous concern is that it might not be a good idea to
> intentionally
> > > let the test hang in AZP in order to get the thread dump. Now I get
> that
> > > there are a few practical concerns around the usage of timeout which
> > makes
> > > testing results unreliable (e.g. flakiness in the presence of VM
> > > migration).
> > >
> > > Regarding the level logging on AZP, it appears that we actually set
> > > "rootLogger.level = OFF" in most log4j2-test.properties, which means
> that
> > > no INFO log would be printed on AZP. For example, I tried to increase
> the
> > > log level in this <https://github.com/apache/flink/pull/15617> PR and
> > was
> > > suggested in this
> > > <
> > >
> >
> https://issues.apache.org/jira/browse/FLINK-22085?focusedCommentId=17321055&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17321055
> > > >
> > > comment to avoid increasing the log level. Did I miss something here?
> > >
> > >
> > > On Wed, Apr 28, 2021 at 2:22 AM Arvid Heise <ar...@apache.org> wrote:
> > >
> > > > Just to add to Dong Lin's list of cons of allowing timeout:
> > > > - Any timeout value that you manually set is arbitrary. If it's set
> too
> > > > low, you get test instabilities. What too low means depends on
> numerous
> > > > factors, such as hardware and current utilization (especially I/O).
> If
> > > you
> > > > run in VMs and the VM is migrated while running a test, any
> reasonable
> > > > timeout will probably fail. While you could make a similar case for
> the
> > > > overall timeout of tests, any smaller hiccup in the range of minutes
> > will
> > > > not impact the overall runtime much. The probability of having a VM
> > > > constantly migrating during the same stage is abysmally low.
> > > > - A timeout is more maintenance-intensive. It's one more knob where
> you
> > > can
> > > > tweak a build or not. If you change the test a bit, you also need to
> > > > double-check the timeout. Hence, there have been quite a few commits
> > that
> > > > just increase timeouts.
> > > > - Whether a test uses a timeout or not is arbitrary: Why do some ITs
> > > have a
> > > > timeout and others don't? All IT tests are prone to timeout if there
> > are
> > > > issues with resource allocation. Similarly, there are quite a few
> unit
> > > > tests with timeouts while others don't have them with no obvious
> > pattern.
> > > > - An ill-set timeout reduces build reproducibility. Imagine having a
> > > > release with such a timeout and the users cannot build Flink
> reliably.
> > > >
> > > > I'd like to also point out that we should not cater around unstable
> > tests
> > > > if our overall goal is to have as many green builds as possible. If
> we
> > > > assume that our builds fail more often than not, we should also look
> > into
> > > > the other direction and continue the builds on error. I'm not a big
> fan
> > > of
> > > > that.
> > > >
> > > > One argument that I also heard is that it eases local debugging in
> case
> > > of
> > > > refactorings as you can see multiple failures at the same time. But
> no
> > > one
> > > > is keeping you from temporarily adding a timeout on your branch.
> Then,
> > we
> > > > can be sure that the timeout is plausible for your hardware and avoid
> > all
> > > > above mentioned drawbacks.
> > > >
> > > > @Robert Metzger <rm...@apache.org>
> > > >
> > > > > If we had a global limit of 1 minute per test, we would have caught
> > > this
> > > > > case (and we would encourage people to be careful with CI time).
> > > > >
> > > > There are quite a few tests that run longer, especially on a well
> > > utilized
> > > > build machine. A global limit is even worse than individual limits as
> > > there
> > > > is no value that fits it all. If you screwed up and 200 tests hang,
> > you'd
> > > > also run into the global timeout anyway. I'm also not sure what these
> > > > additional hangs bring you except a huge log.
> > > >
> > > > I'm also not sure if it's really better in terms of CI time. For
> > example,
> > > > for UnalignedCheckpointRescaleITCase, we test all known partitioners
> in
> > > one
> > > > pipeline for correctness. For higher parallelism, that means the test
> > > runs
> > > > over 1 minute regularly. If there is a global limit, I'd need to
> split
> > > the
> > > > test into smaller chunks, where I'm positive that the sum of the
> chunks
> > > > will be larger than before.
> > > >
> > > > PS: all tests on AZP will print INFO in the artifacts. There you can
> > also
> > > > retrieve the stacktraces.
> > > > PPS: I also said that we should revalidate the current timeout on
> AZP.
> > So
> > > > the argument that we have >2h of precious CI time wasted is kind of
> > > > constructed and is just due to some random defaults.
> > > >
> > > > On Tue, Apr 27, 2021 at 6:42 PM Till Rohrmann <tr...@apache.org>
> > > > wrote:
> > > >
> > > > > I think we do capture the INFO logs of the test runs on AZP.
> > > > >
> > > > > I am also not sure whether we really caught slow tests with Junit's
> > > > timeout
> > > > > rule before. I think the default is usually to increase the timeout
> > to
> > > > make
> > > > > the test pass. One way to find slow tests is to measure the time
> and
> > > look
> > > > > at the outliers.
> > > > >
> > > > > Cheers,
> > > > > Till
> > > > >
> > > > > On Tue, Apr 27, 2021 at 3:49 PM Dong Lin <li...@gmail.com>
> > wrote:
> > > > >
> > > > > > There is one more point that may be useful to consider here.
> > > > > >
> > > > > > In order to debug deadlock that is not easily reproducible, it is
> > > > likely
> > > > > > not sufficient to see only the thread dump to figure out the root
> > > > cause.
> > > > > We
> > > > > > likely need to enable the INFO level logging. Since AZP does not
> > > > provide
> > > > > > INFO level logging by default, we either need to reproduce the
> bug
> > > > > locally
> > > > > > or change the AZP log4j temporarily. This further reduces the
> > benefit
> > > > of
> > > > > > logging the thread dump (which comes at the cost of letting the
> AZP
> > > job
> > > > > > hang).
> > > > > >
> > > > > >
> > > > > > On Tue, Apr 27, 2021 at 9:34 PM Dong Lin <li...@gmail.com>
> > > wrote:
> > > > > >
> > > > > > > Just to make sure I understand the proposal correctly: is the
> > > > proposal
> > > > > to
> > > > > > > disallow the usage of @Test(timeout=...) for Flink Junit tests?
> > > > > > >
> > > > > > > Here is my understanding of the pros/cons according to the
> > > discussion
> > > > > so
> > > > > > > far.
> > > > > > >
> > > > > > > Pros of allowing timeout:
> > > > > > > 1) When there are tests that are unreasonably slow, it helps us
> > > > > > > catch those tests and thus increase the quality of unit tests.
> > > > > > > 2) When there are tests that cause deadlock, it helps the AZP
> job
> > > > fail
> > > > > > > fast instead of being blocked for 4 hours. This saves resources
> > and
> > > > > also
> > > > > > > allows developers to get their PR tested again earlier (useful
> > when
> > > > the
> > > > > > > test failure is not relevant to their PR).
> > > > > > >
> > > > > > > Cons of allowing timeout:
> > > > > > > 1) When there are tests that cause deadlock, we could not see
> the
> > > > > thread
> > > > > > > dump of all threads, which makes debugging the issue harder.
> > > > > > >
> > > > > > > I would suggest that we should still allow timeout because the
> > pros
> > > > > > > outweigh the cons.
> > > > > > >
> > > > > > > As far as I can tell, if we allow timeout and encounter a
> > deadlock
> > > > bug
> > > > > in
> > > > > > > AZP, we still know which test (or test suite) fails. There is a
> > > good
> > > > > > chance
> > > > > > > we can reproduce the deadlock locally (by running it 100 times)
> > and
> > > > get
> > > > > > the
> > > > > > > debug information we need. In the rare case where the deadlock
> > > > happens
> > > > > > only
> > > > > > > on AZP, we can just disable the timeout for that particular
> test.
> > > So
> > > > > the
> > > > > > > lack of thread dump is not really a concern.
> > > > > > >
> > > > > > > On the other hand, if we disallow timeout, it will be very hard
> > for
> > > > us
> > > > > to
> > > > > > > catch low-quality tests. I don't know if there is a good
> > > alternative
> > > > > way
> > > > > > to
> > > > > > > catch those tests.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Mon, Apr 26, 2021 at 3:54 PM Dawid Wysakowicz <
> > > > > dwysakowicz@apache.org
> > > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > >> Hi devs!
> > > > > > >>
> > > > > > >> I wanted to bring up something that was discussed in a few
> > > > independent
> > > > > > >> groups of people in the past days. I'd like to revise using
> > > timeouts
> > > > > in
> > > > > > >> our JUnit tests. The suggestion would be not to use them
> > anymore.
> > > > The
> > > > > > >> problem with timeouts is that we have no thread dump and stack
> > > > traces
> > > > > of
> > > > > > >> the system as it hangs. If we were not using a timeout, the CI
> > > > runner
> > > > > > >> would have caught the timeout and created a thread dump which
> > > often
> > > > > is a
> > > > > > >> great starting point for debugging.
> > > > > > >>
> > > > > > >> This problem has been spotted e.g. during debugging
> > > FLINK-22416[1].
> > > > In
> > > > > > >> the past thread dumps were not always taken for hanging tests,
> > but
> > > > it
> > > > > > >> was changed quite recently in FLINK-21346[2]. I am happy to
> hear
> > > > your
> > > > > > >> opinions on it. If there are no objections I would like to add
> > the
> > > > > > >> suggestion to the Coding Guidelines[3]
> > > > > > >>
> > > > > > >> Best,
> > > > > > >>
> > > > > > >> Dawid
> > > > > > >>
> > > > > > >>
> > > > > > >> [1] https://issues.apache.org/jira/browse/FLINK-22416
> > > > > > >>
> > > > > > >> [2] https://issues.apache.org/jira/browse/FLINK-21346
> > > > > > >>
> > > > > > >> [3]
> > > > > > >>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> https://flink.apache.org/contributing/code-style-and-quality-java.html#java-language-features-and-libraries
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Using timeouts in JUnit tests

Posted by Dong Lin <li...@gmail.com>.

Thanks Till. Yes you are right. The INFO logging is enabled. It is just
dumped to a file (the FileAppender) other than the console.

There is probably a way to retrieve that log file from AZP. I will ask
other colleagues how to get this later.

On Thu, Apr 29, 2021 at 4:51 PM Till Rohrmann <tr...@apache.org> wrote:

> I think for the maven tests we use this log4j.properties file [1].
>
> [1] https://github.com/apache/flink/blob/master/tools/ci/log4j.properties
>
> Cheers,
> Till
>
> On Wed, Apr 28, 2021 at 4:47 AM Dong Lin <li...@gmail.com> wrote:
>
> > Thanks for the detailed explanations! Regarding the usage of timeout,
> now I
> > agree that it is better to remove per-test timeouts because it helps
> > make our testing results more reliable and consistent.
> >
> > My previous concern is that it might not be a good idea to intentionally
> > let the test hang in AZP in order to get the thread dump. Now I get that
> > there are a few practical concerns around the usage of timeout which
> makes
> > testing results unreliable (e.g. flakiness in the presence of VM
> > migration).
> >
> > Regarding the level logging on AZP, it appears that we actually set
> > "rootLogger.level = OFF" in most log4j2-test.properties, which means that
> > no INFO log would be printed on AZP. For example, I tried to increase the
> > log level in this <https://github.com/apache/flink/pull/15617> PR and
> was
> > suggested in this
> > <
> >
> https://issues.apache.org/jira/browse/FLINK-22085?focusedCommentId=17321055&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17321055
> > >
> > comment to avoid increasing the log level. Did I miss something here?
> >
> >
> > On Wed, Apr 28, 2021 at 2:22 AM Arvid Heise <ar...@apache.org> wrote:
> >
> > > Just to add to Dong Lin's list of cons of allowing timeout:
> > > - Any timeout value that you manually set is arbitrary. If it's set too
> > > low, you get test instabilities. What too low means depends on numerous
> > > factors, such as hardware and current utilization (especially I/O). If
> > you
> > > run in VMs and the VM is migrated while running a test, any reasonable
> > > timeout will probably fail. While you could make a similar case for the
> > > overall timeout of tests, any smaller hiccup in the range of minutes
> will
> > > not impact the overall runtime much. The probability of having a VM
> > > constantly migrating during the same stage is abysmally low.
> > > - A timeout is more maintenance-intensive. It's one more knob where you
> > can
> > > tweak a build or not. If you change the test a bit, you also need to
> > > double-check the timeout. Hence, there have been quite a few commits
> that
> > > just increase timeouts.
> > > - Whether a test uses a timeout or not is arbitrary: Why do some ITs
> > have a
> > > timeout and others don't? All IT tests are prone to timeout if there
> are
> > > issues with resource allocation. Similarly, there are quite a few unit
> > > tests with timeouts while others don't have them with no obvious
> pattern.
> > > - An ill-set timeout reduces build reproducibility. Imagine having a
> > > release with such a timeout and the users cannot build Flink reliably.
> > >
> > > I'd like to also point out that we should not cater around unstable
> tests
> > > if our overall goal is to have as many green builds as possible. If we
> > > assume that our builds fail more often than not, we should also look
> into
> > > the other direction and continue the builds on error. I'm not a big fan
> > of
> > > that.
> > >
> > > One argument that I also heard is that it eases local debugging in case
> > of
> > > refactorings as you can see multiple failures at the same time. But no
> > one
> > > is keeping you from temporarily adding a timeout on your branch. Then,
> we
> > > can be sure that the timeout is plausible for your hardware and avoid
> all
> > > above mentioned drawbacks.
> > >
> > > @Robert Metzger <rm...@apache.org>
> > >
> > > > If we had a global limit of 1 minute per test, we would have caught
> > this
> > > > case (and we would encourage people to be careful with CI time).
> > > >
> > > There are quite a few tests that run longer, especially on a well
> > utilized
> > > build machine. A global limit is even worse than individual limits as
> > there
> > > is no value that fits it all. If you screwed up and 200 tests hang,
> you'd
> > > also run into the global timeout anyway. I'm also not sure what these
> > > additional hangs bring you except a huge log.
> > >
> > > I'm also not sure if it's really better in terms of CI time. For
> example,
> > > for UnalignedCheckpointRescaleITCase, we test all known partitioners in
> > one
> > > pipeline for correctness. For higher parallelism, that means the test
> > runs
> > > over 1 minute regularly. If there is a global limit, I'd need to split
> > the
> > > test into smaller chunks, where I'm positive that the sum of the chunks
> > > will be larger than before.
> > >
> > > PS: all tests on AZP will print INFO in the artifacts. There you can
> also
> > > retrieve the stacktraces.
> > > PPS: I also said that we should revalidate the current timeout on AZP.
> So
> > > the argument that we have >2h of precious CI time wasted is kind of
> > > constructed and is just due to some random defaults.
> > >
> > > On Tue, Apr 27, 2021 at 6:42 PM Till Rohrmann <tr...@apache.org>
> > > wrote:
> > >
> > > > I think we do capture the INFO logs of the test runs on AZP.
> > > >
> > > > I am also not sure whether we really caught slow tests with Junit's
> > > timeout
> > > > rule before. I think the default is usually to increase the timeout
> to
> > > make
> > > > the test pass. One way to find slow tests is to measure the time and
> > look
> > > > at the outliers.
> > > >
> > > > Cheers,
> > > > Till
> > > >
> > > > On Tue, Apr 27, 2021 at 3:49 PM Dong Lin <li...@gmail.com>
> wrote:
> > > >
> > > > > There is one more point that may be useful to consider here.
> > > > >
> > > > > In order to debug deadlock that is not easily reproducible, it is
> > > likely
> > > > > not sufficient to see only the thread dump to figure out the root
> > > cause.
> > > > We
> > > > > likely need to enable the INFO level logging. Since AZP does not
> > > provide
> > > > > INFO level logging by default, we either need to reproduce the bug
> > > > locally
> > > > > or change the AZP log4j temporarily. This further reduces the
> benefit
> > > of
> > > > > logging the thread dump (which comes at the cost of letting the AZP
> > job
> > > > > hang).
> > > > >
> > > > >
> > > > > On Tue, Apr 27, 2021 at 9:34 PM Dong Lin <li...@gmail.com>
> > wrote:
> > > > >
> > > > > > Just to make sure I understand the proposal correctly: is the
> > > proposal
> > > > to
> > > > > > disallow the usage of @Test(timeout=...) for Flink Junit tests?
> > > > > >
> > > > > > Here is my understanding of the pros/cons according to the
> > discussion
> > > > so
> > > > > > far.
> > > > > >
> > > > > > Pros of allowing timeout:
> > > > > > 1) When there are tests that are unreasonably slow, it helps us
> > > > > > catch those tests and thus increase the quality of unit tests.
> > > > > > 2) When there are tests that cause deadlock, it helps the AZP job
> > > fail
> > > > > > fast instead of being blocked for 4 hours. This saves resources
> and
> > > > also
> > > > > > allows developers to get their PR tested again earlier (useful
> when
> > > the
> > > > > > test failure is not relevant to their PR).
> > > > > >
> > > > > > Cons of allowing timeout:
> > > > > > 1) When there are tests that cause deadlock, we could not see the
> > > > thread
> > > > > > dump of all threads, which makes debugging the issue harder.
> > > > > >
> > > > > > I would suggest that we should still allow timeout because the
> pros
> > > > > > outweigh the cons.
> > > > > >
> > > > > > As far as I can tell, if we allow timeout and encounter a
> deadlock
> > > bug
> > > > in
> > > > > > AZP, we still know which test (or test suite) fails. There is a
> > good
> > > > > chance
> > > > > > we can reproduce the deadlock locally (by running it 100 times)
> and
> > > get
> > > > > the
> > > > > > debug information we need. In the rare case where the deadlock
> > > happens
> > > > > only
> > > > > > on AZP, we can just disable the timeout for that particular test.
> > So
> > > > the
> > > > > > lack of thread dump is not really a concern.
> > > > > >
> > > > > > On the other hand, if we disallow timeout, it will be very hard
> for
> > > us
> > > > to
> > > > > > catch low-quality tests. I don't know if there is a good
> > alternative
> > > > way
> > > > > to
> > > > > > catch those tests.
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Mon, Apr 26, 2021 at 3:54 PM Dawid Wysakowicz <
> > > > dwysakowicz@apache.org
> > > > > >
> > > > > > wrote:
> > > > > >
> > > > > >> Hi devs!
> > > > > >>
> > > > > >> I wanted to bring up something that was discussed in a few
> > > independent
> > > > > >> groups of people in the past days. I'd like to revise using
> > timeouts
> > > > in
> > > > > >> our JUnit tests. The suggestion would be not to use them
> anymore.
> > > The
> > > > > >> problem with timeouts is that we have no thread dump and stack
> > > traces
> > > > of
> > > > > >> the system as it hangs. If we were not using a timeout, the CI
> > > runner
> > > > > >> would have caught the timeout and created a thread dump which
> > often
> > > > is a
> > > > > >> great starting point for debugging.
> > > > > >>
> > > > > >> This problem has been spotted e.g. during debugging
> > FLINK-22416[1].
> > > In
> > > > > >> the past thread dumps were not always taken for hanging tests,
> but
> > > it
> > > > > >> was changed quite recently in FLINK-21346[2]. I am happy to hear
> > > your
> > > > > >> opinions on it. If there are no objections I would like to add
> the
> > > > > >> suggestion to the Coding Guidelines[3]
> > > > > >>
> > > > > >> Best,
> > > > > >>
> > > > > >> Dawid
> > > > > >>
> > > > > >>
> > > > > >> [1] https://issues.apache.org/jira/browse/FLINK-22416
> > > > > >>
> > > > > >> [2] https://issues.apache.org/jira/browse/FLINK-21346
> > > > > >>
> > > > > >> [3]
> > > > > >>
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://flink.apache.org/contributing/code-style-and-quality-java.html#java-language-features-and-libraries
> > > > > >>
> > > > > >>
> > > > > >>
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Using timeouts in JUnit tests

Posted by Till Rohrmann <tr...@apache.org>.

I think for the maven tests we use this log4j.properties file [1].

[1] https://github.com/apache/flink/blob/master/tools/ci/log4j.properties

Cheers,
Till

On Wed, Apr 28, 2021 at 4:47 AM Dong Lin <li...@gmail.com> wrote:

> Thanks for the detailed explanations! Regarding the usage of timeout, now I
> agree that it is better to remove per-test timeouts because it helps
> make our testing results more reliable and consistent.
>
> My previous concern is that it might not be a good idea to intentionally
> let the test hang in AZP in order to get the thread dump. Now I get that
> there are a few practical concerns around the usage of timeout which makes
> testing results unreliable (e.g. flakiness in the presence of VM
> migration).
>
> Regarding the level logging on AZP, it appears that we actually set
> "rootLogger.level = OFF" in most log4j2-test.properties, which means that
> no INFO log would be printed on AZP. For example, I tried to increase the
> log level in this <https://github.com/apache/flink/pull/15617> PR and was
> suggested in this
> <
> https://issues.apache.org/jira/browse/FLINK-22085?focusedCommentId=17321055&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17321055
> >
> comment to avoid increasing the log level. Did I miss something here?
>
>
> On Wed, Apr 28, 2021 at 2:22 AM Arvid Heise <ar...@apache.org> wrote:
>
> > Just to add to Dong Lin's list of cons of allowing timeout:
> > - Any timeout value that you manually set is arbitrary. If it's set too
> > low, you get test instabilities. What too low means depends on numerous
> > factors, such as hardware and current utilization (especially I/O). If
> you
> > run in VMs and the VM is migrated while running a test, any reasonable
> > timeout will probably fail. While you could make a similar case for the
> > overall timeout of tests, any smaller hiccup in the range of minutes will
> > not impact the overall runtime much. The probability of having a VM
> > constantly migrating during the same stage is abysmally low.
> > - A timeout is more maintenance-intensive. It's one more knob where you
> can
> > tweak a build or not. If you change the test a bit, you also need to
> > double-check the timeout. Hence, there have been quite a few commits that
> > just increase timeouts.
> > - Whether a test uses a timeout or not is arbitrary: Why do some ITs
> have a
> > timeout and others don't? All IT tests are prone to timeout if there are
> > issues with resource allocation. Similarly, there are quite a few unit
> > tests with timeouts while others don't have them with no obvious pattern.
> > - An ill-set timeout reduces build reproducibility. Imagine having a
> > release with such a timeout and the users cannot build Flink reliably.
> >
> > I'd like to also point out that we should not cater around unstable tests
> > if our overall goal is to have as many green builds as possible. If we
> > assume that our builds fail more often than not, we should also look into
> > the other direction and continue the builds on error. I'm not a big fan
> of
> > that.
> >
> > One argument that I also heard is that it eases local debugging in case
> of
> > refactorings as you can see multiple failures at the same time. But no
> one
> > is keeping you from temporarily adding a timeout on your branch. Then, we
> > can be sure that the timeout is plausible for your hardware and avoid all
> > above mentioned drawbacks.
> >
> > @Robert Metzger <rm...@apache.org>
> >
> > > If we had a global limit of 1 minute per test, we would have caught
> this
> > > case (and we would encourage people to be careful with CI time).
> > >
> > There are quite a few tests that run longer, especially on a well
> utilized
> > build machine. A global limit is even worse than individual limits as
> there
> > is no value that fits it all. If you screwed up and 200 tests hang, you'd
> > also run into the global timeout anyway. I'm also not sure what these
> > additional hangs bring you except a huge log.
> >
> > I'm also not sure if it's really better in terms of CI time. For example,
> > for UnalignedCheckpointRescaleITCase, we test all known partitioners in
> one
> > pipeline for correctness. For higher parallelism, that means the test
> runs
> > over 1 minute regularly. If there is a global limit, I'd need to split
> the
> > test into smaller chunks, where I'm positive that the sum of the chunks
> > will be larger than before.
> >
> > PS: all tests on AZP will print INFO in the artifacts. There you can also
> > retrieve the stacktraces.
> > PPS: I also said that we should revalidate the current timeout on AZP. So
> > the argument that we have >2h of precious CI time wasted is kind of
> > constructed and is just due to some random defaults.
> >
> > On Tue, Apr 27, 2021 at 6:42 PM Till Rohrmann <tr...@apache.org>
> > wrote:
> >
> > > I think we do capture the INFO logs of the test runs on AZP.
> > >
> > > I am also not sure whether we really caught slow tests with Junit's
> > timeout
> > > rule before. I think the default is usually to increase the timeout to
> > make
> > > the test pass. One way to find slow tests is to measure the time and
> look
> > > at the outliers.
> > >
> > > Cheers,
> > > Till
> > >
> > > On Tue, Apr 27, 2021 at 3:49 PM Dong Lin <li...@gmail.com> wrote:
> > >
> > > > There is one more point that may be useful to consider here.
> > > >
> > > > In order to debug deadlock that is not easily reproducible, it is
> > likely
> > > > not sufficient to see only the thread dump to figure out the root
> > cause.
> > > We
> > > > likely need to enable the INFO level logging. Since AZP does not
> > provide
> > > > INFO level logging by default, we either need to reproduce the bug
> > > locally
> > > > or change the AZP log4j temporarily. This further reduces the benefit
> > of
> > > > logging the thread dump (which comes at the cost of letting the AZP
> job
> > > > hang).
> > > >
> > > >
> > > > On Tue, Apr 27, 2021 at 9:34 PM Dong Lin <li...@gmail.com>
> wrote:
> > > >
> > > > > Just to make sure I understand the proposal correctly: is the
> > proposal
> > > to
> > > > > disallow the usage of @Test(timeout=...) for Flink Junit tests?
> > > > >
> > > > > Here is my understanding of the pros/cons according to the
> discussion
> > > so
> > > > > far.
> > > > >
> > > > > Pros of allowing timeout:
> > > > > 1) When there are tests that are unreasonably slow, it helps us
> > > > > catch those tests and thus increase the quality of unit tests.
> > > > > 2) When there are tests that cause deadlock, it helps the AZP job
> > fail
> > > > > fast instead of being blocked for 4 hours. This saves resources and
> > > also
> > > > > allows developers to get their PR tested again earlier (useful when
> > the
> > > > > test failure is not relevant to their PR).
> > > > >
> > > > > Cons of allowing timeout:
> > > > > 1) When there are tests that cause deadlock, we could not see the
> > > thread
> > > > > dump of all threads, which makes debugging the issue harder.
> > > > >
> > > > > I would suggest that we should still allow timeout because the pros
> > > > > outweigh the cons.
> > > > >
> > > > > As far as I can tell, if we allow timeout and encounter a deadlock
> > bug
> > > in
> > > > > AZP, we still know which test (or test suite) fails. There is a
> good
> > > > chance
> > > > > we can reproduce the deadlock locally (by running it 100 times) and
> > get
> > > > the
> > > > > debug information we need. In the rare case where the deadlock
> > happens
> > > > only
> > > > > on AZP, we can just disable the timeout for that particular test.
> So
> > > the
> > > > > lack of thread dump is not really a concern.
> > > > >
> > > > > On the other hand, if we disallow timeout, it will be very hard for
> > us
> > > to
> > > > > catch low-quality tests. I don't know if there is a good
> alternative
> > > way
> > > > to
> > > > > catch those tests.
> > > > >
> > > > >
> > > > >
> > > > > On Mon, Apr 26, 2021 at 3:54 PM Dawid Wysakowicz <
> > > dwysakowicz@apache.org
> > > > >
> > > > > wrote:
> > > > >
> > > > >> Hi devs!
> > > > >>
> > > > >> I wanted to bring up something that was discussed in a few
> > independent
> > > > >> groups of people in the past days. I'd like to revise using
> timeouts
> > > in
> > > > >> our JUnit tests. The suggestion would be not to use them anymore.
> > The
> > > > >> problem with timeouts is that we have no thread dump and stack
> > traces
> > > of
> > > > >> the system as it hangs. If we were not using a timeout, the CI
> > runner
> > > > >> would have caught the timeout and created a thread dump which
> often
> > > is a
> > > > >> great starting point for debugging.
> > > > >>
> > > > >> This problem has been spotted e.g. during debugging
> FLINK-22416[1].
> > In
> > > > >> the past thread dumps were not always taken for hanging tests, but
> > it
> > > > >> was changed quite recently in FLINK-21346[2]. I am happy to hear
> > your
> > > > >> opinions on it. If there are no objections I would like to add the
> > > > >> suggestion to the Coding Guidelines[3]
> > > > >>
> > > > >> Best,
> > > > >>
> > > > >> Dawid
> > > > >>
> > > > >>
> > > > >> [1] https://issues.apache.org/jira/browse/FLINK-22416
> > > > >>
> > > > >> [2] https://issues.apache.org/jira/browse/FLINK-21346
> > > > >>
> > > > >> [3]
> > > > >>
> > > > >>
> > > >
> > >
> >
> https://flink.apache.org/contributing/code-style-and-quality-java.html#java-language-features-and-libraries
> > > > >>
> > > > >>
> > > > >>
> > > >
> > >
> >
>

Re: [DISCUSS] Using timeouts in JUnit tests

Posted by Dong Lin <li...@gmail.com>.

Thanks for the detailed explanations! Regarding the usage of timeout, now I
agree that it is better to remove per-test timeouts because it helps
make our testing results more reliable and consistent.

My previous concern is that it might not be a good idea to intentionally
let the test hang in AZP in order to get the thread dump. Now I get that
there are a few practical concerns around the usage of timeout which makes
testing results unreliable (e.g. flakiness in the presence of VM migration).

Regarding the level logging on AZP, it appears that we actually set
"rootLogger.level = OFF" in most log4j2-test.properties, which means that
no INFO log would be printed on AZP. For example, I tried to increase the
log level in this <https://github.com/apache/flink/pull/15617> PR and was
suggested in this
<https://issues.apache.org/jira/browse/FLINK-22085?focusedCommentId=17321055&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17321055>
comment to avoid increasing the log level. Did I miss something here?


On Wed, Apr 28, 2021 at 2:22 AM Arvid Heise <ar...@apache.org> wrote:

> Just to add to Dong Lin's list of cons of allowing timeout:
> - Any timeout value that you manually set is arbitrary. If it's set too
> low, you get test instabilities. What too low means depends on numerous
> factors, such as hardware and current utilization (especially I/O). If you
> run in VMs and the VM is migrated while running a test, any reasonable
> timeout will probably fail. While you could make a similar case for the
> overall timeout of tests, any smaller hiccup in the range of minutes will
> not impact the overall runtime much. The probability of having a VM
> constantly migrating during the same stage is abysmally low.
> - A timeout is more maintenance-intensive. It's one more knob where you can
> tweak a build or not. If you change the test a bit, you also need to
> double-check the timeout. Hence, there have been quite a few commits that
> just increase timeouts.
> - Whether a test uses a timeout or not is arbitrary: Why do some ITs have a
> timeout and others don't? All IT tests are prone to timeout if there are
> issues with resource allocation. Similarly, there are quite a few unit
> tests with timeouts while others don't have them with no obvious pattern.
> - An ill-set timeout reduces build reproducibility. Imagine having a
> release with such a timeout and the users cannot build Flink reliably.
>
> I'd like to also point out that we should not cater around unstable tests
> if our overall goal is to have as many green builds as possible. If we
> assume that our builds fail more often than not, we should also look into
> the other direction and continue the builds on error. I'm not a big fan of
> that.
>
> One argument that I also heard is that it eases local debugging in case of
> refactorings as you can see multiple failures at the same time. But no one
> is keeping you from temporarily adding a timeout on your branch. Then, we
> can be sure that the timeout is plausible for your hardware and avoid all
> above mentioned drawbacks.
>
> @Robert Metzger <rm...@apache.org>
>
> > If we had a global limit of 1 minute per test, we would have caught this
> > case (and we would encourage people to be careful with CI time).
> >
> There are quite a few tests that run longer, especially on a well utilized
> build machine. A global limit is even worse than individual limits as there
> is no value that fits it all. If you screwed up and 200 tests hang, you'd
> also run into the global timeout anyway. I'm also not sure what these
> additional hangs bring you except a huge log.
>
> I'm also not sure if it's really better in terms of CI time. For example,
> for UnalignedCheckpointRescaleITCase, we test all known partitioners in one
> pipeline for correctness. For higher parallelism, that means the test runs
> over 1 minute regularly. If there is a global limit, I'd need to split the
> test into smaller chunks, where I'm positive that the sum of the chunks
> will be larger than before.
>
> PS: all tests on AZP will print INFO in the artifacts. There you can also
> retrieve the stacktraces.
> PPS: I also said that we should revalidate the current timeout on AZP. So
> the argument that we have >2h of precious CI time wasted is kind of
> constructed and is just due to some random defaults.
>
> On Tue, Apr 27, 2021 at 6:42 PM Till Rohrmann <tr...@apache.org>
> wrote:
>
> > I think we do capture the INFO logs of the test runs on AZP.
> >
> > I am also not sure whether we really caught slow tests with Junit's
> timeout
> > rule before. I think the default is usually to increase the timeout to
> make
> > the test pass. One way to find slow tests is to measure the time and look
> > at the outliers.
> >
> > Cheers,
> > Till
> >
> > On Tue, Apr 27, 2021 at 3:49 PM Dong Lin <li...@gmail.com> wrote:
> >
> > > There is one more point that may be useful to consider here.
> > >
> > > In order to debug deadlock that is not easily reproducible, it is
> likely
> > > not sufficient to see only the thread dump to figure out the root
> cause.
> > We
> > > likely need to enable the INFO level logging. Since AZP does not
> provide
> > > INFO level logging by default, we either need to reproduce the bug
> > locally
> > > or change the AZP log4j temporarily. This further reduces the benefit
> of
> > > logging the thread dump (which comes at the cost of letting the AZP job
> > > hang).
> > >
> > >
> > > On Tue, Apr 27, 2021 at 9:34 PM Dong Lin <li...@gmail.com> wrote:
> > >
> > > > Just to make sure I understand the proposal correctly: is the
> proposal
> > to
> > > > disallow the usage of @Test(timeout=...) for Flink Junit tests?
> > > >
> > > > Here is my understanding of the pros/cons according to the discussion
> > so
> > > > far.
> > > >
> > > > Pros of allowing timeout:
> > > > 1) When there are tests that are unreasonably slow, it helps us
> > > > catch those tests and thus increase the quality of unit tests.
> > > > 2) When there are tests that cause deadlock, it helps the AZP job
> fail
> > > > fast instead of being blocked for 4 hours. This saves resources and
> > also
> > > > allows developers to get their PR tested again earlier (useful when
> the
> > > > test failure is not relevant to their PR).
> > > >
> > > > Cons of allowing timeout:
> > > > 1) When there are tests that cause deadlock, we could not see the
> > thread
> > > > dump of all threads, which makes debugging the issue harder.
> > > >
> > > > I would suggest that we should still allow timeout because the pros
> > > > outweigh the cons.
> > > >
> > > > As far as I can tell, if we allow timeout and encounter a deadlock
> bug
> > in
> > > > AZP, we still know which test (or test suite) fails. There is a good
> > > chance
> > > > we can reproduce the deadlock locally (by running it 100 times) and
> get
> > > the
> > > > debug information we need. In the rare case where the deadlock
> happens
> > > only
> > > > on AZP, we can just disable the timeout for that particular test. So
> > the
> > > > lack of thread dump is not really a concern.
> > > >
> > > > On the other hand, if we disallow timeout, it will be very hard for
> us
> > to
> > > > catch low-quality tests. I don't know if there is a good alternative
> > way
> > > to
> > > > catch those tests.
> > > >
> > > >
> > > >
> > > > On Mon, Apr 26, 2021 at 3:54 PM Dawid Wysakowicz <
> > dwysakowicz@apache.org
> > > >
> > > > wrote:
> > > >
> > > >> Hi devs!
> > > >>
> > > >> I wanted to bring up something that was discussed in a few
> independent
> > > >> groups of people in the past days. I'd like to revise using timeouts
> > in
> > > >> our JUnit tests. The suggestion would be not to use them anymore.
> The
> > > >> problem with timeouts is that we have no thread dump and stack
> traces
> > of
> > > >> the system as it hangs. If we were not using a timeout, the CI
> runner
> > > >> would have caught the timeout and created a thread dump which often
> > is a
> > > >> great starting point for debugging.
> > > >>
> > > >> This problem has been spotted e.g. during debugging FLINK-22416[1].
> In
> > > >> the past thread dumps were not always taken for hanging tests, but
> it
> > > >> was changed quite recently in FLINK-21346[2]. I am happy to hear
> your
> > > >> opinions on it. If there are no objections I would like to add the
> > > >> suggestion to the Coding Guidelines[3]
> > > >>
> > > >> Best,
> > > >>
> > > >> Dawid
> > > >>
> > > >>
> > > >> [1] https://issues.apache.org/jira/browse/FLINK-22416
> > > >>
> > > >> [2] https://issues.apache.org/jira/browse/FLINK-21346
> > > >>
> > > >> [3]
> > > >>
> > > >>
> > >
> >
> https://flink.apache.org/contributing/code-style-and-quality-java.html#java-language-features-and-libraries
> > > >>
> > > >>
> > > >>
> > >
> >
>

Re: [DISCUSS] Using timeouts in JUnit tests

Posted by Arvid Heise <ar...@apache.org>.

Just to add to Dong Lin's list of cons of allowing timeout:
- Any timeout value that you manually set is arbitrary. If it's set too
low, you get test instabilities. What too low means depends on numerous
factors, such as hardware and current utilization (especially I/O). If you
run in VMs and the VM is migrated while running a test, any reasonable
timeout will probably fail. While you could make a similar case for the
overall timeout of tests, any smaller hiccup in the range of minutes will
not impact the overall runtime much. The probability of having a VM
constantly migrating during the same stage is abysmally low.
- A timeout is more maintenance-intensive. It's one more knob where you can
tweak a build or not. If you change the test a bit, you also need to
double-check the timeout. Hence, there have been quite a few commits that
just increase timeouts.
- Whether a test uses a timeout or not is arbitrary: Why do some ITs have a
timeout and others don't? All IT tests are prone to timeout if there are
issues with resource allocation. Similarly, there are quite a few unit
tests with timeouts while others don't have them with no obvious pattern.
- An ill-set timeout reduces build reproducibility. Imagine having a
release with such a timeout and the users cannot build Flink reliably.

I'd like to also point out that we should not cater around unstable tests
if our overall goal is to have as many green builds as possible. If we
assume that our builds fail more often than not, we should also look into
the other direction and continue the builds on error. I'm not a big fan of
that.

One argument that I also heard is that it eases local debugging in case of
refactorings as you can see multiple failures at the same time. But no one
is keeping you from temporarily adding a timeout on your branch. Then, we
can be sure that the timeout is plausible for your hardware and avoid all
above mentioned drawbacks.

@Robert Metzger <rm...@apache.org>

> If we had a global limit of 1 minute per test, we would have caught this
> case (and we would encourage people to be careful with CI time).
>
There are quite a few tests that run longer, especially on a well utilized
build machine. A global limit is even worse than individual limits as there
is no value that fits it all. If you screwed up and 200 tests hang, you'd
also run into the global timeout anyway. I'm also not sure what these
additional hangs bring you except a huge log.

I'm also not sure if it's really better in terms of CI time. For example,
for UnalignedCheckpointRescaleITCase, we test all known partitioners in one
pipeline for correctness. For higher parallelism, that means the test runs
over 1 minute regularly. If there is a global limit, I'd need to split the
test into smaller chunks, where I'm positive that the sum of the chunks
will be larger than before.

PS: all tests on AZP will print INFO in the artifacts. There you can also
retrieve the stacktraces.
PPS: I also said that we should revalidate the current timeout on AZP. So
the argument that we have >2h of precious CI time wasted is kind of
constructed and is just due to some random defaults.

On Tue, Apr 27, 2021 at 6:42 PM Till Rohrmann <tr...@apache.org> wrote:

> I think we do capture the INFO logs of the test runs on AZP.
>
> I am also not sure whether we really caught slow tests with Junit's timeout
> rule before. I think the default is usually to increase the timeout to make
> the test pass. One way to find slow tests is to measure the time and look
> at the outliers.
>
> Cheers,
> Till
>
> On Tue, Apr 27, 2021 at 3:49 PM Dong Lin <li...@gmail.com> wrote:
>
> > There is one more point that may be useful to consider here.
> >
> > In order to debug deadlock that is not easily reproducible, it is likely
> > not sufficient to see only the thread dump to figure out the root cause.
> We
> > likely need to enable the INFO level logging. Since AZP does not provide
> > INFO level logging by default, we either need to reproduce the bug
> locally
> > or change the AZP log4j temporarily. This further reduces the benefit of
> > logging the thread dump (which comes at the cost of letting the AZP job
> > hang).
> >
> >
> > On Tue, Apr 27, 2021 at 9:34 PM Dong Lin <li...@gmail.com> wrote:
> >
> > > Just to make sure I understand the proposal correctly: is the proposal
> to
> > > disallow the usage of @Test(timeout=...) for Flink Junit tests?
> > >
> > > Here is my understanding of the pros/cons according to the discussion
> so
> > > far.
> > >
> > > Pros of allowing timeout:
> > > 1) When there are tests that are unreasonably slow, it helps us
> > > catch those tests and thus increase the quality of unit tests.
> > > 2) When there are tests that cause deadlock, it helps the AZP job fail
> > > fast instead of being blocked for 4 hours. This saves resources and
> also
> > > allows developers to get their PR tested again earlier (useful when the
> > > test failure is not relevant to their PR).
> > >
> > > Cons of allowing timeout:
> > > 1) When there are tests that cause deadlock, we could not see the
> thread
> > > dump of all threads, which makes debugging the issue harder.
> > >
> > > I would suggest that we should still allow timeout because the pros
> > > outweigh the cons.
> > >
> > > As far as I can tell, if we allow timeout and encounter a deadlock bug
> in
> > > AZP, we still know which test (or test suite) fails. There is a good
> > chance
> > > we can reproduce the deadlock locally (by running it 100 times) and get
> > the
> > > debug information we need. In the rare case where the deadlock happens
> > only
> > > on AZP, we can just disable the timeout for that particular test. So
> the
> > > lack of thread dump is not really a concern.
> > >
> > > On the other hand, if we disallow timeout, it will be very hard for us
> to
> > > catch low-quality tests. I don't know if there is a good alternative
> way
> > to
> > > catch those tests.
> > >
> > >
> > >
> > > On Mon, Apr 26, 2021 at 3:54 PM Dawid Wysakowicz <
> dwysakowicz@apache.org
> > >
> > > wrote:
> > >
> > >> Hi devs!
> > >>
> > >> I wanted to bring up something that was discussed in a few independent
> > >> groups of people in the past days. I'd like to revise using timeouts
> in
> > >> our JUnit tests. The suggestion would be not to use them anymore. The
> > >> problem with timeouts is that we have no thread dump and stack traces
> of
> > >> the system as it hangs. If we were not using a timeout, the CI runner
> > >> would have caught the timeout and created a thread dump which often
> is a
> > >> great starting point for debugging.
> > >>
> > >> This problem has been spotted e.g. during debugging FLINK-22416[1]. In
> > >> the past thread dumps were not always taken for hanging tests, but it
> > >> was changed quite recently in FLINK-21346[2]. I am happy to hear your
> > >> opinions on it. If there are no objections I would like to add the
> > >> suggestion to the Coding Guidelines[3]
> > >>
> > >> Best,
> > >>
> > >> Dawid
> > >>
> > >>
> > >> [1] https://issues.apache.org/jira/browse/FLINK-22416
> > >>
> > >> [2] https://issues.apache.org/jira/browse/FLINK-21346
> > >>
> > >> [3]
> > >>
> > >>
> >
> https://flink.apache.org/contributing/code-style-and-quality-java.html#java-language-features-and-libraries
> > >>
> > >>
> > >>
> >
>

Re: [DISCUSS] Using timeouts in JUnit tests

Posted by Till Rohrmann <tr...@apache.org>.

I think we do capture the INFO logs of the test runs on AZP.

I am also not sure whether we really caught slow tests with Junit's timeout
rule before. I think the default is usually to increase the timeout to make
the test pass. One way to find slow tests is to measure the time and look
at the outliers.

Cheers,
Till

On Tue, Apr 27, 2021 at 3:49 PM Dong Lin <li...@gmail.com> wrote:

> There is one more point that may be useful to consider here.
>
> In order to debug deadlock that is not easily reproducible, it is likely
> not sufficient to see only the thread dump to figure out the root cause. We
> likely need to enable the INFO level logging. Since AZP does not provide
> INFO level logging by default, we either need to reproduce the bug locally
> or change the AZP log4j temporarily. This further reduces the benefit of
> logging the thread dump (which comes at the cost of letting the AZP job
> hang).
>
>
> On Tue, Apr 27, 2021 at 9:34 PM Dong Lin <li...@gmail.com> wrote:
>
> > Just to make sure I understand the proposal correctly: is the proposal to
> > disallow the usage of @Test(timeout=...) for Flink Junit tests?
> >
> > Here is my understanding of the pros/cons according to the discussion so
> > far.
> >
> > Pros of allowing timeout:
> > 1) When there are tests that are unreasonably slow, it helps us
> > catch those tests and thus increase the quality of unit tests.
> > 2) When there are tests that cause deadlock, it helps the AZP job fail
> > fast instead of being blocked for 4 hours. This saves resources and also
> > allows developers to get their PR tested again earlier (useful when the
> > test failure is not relevant to their PR).
> >
> > Cons of allowing timeout:
> > 1) When there are tests that cause deadlock, we could not see the thread
> > dump of all threads, which makes debugging the issue harder.
> >
> > I would suggest that we should still allow timeout because the pros
> > outweigh the cons.
> >
> > As far as I can tell, if we allow timeout and encounter a deadlock bug in
> > AZP, we still know which test (or test suite) fails. There is a good
> chance
> > we can reproduce the deadlock locally (by running it 100 times) and get
> the
> > debug information we need. In the rare case where the deadlock happens
> only
> > on AZP, we can just disable the timeout for that particular test. So the
> > lack of thread dump is not really a concern.
> >
> > On the other hand, if we disallow timeout, it will be very hard for us to
> > catch low-quality tests. I don't know if there is a good alternative way
> to
> > catch those tests.
> >
> >
> >
> > On Mon, Apr 26, 2021 at 3:54 PM Dawid Wysakowicz <dwysakowicz@apache.org
> >
> > wrote:
> >
> >> Hi devs!
> >>
> >> I wanted to bring up something that was discussed in a few independent
> >> groups of people in the past days. I'd like to revise using timeouts in
> >> our JUnit tests. The suggestion would be not to use them anymore. The
> >> problem with timeouts is that we have no thread dump and stack traces of
> >> the system as it hangs. If we were not using a timeout, the CI runner
> >> would have caught the timeout and created a thread dump which often is a
> >> great starting point for debugging.
> >>
> >> This problem has been spotted e.g. during debugging FLINK-22416[1]. In
> >> the past thread dumps were not always taken for hanging tests, but it
> >> was changed quite recently in FLINK-21346[2]. I am happy to hear your
> >> opinions on it. If there are no objections I would like to add the
> >> suggestion to the Coding Guidelines[3]
> >>
> >> Best,
> >>
> >> Dawid
> >>
> >>
> >> [1] https://issues.apache.org/jira/browse/FLINK-22416
> >>
> >> [2] https://issues.apache.org/jira/browse/FLINK-21346
> >>
> >> [3]
> >>
> >>
> https://flink.apache.org/contributing/code-style-and-quality-java.html#java-language-features-and-libraries
> >>
> >>
> >>
>

Re: [DISCUSS] Using timeouts in JUnit tests

Posted by Dong Lin <li...@gmail.com>.

There is one more point that may be useful to consider here.

In order to debug deadlock that is not easily reproducible, it is likely
not sufficient to see only the thread dump to figure out the root cause. We
likely need to enable the INFO level logging. Since AZP does not provide
INFO level logging by default, we either need to reproduce the bug locally
or change the AZP log4j temporarily. This further reduces the benefit of
logging the thread dump (which comes at the cost of letting the AZP job
hang).


On Tue, Apr 27, 2021 at 9:34 PM Dong Lin <li...@gmail.com> wrote:

> Just to make sure I understand the proposal correctly: is the proposal to
> disallow the usage of @Test(timeout=...) for Flink Junit tests?
>
> Here is my understanding of the pros/cons according to the discussion so
> far.
>
> Pros of allowing timeout:
> 1) When there are tests that are unreasonably slow, it helps us
> catch those tests and thus increase the quality of unit tests.
> 2) When there are tests that cause deadlock, it helps the AZP job fail
> fast instead of being blocked for 4 hours. This saves resources and also
> allows developers to get their PR tested again earlier (useful when the
> test failure is not relevant to their PR).
>
> Cons of allowing timeout:
> 1) When there are tests that cause deadlock, we could not see the thread
> dump of all threads, which makes debugging the issue harder.
>
> I would suggest that we should still allow timeout because the pros
> outweigh the cons.
>
> As far as I can tell, if we allow timeout and encounter a deadlock bug in
> AZP, we still know which test (or test suite) fails. There is a good chance
> we can reproduce the deadlock locally (by running it 100 times) and get the
> debug information we need. In the rare case where the deadlock happens only
> on AZP, we can just disable the timeout for that particular test. So the
> lack of thread dump is not really a concern.
>
> On the other hand, if we disallow timeout, it will be very hard for us to
> catch low-quality tests. I don't know if there is a good alternative way to
> catch those tests.
>
>
>
> On Mon, Apr 26, 2021 at 3:54 PM Dawid Wysakowicz <dw...@apache.org>
> wrote:
>
>> Hi devs!
>>
>> I wanted to bring up something that was discussed in a few independent
>> groups of people in the past days. I'd like to revise using timeouts in
>> our JUnit tests. The suggestion would be not to use them anymore. The
>> problem with timeouts is that we have no thread dump and stack traces of
>> the system as it hangs. If we were not using a timeout, the CI runner
>> would have caught the timeout and created a thread dump which often is a
>> great starting point for debugging.
>>
>> This problem has been spotted e.g. during debugging FLINK-22416[1]. In
>> the past thread dumps were not always taken for hanging tests, but it
>> was changed quite recently in FLINK-21346[2]. I am happy to hear your
>> opinions on it. If there are no objections I would like to add the
>> suggestion to the Coding Guidelines[3]
>>
>> Best,
>>
>> Dawid
>>
>>
>> [1] https://issues.apache.org/jira/browse/FLINK-22416
>>
>> [2] https://issues.apache.org/jira/browse/FLINK-21346
>>
>> [3]
>>
>> https://flink.apache.org/contributing/code-style-and-quality-java.html#java-language-features-and-libraries
>>
>>
>>

Re: [DISCUSS] Using timeouts in JUnit tests

Posted by Dong Lin <li...@gmail.com>.

Just to make sure I understand the proposal correctly: is the proposal to
disallow the usage of @Test(timeout=...) for Flink Junit tests?

Here is my understanding of the pros/cons according to the discussion so
far.

Pros of allowing timeout:
1) When there are tests that are unreasonably slow, it helps us catch those
tests and thus increase the quality of unit tests.
2) When there are tests that cause deadlock, it helps the AZP job fail fast
instead of being blocked for 4 hours. This saves resources and also allows
developers to get their PR tested again earlier (useful when the test
failure is not relevant to their PR).

Cons of allowing timeout:
1) When there are tests that cause deadlock, we could not see the thread
dump of all threads, which makes debugging the issue harder.

I would suggest that we should still allow timeout because the pros
outweigh the cons.

As far as I can tell, if we allow timeout and encounter a deadlock bug in
AZP, we still know which test (or test suite) fails. There is a good chance
we can reproduce the deadlock locally (by running it 100 times) and get the
debug information we need. In the rare case where the deadlock happens only
on AZP, we can just disable the timeout for that particular test. So the
lack of thread dump is not really a concern.

On the other hand, if we disallow timeout, it will be very hard for us to
catch low-quality tests. I don't know if there is a good alternative way to
catch those tests.

On Mon, Apr 26, 2021 at 3:54 PM Dawid Wysakowicz <dw...@apache.org>
wrote:

> Hi devs!
>
> I wanted to bring up something that was discussed in a few independent
> groups of people in the past days. I'd like to revise using timeouts in
> our JUnit tests. The suggestion would be not to use them anymore. The
> problem with timeouts is that we have no thread dump and stack traces of
> the system as it hangs. If we were not using a timeout, the CI runner
> would have caught the timeout and created a thread dump which often is a
> great starting point for debugging.
>
> This problem has been spotted e.g. during debugging FLINK-22416[1]. In
> the past thread dumps were not always taken for hanging tests, but it
> was changed quite recently in FLINK-21346[2]. I am happy to hear your
> opinions on it. If there are no objections I would like to add the
> suggestion to the Coding Guidelines[3]
>
> Best,
>
> Dawid
>
>
> [1] https://issues.apache.org/jira/browse/FLINK-22416
>
> [2] https://issues.apache.org/jira/browse/FLINK-21346
>
> [3]
>
> https://flink.apache.org/contributing/code-style-and-quality-java.html#java-language-features-and-libraries
>
>
>