You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@flink.apache.org by Ufuk Celebi <uc...@apache.org> on 2016/06/02 11:14:41 UTC

Re: [ANNOUNCE] Build Issues Solved

With the recent fixes, the builds are more stable, but I still see
many failing, because of the Scala shell tests, which lead to the JVMs
crashing. I've researched this a little bit, but didn't find an
obvious solution to the problem.

Does it make sense to disable the tests until someone has time to look into it?

– Ufuk

On Tue, May 31, 2016 at 1:46 PM, Stephan Ewen <se...@apache.org> wrote:
> You are right, Chiwan.
>
> I think that this pattern you use should be supported, though. Would be
> good to check if the job executes at the point of the "collect()" calls
> more than is necessary.
> That would explain the network buffer issue then...
>
> On Tue, May 31, 2016 at 12:18 PM, Chiwan Park <ch...@apache.org> wrote:
>
>> Hi Stephan,
>>
>> Yes, right. But KNNITSuite calls
>> ExecutionEnvironment.getExecutionEnvironment only once [1]. I’m testing
>> with moving method call of getExecutionEnvironment to each test case.
>>
>> [1]:
>> https://github.com/apache/flink/blob/master/flink-libraries/flink-ml/src/test/scala/org/apache/flink/ml/nn/KNNITSuite.scala#L45
>>
>> Regards,
>> Chiwan Park
>>
>> > On May 31, 2016, at 7:09 PM, Stephan Ewen <se...@apache.org> wrote:
>> >
>> > Hi Chiwan!
>> >
>> > I think the Execution environment is not shared, because what the
>> > TestEnvironment sets is a Context Environment Factory. Every time you
>> call
>> > "ExecutionEnvironment.getExecutionEnvironment()", you get a new
>> environment.
>> >
>> > Stephan
>> >
>> >
>> > On Tue, May 31, 2016 at 11:53 AM, Chiwan Park <ch...@apache.org>
>> wrote:
>> >
>> >> I’ve created a JIRA issue [1] related to KNN test cases. I will send a
>> PR
>> >> for it.
>> >>
>> >> From my investigation [2], cluster for ML tests have only one
>> taskmanager
>> >> with 4 slots. Is 2048 insufficient for total number of network numbers?
>> I
>> >> still think the problem is sharing ExecutionEnvironment between test
>> cases.
>> >>
>> >> [1]: https://issues.apache.org/jira/browse/FLINK-3994
>> >> [2]:
>> >>
>> https://github.com/apache/flink/blob/master/flink-test-utils/src/test/scala/org/apache/flink/test/util/FlinkTestBase.scala#L56
>> >>
>> >> Regards,
>> >> Chiwan Park
>> >>
>> >>> On May 31, 2016, at 6:05 PM, Maximilian Michels <mx...@apache.org>
>> wrote:
>> >>>
>> >>> Thanks Stephan for the synopsis of our last weeks test instability
>> >>> madness. It's sad to see the shortcomings of Maven test plugins but
>> >>> another lesson learned is that our testing infrastructure should get a
>> >>> bit more attention. We have reached a point several times where our
>> >>> tests where inherently instable. Now we saw that even more problems
>> >>> were hidden in the dark. I would like to see more maintenance
>> >>> dedicated to testing.
>> >>>
>> >>> @Chiwan: Please, no hotfix! Please open a JIRA issue and a pull
>> >>> request with a systematic fix. Those things are too crucial to be
>> >>> fixed on the go. The problems is that Travis reports the number of
>> >>> processors to be "32" (which is used for the number of task slots in
>> >>> local execution). The network buffers are not adjusted accordingly. We
>> >>> should set them correctly in the MiniCluster. Also, we could define an
>> >>> upper limit to the number of task slots for tests.
>> >>>
>> >>> On Tue, May 31, 2016 at 10:59 AM, Chiwan Park <ch...@apache.org>
>> >> wrote:
>> >>>> I think that the tests fail because of sharing ExecutionEnvironment
>> >> between test cases. I’m not sure why it is problem, but it is only
>> >> difference between other ML tests.
>> >>>>
>> >>>> I created a hotfix and pushed it to my repository. When it seems fixed
>> >> [1], I’ll merge the hotfix to master branch.
>> >>>>
>> >>>> [1]: https://travis-ci.org/chiwanpark/flink/builds/134104491
>> >>>>
>> >>>> Regards,
>> >>>> Chiwan Park
>> >>>>
>> >>>>> On May 31, 2016, at 5:43 PM, Chiwan Park <ch...@apache.org>
>> >> wrote:
>> >>>>>
>> >>>>> Maybe it seems about KNN test case which is merged into yesterday.
>> >> I’ll look into ML test.
>> >>>>>
>> >>>>> Regards,
>> >>>>> Chiwan Park
>> >>>>>
>> >>>>>> On May 31, 2016, at 5:38 PM, Ufuk Celebi <uc...@apache.org> wrote:
>> >>>>>>
>> >>>>>> Currently, an ML test is reliably failing and occasionally some HA
>> >>>>>> tests. Is someone looking into the ML test?
>> >>>>>>
>> >>>>>> For HA, I will revert a commit, which might cause the HA
>> >>>>>> instabilities. Till is working on a proper fix as far as I know.
>> >>>>>>
>> >>>>>> On Tue, May 31, 2016 at 3:50 AM, Chiwan Park <chiwanpark@apache.org
>> >
>> >> wrote:
>> >>>>>>> Thanks for the great work! :-)
>> >>>>>>>
>> >>>>>>> Regards,
>> >>>>>>> Chiwan Park
>> >>>>>>>
>> >>>>>>>> On May 31, 2016, at 7:47 AM, Flavio Pompermaier <
>> >> pompermaier@okkam.it> wrote:
>> >>>>>>>>
>> >>>>>>>> Awesome work guys!
>> >>>>>>>> And even more thanks for the detailed report...This
>> troubleshooting
>> >> summary
>> >>>>>>>> will be undoubtedly useful for all our maven projects!
>> >>>>>>>>
>> >>>>>>>> Best,
>> >>>>>>>> Flavio
>> >>>>>>>> On 30 May 2016 23:47, "Ufuk Celebi" <uc...@apache.org> wrote:
>> >>>>>>>>
>> >>>>>>>>> Thanks for the effort, Max and Stephan! Happy to see the green
>> >> light again.
>> >>>>>>>>>
>> >>>>>>>>> On Mon, May 30, 2016 at 11:03 PM, Stephan Ewen <sewen@apache.org
>> >
>> >> wrote:
>> >>>>>>>>>> Hi all!
>> >>>>>>>>>>
>> >>>>>>>>>> After a few weeks of terrible build issues, I am happy to
>> >> announce that
>> >>>>>>>>> the
>> >>>>>>>>>> build works again properly, and we actually get meaningful CI
>> >> results.
>> >>>>>>>>>>
>> >>>>>>>>>> Here is a story in many acts, from builds deep red to bright
>> >> green joy.
>> >>>>>>>>>> Kudos to Max, who did most of this troubleshooting. This
>> evening,
>> >> Max and
>> >>>>>>>>>> me debugged the final issue and got the build back on track.
>> >>>>>>>>>>
>> >>>>>>>>>> ------------------
>> >>>>>>>>>> The Journey
>> >>>>>>>>>> ------------------
>> >>>>>>>>>>
>> >>>>>>>>>> (1) Failsafe Plugin
>> >>>>>>>>>>
>> >>>>>>>>>> The Maven Failsafe Build Plugin had a critical bug due to which
>> >> failed
>> >>>>>>>>>> tests did not result in a failed build.
>> >>>>>>>>>>
>> >>>>>>>>>> That is a pretty bad bug for a plugin whose only task is to run
>> >> tests and
>> >>>>>>>>>> fail the build if a test fails.
>> >>>>>>>>>>
>> >>>>>>>>>> After we recognized that, we upgraded the Failsafe Plugin.
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> (2) Failsafe Plugin Dependency Issues
>> >>>>>>>>>>
>> >>>>>>>>>> After the upgrade, the Failsafe Plugin behaved differently and
>> >> did not
>> >>>>>>>>>> interoperate with Dependency Shading any more.
>> >>>>>>>>>>
>> >>>>>>>>>> Because of that, we switched to the Surefire Plugin.
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> (3) Fixing all the issues introduced in the meantime
>> >>>>>>>>>>
>> >>>>>>>>>> Naturally, a number of test instabilities had been introduced,
>> >> which
>> >>>>>>>>> needed
>> >>>>>>>>>> to be fixed.
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> (4) Yarn Tests and Test Scope Refactoring
>> >>>>>>>>>>
>> >>>>>>>>>> In the meantime, a Pull Request was merged that moved the Yarn
>> >> Tests to
>> >>>>>>>>> the
>> >>>>>>>>>> test scope.
>> >>>>>>>>>> Because the configuration searched for tests in the "main"
>> scope,
>> >> no Yarn
>> >>>>>>>>>> tests were executed for a while, until the scope was fixed.
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> (5) Yarn Tests and JMX Metrics
>> >>>>>>>>>>
>> >>>>>>>>>> After the Yarn Tests were re-activated, we saw them fail due to
>> >> warnings
>> >>>>>>>>>> created by the newly introduced metrics code. We could fix that
>> by
>> >>>>>>>>> updating
>> >>>>>>>>>> the metrics code and temporarily not registering JMX beans for
>> all
>> >>>>>>>>> metrics.
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> (6) Yarn / Surefire Deadlock
>> >>>>>>>>>>
>> >>>>>>>>>> Finally, some Yarn tests failed reliably in Maven (though not in
>> >> the
>> >>>>>>>>> IDE).
>> >>>>>>>>>> It turned out that those test a command line interface that
>> >> interacts
>> >>>>>>>>> with
>> >>>>>>>>>> the standard input stream.
>> >>>>>>>>>>
>> >>>>>>>>>> The newly deployed Surefire Plugin uses standard input as well,
>> >> for
>> >>>>>>>>>> communication with forked JVMs. Since Surefire internally locks
>> >> the
>> >>>>>>>>>> standard input stream, the Yarn CLI cannot poll the standard
>> >> input stream
>> >>>>>>>>>> without locking up and stalling the tests.
>> >>>>>>>>>>
>> >>>>>>>>>> We adjusted the tests and now the build happily builds again.
>> >>>>>>>>>>
>> >>>>>>>>>> -----------------
>> >>>>>>>>>> Conclusions:
>> >>>>>>>>>> -----------------
>> >>>>>>>>>>
>> >>>>>>>>>> - CI is terribly crucial It took us weeks with the fallout of
>> >> having a
>> >>>>>>>>>> period of unreliably CI.
>> >>>>>>>>>>
>> >>>>>>>>>> - Maven could do a better job. A bug as crucial as the one that
>> >> started
>> >>>>>>>>>> our problem should not occur in a test plugin like surefire.
>> >> Also, the
>> >>>>>>>>>> constant change of semantics and dependency scopes is annoying.
>> >> The
>> >>>>>>>>>> semantic changes are subtle, but for a build as complex as
>> Flink,
>> >> they
>> >>>>>>>>> make
>> >>>>>>>>>> a difference.
>> >>>>>>>>>>
>> >>>>>>>>>> - File-based communication is rarely a good idea. The bug in the
>> >>>>>>>>> failsafe
>> >>>>>>>>>> plugin was caused by improper file-based communication, and some
>> >> of our
>> >>>>>>>>>> discovered instabilities as well.
>> >>>>>>>>>>
>> >>>>>>>>>> Greetings,
>> >>>>>>>>>> Stephan
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> PS: Some issues and mysteries remain for us to solve: When we
>> >> allow our
>> >>>>>>>>>> metrics subsystem to register JMX beans, we see some tests
>> >> failing due to
>> >>>>>>>>>> spontaneous JVM process kills. Whoever has a pointer there,
>> >> please ping
>> >>>>>>>>> us!
>> >>>>>>>>>
>> >>>>>>>
>> >>>>>
>> >>>>
>> >>
>> >>
>>
>>

Re: [ANNOUNCE] Build Issues Solved

Posted by Maximilian Michels <mx...@apache.org>.

I think this is related to the Yarn bug with the YarnSessionCli we
just fixed. The problem is that forked processes of the Surefire
plugin communicate via STDIN. The Scala Shell also reads from STDIN
which results in a deadlock from time to time...

Created an issue for that: https://issues.apache.org/jira/browse/FLINK-4010

On Thu, Jun 2, 2016 at 1:31 PM, Ufuk Celebi <uc...@apache.org> wrote:
> On Thu, Jun 2, 2016 at 1:26 PM, Maximilian Michels <mx...@apache.org> wrote:
>> I thought this had been fixed by Chiwan in the meantime. Could you
>
> Chiwan fixed the ML issues IMO. You can pick any of the recent builds
> from https://travis-ci.org/apache/flink/builds
>
> For example: https://s3.amazonaws.com/archive.travis-ci.org/jobs/134458335/log.txt

Re: [ANNOUNCE] Build Issues Solved

Posted by Ufuk Celebi <uc...@apache.org>.

On Thu, Jun 2, 2016 at 1:26 PM, Maximilian Michels <mx...@apache.org> wrote:
> I thought this had been fixed by Chiwan in the meantime. Could you

Chiwan fixed the ML issues IMO. You can pick any of the recent builds
from https://travis-ci.org/apache/flink/builds

For example: https://s3.amazonaws.com/archive.travis-ci.org/jobs/134458335/log.txt

Re: [ANNOUNCE] Build Issues Solved

Posted by Maximilian Michels <mx...@apache.org>.

I thought this had been fixed by Chiwan in the meantime. Could you
post a build log?

On Thu, Jun 2, 2016 at 1:14 PM, Ufuk Celebi <uc...@apache.org> wrote:
> With the recent fixes, the builds are more stable, but I still see
> many failing, because of the Scala shell tests, which lead to the JVMs
> crashing. I've researched this a little bit, but didn't find an
> obvious solution to the problem.
>
> Does it make sense to disable the tests until someone has time to look into it?
>
> – Ufuk
>
> On Tue, May 31, 2016 at 1:46 PM, Stephan Ewen <se...@apache.org> wrote:
>> You are right, Chiwan.
>>
>> I think that this pattern you use should be supported, though. Would be
>> good to check if the job executes at the point of the "collect()" calls
>> more than is necessary.
>> That would explain the network buffer issue then...
>>
>> On Tue, May 31, 2016 at 12:18 PM, Chiwan Park <ch...@apache.org> wrote:
>>
>>> Hi Stephan,
>>>
>>> Yes, right. But KNNITSuite calls
>>> ExecutionEnvironment.getExecutionEnvironment only once [1]. I’m testing
>>> with moving method call of getExecutionEnvironment to each test case.
>>>
>>> [1]:
>>> https://github.com/apache/flink/blob/master/flink-libraries/flink-ml/src/test/scala/org/apache/flink/ml/nn/KNNITSuite.scala#L45
>>>
>>> Regards,
>>> Chiwan Park
>>>
>>> > On May 31, 2016, at 7:09 PM, Stephan Ewen <se...@apache.org> wrote:
>>> >
>>> > Hi Chiwan!
>>> >
>>> > I think the Execution environment is not shared, because what the
>>> > TestEnvironment sets is a Context Environment Factory. Every time you
>>> call
>>> > "ExecutionEnvironment.getExecutionEnvironment()", you get a new
>>> environment.
>>> >
>>> > Stephan
>>> >
>>> >
>>> > On Tue, May 31, 2016 at 11:53 AM, Chiwan Park <ch...@apache.org>
>>> wrote:
>>> >
>>> >> I’ve created a JIRA issue [1] related to KNN test cases. I will send a
>>> PR
>>> >> for it.
>>> >>
>>> >> From my investigation [2], cluster for ML tests have only one
>>> taskmanager
>>> >> with 4 slots. Is 2048 insufficient for total number of network numbers?
>>> I
>>> >> still think the problem is sharing ExecutionEnvironment between test
>>> cases.
>>> >>
>>> >> [1]: https://issues.apache.org/jira/browse/FLINK-3994
>>> >> [2]:
>>> >>
>>> https://github.com/apache/flink/blob/master/flink-test-utils/src/test/scala/org/apache/flink/test/util/FlinkTestBase.scala#L56
>>> >>
>>> >> Regards,
>>> >> Chiwan Park
>>> >>
>>> >>> On May 31, 2016, at 6:05 PM, Maximilian Michels <mx...@apache.org>
>>> wrote:
>>> >>>
>>> >>> Thanks Stephan for the synopsis of our last weeks test instability
>>> >>> madness. It's sad to see the shortcomings of Maven test plugins but
>>> >>> another lesson learned is that our testing infrastructure should get a
>>> >>> bit more attention. We have reached a point several times where our
>>> >>> tests where inherently instable. Now we saw that even more problems
>>> >>> were hidden in the dark. I would like to see more maintenance
>>> >>> dedicated to testing.
>>> >>>
>>> >>> @Chiwan: Please, no hotfix! Please open a JIRA issue and a pull
>>> >>> request with a systematic fix. Those things are too crucial to be
>>> >>> fixed on the go. The problems is that Travis reports the number of
>>> >>> processors to be "32" (which is used for the number of task slots in
>>> >>> local execution). The network buffers are not adjusted accordingly. We
>>> >>> should set them correctly in the MiniCluster. Also, we could define an
>>> >>> upper limit to the number of task slots for tests.
>>> >>>
>>> >>> On Tue, May 31, 2016 at 10:59 AM, Chiwan Park <ch...@apache.org>
>>> >> wrote:
>>> >>>> I think that the tests fail because of sharing ExecutionEnvironment
>>> >> between test cases. I’m not sure why it is problem, but it is only
>>> >> difference between other ML tests.
>>> >>>>
>>> >>>> I created a hotfix and pushed it to my repository. When it seems fixed
>>> >> [1], I’ll merge the hotfix to master branch.
>>> >>>>
>>> >>>> [1]: https://travis-ci.org/chiwanpark/flink/builds/134104491
>>> >>>>
>>> >>>> Regards,
>>> >>>> Chiwan Park
>>> >>>>
>>> >>>>> On May 31, 2016, at 5:43 PM, Chiwan Park <ch...@apache.org>
>>> >> wrote:
>>> >>>>>
>>> >>>>> Maybe it seems about KNN test case which is merged into yesterday.
>>> >> I’ll look into ML test.
>>> >>>>>
>>> >>>>> Regards,
>>> >>>>> Chiwan Park
>>> >>>>>
>>> >>>>>> On May 31, 2016, at 5:38 PM, Ufuk Celebi <uc...@apache.org> wrote:
>>> >>>>>>
>>> >>>>>> Currently, an ML test is reliably failing and occasionally some HA
>>> >>>>>> tests. Is someone looking into the ML test?
>>> >>>>>>
>>> >>>>>> For HA, I will revert a commit, which might cause the HA
>>> >>>>>> instabilities. Till is working on a proper fix as far as I know.
>>> >>>>>>
>>> >>>>>> On Tue, May 31, 2016 at 3:50 AM, Chiwan Park <chiwanpark@apache.org
>>> >
>>> >> wrote:
>>> >>>>>>> Thanks for the great work! :-)
>>> >>>>>>>
>>> >>>>>>> Regards,
>>> >>>>>>> Chiwan Park
>>> >>>>>>>
>>> >>>>>>>> On May 31, 2016, at 7:47 AM, Flavio Pompermaier <
>>> >> pompermaier@okkam.it> wrote:
>>> >>>>>>>>
>>> >>>>>>>> Awesome work guys!
>>> >>>>>>>> And even more thanks for the detailed report...This
>>> troubleshooting
>>> >> summary
>>> >>>>>>>> will be undoubtedly useful for all our maven projects!
>>> >>>>>>>>
>>> >>>>>>>> Best,
>>> >>>>>>>> Flavio
>>> >>>>>>>> On 30 May 2016 23:47, "Ufuk Celebi" <uc...@apache.org> wrote:
>>> >>>>>>>>
>>> >>>>>>>>> Thanks for the effort, Max and Stephan! Happy to see the green
>>> >> light again.
>>> >>>>>>>>>
>>> >>>>>>>>> On Mon, May 30, 2016 at 11:03 PM, Stephan Ewen <sewen@apache.org
>>> >
>>> >> wrote:
>>> >>>>>>>>>> Hi all!
>>> >>>>>>>>>>
>>> >>>>>>>>>> After a few weeks of terrible build issues, I am happy to
>>> >> announce that
>>> >>>>>>>>> the
>>> >>>>>>>>>> build works again properly, and we actually get meaningful CI
>>> >> results.
>>> >>>>>>>>>>
>>> >>>>>>>>>> Here is a story in many acts, from builds deep red to bright
>>> >> green joy.
>>> >>>>>>>>>> Kudos to Max, who did most of this troubleshooting. This
>>> evening,
>>> >> Max and
>>> >>>>>>>>>> me debugged the final issue and got the build back on track.
>>> >>>>>>>>>>
>>> >>>>>>>>>> ------------------
>>> >>>>>>>>>> The Journey
>>> >>>>>>>>>> ------------------
>>> >>>>>>>>>>
>>> >>>>>>>>>> (1) Failsafe Plugin
>>> >>>>>>>>>>
>>> >>>>>>>>>> The Maven Failsafe Build Plugin had a critical bug due to which
>>> >> failed
>>> >>>>>>>>>> tests did not result in a failed build.
>>> >>>>>>>>>>
>>> >>>>>>>>>> That is a pretty bad bug for a plugin whose only task is to run
>>> >> tests and
>>> >>>>>>>>>> fail the build if a test fails.
>>> >>>>>>>>>>
>>> >>>>>>>>>> After we recognized that, we upgraded the Failsafe Plugin.
>>> >>>>>>>>>>
>>> >>>>>>>>>>
>>> >>>>>>>>>> (2) Failsafe Plugin Dependency Issues
>>> >>>>>>>>>>
>>> >>>>>>>>>> After the upgrade, the Failsafe Plugin behaved differently and
>>> >> did not
>>> >>>>>>>>>> interoperate with Dependency Shading any more.
>>> >>>>>>>>>>
>>> >>>>>>>>>> Because of that, we switched to the Surefire Plugin.
>>> >>>>>>>>>>
>>> >>>>>>>>>>
>>> >>>>>>>>>> (3) Fixing all the issues introduced in the meantime
>>> >>>>>>>>>>
>>> >>>>>>>>>> Naturally, a number of test instabilities had been introduced,
>>> >> which
>>> >>>>>>>>> needed
>>> >>>>>>>>>> to be fixed.
>>> >>>>>>>>>>
>>> >>>>>>>>>>
>>> >>>>>>>>>> (4) Yarn Tests and Test Scope Refactoring
>>> >>>>>>>>>>
>>> >>>>>>>>>> In the meantime, a Pull Request was merged that moved the Yarn
>>> >> Tests to
>>> >>>>>>>>> the
>>> >>>>>>>>>> test scope.
>>> >>>>>>>>>> Because the configuration searched for tests in the "main"
>>> scope,
>>> >> no Yarn
>>> >>>>>>>>>> tests were executed for a while, until the scope was fixed.
>>> >>>>>>>>>>
>>> >>>>>>>>>>
>>> >>>>>>>>>> (5) Yarn Tests and JMX Metrics
>>> >>>>>>>>>>
>>> >>>>>>>>>> After the Yarn Tests were re-activated, we saw them fail due to
>>> >> warnings
>>> >>>>>>>>>> created by the newly introduced metrics code. We could fix that
>>> by
>>> >>>>>>>>> updating
>>> >>>>>>>>>> the metrics code and temporarily not registering JMX beans for
>>> all
>>> >>>>>>>>> metrics.
>>> >>>>>>>>>>
>>> >>>>>>>>>>
>>> >>>>>>>>>> (6) Yarn / Surefire Deadlock
>>> >>>>>>>>>>
>>> >>>>>>>>>> Finally, some Yarn tests failed reliably in Maven (though not in
>>> >> the
>>> >>>>>>>>> IDE).
>>> >>>>>>>>>> It turned out that those test a command line interface that
>>> >> interacts
>>> >>>>>>>>> with
>>> >>>>>>>>>> the standard input stream.
>>> >>>>>>>>>>
>>> >>>>>>>>>> The newly deployed Surefire Plugin uses standard input as well,
>>> >> for
>>> >>>>>>>>>> communication with forked JVMs. Since Surefire internally locks
>>> >> the
>>> >>>>>>>>>> standard input stream, the Yarn CLI cannot poll the standard
>>> >> input stream
>>> >>>>>>>>>> without locking up and stalling the tests.
>>> >>>>>>>>>>
>>> >>>>>>>>>> We adjusted the tests and now the build happily builds again.
>>> >>>>>>>>>>
>>> >>>>>>>>>> -----------------
>>> >>>>>>>>>> Conclusions:
>>> >>>>>>>>>> -----------------
>>> >>>>>>>>>>
>>> >>>>>>>>>> - CI is terribly crucial It took us weeks with the fallout of
>>> >> having a
>>> >>>>>>>>>> period of unreliably CI.
>>> >>>>>>>>>>
>>> >>>>>>>>>> - Maven could do a better job. A bug as crucial as the one that
>>> >> started
>>> >>>>>>>>>> our problem should not occur in a test plugin like surefire.
>>> >> Also, the
>>> >>>>>>>>>> constant change of semantics and dependency scopes is annoying.
>>> >> The
>>> >>>>>>>>>> semantic changes are subtle, but for a build as complex as
>>> Flink,
>>> >> they
>>> >>>>>>>>> make
>>> >>>>>>>>>> a difference.
>>> >>>>>>>>>>
>>> >>>>>>>>>> - File-based communication is rarely a good idea. The bug in the
>>> >>>>>>>>> failsafe
>>> >>>>>>>>>> plugin was caused by improper file-based communication, and some
>>> >> of our
>>> >>>>>>>>>> discovered instabilities as well.
>>> >>>>>>>>>>
>>> >>>>>>>>>> Greetings,
>>> >>>>>>>>>> Stephan
>>> >>>>>>>>>>
>>> >>>>>>>>>>
>>> >>>>>>>>>> PS: Some issues and mysteries remain for us to solve: When we
>>> >> allow our
>>> >>>>>>>>>> metrics subsystem to register JMX beans, we see some tests
>>> >> failing due to
>>> >>>>>>>>>> spontaneous JVM process kills. Whoever has a pointer there,
>>> >> please ping
>>> >>>>>>>>> us!
>>> >>>>>>>>>
>>> >>>>>>>
>>> >>>>>
>>> >>>>
>>> >>
>>> >>
>>>
>>>