You are viewing a plain text version of this content. The canonical link for it is here.

Posted to notifications@pekko.apache.org by "jrudolph (via GitHub)" <gi...@apache.org> on 2023/02/17 09:51:36 UTC

[GitHub] [incubator-pekko] jrudolph opened a new pull request, #202: move unstable nightly aeron builds into their own workflow

jrudolph opened a new pull request, #202:
URL: https://github.com/apache/incubator-pekko/pull/202

   So, that we can get the main nightlies green as the Aeron tests are the ones failing the most.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@pekko.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: notifications-unsubscribe@pekko.apache.org
For additional commands, e-mail: notifications-help@pekko.apache.org

[GitHub] [incubator-pekko] jrudolph commented on pull request #202: move unstable nightly aeron builds into their own workflow

Posted by "jrudolph (via GitHub)" <gi...@apache.org>.

jrudolph commented on PR #202:
URL: https://github.com/apache/incubator-pekko/pull/202#issuecomment-1439708928

I don't think anyone is guilty here :) It's super difficult to deal with the pekko core tests because they are so hard to stabilize because of the non-determinism involved. (Other projects remove sources of non-determinism to avoid this problem, however, that means that tests are less representative of real world scenarios unless you manage to create deterministic versions of all the race conditions that could happen in reality which might work for unit tests but the bigger the scenarios get the harder it gets...)

So, even in the best case, there will be quite a bit of spam involved. The difficulty is to find a good signal to noise ratio (which has been a much discussed topic in the past because of the amount of work required to get there...).

The thing we are interested in is finding regressions. In the best case, a regression will make some test fail deterministically after a change but some tests could also start to fail non-deterministically in a regression case. These days, in many cases, the tests themselves are accidentally depending on some outcome that is not guaranteed.

In the best case, all the flakiness could be fixed by hardening the tests (and this an ongoing effort) but this is almost a job in itself. The next best solution is to identify tests that started failing at some point. In the past, I wrote different kinds of tools to deal with that (collect test results over a period and then trying to extract the signal from the noise) which was somewhat easy because Jenkins would parse JUnit XML results and provide them as json.

Another strategy is to separate flaky and stable tests and run them in different jobs (which already happens by tagging tests) to make sure the stable tests always succeed (but then the question is who would watch the unstable tests?).

In any case, it's not a simple problem.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@pekko.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: notifications-unsubscribe@pekko.apache.org
For additional commands, e-mail: notifications-help@pekko.apache.org

[GitHub] [incubator-pekko] jrudolph merged pull request #202: move unstable nightly aeron builds into their own workflow

Posted by "jrudolph (via GitHub)" <gi...@apache.org>.

jrudolph merged PR #202:
URL: https://github.com/apache/incubator-pekko/pull/202


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@pekko.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: notifications-unsubscribe@pekko.apache.org
For additional commands, e-mail: notifications-help@pekko.apache.org

[GitHub] [incubator-pekko] mdedetrich commented on pull request #202: move unstable nightly aeron builds into their own workflow

Posted by "mdedetrich (via GitHub)" <gi...@apache.org>.

mdedetrich commented on PR #202:
URL: https://github.com/apache/incubator-pekko/pull/202#issuecomment-1439763699

> This is part of the reason. We always thought of that as a feature, though. By putting more stress on the tests, we hoped to make them more resilient. Until GHA was adopted all the tests used to run on ~10 year old servers. The multi-node tests also ran on these machines and after the move to GHA needed dedicated VMs to run on (because the tests need ssh access to set them up).

> One problem it might not be slow build machines which makes the tests fail in the first place but a high variance in processing power (e.g. for GHA because they run on unknown shared infrastructure with an unknown amount of noisy neighbors).

Yes, both of these go hand in hand. Can't speak to much of Akka/Pekko nightly tests because I am not that familiar with it, but for Kafka because the way the tests are run is also not deterministic (irrespective of machine strength), you can hit cases where a test happens to be running alongside a lot of other tests. and because of that the specific test has a lot less resources which causes it to fail. One very common workaround to solve this issue is to increase the timeout which does solve this specific problem but then creates another problem which is when people run the tests locally they complain that they take too long because there is a 5 minute timeout (which was set to 5 minutes because of flaky CI issues) albeit this can be solved by just configuring the timeout separately in CI vs locally.

On the point of taking advantage of using slow hardware deliberately to test the resiliency/memory usage of Pekko while I can understand the sentiment this to me sounds like we are conflating too many things in a test which then creates the scenario where by definition they will also be flaky (especially if, as you said, we don't even know the strength of the machines that the CI is run on).

Would it be wiser to separate the tests out by their primary concern so that we can better handle them? i.e. using self hosted runners for general nightly tests makes sense since we need the raw power and if we also want tests with deliberately less powerful hardware for resilience to me it seems like selecting the relevant subset of tests and running them on a dedicated machine inside LXC/docker so we can more precisely limit the resources of the test to help with determinism?

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@pekko.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-pekko] jrudolph commented on pull request #202: move unstable nightly aeron builds into their own workflow

Posted by "jrudolph (via GitHub)" <gi...@apache.org>.

jrudolph commented on PR #202:
URL: https://github.com/apache/incubator-pekko/pull/202#issuecomment-1439918874

   > you can hit cases where a test happens to be running alongside a lot of other tests that happen to be expensive and because of that the specific test has a lot less resources which causes it to fail
   
   Running tests in parallel is turned off for that reason.
   
   > One very common workaround to solve this issue is to increase the timeout which does solve this specific problem but then creates another problem which is when people run the tests locally they complain that they take too long because there is a 5 minute timeout (which was set to 5 minutes because of flaky CI issues) albeit this can be solved by just configuring the timeout separately in CI vs locally.
   
   This is also already done by setting `timefactor=2` when running tests in CI.
   
   > they complain that they take too long
   
   Comprehensive tests are already 1 hour+ regardless of running them locally or on CI. We have at least the logic not to test modules which should not be affected by a change for PR validation purposes.
   
   > Would it be wiser to separate the tests out by their primary concern so that we can better handle them?
   
   The question is to what purpose? There are multiple goals here:
   
    * Getting test results faster (or run more comprehensive tests in the same time) -> use faster dedicated runners
    * Run tests in a more difficult (slow) environment to find bugs that would otherwise not turn up -> run tests on slower hardware
   
   I'm not against doing both, doing quicker (but happier path) testing in the default case but also having nightlies running in more constrained environments (only makes sense if someone has time to evaluate the results, this can be a fulltime job).
   
   (I used to personally organize Jenkins runners for akka-http with the exact reasoning of getting more stable tests after having spent several weeks on doing nothing else than fixing flaky tests on slow hardware).
   
   Also, the cost of dedicated hardware must be taken into account, not sure if we can get GHA usage summaries but at the current usage for pekko-core it might not be feasible to have dedicated runners, certainly not on cloud machines where I would expect the usage to be easily hundreds of dollars per month for the current usage.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@pekko.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: notifications-unsubscribe@pekko.apache.org
For additional commands, e-mail: notifications-help@pekko.apache.org

[GitHub] [incubator-pekko] mdedetrich commented on pull request #202: move unstable nightly aeron builds into their own workflow

Posted by "mdedetrich (via GitHub)" <gi...@apache.org>.

mdedetrich commented on PR #202:
URL: https://github.com/apache/incubator-pekko/pull/202#issuecomment-1439675874

   > All tests have been running but no one is closely monitoring their result afaik. (We should enable mails for failed runs at some point, the question is where those should go. Maybe a new mailing list just for that?)
   
   I am guilty of this. Mailing list is a good idea, another one is to add a badge onto `README.md` specifically for nightly tests


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@pekko.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: notifications-unsubscribe@pekko.apache.org
For additional commands, e-mail: notifications-help@pekko.apache.org

[GitHub] [incubator-pekko] mdedetrich commented on pull request #202: move unstable nightly aeron builds into their own workflow

Posted by "mdedetrich (via GitHub)" <gi...@apache.org>.

mdedetrich commented on PR #202:
URL: https://github.com/apache/incubator-pekko/pull/202#issuecomment-1438719627

   @jrudolph You fine with this being merged?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@pekko.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: notifications-unsubscribe@pekko.apache.org
For additional commands, e-mail: notifications-help@pekko.apache.org

[GitHub] [incubator-pekko] mdedetrich commented on pull request #202: move unstable nightly aeron builds into their own workflow

Posted by "mdedetrich (via GitHub)" <gi...@apache.org>.

mdedetrich commented on PR #202:
URL: https://github.com/apache/incubator-pekko/pull/202#issuecomment-1439728291

   Regarding the visibility point, I just created a PR which adds a badge on `README.md` showing current status of aeron nightly tests
   
   > I don't think anyone is guilty here :) It's super difficult to deal with the pekko core tests because they are so hard to stabilize because of the non-determinism involved. (Other projects remove sources of non-determinism to avoid this problem, however, that means that tests are less representative of real world scenarios unless you manage to create deterministic versions of all the race conditions that could happen in reality which might work for unit tests but the bigger the scenarios get the harder it gets...)
   
   Indeed, I just manually ran the tests and noticed that one of them failed because they the test didn't get the necessary amount of messages (test received 160 and expected at least 200). I would suspect that one reason behind the flakiness is the strength of the machines, another Apache project that I work on (Kafka) has the same problem where due to the machines being so weak it causes tests to fail.
   
   One way to solve this would be to use our own self hosted runners, which would make sense for nightly tests. @pjfanning did some research into this and Apache supports this albeit there are some security workarounds we have to do (iirc some of those workarounds are irrelevant for nightly tests, they are only a concern on PR's).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@pekko.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: notifications-unsubscribe@pekko.apache.org
For additional commands, e-mail: notifications-help@pekko.apache.org

[GitHub] [incubator-pekko] jrudolph commented on pull request #202: move unstable nightly aeron builds into their own workflow

Posted by "jrudolph (via GitHub)" <gi...@apache.org>.

jrudolph commented on PR #202:
URL: https://github.com/apache/incubator-pekko/pull/202#issuecomment-1439650349

   > lgtm, but quick question since I am on mobile, why weren't the nightly tests running in the first place (did we explicitly disable them or?)
   
   All tests have been running but no one is closely monitoring their result afaik. (We should enable mails for failed runs at some point, the question is where those should go. Maybe a new mailing list just for that?)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@pekko.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: notifications-unsubscribe@pekko.apache.org
For additional commands, e-mail: notifications-help@pekko.apache.org

[GitHub] [incubator-pekko] jrudolph commented on pull request #202: move unstable nightly aeron builds into their own workflow

Posted by "jrudolph (via GitHub)" <gi...@apache.org>.

jrudolph commented on PR #202:
URL: https://github.com/apache/incubator-pekko/pull/202#issuecomment-1439749570

> I would suspect that one reason behind the flakiness is the strength of the CI machines, another Apache project that I work on (Kafka) has the same problem where due to the machines being so weak it causes tests to fail.

This is part of the reason. We always thought of that as a feature, though. By putting more stress on the tests, we hoped to make them more resilient. Until GHA was adopted all the tests used to run on ~10 year old servers. The multi-node tests also ran on these machines and after the move to GHA needed dedicated VMs to run on (because the tests need ssh access to set them up).

One problem it might not be slow build machines which makes the tests fail in the first place but a high variance in processing power (e.g. for GHA because they run on unknown shared infrastructure with an unknown amount of noisy neighbors).

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@pekko.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org