You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pulsar.apache.org by Lari Hotari <lh...@apache.org> on 2021/07/21 11:28:58 UTC

Integration tests failing in Pulsar CI with error code 137 (out of memory error)

Hi all,

There are several integration test jobs failing where the docker container
run by Testcontainers gets terminated with error code 137 (maps to out of
memory error).

The failing jobs are:
CI - Integration - Sql -
https://github.com/apache/pulsar/actions/workflows/ci-integration-sql.yaml
(most fail)
CI - Integration - Process -
https://github.com/apache/pulsar/actions/workflows/ci-integration-process.yaml
(some succeed)
CI - Integration - Messaging -
https://github.com/apache/pulsar/actions/workflows/ci-integration-messaging.yaml
(some succeed)
CI - Integration - Function & IO -
https://github.com/apache/pulsar/actions/workflows/ci-integration-function.yaml
(some succeed)

This started happening yesterday for most PR builds.

For example:
https://github.com/apache/pulsar/runs/3111868662?check_suite_focus=true#step:14:1024

Error:  Tests run: 22, Failures: 1, Errors: 0, Skipped: 21, Time elapsed:
292.035 s <<< FAILURE! - in TestSuite
Error:
 testPythonWordCountFunction(org.apache.pulsar.tests.integration.functions.PulsarStateTest)
 Time elapsed: 43.416 s  <<< FAILURE!
org.apache.pulsar.tests.integration.docker.ContainerExecException:
/pulsar/bin/pulsar-admin functions querystate --tenant public --namespace
default --name test-wordcount-py-fn-tfhycxsf --key message-1 failed on
705ecb067214d1cc42cd16358df6fa6d7a8cacc6c5ddd0cdde84a73b3e2e1f76 with error
code 137
at
org.apache.pulsar.tests.integration.utils.DockerUtils$2.onComplete(DockerUtils.java:248)
at
org.testcontainers.shaded.com.github.dockerjava.core.exec.AbstrAsyncDockerCmdExec$1.onComplete(AbstrAsyncDockerCmdExec.java:51)
at
org.testcontainers.shaded.com.github.dockerjava.core.DefaultInvocationBuilder.lambda$executeAndStream$1(DefaultInvocationBuilder.java:276)
at java.base/java.lang.Thread.run(Thread.java:829)


I also ran an experiment in my own fork with a very old commit from 13 days
ago (experiment is https://github.com/lhotari/pulsar/pull/47). The build
failed in the same way. Therefore it doesn't seem to be caused by commits
in the master branch.
I'll try to debug the issue locally as the next step.

BR,
-Lari

Re: Integration tests failing in Pulsar CI with error code 137 (out of memory error)

Posted by Lari Hotari <lh...@apache.org>.
After a few more attempts, the issue was resolved.

It seems that GitHub made a change in their latest update to VMs which
leaves less memory available for builds.
The root cause of the issue seemed that swap space had been disabled for
some builds a long time ago to save diskspace. Re-enabling the swap space
was the most impactful change.

Linux cgroups are used in the GitHub Actions Runner VM to limit resources.
When using cgroups, I have the assumption that without swap space OOMs
could happen even in the case where all RAM isn't fully consumed.

Pulsar CI is still in bad shape. There are a few very flaky tests, such as
RackAwareTest which causes most builds to fail. The root cause of
RackAwareTest seems to be a production code issue and a new issue has been
created to track it: https://github.com/apache/pulsar/issues/11433 .
RackAwareTest will be moved to quarantine group to unblock CI, that is part
of PR https://github.com/apache/pulsar/pull/11370 .
There are also 2 other urgent PRs to address very flaky tests:
https://github.com/apache/pulsar/pull/11424 , for
fixing MessagePublishBufferThrottleTest.testBlockByPublishRateLimiting  and
https://github.com/apache/pulsar/pull/11425 to fix flaky Presto tiered
storage integration tests.

*Before retrying a failed build, please check the failure. Before
re-running, check the issue tracker whether the test failure has been
reported as a flaky test. If not, please report it.*
*If it has been reported, please add a comment about facing a flaky issue.*
*If we all follow this principle, it will be easier to get the flaky test
issues under control. Hopefully over time there would be solutions to
automate this task,  but we aren't there yet. Contributions are welcome to
improve this process.*

Reporting a new Flaky issue can be done by choosing "Flaky test" on the new
issue page https://github.com/apache/pulsar/issues/new/choose .
Searching flaky issues:
https://github.com/apache/pulsar/issues?q=is%3Aissue+is%3Aopen+flaky+sort%3Aupdated-desc
(enter "flaky" or the test method name as the search query)

Please contribute to improving Pulsar CI stability!


-Lari


On Wed, Jul 21, 2021 at 6:14 PM Lari Hotari <lh...@apache.org> wrote:

> I pushed PR https://github.com/apache/pulsar/pull/11414 to fix the issue.
> Let's see if it does.
>
> -Lari
>
> On Wed, Jul 21, 2021 at 2:28 PM Lari Hotari <lh...@apache.org> wrote:
>
>> Hi all,
>>
>> There are several integration test jobs failing where the docker
>> container run by Testcontainers gets terminated with error code 137 (maps
>> to out of memory error).
>>
>> The failing jobs are:
>> CI - Integration - Sql -
>> https://github.com/apache/pulsar/actions/workflows/ci-integration-sql.yaml
>> (most fail)
>> CI - Integration - Process -
>> https://github.com/apache/pulsar/actions/workflows/ci-integration-process.yaml
>> (some succeed)
>> CI - Integration - Messaging -
>> https://github.com/apache/pulsar/actions/workflows/ci-integration-messaging.yaml
>> (some succeed)
>> CI - Integration - Function & IO -
>> https://github.com/apache/pulsar/actions/workflows/ci-integration-function.yaml
>> (some succeed)
>>
>> This started happening yesterday for most PR builds.
>>
>> For example:
>>
>> https://github.com/apache/pulsar/runs/3111868662?check_suite_focus=true#step:14:1024
>>
>> Error:  Tests run: 22, Failures: 1, Errors: 0, Skipped: 21, Time elapsed:
>> 292.035 s <<< FAILURE! - in TestSuite
>> Error:
>>  testPythonWordCountFunction(org.apache.pulsar.tests.integration.functions.PulsarStateTest)
>>  Time elapsed: 43.416 s  <<< FAILURE!
>> org.apache.pulsar.tests.integration.docker.ContainerExecException:
>> /pulsar/bin/pulsar-admin functions querystate --tenant public --namespace
>> default --name test-wordcount-py-fn-tfhycxsf --key message-1 failed on
>> 705ecb067214d1cc42cd16358df6fa6d7a8cacc6c5ddd0cdde84a73b3e2e1f76 with error
>> code 137
>> at
>> org.apache.pulsar.tests.integration.utils.DockerUtils$2.onComplete(DockerUtils.java:248)
>> at
>> org.testcontainers.shaded.com.github.dockerjava.core.exec.AbstrAsyncDockerCmdExec$1.onComplete(AbstrAsyncDockerCmdExec.java:51)
>> at
>> org.testcontainers.shaded.com.github.dockerjava.core.DefaultInvocationBuilder.lambda$executeAndStream$1(DefaultInvocationBuilder.java:276)
>> at java.base/java.lang.Thread.run(Thread.java:829)
>>
>>
>> I also ran an experiment in my own fork with a very old commit from 13
>> days ago (experiment is https://github.com/lhotari/pulsar/pull/47). The
>> build failed in the same way. Therefore it doesn't seem to be caused by
>> commits in the master branch.
>> I'll try to debug the issue locally as the next step.
>>
>> BR,
>> -Lari
>>
>

Re: Integration tests failing in Pulsar CI with error code 137 (out of memory error)

Posted by Lari Hotari <lh...@apache.org>.
I pushed PR https://github.com/apache/pulsar/pull/11414 to fix the issue.
Let's see if it does.

-Lari

On Wed, Jul 21, 2021 at 2:28 PM Lari Hotari <lh...@apache.org> wrote:

> Hi all,
>
> There are several integration test jobs failing where the docker container
> run by Testcontainers gets terminated with error code 137 (maps to out of
> memory error).
>
> The failing jobs are:
> CI - Integration - Sql -
> https://github.com/apache/pulsar/actions/workflows/ci-integration-sql.yaml
> (most fail)
> CI - Integration - Process -
> https://github.com/apache/pulsar/actions/workflows/ci-integration-process.yaml
> (some succeed)
> CI - Integration - Messaging -
> https://github.com/apache/pulsar/actions/workflows/ci-integration-messaging.yaml
> (some succeed)
> CI - Integration - Function & IO -
> https://github.com/apache/pulsar/actions/workflows/ci-integration-function.yaml
> (some succeed)
>
> This started happening yesterday for most PR builds.
>
> For example:
>
> https://github.com/apache/pulsar/runs/3111868662?check_suite_focus=true#step:14:1024
>
> Error:  Tests run: 22, Failures: 1, Errors: 0, Skipped: 21, Time elapsed:
> 292.035 s <<< FAILURE! - in TestSuite
> Error:
>  testPythonWordCountFunction(org.apache.pulsar.tests.integration.functions.PulsarStateTest)
>  Time elapsed: 43.416 s  <<< FAILURE!
> org.apache.pulsar.tests.integration.docker.ContainerExecException:
> /pulsar/bin/pulsar-admin functions querystate --tenant public --namespace
> default --name test-wordcount-py-fn-tfhycxsf --key message-1 failed on
> 705ecb067214d1cc42cd16358df6fa6d7a8cacc6c5ddd0cdde84a73b3e2e1f76 with error
> code 137
> at
> org.apache.pulsar.tests.integration.utils.DockerUtils$2.onComplete(DockerUtils.java:248)
> at
> org.testcontainers.shaded.com.github.dockerjava.core.exec.AbstrAsyncDockerCmdExec$1.onComplete(AbstrAsyncDockerCmdExec.java:51)
> at
> org.testcontainers.shaded.com.github.dockerjava.core.DefaultInvocationBuilder.lambda$executeAndStream$1(DefaultInvocationBuilder.java:276)
> at java.base/java.lang.Thread.run(Thread.java:829)
>
>
> I also ran an experiment in my own fork with a very old commit from 13
> days ago (experiment is https://github.com/lhotari/pulsar/pull/47). The
> build failed in the same way. Therefore it doesn't seem to be caused by
> commits in the master branch.
> I'll try to debug the issue locally as the next step.
>
> BR,
> -Lari
>