You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pulsar.apache.org by tison <wa...@gmail.com> on 2023/06/28 01:52:14 UTC

[BLOCKING] Master branch always fails on testElasticSearch8Sink

See also https://github.com/apache/pulsar/issues/20661

Enrico and I both verified that it works well locally, so that can be an
env issue or unstable dependency - I checked the ES image not changed,
though.

If we cannot locate the cause quickly, perhaps disable the test to unblock
other PRs first?

I tried to read the code, but there is no trivial cause (even the test
passed locally). The log indicates that statistics received one message
instead of 20 expected, but as other test cases passed, it may not be a
kernel logic issue.

Best,
tison.

Re: [BLOCKING] Master branch always fails on testElasticSearch8Sink

Posted by Michael Marshall <mm...@apache.org>.
> I think it's better to disable (or quarantine) a test if it blocks master and there's no immediate solution.

It is regrettable to disable a test, but given that this test is
passing locally, I agree with quarantining the test.

Thanks,
Michael

On Wed, Jun 28, 2023 at 12:10 AM Lari Hotari <lh...@apache.org> wrote:
>
> I think it's better to disable (or quarantine) a test if it blocks master and there's no immediate solution.
>
> -Lari
>
> On 2023/06/28 01:52:14 tison wrote:
> > See also https://github.com/apache/pulsar/issues/20661
> >
> > Enrico and I both verified that it works well locally, so that can be an
> > env issue or unstable dependency - I checked the ES image not changed,
> > though.
> >
> > If we cannot locate the cause quickly, perhaps disable the test to unblock
> > other PRs first?
> >
> > I tried to read the code, but there is no trivial cause (even the test
> > passed locally). The log indicates that statistics received one message
> > instead of 20 expected, but as other test cases passed, it may not be a
> > kernel logic issue.
> >
> > Best,
> > tison.
> >

Re: [BLOCKING] Master branch always fails on testElasticSearch8Sink

Posted by Lari Hotari <lh...@apache.org>.
I think it's better to disable (or quarantine) a test if it blocks master and there's no immediate solution.

-Lari

On 2023/06/28 01:52:14 tison wrote:
> See also https://github.com/apache/pulsar/issues/20661
> 
> Enrico and I both verified that it works well locally, so that can be an
> env issue or unstable dependency - I checked the ES image not changed,
> though.
> 
> If we cannot locate the cause quickly, perhaps disable the test to unblock
> other PRs first?
> 
> I tried to read the code, but there is no trivial cause (even the test
> passed locally). The log indicates that statistics received one message
> instead of 20 expected, but as other test cases passed, it may not be a
> kernel logic issue.
> 
> Best,
> tison.
> 

Re: [BLOCKING] Master branch always fails on testElasticSearch8Sink

Posted by Lari Hotari <lh...@apache.org>.
https://github.com/apache/pulsar/pull/20671 is now merged.

For existing PRs that are blocked, it is necessary to push new changes to the PR or
close and re-open the PR to pick up the fix that is merged to the master branch.

I have also cherry-picked the fix to branch-3.0 and branch-2.11.
There's an open PR for a backport to branch-2.10:
https://github.com/apache/pulsar/pull/20676

-Lari


On 2023/06/28 15:50:12 Lari Hotari wrote:
> The root cause appears to be different than the geoip database download in Elastic.
> By default, Elastic will stop writes when the disk usage goes over 90%. I've now added a setting to disable the disk usage threshold in the PR [1].
> A similar setting is applied in elastic-github-actions [2].
> Once the build passes for the PR [3], I'll proceed with merging it to unblock Pulsar CI.
> 
> -Lari
> 
> [1] - https://github.com/lhotari/pulsar/commit/d959eb4929d4192fb56c140a8b590e0ba25d866b
> [2] - https://github.com/elastic/elastic-github-actions/blob/562b8b6ae4677da97273ff6bc4d630ce96ecbaa5/elasticsearch/run-elasticsearch.sh#L41
> [3] - https://github.com/apache/pulsar/pull/20671
> 
> On 2023/06/28 13:05:30 tison wrote:
> > > I guess nobody proceeded in disabling the test.
> > 
> > Yeah. I'm not in a hurry but bring up the case. It seems no one is blocked
> > urgently and we have time to investigate it :D
> > 
> > Thanks for your investigation and patch! Indeed.
> > 
> > Best,
> > tison.
> > 
> > 
> > Lari Hotari <lh...@apache.org> 于2023年6月28日周三 20:58写道:
> > 
> > > I guess nobody proceeded in disabling the test.
> > >
> > > I have investigated the problem and written a short guide about
> > > investigating integration tests
> > > in the real GitHub Actions VM environment using ssh.
> > > This guide is a comment on the issue:
> > > https://github.com/apache/pulsar/issues/20661#issuecomment-1611216464
> > >
> > > While investigating the failing test, the test started suddenly passing
> > > and I couldn't reproduce the issue so I didn't catch the problem yet. This
> > > also means that the problem is transient.
> > >
> > > I suspect that it's the geoip database download that Elastic container
> > > does at startup time which is causing issues. There's also an elastic issue
> > > #92335 about the default geoip download [1]. This can be disabled by
> > > setting `ingest.geoip.downloader.enabled` to `false` in the container
> > > environment.
> > >
> > > geoip download might not be the root cause, but I'm now testing a change
> > > that disables the geoip database download and enables logging for Elastic
> > > container stdout and stderr output.
> > >
> > > The PR is https://github.com/apache/pulsar/pull/20671 .
> > >
> > > -Lari
> > >
> > > [1] https://github.com/elastic/elasticsearch/pull/92335
> > >
> > > On 2023/06/28 01:52:14 tison wrote:
> > > > See also https://github.com/apache/pulsar/issues/20661
> > > >
> > > > Enrico and I both verified that it works well locally, so that can be an
> > > > env issue or unstable dependency - I checked the ES image not changed,
> > > > though.
> > > >
> > > > If we cannot locate the cause quickly, perhaps disable the test to
> > > unblock
> > > > other PRs first?
> > > >
> > > > I tried to read the code, but there is no trivial cause (even the test
> > > > passed locally). The log indicates that statistics received one message
> > > > instead of 20 expected, but as other test cases passed, it may not be a
> > > > kernel logic issue.
> > > >
> > > > Best,
> > > > tison.
> > > >
> > >
> > 
> 

Re: [BLOCKING] Master branch always fails on testElasticSearch8Sink

Posted by tison <wa...@gmail.com>.
Cool!

Best,
tison.


Lari Hotari <lh...@apache.org> 于2023年6月28日周三 23:50写道:

> The root cause appears to be different than the geoip database download in
> Elastic.
> By default, Elastic will stop writes when the disk usage goes over 90%.
> I've now added a setting to disable the disk usage threshold in the PR [1].
> A similar setting is applied in elastic-github-actions [2].
> Once the build passes for the PR [3], I'll proceed with merging it to
> unblock Pulsar CI.
>
> -Lari
>
> [1] -
> https://github.com/lhotari/pulsar/commit/d959eb4929d4192fb56c140a8b590e0ba25d866b
> [2] -
> https://github.com/elastic/elastic-github-actions/blob/562b8b6ae4677da97273ff6bc4d630ce96ecbaa5/elasticsearch/run-elasticsearch.sh#L41
> [3] - https://github.com/apache/pulsar/pull/20671
>
> On 2023/06/28 13:05:30 tison wrote:
> > > I guess nobody proceeded in disabling the test.
> >
> > Yeah. I'm not in a hurry but bring up the case. It seems no one is
> blocked
> > urgently and we have time to investigate it :D
> >
> > Thanks for your investigation and patch! Indeed.
> >
> > Best,
> > tison.
> >
> >
> > Lari Hotari <lh...@apache.org> 于2023年6月28日周三 20:58写道:
> >
> > > I guess nobody proceeded in disabling the test.
> > >
> > > I have investigated the problem and written a short guide about
> > > investigating integration tests
> > > in the real GitHub Actions VM environment using ssh.
> > > This guide is a comment on the issue:
> > > https://github.com/apache/pulsar/issues/20661#issuecomment-1611216464
> > >
> > > While investigating the failing test, the test started suddenly passing
> > > and I couldn't reproduce the issue so I didn't catch the problem yet.
> This
> > > also means that the problem is transient.
> > >
> > > I suspect that it's the geoip database download that Elastic container
> > > does at startup time which is causing issues. There's also an elastic
> issue
> > > #92335 about the default geoip download [1]. This can be disabled by
> > > setting `ingest.geoip.downloader.enabled` to `false` in the container
> > > environment.
> > >
> > > geoip download might not be the root cause, but I'm now testing a
> change
> > > that disables the geoip database download and enables logging for
> Elastic
> > > container stdout and stderr output.
> > >
> > > The PR is https://github.com/apache/pulsar/pull/20671 .
> > >
> > > -Lari
> > >
> > > [1] https://github.com/elastic/elasticsearch/pull/92335
> > >
> > > On 2023/06/28 01:52:14 tison wrote:
> > > > See also https://github.com/apache/pulsar/issues/20661
> > > >
> > > > Enrico and I both verified that it works well locally, so that can
> be an
> > > > env issue or unstable dependency - I checked the ES image not
> changed,
> > > > though.
> > > >
> > > > If we cannot locate the cause quickly, perhaps disable the test to
> > > unblock
> > > > other PRs first?
> > > >
> > > > I tried to read the code, but there is no trivial cause (even the
> test
> > > > passed locally). The log indicates that statistics received one
> message
> > > > instead of 20 expected, but as other test cases passed, it may not
> be a
> > > > kernel logic issue.
> > > >
> > > > Best,
> > > > tison.
> > > >
> > >
> >
>

Re: [BLOCKING] Master branch always fails on testElasticSearch8Sink

Posted by Lari Hotari <lh...@apache.org>.
The root cause appears to be different than the geoip database download in Elastic.
By default, Elastic will stop writes when the disk usage goes over 90%. I've now added a setting to disable the disk usage threshold in the PR [1].
A similar setting is applied in elastic-github-actions [2].
Once the build passes for the PR [3], I'll proceed with merging it to unblock Pulsar CI.

-Lari

[1] - https://github.com/lhotari/pulsar/commit/d959eb4929d4192fb56c140a8b590e0ba25d866b
[2] - https://github.com/elastic/elastic-github-actions/blob/562b8b6ae4677da97273ff6bc4d630ce96ecbaa5/elasticsearch/run-elasticsearch.sh#L41
[3] - https://github.com/apache/pulsar/pull/20671

On 2023/06/28 13:05:30 tison wrote:
> > I guess nobody proceeded in disabling the test.
> 
> Yeah. I'm not in a hurry but bring up the case. It seems no one is blocked
> urgently and we have time to investigate it :D
> 
> Thanks for your investigation and patch! Indeed.
> 
> Best,
> tison.
> 
> 
> Lari Hotari <lh...@apache.org> 于2023年6月28日周三 20:58写道:
> 
> > I guess nobody proceeded in disabling the test.
> >
> > I have investigated the problem and written a short guide about
> > investigating integration tests
> > in the real GitHub Actions VM environment using ssh.
> > This guide is a comment on the issue:
> > https://github.com/apache/pulsar/issues/20661#issuecomment-1611216464
> >
> > While investigating the failing test, the test started suddenly passing
> > and I couldn't reproduce the issue so I didn't catch the problem yet. This
> > also means that the problem is transient.
> >
> > I suspect that it's the geoip database download that Elastic container
> > does at startup time which is causing issues. There's also an elastic issue
> > #92335 about the default geoip download [1]. This can be disabled by
> > setting `ingest.geoip.downloader.enabled` to `false` in the container
> > environment.
> >
> > geoip download might not be the root cause, but I'm now testing a change
> > that disables the geoip database download and enables logging for Elastic
> > container stdout and stderr output.
> >
> > The PR is https://github.com/apache/pulsar/pull/20671 .
> >
> > -Lari
> >
> > [1] https://github.com/elastic/elasticsearch/pull/92335
> >
> > On 2023/06/28 01:52:14 tison wrote:
> > > See also https://github.com/apache/pulsar/issues/20661
> > >
> > > Enrico and I both verified that it works well locally, so that can be an
> > > env issue or unstable dependency - I checked the ES image not changed,
> > > though.
> > >
> > > If we cannot locate the cause quickly, perhaps disable the test to
> > unblock
> > > other PRs first?
> > >
> > > I tried to read the code, but there is no trivial cause (even the test
> > > passed locally). The log indicates that statistics received one message
> > > instead of 20 expected, but as other test cases passed, it may not be a
> > > kernel logic issue.
> > >
> > > Best,
> > > tison.
> > >
> >
> 

Re: [BLOCKING] Master branch always fails on testElasticSearch8Sink

Posted by tison <wa...@gmail.com>.
> I guess nobody proceeded in disabling the test.

Yeah. I'm not in a hurry but bring up the case. It seems no one is blocked
urgently and we have time to investigate it :D

Thanks for your investigation and patch! Indeed.

Best,
tison.


Lari Hotari <lh...@apache.org> 于2023年6月28日周三 20:58写道:

> I guess nobody proceeded in disabling the test.
>
> I have investigated the problem and written a short guide about
> investigating integration tests
> in the real GitHub Actions VM environment using ssh.
> This guide is a comment on the issue:
> https://github.com/apache/pulsar/issues/20661#issuecomment-1611216464
>
> While investigating the failing test, the test started suddenly passing
> and I couldn't reproduce the issue so I didn't catch the problem yet. This
> also means that the problem is transient.
>
> I suspect that it's the geoip database download that Elastic container
> does at startup time which is causing issues. There's also an elastic issue
> #92335 about the default geoip download [1]. This can be disabled by
> setting `ingest.geoip.downloader.enabled` to `false` in the container
> environment.
>
> geoip download might not be the root cause, but I'm now testing a change
> that disables the geoip database download and enables logging for Elastic
> container stdout and stderr output.
>
> The PR is https://github.com/apache/pulsar/pull/20671 .
>
> -Lari
>
> [1] https://github.com/elastic/elasticsearch/pull/92335
>
> On 2023/06/28 01:52:14 tison wrote:
> > See also https://github.com/apache/pulsar/issues/20661
> >
> > Enrico and I both verified that it works well locally, so that can be an
> > env issue or unstable dependency - I checked the ES image not changed,
> > though.
> >
> > If we cannot locate the cause quickly, perhaps disable the test to
> unblock
> > other PRs first?
> >
> > I tried to read the code, but there is no trivial cause (even the test
> > passed locally). The log indicates that statistics received one message
> > instead of 20 expected, but as other test cases passed, it may not be a
> > kernel logic issue.
> >
> > Best,
> > tison.
> >
>

Re: [BLOCKING] Master branch always fails on testElasticSearch8Sink

Posted by Lari Hotari <lh...@apache.org>.
I guess nobody proceeded in disabling the test.

I have investigated the problem and written a short guide about investigating integration tests
in the real GitHub Actions VM environment using ssh.
This guide is a comment on the issue:
https://github.com/apache/pulsar/issues/20661#issuecomment-1611216464

While investigating the failing test, the test started suddenly passing and I couldn't reproduce the issue so I didn't catch the problem yet. This also means that the problem is transient.

I suspect that it's the geoip database download that Elastic container does at startup time which is causing issues. There's also an elastic issue #92335 about the default geoip download [1]. This can be disabled by setting `ingest.geoip.downloader.enabled` to `false` in the container environment.

geoip download might not be the root cause, but I'm now testing a change that disables the geoip database download and enables logging for Elastic container stdout and stderr output.

The PR is https://github.com/apache/pulsar/pull/20671 .

-Lari

[1] https://github.com/elastic/elasticsearch/pull/92335

On 2023/06/28 01:52:14 tison wrote:
> See also https://github.com/apache/pulsar/issues/20661
> 
> Enrico and I both verified that it works well locally, so that can be an
> env issue or unstable dependency - I checked the ES image not changed,
> though.
> 
> If we cannot locate the cause quickly, perhaps disable the test to unblock
> other PRs first?
> 
> I tried to read the code, but there is no trivial cause (even the test
> passed locally). The log indicates that statistics received one message
> instead of 20 expected, but as other test cases passed, it may not be a
> kernel logic issue.
> 
> Best,
> tison.
>