You are viewing a plain text version of this content. The canonical link for it is here.
Posted to server-dev@james.apache.org by Jean Helou <je...@gmail.com> on 2020/12/03 20:21:34 UTC

Re: Jenkins CI setup

Hello fellow jamers !

The Jenkinsfile in the PR works, up until the test suite fails, the tests
failures are from seemingly "unstable" tests that fail because of timing
issues. Benoit fixed the first one in
https://github.com/apache/james-project/pull/267 by disabling read repairs
during consistency checks (I have no idea what it means but it sounds
awesome :) ), I fixed the second one in
https://github.com/apache/james-project/pull/269 where the event bus sender
and receivers where closed out of order on shutdown sometimes leading up to
events being sent to a closed receiver.

After some cleanup, Matthieu recreated a buildable PR which lead to yet
another unstable test in
https://builds.apache.org/blue/organizations/jenkins/james%2FApacheJames/detail/PR-268/1/tests

I started investigating the issue and ended up roping in Matthieu since the
symptoms for the issue left me completely puzzled. Matthieu managed to
pinpoint the root cause to a NPE sometimes thrown from
within org.apache.james.server.core.MimeMessageCopyOnWriteProxy which in
turn triggered further NullPointerExceptions in the mailet pipeline error
handling code.
We finally confirmed a concurrency issue in the refcounting management of
the proxy which if I understand correctly can lead to unrecoverable data
loss. We wrote a test to trigger it [1] in an almost deterministic manner.

Once we had a test to reproduce the race condition, we tried to fix the
issue only to realize that it led to even more concurrency issues. The
rather depressing conclusion we reached yesterday was that the whole
implementation is currently unsound with regard to concurrency. I am unable
to estimate the resolution effort at this point, Matthieu has some ideas
and will work on it (as well as I) when time allows.

Which leads me to my current interrogations: I feel that fixing such long
standing issues in the test suite is not actually part of configuring the
apache CI but I am unsure how to proceed.

Here is what I would like to do at this stage :
- Isolate the unstable tests under with an unstable tag (akin to "feature
tags")
- exclude these tests from the default surefire execution profile,
- add a parallel pipeline step for these tests where the step failure
doesn't fail the pipeline [2]
- ensure that the build is green
- merge so the project finally has a working public CI

I intend to start working on this quickly so we can all enjoy a functional
public CI.

Alternatives:
- Merge the jenkinsfile after the whole pipeline has been tested in the PR
branch, which may not happen in a short-medium term...
- Merging as is, means that many builds on PRs will end up failing and the
last steps (snapshot publish) might fail even if the testsuite succeeds
since it never ran.
- Something I haven't thought of ?

Another issue I want to raise is the availability of the CI builds. As you
have seen from my experiments, the CI triggers configuration will only
build commits from :
- all branches of the main repository
- all PRs opened from the main repository
- all PRs opened by someone with write access to the main repository

Which means that PRs for external contributors will not be built at all.

I tried adding the  issueCommentTrigger to the jenkins file but neither my
comments nor those of someone with commit access were able to trigger the
build.

I think that one of the project members should revise the current settings
to make it possible to build external contributors PR one way or another.
(only project members have access or can have access to the jenkins project
configuration).
Here are two options:
- the easiest and quickest modification is to let the CI build all and
every PR, there are relatively few PRs on james so the burden on the CI
platform shouldn't be too bad.
- alternatively it may be possible to configure jenkins to require a
comment for someone with write access to trigger a build. unfortunately I
am not certain how to set this up, maybe INFRA can help.

I know this was a long piece, I look forward to reading your opinions !
Jean

[1] see
https://github.com/jeantil/james-project/tree/james-3225-concurrency-bug-mimemessagecow
[2] see
https://stackoverflow.com/questions/44022775/jenkins-ignore-failure-in-pipeline-build-step

On Thu, Nov 26, 2020 at 11:22 AM Jean Helou <je...@gmail.com> wrote:

> The good news is that docker does indeed work, the bad news is that the
> tests fail with an issue that's too involved for me :/
>
> [INFO]
> [INFO] Results:
> [INFO]
> [ERROR] Failures:
> [ERROR]   CassandraMailboxManagerConsistencyTest$FailuresOnDeletion$DeleteOnce.deleteMailboxByPathShouldBeConsistentWhenMailboxPathDaoFails:433 Multiple Failures (1 failure)
> 	
> Expecting:
>   <[]>
> to contain exactly (and in same order):
>   <[#private:user:INBOX]>
> but could not find the following elements:
>   <[#private:user:INBOX]>
>
> at CassandraMailboxManagerConsistencyTest$FailuresOnDeletion$DeleteOnce.lambda$deleteMailboxByPathShouldBeConsistentWhenMailboxPathDaoFails$8(CassandraMailboxManagerConsistencyTest$FailuresOnDeletion$DeleteOnce.java:440)
>
> so unless the build for
>
> * 6fab99364a - JAMES-3448 Rewrite links to http://james.apache.org/server/3/ (Mon Nov 23 15:10:36 2020 +0700) <Benoit Tellier> N
>
> is broken which sounds unlikely, I'm going to need help
>
> jean
>
> On Thu, Nov 26, 2020 at 10:53 AM Jean Helou <je...@gmail.com> wrote:
>
>> on a loosely related note : the test suite logs are scary to look at:
>> piles upon piles of stack traces and error logs but the tests actually pass
>> ...
>>
>> On Thu, Nov 26, 2020 at 10:50 AM Jean Helou <je...@gmail.com> wrote:
>>
>>> Thanks benoit,
>>>
>>> Matthieu pointed me to numerous apache projects with jenkinsfiles which
>>> mention docker in
>>> https://github.com/search?q=org%3Aapache++filename%3AJenkinsfile+docker&type=Code
>>> so I'm trying out things based on that
>>>
>>> the logs seem promising so far :
>>> ```
>>>
>>> [INFO] Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.697 s - in org.apache.james.backends.rabbitmq.RabbitMQConnectionFactoryTest
>>>         ℹ︎ Checking the system...
>>>         ✔ Docker version should be at least 1.6.0
>>>         ✔ Docker environment should have more than 2GB free disk space
>>> [INFO] Running org.apache.james.backends.rabbitmq.RabbitMQTest
>>> ```
>>>
>>>
>>> On Thu, Nov 26, 2020 at 10:40 AM Tellier Benoit <bt...@apache.org>
>>> wrote:
>>>
>>>> Done
>>>>
>>>> Le 26/11/2020 à 16:25, Jean Helou a écrit :
>>>> > hi all,
>>>> >
>>>> > As you know I started a PR to setup jenkins CI, the latest iteration
>>>> sees
>>>> > the compilation of the project complete in 5 minutes ( thanks to T1C)
>>>> but
>>>> > the tests fail to initialize docker containers with the disastrous
>>>> > consequences you can imagine :D
>>>> >
>>>> > I opened https://issues.apache.org/jira/browse/INFRA-21144 to ask if
>>>> it is
>>>> > possible to have the docker service enable don some nodes, since I am
>>>> not
>>>> > official member of the project I think it may be useful if you chimed
>>>> in on
>>>> > the ticket to confirm that this is a legitimate request.
>>>> >
>>>> > Best regards,
>>>> > Jean
>>>> >
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
>>>> For additional commands, e-mail: server-dev-help@james.apache.org
>>>>
>>>>

Re: Jenkins CI setup

Posted by Matthieu Baechler <ma...@apache.org>.
Hi,

On Fri, 2020-12-04 at 14:59 +0700, btellier@linagora.com (OpenPaaS)
wrote:


[...]

> > 
> > Here is what I would like to do at this stage :
> > - Isolate the unstable tests under with an unstable tag (akin to
> > "feature
> > tags")
> I'd advocate a @Disabled tag, referencing both a JIRA ticket specific
> to
> the bugfix needed, and the JIRA of the CI build.
> 
> Having a list of such issues in the JIRA (CI setup) ticket would be
> valuable. I'd even advise doing subtickets to have a nice checklist.

Let's say there's 10 unstable tests that will prevent the CI PR to be
green, do you expect Jean to open 10 tickets with explanation of each
problem? That would be a very high expectation.

> > - exclude these tests from the default surefire execution profile,
> > - add a parallel pipeline step for these tests where the step
> > failure
> > doesn't fail the pipeline [2]
> > - ensure that the build is green
> > - merge so the project finally has a working public CI
> > 
> > I intend to start working on this quickly so we can all enjoy a
> > functional
> > public CI.
> +1 I agree on the approach.

I think we can event skip the "add a parallel pipeline step" part
entirely. The simpler the better.

[...]

Cheers,

-- Matthieu Baechler


---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org


Re: Jenkins CI setup

Posted by Matthieu Baechler <ma...@apache.org>.
Hi Jean,

Thank you so much for delivering this awesome fight against the red CI.

In the past, such unstable builds have always been linked to some
resources leaks during tests.

At this point, we usually stopped implementing new things for a while
and focus in pluging leaks to bring back build stability.

I guess it's this time of the year again and you are the one paying the
price for now.

The first solution is to recycle JVMs less to mitigate leaks effects
(with surefire reusefork option).

If it's not enough, I'll have a look at pluging some leaks myself, if
nobody else care enough.

Keep me posted about the outcome of this change, I do care about having
a CI for James very much.

Cheers,

-- Matthieu Baechler

On Wed, 2021-01-13 at 12:50 +0100, Jean Helou wrote:
> Happy new year fellow jamers !
> 
> In this thrilling new episode you might learn if 2021 will be the
> year the
> james project gets a public ci rolling again !
> 
> CI wars
> Episode 49e^55
> The Memory errors strike back
> The CI Resistance succeeded in configuring jenkins, fixed some tests,
> exposed some bugs and tagged a lot of unstable tests as being
> unstable.
> After such a striking defeat the empire of bugs reacted in the most
> vicious
> way ever, it deployed "Direct buffer memory" errors throughout the
> galaxy
> to find contributors to the CI effort and tear down their hope and
> motivation. They found the apache jenkins and it will need help from
> all
> the CI resistance members to fight them off !
> 
> On a bit more serious note,
> I am at a loss as to how to fix this issue. My last four builds have
> failed
> because a `java.lang.OutOfMemoryError: Direct buffer memory` caused
> the
> forked jvm to crash, crashing the surefire plugin and the build with
> it.
> and that has been a build failure cause for a lot of the 63 builds on
> the
> apache CI. until now I updated the pom files of the corresponding
> projects
> to increase heap to 2G but the last failure occured in a project
> where the
> heap was already increased.
> 
> Looking at a specific log
> https://builds.apache.org/blue/organizations/jenkins/james%2FApacheJames/detail/PR-264/12/pipeline
> I get a more classical `java.lang.OutOfMemoryError: Java heap space`
> a bit
> before (12:14:09.563 vs 12:45:57.988).
>  The last non error line before the fatal Direct buffer memory error
> is
> [INFO] Running
> org.apache.james.webadmin.integration.rabbitmq.RabbitMQReindexingWith
> EventDeadLettersTest
>  The last non error line before the nonfatal heap memory error is
> [INFO] Running
> org.apache.james.jmap.memory.cucumber.MemoryDownloadCucumberTest
> 
> I will try to increase surefire's heap for the
> memory-jmap-draft-integration-testing project too in case the inital
> heap
> space OOM triggered the other one.
> stackoverflow is not very helpful either
> https://stackoverflow.com/search?q=java.lang.OutOfMemoryError%3A+Direct+buffer+memory
> or I have not been able to comprehend how the solutions there could
> help
> 
> I have gone through the files in /dockerfiles without finding
> anything that
> looked related to memory configuration of maven itself, If people who
> run
> the build locally with success or on their own CI could check the
> MVN_OPTS
> and let me know if they override maven's Xmx itself I would
> appreciate it.
> 
> thanks for your help
> jean
> 
> 
> 
> On Tue, Dec 29, 2020 at 9:10 AM Jean Helou <je...@gmail.com>
> wrote:
> 
> > Hi Benoit,
> > 
> > As someone operating another CI, I want to play even unstable test
> > on
> > > every runs. Is there some adaptation needed to do this?
> > > 
> > 
> > Yes you will have to change your CI,
> > > mvn -B -e -fae test
> > now only runs stable tests, to run unstable tests you need an
> > additional
> > step
> > > mvn -B -e -fae test -Punstable-tests
> > 
> > I believe your CI is also based on jenkins (because of the stress
> > test
> > jenkinsfile at the root of the project) in which case you could
> > configure
> > your jenkins to pick up the jenkinsfile and use the same pipeline
> > as we use
> > on apache CI
> > 
> > cheers,
> > jean
> > 



---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org


Re: Jenkins CI setup

Posted by Jean Helou <je...@gmail.com>.
> Could there be some automatic calculation of memory resources that makes
> the build fail on some servers and not on others?
>

maybe but


> Our CI servers have 28GiB of memory. Could Docker allocate an amount
> that is particularly suitable for our test suite?
>

the last memory error failure (
https://builds.apache.org/job/james/job/ApacheJames/job/PR-264 ) ran on H39
according to
https://cwiki.apache.org/confluence/display/INFRA/Jenkins+node+labels this
node has 4TB disk, 96GB RAM,
builds 9->12 & 14 ran on H39
build 6 which also failed on memory ran on H22 for which the wiki doesn"t
have stats
builds 45 48,49 and 50 succeeded on H48
(I'm sticking to failures  on purpose)

I can't really wait for 3-4 hours in front of the jenkins page to see if
another job starts on the worker my build is running on and the runners
don't seem to store build history (not even for 24h
https://builds.apache.org/computer/H39/builds build #13 failed on OOM and
ran between 12 and 15 on 2021-01-13)

so its really hard to correlate.

for this morning build I happened to see a kafka job start on the runner
(H42) while build #15 was running
Kafka » kafka-trunk-jdk15
<https://builds.apache.org/job/Kafka/job/kafka-trunk-jdk15/>
unfortunately that build was hit by the copy on write concurrency bug so no
memory error this time  \o/

However if you look at what I said about build #13

> I get a more classical `java.lang.OutOfMemoryError: Java heap space` a bit
> before (12:14:09.563 vs 12:45:57.988).
>  The last non error line before the fatal Direct buffer memory error is
> [INFO] Running
> org.apache.james.webadmin.integration.rabbitmq.RabbitMQReindexingWithEventDeadLettersTest
>  The last non error line before the nonfatal heap memory error is
> [INFO] Running
> org.apache.james.jmap.memory.cucumber.MemoryDownloadCucumberTest


According to all the documentation I have ever seen the Java heap space
message means that the JVM could not allocate memory for an object within
the heap and is not related to missing outside memory
in my experience which seems to match
https://stackoverflow.com/questions/46801741/jvm-crashes-with-error-cannot-allocate-memory-errno-12
a native memory allocation error while trying to increase the heap will
crash the JVM so the Java heap OOM error is not related to other processes
running on the machine
I also don't see errors related to docker containers used by the tests
failing to allocate memory, (there are some errors about the docker
containers not finding an image once in a while but thats it) the memory
errors are always in the james test code itself
the OOM Direct buffer memory seems to often be triggered by an attempt to
start GuiceJamesServer which in turns starts multiple netty servers (for
the various ports) the default max off heap memory can be related to Xmx
(see https://blog.alexis-hassler.com/2020/05/15/direct-buffer-memory.html
for a recent french resource on the subject or
http://www.mastertheboss.com/other/java-stuff/troubleshooting-outofmemoryerror-direct-buffer-memory
)
also
https://dzone.com/articles/troubleshooting-problems-with-native-off-heap-memo
says that the error seems to hitting a limit of the JVM instead of malloc
being unable to allocate native memory I couldn't find the corresponding
JDK documentation
So all this seems to confirm Matthieu"s intuition of a resource leak in the
test suites instead of a native memory starvation issue. why that leak
doesn't affect your CI is still a mystery to me

I know that I had some pain to reproduce a green build on my local
> computer, while it works pretty smoothly on the CI.
>
> Raphaël.
>
> Le 14/01/2021 à 09:37, Jean Helou a écrit :
> >> My 2 cents trying to bring a little force in this nice project :)
> >>
> > Thanks Raphaël :)
> >
> > All our old CI is open source, so you can just check the source, Luke:
> >> https://github.com/linagora/james-jenkins/blob/master/workflow-job#L643
> >
> > thanks for the pointer, I had not looked for that one. After digging
> > through the repo, I couldn't find any memory specific settings either.
> >
> > and in particular
> >>
> https://github.com/linagora/james-project/blob/master/dockerfiles/compilation/java-11/compile.sh#L65
> >>
> > So I'm sorry there is no magic mvn parameter...
> > well I was wondering if maybe there was something passed in
> > MVN_ADDITIONAL_ARG_LINE  but as I said I was unable to find anything
> > special in the james-jenkins repo.
> >
> > This leaves me even more confused, don't you encounter these random
> looking
> > failures on the linagora ci platform ?
> > I mean I did manage to get a few green builds but overall out of 62
> builds
> > I had 5 successful ones, that's less than 10% !
> >
> > I haven't kept detailed stats (I didn't think it would be this bad) but
> out
> > of gut feeling, the primary causes seem to be:
> > - copy on write thread safety (which can arguably be explained by slower
> > computers), hence my impatience to see JAMES-3477 fixed since this would
> > likely resolve a lot of unstable tests
> > - out of memory errors which I find much harder to explain by slower
> > machines
> >
> > For the out of memory errors I ended up increasing the Xmx of surefire
> > (from 1g to 2g) in the following pom files :
> >
> >
>  server/protocols/jmap-draft-integration-testing/cassandra-jmap-draft-integration-testing/pom.xml
> >
>  server/protocols/webadmin-integration-test/distributed-webadmin-integration-test/pom.xml
> >
> >
>  server/protocols/webadmin-integration-test/memory-webadmin-integration-test/pom.xml
> >
> >   server/protocols/webadmin-integration-test/pom.xml
> >
> >
>  server/protocols/jmap-rfc-8621-integration-tests/distributed-jmap-rfc-8621-integration-tests/pom.xml
> >   server/protocols/webadmin/webadmin-mailbox/pom.xml
> >   server/protocols/webadmin/webadmin-mailbox/pom.xml
> >   server/container/guice/cassandra-rabbitmq-guice/pom.xml
> >
>  server/protocols/jmap-draft-integration-testing/rabbitmq-jmap-draft-integration-testing/pom.xml
> >
>  server/protocols/jmap-draft-integration-testing/memory-jmap-draft-integration-testing/pom.xml
> >
> > Obviously I very much look forward to the removal of the jmap draft
> module,
> > since I read in other exchanges that it was deprecated and would be
> removed
> >
> > If I still can't get the build to pass, I'll look into matthieu's
> > suggestion :
> >
> > The first solution is to recycle JVMs less to mitigate leaks effects
> >> (with surefire reusefork option).
> >
> > Cheers,
> > jean
> >
> >
> >> And happy new year to you!
> >>
> >> Cheers,
> >>
> >> Raphaël.
> >>
> >> Le 13/01/2021 à 12:50, Jean Helou a écrit :
> >>> Happy new year fellow jamers !
> >>>
> >>> In this thrilling new episode you might learn if 2021 will be the year
> >> the
> >>> james project gets a public ci rolling again !
> >>>
> >>> CI wars
> >>> Episode 49e^55
> >>> The Memory errors strike back
> >>> The CI Resistance succeeded in configuring jenkins, fixed some tests,
> >>> exposed some bugs and tagged a lot of unstable tests as being unstable.
> >>> After such a striking defeat the empire of bugs reacted in the most
> >> vicious
> >>> way ever, it deployed "Direct buffer memory" errors throughout the
> galaxy
> >>> to find contributors to the CI effort and tear down their hope and
> >>> motivation. They found the apache jenkins and it will need help from
> all
> >>> the CI resistance members to fight them off !
> >>>
> >>> On a bit more serious note,
> >>> I am at a loss as to how to fix this issue. My last four builds have
> >> failed
> >>> because a `java.lang.OutOfMemoryError: Direct buffer memory` caused the
> >>> forked jvm to crash, crashing the surefire plugin and the build with
> it.
> >>> and that has been a build failure cause for a lot of the 63 builds on
> the
> >>> apache CI. until now I updated the pom files of the corresponding
> >> projects
> >>> to increase heap to 2G but the last failure occured in a project where
> >> the
> >>> heap was already increased.
> >>>
> >>> Looking at a specific log
> >>>
> >>
> https://builds.apache.org/blue/organizations/jenkins/james%2FApacheJames/detail/PR-264/12/pipeline
> >>> I get a more classical `java.lang.OutOfMemoryError: Java heap space` a
> >> bit
> >>> before (12:14:09.563 vs 12:45:57.988).
> >>>    The last non error line before the fatal Direct buffer memory error
> is
> >>> [INFO] Running
> >>>
> >>
> org.apache.james.webadmin.integration.rabbitmq.RabbitMQReindexingWithEventDeadLettersTest
> >>>    The last non error line before the nonfatal heap memory error is
> >>> [INFO] Running
> >>> org.apache.james.jmap.memory.cucumber.MemoryDownloadCucumberTest
> >>>
> >>> I will try to increase surefire's heap for the
> >>> memory-jmap-draft-integration-testing project too in case the inital
> heap
> >>> space OOM triggered the other one.
> >>> stackoverflow is not very helpful either
> >>>
> >>
> https://stackoverflow.com/search?q=java.lang.OutOfMemoryError%3A+Direct+buffer+memory
> >>> or I have not been able to comprehend how the solutions there could
> help
> >>>
> >>> I have gone through the files in /dockerfiles without finding anything
> >> that
> >>> looked related to memory configuration of maven itself, If people who
> run
> >>> the build locally with success or on their own CI could check the
> >> MVN_OPTS
> >>> and let me know if they override maven's Xmx itself I would appreciate
> >> it.
> >>> thanks for your help
> >>> jean
> >>>
> >>>
> >>>
> >>> On Tue, Dec 29, 2020 at 9:10 AM Jean Helou <je...@gmail.com>
> wrote:
> >>>
> >>>> Hi Benoit,
> >>>>
> >>>> As someone operating another CI, I want to play even unstable test on
> >>>>> every runs. Is there some adaptation needed to do this?
> >>>>>
> >>>> Yes you will have to change your CI,
> >>>>> mvn -B -e -fae test
> >>>> now only runs stable tests, to run unstable tests you need an
> additional
> >>>> step
> >>>>> mvn -B -e -fae test -Punstable-tests
> >>>> I believe your CI is also based on jenkins (because of the stress test
> >>>> jenkinsfile at the root of the project) in which case you could
> >> configure
> >>>> your jenkins to pick up the jenkinsfile and use the same pipeline as
> we
> >> use
> >>>> on apache CI
> >>>>
> >>>> cheers,
> >>>> jean
> >>>>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
> >> For additional commands, e-mail: server-dev-help@james.apache.org
> >>
> >>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
> For additional commands, e-mail: server-dev-help@james.apache.org
>
>

Re: Jenkins CI setup

Posted by Raphaël Ouazana <ro...@linagora.com>.
Could there be some automatic calculation of memory resources that makes 
the build fail on some servers and not on others?

Our CI servers have 28GiB of memory. Could Docker allocate an amount 
that is particularly suitable for our test suite?

I know that I had some pain to reproduce a green build on my local 
computer, while it works pretty smoothly on the CI.

Raphaël.

Le 14/01/2021 à 09:37, Jean Helou a écrit :
>> My 2 cents trying to bring a little force in this nice project :)
>>
> Thanks Raphaël :)
>
> All our old CI is open source, so you can just check the source, Luke:
>> https://github.com/linagora/james-jenkins/blob/master/workflow-job#L643
>
> thanks for the pointer, I had not looked for that one. After digging
> through the repo, I couldn't find any memory specific settings either.
>
> and in particular
>> https://github.com/linagora/james-project/blob/master/dockerfiles/compilation/java-11/compile.sh#L65
>>
> So I'm sorry there is no magic mvn parameter...
> well I was wondering if maybe there was something passed in
> MVN_ADDITIONAL_ARG_LINE  but as I said I was unable to find anything
> special in the james-jenkins repo.
>
> This leaves me even more confused, don't you encounter these random looking
> failures on the linagora ci platform ?
> I mean I did manage to get a few green builds but overall out of 62 builds
> I had 5 successful ones, that's less than 10% !
>
> I haven't kept detailed stats (I didn't think it would be this bad) but out
> of gut feeling, the primary causes seem to be:
> - copy on write thread safety (which can arguably be explained by slower
> computers), hence my impatience to see JAMES-3477 fixed since this would
> likely resolve a lot of unstable tests
> - out of memory errors which I find much harder to explain by slower
> machines
>
> For the out of memory errors I ended up increasing the Xmx of surefire
> (from 1g to 2g) in the following pom files :
>
>   server/protocols/jmap-draft-integration-testing/cassandra-jmap-draft-integration-testing/pom.xml
>   server/protocols/webadmin-integration-test/distributed-webadmin-integration-test/pom.xml
>
>   server/protocols/webadmin-integration-test/memory-webadmin-integration-test/pom.xml
>
>   server/protocols/webadmin-integration-test/pom.xml
>
>   server/protocols/jmap-rfc-8621-integration-tests/distributed-jmap-rfc-8621-integration-tests/pom.xml
>   server/protocols/webadmin/webadmin-mailbox/pom.xml
>   server/protocols/webadmin/webadmin-mailbox/pom.xml
>   server/container/guice/cassandra-rabbitmq-guice/pom.xml
>   server/protocols/jmap-draft-integration-testing/rabbitmq-jmap-draft-integration-testing/pom.xml
>   server/protocols/jmap-draft-integration-testing/memory-jmap-draft-integration-testing/pom.xml
>
> Obviously I very much look forward to the removal of the jmap draft module,
> since I read in other exchanges that it was deprecated and would be removed
>
> If I still can't get the build to pass, I'll look into matthieu's
> suggestion :
>
> The first solution is to recycle JVMs less to mitigate leaks effects
>> (with surefire reusefork option).
>
> Cheers,
> jean
>
>
>> And happy new year to you!
>>
>> Cheers,
>>
>> Raphaël.
>>
>> Le 13/01/2021 à 12:50, Jean Helou a écrit :
>>> Happy new year fellow jamers !
>>>
>>> In this thrilling new episode you might learn if 2021 will be the year
>> the
>>> james project gets a public ci rolling again !
>>>
>>> CI wars
>>> Episode 49e^55
>>> The Memory errors strike back
>>> The CI Resistance succeeded in configuring jenkins, fixed some tests,
>>> exposed some bugs and tagged a lot of unstable tests as being unstable.
>>> After such a striking defeat the empire of bugs reacted in the most
>> vicious
>>> way ever, it deployed "Direct buffer memory" errors throughout the galaxy
>>> to find contributors to the CI effort and tear down their hope and
>>> motivation. They found the apache jenkins and it will need help from all
>>> the CI resistance members to fight them off !
>>>
>>> On a bit more serious note,
>>> I am at a loss as to how to fix this issue. My last four builds have
>> failed
>>> because a `java.lang.OutOfMemoryError: Direct buffer memory` caused the
>>> forked jvm to crash, crashing the surefire plugin and the build with it.
>>> and that has been a build failure cause for a lot of the 63 builds on the
>>> apache CI. until now I updated the pom files of the corresponding
>> projects
>>> to increase heap to 2G but the last failure occured in a project where
>> the
>>> heap was already increased.
>>>
>>> Looking at a specific log
>>>
>> https://builds.apache.org/blue/organizations/jenkins/james%2FApacheJames/detail/PR-264/12/pipeline
>>> I get a more classical `java.lang.OutOfMemoryError: Java heap space` a
>> bit
>>> before (12:14:09.563 vs 12:45:57.988).
>>>    The last non error line before the fatal Direct buffer memory error is
>>> [INFO] Running
>>>
>> org.apache.james.webadmin.integration.rabbitmq.RabbitMQReindexingWithEventDeadLettersTest
>>>    The last non error line before the nonfatal heap memory error is
>>> [INFO] Running
>>> org.apache.james.jmap.memory.cucumber.MemoryDownloadCucumberTest
>>>
>>> I will try to increase surefire's heap for the
>>> memory-jmap-draft-integration-testing project too in case the inital heap
>>> space OOM triggered the other one.
>>> stackoverflow is not very helpful either
>>>
>> https://stackoverflow.com/search?q=java.lang.OutOfMemoryError%3A+Direct+buffer+memory
>>> or I have not been able to comprehend how the solutions there could help
>>>
>>> I have gone through the files in /dockerfiles without finding anything
>> that
>>> looked related to memory configuration of maven itself, If people who run
>>> the build locally with success or on their own CI could check the
>> MVN_OPTS
>>> and let me know if they override maven's Xmx itself I would appreciate
>> it.
>>> thanks for your help
>>> jean
>>>
>>>
>>>
>>> On Tue, Dec 29, 2020 at 9:10 AM Jean Helou <je...@gmail.com> wrote:
>>>
>>>> Hi Benoit,
>>>>
>>>> As someone operating another CI, I want to play even unstable test on
>>>>> every runs. Is there some adaptation needed to do this?
>>>>>
>>>> Yes you will have to change your CI,
>>>>> mvn -B -e -fae test
>>>> now only runs stable tests, to run unstable tests you need an additional
>>>> step
>>>>> mvn -B -e -fae test -Punstable-tests
>>>> I believe your CI is also based on jenkins (because of the stress test
>>>> jenkinsfile at the root of the project) in which case you could
>> configure
>>>> your jenkins to pick up the jenkinsfile and use the same pipeline as we
>> use
>>>> on apache CI
>>>>
>>>> cheers,
>>>> jean
>>>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
>> For additional commands, e-mail: server-dev-help@james.apache.org
>>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org


Re: Jenkins CI setup

Posted by Jean Helou <je...@gmail.com>.
>
> My 2 cents trying to bring a little force in this nice project :)
>

Thanks Raphaël :)

All our old CI is open source, so you can just check the source, Luke:
> https://github.com/linagora/james-jenkins/blob/master/workflow-job#L643


thanks for the pointer, I had not looked for that one. After digging
through the repo, I couldn't find any memory specific settings either.

and in particular
>
> https://github.com/linagora/james-project/blob/master/dockerfiles/compilation/java-11/compile.sh#L65
>

So I'm sorry there is no magic mvn parameter...
>

well I was wondering if maybe there was something passed in
MVN_ADDITIONAL_ARG_LINE  but as I said I was unable to find anything
special in the james-jenkins repo.

This leaves me even more confused, don't you encounter these random looking
failures on the linagora ci platform ?
I mean I did manage to get a few green builds but overall out of 62 builds
I had 5 successful ones, that's less than 10% !

I haven't kept detailed stats (I didn't think it would be this bad) but out
of gut feeling, the primary causes seem to be:
- copy on write thread safety (which can arguably be explained by slower
computers), hence my impatience to see JAMES-3477 fixed since this would
likely resolve a lot of unstable tests
- out of memory errors which I find much harder to explain by slower
machines

For the out of memory errors I ended up increasing the Xmx of surefire
(from 1g to 2g) in the following pom files :

 server/protocols/jmap-draft-integration-testing/cassandra-jmap-draft-integration-testing/pom.xml
 server/protocols/webadmin-integration-test/distributed-webadmin-integration-test/pom.xml

 server/protocols/webadmin-integration-test/memory-webadmin-integration-test/pom.xml

 server/protocols/webadmin-integration-test/pom.xml

 server/protocols/jmap-rfc-8621-integration-tests/distributed-jmap-rfc-8621-integration-tests/pom.xml
 server/protocols/webadmin/webadmin-mailbox/pom.xml
 server/protocols/webadmin/webadmin-mailbox/pom.xml
 server/container/guice/cassandra-rabbitmq-guice/pom.xml
 server/protocols/jmap-draft-integration-testing/rabbitmq-jmap-draft-integration-testing/pom.xml
 server/protocols/jmap-draft-integration-testing/memory-jmap-draft-integration-testing/pom.xml

Obviously I very much look forward to the removal of the jmap draft module,
since I read in other exchanges that it was deprecated and would be removed

If I still can't get the build to pass, I'll look into matthieu's
suggestion :

The first solution is to recycle JVMs less to mitigate leaks effects
> (with surefire reusefork option).


Cheers,
jean


> And happy new year to you!
>
> Cheers,
>
> Raphaël.
>
> Le 13/01/2021 à 12:50, Jean Helou a écrit :
> > Happy new year fellow jamers !
> >
> > In this thrilling new episode you might learn if 2021 will be the year
> the
> > james project gets a public ci rolling again !
> >
> > CI wars
> > Episode 49e^55
> > The Memory errors strike back
> > The CI Resistance succeeded in configuring jenkins, fixed some tests,
> > exposed some bugs and tagged a lot of unstable tests as being unstable.
> > After such a striking defeat the empire of bugs reacted in the most
> vicious
> > way ever, it deployed "Direct buffer memory" errors throughout the galaxy
> > to find contributors to the CI effort and tear down their hope and
> > motivation. They found the apache jenkins and it will need help from all
> > the CI resistance members to fight them off !
> >
> > On a bit more serious note,
> > I am at a loss as to how to fix this issue. My last four builds have
> failed
> > because a `java.lang.OutOfMemoryError: Direct buffer memory` caused the
> > forked jvm to crash, crashing the surefire plugin and the build with it.
> > and that has been a build failure cause for a lot of the 63 builds on the
> > apache CI. until now I updated the pom files of the corresponding
> projects
> > to increase heap to 2G but the last failure occured in a project where
> the
> > heap was already increased.
> >
> > Looking at a specific log
> >
> https://builds.apache.org/blue/organizations/jenkins/james%2FApacheJames/detail/PR-264/12/pipeline
> > I get a more classical `java.lang.OutOfMemoryError: Java heap space` a
> bit
> > before (12:14:09.563 vs 12:45:57.988).
> >   The last non error line before the fatal Direct buffer memory error is
> > [INFO] Running
> >
> org.apache.james.webadmin.integration.rabbitmq.RabbitMQReindexingWithEventDeadLettersTest
> >   The last non error line before the nonfatal heap memory error is
> > [INFO] Running
> > org.apache.james.jmap.memory.cucumber.MemoryDownloadCucumberTest
> >
> > I will try to increase surefire's heap for the
> > memory-jmap-draft-integration-testing project too in case the inital heap
> > space OOM triggered the other one.
> > stackoverflow is not very helpful either
> >
> https://stackoverflow.com/search?q=java.lang.OutOfMemoryError%3A+Direct+buffer+memory
> > or I have not been able to comprehend how the solutions there could help
> >
> > I have gone through the files in /dockerfiles without finding anything
> that
> > looked related to memory configuration of maven itself, If people who run
> > the build locally with success or on their own CI could check the
> MVN_OPTS
> > and let me know if they override maven's Xmx itself I would appreciate
> it.
> >
> > thanks for your help
> > jean
> >
> >
> >
> > On Tue, Dec 29, 2020 at 9:10 AM Jean Helou <je...@gmail.com> wrote:
> >
> >> Hi Benoit,
> >>
> >> As someone operating another CI, I want to play even unstable test on
> >>> every runs. Is there some adaptation needed to do this?
> >>>
> >> Yes you will have to change your CI,
> >>> mvn -B -e -fae test
> >> now only runs stable tests, to run unstable tests you need an additional
> >> step
> >>> mvn -B -e -fae test -Punstable-tests
> >> I believe your CI is also based on jenkins (because of the stress test
> >> jenkinsfile at the root of the project) in which case you could
> configure
> >> your jenkins to pick up the jenkinsfile and use the same pipeline as we
> use
> >> on apache CI
> >>
> >> cheers,
> >> jean
> >>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
> For additional commands, e-mail: server-dev-help@james.apache.org
>
>

Re: Jenkins CI setup

Posted by Raphaël Ouazana <ro...@linagora.com>.
Hello Jean,

My 2 cents trying to bring a little force in this nice project :)

All our old CI is open source, so you can just check the source, Luke: 
https://github.com/linagora/james-jenkins/blob/master/workflow-job#L643

This calls images.jamesCompile which is built here: 
https://github.com/linagora/james-project/tree/master/dockerfiles/compilation/java-11 
and in particular 
https://github.com/linagora/james-project/blob/master/dockerfiles/compilation/java-11/compile.sh#L65

So I'm sorry there is no magic mvn parameter...

And happy new year to you!

Cheers,

Raphaël.

Le 13/01/2021 à 12:50, Jean Helou a écrit :
> Happy new year fellow jamers !
>
> In this thrilling new episode you might learn if 2021 will be the year the
> james project gets a public ci rolling again !
>
> CI wars
> Episode 49e^55
> The Memory errors strike back
> The CI Resistance succeeded in configuring jenkins, fixed some tests,
> exposed some bugs and tagged a lot of unstable tests as being unstable.
> After such a striking defeat the empire of bugs reacted in the most vicious
> way ever, it deployed "Direct buffer memory" errors throughout the galaxy
> to find contributors to the CI effort and tear down their hope and
> motivation. They found the apache jenkins and it will need help from all
> the CI resistance members to fight them off !
>
> On a bit more serious note,
> I am at a loss as to how to fix this issue. My last four builds have failed
> because a `java.lang.OutOfMemoryError: Direct buffer memory` caused the
> forked jvm to crash, crashing the surefire plugin and the build with it.
> and that has been a build failure cause for a lot of the 63 builds on the
> apache CI. until now I updated the pom files of the corresponding projects
> to increase heap to 2G but the last failure occured in a project where the
> heap was already increased.
>
> Looking at a specific log
> https://builds.apache.org/blue/organizations/jenkins/james%2FApacheJames/detail/PR-264/12/pipeline
> I get a more classical `java.lang.OutOfMemoryError: Java heap space` a bit
> before (12:14:09.563 vs 12:45:57.988).
>   The last non error line before the fatal Direct buffer memory error is
> [INFO] Running
> org.apache.james.webadmin.integration.rabbitmq.RabbitMQReindexingWithEventDeadLettersTest
>   The last non error line before the nonfatal heap memory error is
> [INFO] Running
> org.apache.james.jmap.memory.cucumber.MemoryDownloadCucumberTest
>
> I will try to increase surefire's heap for the
> memory-jmap-draft-integration-testing project too in case the inital heap
> space OOM triggered the other one.
> stackoverflow is not very helpful either
> https://stackoverflow.com/search?q=java.lang.OutOfMemoryError%3A+Direct+buffer+memory
> or I have not been able to comprehend how the solutions there could help
>
> I have gone through the files in /dockerfiles without finding anything that
> looked related to memory configuration of maven itself, If people who run
> the build locally with success or on their own CI could check the MVN_OPTS
> and let me know if they override maven's Xmx itself I would appreciate it.
>
> thanks for your help
> jean
>
>
>
> On Tue, Dec 29, 2020 at 9:10 AM Jean Helou <je...@gmail.com> wrote:
>
>> Hi Benoit,
>>
>> As someone operating another CI, I want to play even unstable test on
>>> every runs. Is there some adaptation needed to do this?
>>>
>> Yes you will have to change your CI,
>>> mvn -B -e -fae test
>> now only runs stable tests, to run unstable tests you need an additional
>> step
>>> mvn -B -e -fae test -Punstable-tests
>> I believe your CI is also based on jenkins (because of the stress test
>> jenkinsfile at the root of the project) in which case you could configure
>> your jenkins to pick up the jenkinsfile and use the same pipeline as we use
>> on apache CI
>>
>> cheers,
>> jean
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org


Re: Jenkins CI setup

Posted by Jean Helou <je...@gmail.com>.
Happy new year fellow jamers !

In this thrilling new episode you might learn if 2021 will be the year the
james project gets a public ci rolling again !

CI wars
Episode 49e^55
The Memory errors strike back
The CI Resistance succeeded in configuring jenkins, fixed some tests,
exposed some bugs and tagged a lot of unstable tests as being unstable.
After such a striking defeat the empire of bugs reacted in the most vicious
way ever, it deployed "Direct buffer memory" errors throughout the galaxy
to find contributors to the CI effort and tear down their hope and
motivation. They found the apache jenkins and it will need help from all
the CI resistance members to fight them off !

On a bit more serious note,
I am at a loss as to how to fix this issue. My last four builds have failed
because a `java.lang.OutOfMemoryError: Direct buffer memory` caused the
forked jvm to crash, crashing the surefire plugin and the build with it.
and that has been a build failure cause for a lot of the 63 builds on the
apache CI. until now I updated the pom files of the corresponding projects
to increase heap to 2G but the last failure occured in a project where the
heap was already increased.

Looking at a specific log
https://builds.apache.org/blue/organizations/jenkins/james%2FApacheJames/detail/PR-264/12/pipeline
I get a more classical `java.lang.OutOfMemoryError: Java heap space` a bit
before (12:14:09.563 vs 12:45:57.988).
 The last non error line before the fatal Direct buffer memory error is
[INFO] Running
org.apache.james.webadmin.integration.rabbitmq.RabbitMQReindexingWithEventDeadLettersTest
 The last non error line before the nonfatal heap memory error is
[INFO] Running
org.apache.james.jmap.memory.cucumber.MemoryDownloadCucumberTest

I will try to increase surefire's heap for the
memory-jmap-draft-integration-testing project too in case the inital heap
space OOM triggered the other one.
stackoverflow is not very helpful either
https://stackoverflow.com/search?q=java.lang.OutOfMemoryError%3A+Direct+buffer+memory
or I have not been able to comprehend how the solutions there could help

I have gone through the files in /dockerfiles without finding anything that
looked related to memory configuration of maven itself, If people who run
the build locally with success or on their own CI could check the MVN_OPTS
and let me know if they override maven's Xmx itself I would appreciate it.

thanks for your help
jean



On Tue, Dec 29, 2020 at 9:10 AM Jean Helou <je...@gmail.com> wrote:

> Hi Benoit,
>
> As someone operating another CI, I want to play even unstable test on
>> every runs. Is there some adaptation needed to do this?
>>
>
> Yes you will have to change your CI,
> > mvn -B -e -fae test
> now only runs stable tests, to run unstable tests you need an additional
> step
> > mvn -B -e -fae test -Punstable-tests
>
> I believe your CI is also based on jenkins (because of the stress test
> jenkinsfile at the root of the project) in which case you could configure
> your jenkins to pick up the jenkinsfile and use the same pipeline as we use
> on apache CI
>
> cheers,
> jean
>

Re: Jenkins CI setup

Posted by Jean Helou <je...@gmail.com>.
Hi Benoit,

As someone operating another CI, I want to play even unstable test on
> every runs. Is there some adaptation needed to do this?
>

Yes you will have to change your CI,
> mvn -B -e -fae test
now only runs stable tests, to run unstable tests you need an additional
step
> mvn -B -e -fae test -Punstable-tests

I believe your CI is also based on jenkins (because of the stress test
jenkinsfile at the root of the project) in which case you could configure
your jenkins to pick up the jenkinsfile and use the same pipeline as we use
on apache CI

cheers,
jean

Re: Jenkins CI setup

Posted by Tellier Benoit <bt...@apache.org>.
Hello Jean,

Nice work!

Le 28/12/2020 à 23:21, Jean Helou a écrit :
> Hello again jamers!
> 
> It's time for a new irregular report on the CI effort on apache infra 🎅 !
> 
> Let's start with the good news : today I finally reached a successful build
> https://builds.apache.org/blue/organizations/jenkins/james%2FApacheJames/detail/PR-268/45/pipeline
> (the first fully successful build on apache infra)

\o/

> 
> You can see in the pipeline that as discussed before the testing phase is
> split in 2 parts: stable tests vs unstable tests, failure in the first
> phase will fail the build, failures in the unstable phase will not be
> considered a build failure (but should still collect the failed tests in
> the reports, however the recent failures where mostly memory related in
> which case the surefire report is not generated :( )
> 
> Over the last 2 weeks and 45 build attempts, I tagged all failing tests as
> "Unstable", I also increased the heap in the forked surefire to resolve
> some of the OutOfMemoryException failures
> 
> At this stage I would really like to see this merged (if only to be able to
> evaluate dangerous changes such as
> https://github.com/apache/james-project/pull/282)

+1 I think this however deserves a separate thread. I will start it now.

> 
> You can look at https://github.com/apache/james-project/pull/268 to see
> which tests have been marked as Unstable. It was rebased on master this
> morning and I intend to clean up the history tonight.

Will do, maybe tomorrow. On the principles, it is a yes from my side.

As someone operating another CI, I want to play even unstable test on
every runs. Is there some adaptation needed to do this?

> I also removed some invasive logging from the webadmin test code (it used
> to log every single http request made in the tests) the full log is still a
> bit over 30MB...

Nice enhancement, it is likely some long-forgotten debug statements.
> 
> Best regards,
> Jean
> 
> On Fri, Dec 11, 2020 at 12:25 PM Jean Helou <je...@gmail.com> wrote:
> 
>> I conclude that my effort to get CI working is cursed by the gods,
>> remember :
>>
>>>> {"message":"No such image: quay.io/testcontainers/ryuk:0.2.3"}
>>>
>>> which repeats for most tests failures, this seems to be common enough
>>> that there is stack overflow for it
>>>
>>> https://stackoverflow.com/questions/61887363/testcontainers-cant-pull-ryuk-image-quay-io-is-not-reachable
>>> I have attempted to upgrade test containers to 1.15.0 (as it will pull
>>> ryuk from docker hub instead of quay.io since 1.14.3 and we were using
>>> 1.12)
>>> hopefully this will help :)
>>>
>>
>> A docker API change broke most of testcontainers versions, which won't be
>> able to pull the images if they are not already available locally !
>> https://github.com/testcontainers/testcontainers-java/issues/3574
>>> yes, this Docker API change applies to most of Testcontainers versions.
>>
>> They should release  a 1.15.1 to resolve the issue shortly, I have tried
>> explicitly pulling the image in the steps of running the tests but sadly it
>> doesn't seem to have helped :(
>>
>> jean
>>
>>>
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org


Re: Jenkins CI setup

Posted by Jean Helou <je...@gmail.com>.
Hello again jamers!

It's time for a new irregular report on the CI effort on apache infra 🎅 !

Let's start with the good news : today I finally reached a successful build
https://builds.apache.org/blue/organizations/jenkins/james%2FApacheJames/detail/PR-268/45/pipeline
(the first fully successful build on apache infra)

You can see in the pipeline that as discussed before the testing phase is
split in 2 parts: stable tests vs unstable tests, failure in the first
phase will fail the build, failures in the unstable phase will not be
considered a build failure (but should still collect the failed tests in
the reports, however the recent failures where mostly memory related in
which case the surefire report is not generated :( )

Over the last 2 weeks and 45 build attempts, I tagged all failing tests as
"Unstable", I also increased the heap in the forked surefire to resolve
some of the OutOfMemoryException failures

At this stage I would really like to see this merged (if only to be able to
evaluate dangerous changes such as
https://github.com/apache/james-project/pull/282), I will make a

You can look at https://github.com/apache/james-project/pull/268 to see
which tests have been marked as Unstable. It was rebased on master this
morning and I intend to clean up the history tonight.
I also removed some invasive logging from the webadmin test code (it used
to log every single http request made in the tests) the full log is still a
bit over 30MB...

Best regards,
Jean

On Fri, Dec 11, 2020 at 12:25 PM Jean Helou <je...@gmail.com> wrote:

> I conclude that my effort to get CI working is cursed by the gods,
> remember :
>
>> > {"message":"No such image: quay.io/testcontainers/ryuk:0.2.3"}
>>
>> which repeats for most tests failures, this seems to be common enough
>> that there is stack overflow for it
>>
>> https://stackoverflow.com/questions/61887363/testcontainers-cant-pull-ryuk-image-quay-io-is-not-reachable
>> I have attempted to upgrade test containers to 1.15.0 (as it will pull
>> ryuk from docker hub instead of quay.io since 1.14.3 and we were using
>> 1.12)
>> hopefully this will help :)
>>
>
> A docker API change broke most of testcontainers versions, which won't be
> able to pull the images if they are not already available locally !
> https://github.com/testcontainers/testcontainers-java/issues/3574
> > yes, this Docker API change applies to most of Testcontainers versions.
>
> They should release  a 1.15.1 to resolve the issue shortly, I have tried
> explicitly pulling the image in the steps of running the tests but sadly it
> doesn't seem to have helped :(
>
> jean
>
>>

Re: Jenkins CI setup

Posted by Jean Helou <je...@gmail.com>.
I conclude that my effort to get CI working is cursed by the gods,
remember :

> > {"message":"No such image: quay.io/testcontainers/ryuk:0.2.3"}
>
> which repeats for most tests failures, this seems to be common enough that
> there is stack overflow for it
>
> https://stackoverflow.com/questions/61887363/testcontainers-cant-pull-ryuk-image-quay-io-is-not-reachable
> I have attempted to upgrade test containers to 1.15.0 (as it will pull
> ryuk from docker hub instead of quay.io since 1.14.3 and we were using
> 1.12)
> hopefully this will help :)
>

A docker API change broke most of testcontainers versions, which won't be
able to pull the images if they are not already available locally !
https://github.com/testcontainers/testcontainers-java/issues/3574
> yes, this Docker API change applies to most of Testcontainers versions.

They should release  a 1.15.1 to resolve the issue shortly, I have tried
explicitly pulling the image in the steps of running the tests but sadly it
doesn't seem to have helped :(

jean

>

Re: Jenkins CI setup

Posted by Tellier Benoit <bt...@apache.org>.

Le 11/12/2020 à 16:35, Jean Helou a écrit :
>>>> I'm in favor for opening a dedicated ticket and merge a disabled version
>>>> of this test in order to document the problem.
>>>>
>>>
>>> it"s been busy and I haven't opened the ticket yet, nor have we managed
>> to
>>> fully fix the issue yet
>>>
>>
>> I can devote some of my time to support you on this.
>>
> 
> Thanks, if you can open the jira issue I think I provided the relevant
> information in my previous mail, and if you feel lucky you can take a look
> at the branch I referred to the test demonstrating the issue is still there
> :)
+1

I sadly already did unsuccessful tries with read/write locks instead of
synchronize blocks, it do not seem to help...
> 
> [...]

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org


Re: Jenkins CI setup

Posted by Jean Helou <je...@gmail.com>.
> >> I'm in favor for opening a dedicated ticket and merge a disabled version
> >> of this test in order to document the problem.
> >>
> >
> > it"s been busy and I haven't opened the ticket yet, nor have we managed
> to
> > fully fix the issue yet
> >
>
> I can devote some of my time to support you on this.
>

Thanks, if you can open the jira issue I think I provided the relevant
information in my previous mail, and if you feel lucky you can take a look
at the branch I referred to the test demonstrating the issue is still there
:)


> We use a static singleton approach in order for testcontainers docker
> containers to be initialised once per surefire fork and not once per
> test class. Combined wih a reuseForks=true setting this dramatically
> reduce testing time!
>
> The only downside in cryptic NoClassDefound errors if the given docker
> container can't start.
>

That was extremely useful insight, I downloaded the full log and found the
following:

> {"message":"No such image: quay.io/testcontainers/ryuk:0.2.3"}

which repeats for most tests failures, this seems to be common enough that
there is stack overflow for it
https://stackoverflow.com/questions/61887363/testcontainers-cant-pull-ryuk-image-quay-io-is-not-reachable
I have attempted to upgrade test containers to 1.15.0 (as it will pull ryuk
from docker hub instead of quay.io since 1.14.3 and we were using 1.12)
hopefully this will help :)

To be noted that:
>  - some tests reuse existing images
>  - some tests like the LDAP one uses a Dockerfile to build their own
> image. This is likely the source of some instability.
>

I'll keep that in mind once the ryuk issue is fixed, thanks :)


> I also saw Cassandra containers failing initialization in memory
> constraint environments (eg on my laptop if less than 5GB are available).
>

RAM shouldn't be an issue on apache CI, the builders have 48GB of RAM :)


> A quick action could be to decrease surefire concurrency in some
> projects using expensive-to-start containers (like mailbox/cassandra,
> mpt/impl/imap/cassandra, etc...). If we can use en ENV variable for this..


I'll keep that in mind as a backup plan

thanks
Jean

Re: Jenkins CI setup

Posted by Tellier Benoit <bt...@apache.org>.
Hello Jean,

Le 11/12/2020 à 15:47, Jean Helou a écrit :
> Hello again jamers!
> 
> It's time for your irregular report on the CI effort on apache infra :)

\o/

>> I'm in favor for opening a dedicated ticket and merge a disabled version
>> of this test in order to document the problem.
>>
> 
> it"s been busy and I haven't opened the ticket yet, nor have we managed to
> fully fix the issue yet
> 

I can devote some of my time to support you on this.

> 
>>> Here is what I would like to do at this stage :
>>> - Isolate the unstable tests under with an unstable tag (akin to "feature
>>> tags")
>>> - exclude these tests from the default surefire execution profile,
>>> - add a parallel pipeline step for these tests where the step failure
>>> doesn't fail the pipeline [2]
>>> - ensure that the build is green
>>> - merge so the project finally has a working public CI
>>>
>>> I intend to start working on this quickly so we can all enjoy a
>> functional
>>> public CI.
>>
> 
> So I added an `Unstable.Tag` and started tagging the known unstable tests,
> it seemed that running the pipeline in parallele led to more issues
> so I reverted the parallel run to a serial run for now. I also changed from
> fail at first error to fail at end to get an idea of the volume of unstable
> tests.

+1

> 
>> I'd advocate a @Disabled tag, referencing both a JIRA ticket specific to
>> the bugfix needed, and the JIRA of the CI build.
>> Having a list of such issues in the JIRA (CI setup) ticket would be
>> valuable. I'd even advise doing subtickets to have a nice checklist.
> 
> 
> Despite Matthieu' remarks I was open to create the tickets and add the
> information to the Unstable.TAG or as a comment next to the tag.
> 
> However the recent CI results make the effort feel overwhelming :
> -
> https://builds.apache.org/blue/organizations/jenkins/james%2FApacheJames/detail/PR-268/7/tests
> : failures 4 new, 33 existing, 27 fixed, total 37
> -
> https://builds.apache.org/blue/organizations/jenkins/james%2FApacheJames/detail/PR-268/6/tests
> : failures 27 new, 27 existing, 0 fixed, total 54
> -
> https://builds.apache.org/blue/organizations/jenkins/james%2FApacheJames/detail/PR-268/5/tests
> : failures 0 new, 8 existing, 6 fixed, total 8
> -
> https://builds.apache.org/blue/organizations/jenkins/james%2FApacheJames/detail/PR-268/4/tests
> (issue with the jenkins file the build did not run)
> -
> https://builds.apache.org/blue/organizations/jenkins/james%2FApacheJames/detail/PR-268/3/tests
> : 4 fixed, 92 existing failures (that uses parallel to run both stable and
> unstable)
> -
> https://builds.apache.org/blue/organizations/jenkins/james%2FApacheJames/detail/PR-268/2/tests
> : 92 existing failures (that uses parallel to run both stable and unstable)
> 
> Now the last 2 runs where triggered after I rebased on master.
> I will trigger the next 3 by modifying random comments in the jenkins file
> to see if the build has a reproductible failure pattern or if all these
> need to be tagged as Unstable.
> Thats a lot of failures some of which don't even make sense to me, in run 7
> the first of the 4 new failures is:
> ```
> java.lang.NoClassDefFoundError: Could not initialize class
> org.apache.james.user.ldap.ReadOnlyUsersLDAPRepositoryTest
> at
> org.apache.james.user.ldap.ReadOnlyUsersLDAPRepositoryInvalidDnTest.setUp(ReadOnlyUsersLDAPRepositoryInvalidDnTest.java:62)
> ```
> and quite a few of the errors listed are similar NoClassDefFound errors
> which I quite fail to understand ... I would very much welcom feedback if
> one of you encountered this kind of issues before, it feels like I am
> missing something :(

We use a static singleton approach in order for testcontainers docker
containers to be initialised once per surefire fork and not once per
test class. Combined wih a reuseForks=true setting this dramatically
reduce testing time!

The only downside in cryptic NoClassDefound errors if the given docker
container can't start.

To be noted that:
 - some tests reuse existing images
 - some tests like the LDAP one uses a Dockerfile to build their own
image. This is likely the source of some instability.

I also saw Cassandra containers failing initialization in memory
constraint environments (eg on my laptop if less than 5GB are available).

Not sure if it helps.
Not sure if we can hope for changes in the docker environment.

A quick action could be to decrease surefire concurrency in some
projects using expensive-to-start containers (like mailbox/cassandra,
mpt/impl/imap/cassandra, etc...). If we can use en ENV variable for this...

Do give some insight, at linagora we run 1 build only per jenkins slave,
which have a dedicated physical host. I doubt we have such things in the
Apache environment.

Hope it helps.

> 
>> Having a build in the first place, even with the restrictions you
>> describe sounds like a good progress to me.
>>
> 
> As it stands it looks like it's going to take a bit longer :(
> 

Cheers,

Benoit

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org


Re: Jenkins CI setup

Posted by Jean Helou <je...@gmail.com>.
Hello again jamers!

It's time for your irregular report on the CI effort on apache infra :)

I'm in favor for opening a dedicated ticket and merge a disabled version
> of this test in order to document the problem.
>

it"s been busy and I haven't opened the ticket yet, nor have we managed to
fully fix the issue yet


> > Here is what I would like to do at this stage :
> > - Isolate the unstable tests under with an unstable tag (akin to "feature
> > tags")
> > - exclude these tests from the default surefire execution profile,
> > - add a parallel pipeline step for these tests where the step failure
> > doesn't fail the pipeline [2]
> > - ensure that the build is green
> > - merge so the project finally has a working public CI
> >
> > I intend to start working on this quickly so we can all enjoy a
> functional
> > public CI.
>

So I added an `Unstable.Tag` and started tagging the known unstable tests,
it seemed that running the pipeline in parallele led to more issues
so I reverted the parallel run to a serial run for now. I also changed from
fail at first error to fail at end to get an idea of the volume of unstable
tests.

I'd advocate a @Disabled tag, referencing both a JIRA ticket specific to
> the bugfix needed, and the JIRA of the CI build.
> Having a list of such issues in the JIRA (CI setup) ticket would be
> valuable. I'd even advise doing subtickets to have a nice checklist.


Despite Matthieu' remarks I was open to create the tickets and add the
information to the Unstable.TAG or as a comment next to the tag.

However the recent CI results make the effort feel overwhelming :
-
https://builds.apache.org/blue/organizations/jenkins/james%2FApacheJames/detail/PR-268/7/tests
: failures 4 new, 33 existing, 27 fixed, total 37
-
https://builds.apache.org/blue/organizations/jenkins/james%2FApacheJames/detail/PR-268/6/tests
: failures 27 new, 27 existing, 0 fixed, total 54
-
https://builds.apache.org/blue/organizations/jenkins/james%2FApacheJames/detail/PR-268/5/tests
: failures 0 new, 8 existing, 6 fixed, total 8
-
https://builds.apache.org/blue/organizations/jenkins/james%2FApacheJames/detail/PR-268/4/tests
(issue with the jenkins file the build did not run)
-
https://builds.apache.org/blue/organizations/jenkins/james%2FApacheJames/detail/PR-268/3/tests
: 4 fixed, 92 existing failures (that uses parallel to run both stable and
unstable)
-
https://builds.apache.org/blue/organizations/jenkins/james%2FApacheJames/detail/PR-268/2/tests
: 92 existing failures (that uses parallel to run both stable and unstable)

Now the last 2 runs where triggered after I rebased on master.
I will trigger the next 3 by modifying random comments in the jenkins file
to see if the build has a reproductible failure pattern or if all these
need to be tagged as Unstable.
Thats a lot of failures some of which don't even make sense to me, in run 7
the first of the 4 new failures is:
```
java.lang.NoClassDefFoundError: Could not initialize class
org.apache.james.user.ldap.ReadOnlyUsersLDAPRepositoryTest
at
org.apache.james.user.ldap.ReadOnlyUsersLDAPRepositoryInvalidDnTest.setUp(ReadOnlyUsersLDAPRepositoryInvalidDnTest.java:62)
```
and quite a few of the errors listed are similar NoClassDefFound errors
which I quite fail to understand ... I would very much welcom feedback if
one of you encountered this kind of issues before, it feels like I am
missing something :(

Having a build in the first place, even with the restrictions you
> describe sounds like a good progress to me.
>

As it stands it looks like it's going to take a bit longer :(

Re: Jenkins CI setup

Posted by "btellier@linagora.com (OpenPaaS)" <bt...@linagora.com>.
Le 04/12/2020 à 03:21, Jean Helou a écrit :
> Hello fellow jamers !
>
> The Jenkinsfile in the PR works, up until the test suite fails, the tests
> failures are from seemingly "unstable" tests that fail because of timing
> issues. Benoit fixed the first one in
> https://github.com/apache/james-project/pull/267 by disabling read repairs
> during consistency checks (I have no idea what it means but it sounds
> awesome :) ), I fixed the second one in
> https://github.com/apache/james-project/pull/269 where the event bus sender
> and receivers where closed out of order on shutdown sometimes leading up to
> events being sent to a closed receiver.
>
> After some cleanup, Matthieu recreated a buildable PR which lead to yet
> another unstable test in
> https://builds.apache.org/blue/organizations/jenkins/james%2FApacheJames/detail/PR-268/1/tests
We have been encoutering this for a while. Thanks for digging in.
>
> I started investigating the issue and ended up roping in Matthieu since the
> symptoms for the issue left me completely puzzled. Matthieu managed to
> pinpoint the root cause to a NPE sometimes thrown from
> within org.apache.james.server.core.MimeMessageCopyOnWriteProxy which in
> turn triggered further NullPointerExceptions in the mailet pipeline error
> handling code.
> We finally confirmed a concurrency issue in the refcounting management of
> the proxy which if I understand correctly can lead to unrecoverable data
> loss. We wrote a test to trigger it [1] in an almost deterministic manner.
I'm in favor for opening a dedicated ticket and merge a disabled version
of this test in order to document the problem.
>
> Once we had a test to reproduce the race condition, we tried to fix the
> issue only to realize that it led to even more concurrency issues. The
> rather depressing conclusion we reached yesterday was that the whole
> implementation is currently unsound with regard to concurrency. I am unable
> to estimate the resolution effort at this point, Matthieu has some ideas
> and will work on it (as well as I) when time allows.
>
> Which leads me to my current interrogations: I feel that fixing such long
> standing issues in the test suite is not actually part of configuring the
> apache CI but I am unsure how to proceed.
+1
>
> Here is what I would like to do at this stage :
> - Isolate the unstable tests under with an unstable tag (akin to "feature
> tags")
I'd advocate a @Disabled tag, referencing both a JIRA ticket specific to
the bugfix needed, and the JIRA of the CI build.

Having a list of such issues in the JIRA (CI setup) ticket would be
valuable. I'd even advise doing subtickets to have a nice checklist.
> - exclude these tests from the default surefire execution profile,
> - add a parallel pipeline step for these tests where the step failure
> doesn't fail the pipeline [2]
> - ensure that the build is green
> - merge so the project finally has a working public CI
>
> I intend to start working on this quickly so we can all enjoy a functional
> public CI.
+1 I agree on the approach.
>
> Alternatives:
> - Merge the jenkinsfile after the whole pipeline has been tested in the PR
> branch, which may not happen in a short-medium term...
> - Merging as is, means that many builds on PRs will end up failing and the
> last steps (snapshot publish) might fail even if the testsuite succeeds
> since it never ran.
> - Something I haven't thought of ?
>
> Another issue I want to raise is the availability of the CI builds. As you
> have seen from my experiments, the CI triggers configuration will only
> build commits from :
> - all branches of the main repository
> - all PRs opened from the main repository
> - all PRs opened by someone with write access to the main repository
>
> Which means that PRs for external contributors will not be built at all.
>
> I tried adding the  issueCommentTrigger to the jenkins file but neither my
> comments nor those of someone with commit access were able to trigger the
> build.
>
> I think that one of the project members should revise the current settings
> to make it possible to build external contributors PR one way or another.
> (only project members have access or can have access to the jenkins project
> configuration).
> Here are two options:
> - the easiest and quickest modification is to let the CI build all and
> every PR, there are relatively few PRs on james so the burden on the CI
> platform shouldn't be too bad.
> - alternatively it may be possible to configure jenkins to require a
> comment for someone with write access to trigger a build. unfortunately I
> am not certain how to set this up, maybe INFRA can help.
Having a build in the first place, even with the restrictions you
describe sounds like a good progress to me.

I agree we need to see what other Apache projects are doing, and if
needed ask the INFRA.
>
> I know this was a long piece, I look forward to reading your opinions !
Thanks for your involvement on this topic.

Benoit
> Jean
>
> [1] see
> https://github.com/jeantil/james-project/tree/james-3225-concurrency-bug-mimemessagecow
> [2] see
> https://stackoverflow.com/questions/44022775/jenkins-ignore-failure-in-pipeline-build-step
>
> On Thu, Nov 26, 2020 at 11:22 AM Jean Helou <je...@gmail.com> wrote:
>
>> The good news is that docker does indeed work, the bad news is that the
>> tests fail with an issue that's too involved for me :/
>>
>> [INFO]
>> [INFO] Results:
>> [INFO]
>> [ERROR] Failures:
>> [ERROR]   CassandraMailboxManagerConsistencyTest$FailuresOnDeletion$DeleteOnce.deleteMailboxByPathShouldBeConsistentWhenMailboxPathDaoFails:433 Multiple Failures (1 failure)
>> 	
>> Expecting:
>>   <[]>
>> to contain exactly (and in same order):
>>   <[#private:user:INBOX]>
>> but could not find the following elements:
>>   <[#private:user:INBOX]>
>>
>> at CassandraMailboxManagerConsistencyTest$FailuresOnDeletion$DeleteOnce.lambda$deleteMailboxByPathShouldBeConsistentWhenMailboxPathDaoFails$8(CassandraMailboxManagerConsistencyTest$FailuresOnDeletion$DeleteOnce.java:440)
>>
>> so unless the build for
>>
>> * 6fab99364a - JAMES-3448 Rewrite links to http://james.apache.org/server/3/ (Mon Nov 23 15:10:36 2020 +0700) <Benoit Tellier> N
>>
>> is broken which sounds unlikely, I'm going to need help
>>
>> jean
>>
>> On Thu, Nov 26, 2020 at 10:53 AM Jean Helou <je...@gmail.com> wrote:
>>
>>> on a loosely related note : the test suite logs are scary to look at:
>>> piles upon piles of stack traces and error logs but the tests actually pass
>>> ...
>>>
>>> On Thu, Nov 26, 2020 at 10:50 AM Jean Helou <je...@gmail.com> wrote:
>>>
>>>> Thanks benoit,
>>>>
>>>> Matthieu pointed me to numerous apache projects with jenkinsfiles which
>>>> mention docker in
>>>> https://github.com/search?q=org%3Aapache++filename%3AJenkinsfile+docker&type=Code
>>>> so I'm trying out things based on that
>>>>
>>>> the logs seem promising so far :
>>>> ```
>>>>
>>>> [INFO] Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.697 s - in org.apache.james.backends.rabbitmq.RabbitMQConnectionFactoryTest
>>>>         ℹ︎ Checking the system...
>>>>         ✔ Docker version should be at least 1.6.0
>>>>         ✔ Docker environment should have more than 2GB free disk space
>>>> [INFO] Running org.apache.james.backends.rabbitmq.RabbitMQTest
>>>> ```
>>>>
>>>>
>>>> On Thu, Nov 26, 2020 at 10:40 AM Tellier Benoit <bt...@apache.org>
>>>> wrote:
>>>>
>>>>> Done
>>>>>
>>>>> Le 26/11/2020 à 16:25, Jean Helou a écrit :
>>>>>> hi all,
>>>>>>
>>>>>> As you know I started a PR to setup jenkins CI, the latest iteration
>>>>> sees
>>>>>> the compilation of the project complete in 5 minutes ( thanks to T1C)
>>>>> but
>>>>>> the tests fail to initialize docker containers with the disastrous
>>>>>> consequences you can imagine :D
>>>>>>
>>>>>> I opened https://issues.apache.org/jira/browse/INFRA-21144 to ask if
>>>>> it is
>>>>>> possible to have the docker service enable don some nodes, since I am
>>>>> not
>>>>>> official member of the project I think it may be useful if you chimed
>>>>> in on
>>>>>> the ticket to confirm that this is a legitimate request.
>>>>>>
>>>>>> Best regards,
>>>>>> Jean
>>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
>>>>> For additional commands, e-mail: server-dev-help@james.apache.org
>>>>>
>>>>>

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org