You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@maven.apache.org by Arnaud Héritier <ah...@gmail.com> on 2021/08/23 15:52:23 UTC

Re: [JENKINS] - New Maven Controller for the project

To close this thread, the disconnection of linux agents was a side effect
of the kernel settings which were using a too long tcp timeout
Gavin applied the settings recommended in this doc (
https://support.cloudbees.com/hc/en-us/articles/115001369667-Dedicated-SSH-agent-gets-disconnected
) and it solved this issue

sysctl -w net.ipv4.tcp_keepalive_time=120 sysctl -w
net.ipv4.tcp_keepalive_intvl=30 sysctl -w net.ipv4.tcp_keepalive_probes=8
sysctl -w net.ipv4.tcp_fin_timeout=30

Thanks a lot Gavin for your help

On Tue, Jul 27, 2021 at 10:48 AM Arnaud Héritier <ah...@gmail.com>
wrote:

> 👍 thanks
> As discussed on Slack I will open.a support case on CloudBees side to
> study the instability issue of linux agents.
> I will verify that there is nothing wrong in the new setup (but I found
> nothing bad personally which could create such issue)
> The major change when we compare ci-builds and ci-maven environments is
> that our agents are now running on Azure and sadly I heard about similar
> issues in others communities using Azure like Jenkins
> I will check if we can find a solution or if we can give enough details to
> Gavin to open a case on Azure side too.
> If we really don't find any solution with Azure we'll see with Gavin to
> deploy our agents somewhere else (but let's try to give a chance to Azure
> first)
>
> Cheers
>
>
>
> On Tue, Jul 27, 2021 at 10:37 AM Gavin McDonald <gm...@apache.org>
> wrote:
>
>> On Tue, Jul 27, 2021 at 10:18 AM Arnaud Héritier <ah...@gmail.com>
>> wrote:
>>
>> > Gavin, these JDKs are only for build agents, right ?
>> > Tibor was asking for the JVM used to host Tomcat/Jenkins.
>> > (And I suppose that the controller part is templatised)
>> >
>>
>> Oh right, sorry, yes the client controllers use a system openjdk 8
>>
>>
>> https://github.com/apache/infrastructure-p6/blob/production/modules/jenkins_client_master/manifests/init.pp
>>
>>
>> > On Mon, Jul 26, 2021 at 1:06 PM Gavin McDonald <gm...@apache.org>
>> > wrote:
>> >
>> >> There are MANY JDKS installed already. Oracle JDKs OpenJDKs
>> AdoptOpenJDKs
>> >> - they are all already there.
>> >>
>> >>
>> https://cwiki.apache.org/confluence/display/INFRA/JDK+Installation+Matrix
>> >>
>> >> On Mon, Jul 26, 2021 at 11:47 AM Arnaud Héritier <ah...@gmail.com>
>> >> wrote:
>> >>
>> >> > It has to be discussed with infra.
>> >> > I am not sure which distro is used.
>> >> > It's a Private Build of openJDK 8 (today CloudBees CI doesn't support
>> >> > something else than Java 8 - no comment)
>> >> > I don't have the feeling for now (with the data I reviewed) that
>> it's a
>> >> > memory / GC issue (it could create disconnections under high load)
>> >> > Here the stability issue occurs even when the controller does nothing
>> >> (the
>> >> > controller cannot ping the agent or vice and versa) and it seems to
>> >> impact
>> >> > more the linux agents than the windows ones (it's a pity)
>> >> >
>> >> >
>> >> >
>> >> > On Fri, Jul 23, 2021 at 12:06 AM Tibor Digana <
>> tibordigana@apache.org>
>> >> > wrote:
>> >> >
>> >> >> Can you install AdoptOpenJdk for the Jenkins controller?
>> >> >> It contains Eclipse OpenJ9 Garbage Collector and it significantly
>> >> >> decreases
>> >> >> memory consumption of the application due to the meta space goes to
>> the
>> >> >> disk.
>> >> >> You should save 40 - 75% out of 3GB.
>> >> >> I used G1, Shenandoah, ZGC and Eclipse OpenJ9 which saved the most
>> >> memory.
>> >> >>
>> >> >> On Thu, Jul 22, 2021 at 9:23 AM Arnaud Héritier <
>> aheritier@gmail.com>
>> >> >> wrote:
>> >> >>
>> >> >> > yes for the controller it depends of its size (number of jobs and
>> >> types
>> >> >> of
>> >> >> > jobs) but here we are fine it seems with our 3Gb
>> >> >> >
>> >> >> > * Java
>> >> >> > - Version: 1.8.0&#95;292
>> >> >> > - Maximum memory: 3.00 GB (3221225472)
>> >> >> > - Allocated memory: 3.00 GB (3221225472)
>> >> >> > - Free memory: 750.15 MB (786591664)
>> >> >> > - In-use memory: 2.27 GB (2434633808)
>> >> >> > - GC strategy: G1
>> >> >> > - Available CPUs: 2
>> >> >> >
>> >> >> > For agents I reduced the memory allocated to the agent process
>> but it
>> >> >> > doesn't help much (it seems - even if it is still a good thing to
>> do)
>> >> >> >
>> >> >> > What is strange is that I see our agents sometimes disconnected
>> even
>> >> >> when
>> >> >> > we have no activity on the jenkins controller
>> >> >> >
>> >> >> > Sadly jenkins is deployed on Apache Tomcat thus I cannot get
>> access
>> >> to
>> >> >> its
>> >> >> > logs
>> >> >> >
>> >> >> > In general the connection lost is detected by what we call the
>> >> >> PingThread (
>> >> >> >
>> >> >> >
>> >> >>
>> >>
>> https://www.jenkins.io/doc/book/system-administration/monitoring/#ping-thread
>> >> >> > ) but not only
>> >> >> >
>> >> >> > https://ci-maven.apache.org/log/all
>> >> >> >
>> >> >> > For example it was few minutes ago we got 3 agents disconnected
>> while
>> >> >> > nothing was running
>> >> >> >
>> >> >> > 2021-07-22 06:58:21.769+0000 [id=106291] INFO
>> >> >> > hudson.slaves.ChannelPinger$1#onDead:
>> >> >> > Ping failed. Terminating the channel maven4.
>> >> >> > java.util.concurrent.TimeoutException: Ping started at
>> 1626936861769
>> >> >> hasn't
>> >> >> > completed by 1626937101769
>> >> >> > at hudson.remoting.PingThread.ping(PingThread.java:134)
>> >> >> > at hudson.remoting.PingThread.run(PingThread.java:90)
>> >> >> > 2021-07-22 06:58:21.778+0000 [id=106292] INFO
>> >> >> > hudson.slaves.ChannelPinger$1#onDead:
>> >> >> > Ping failed. Terminating the channel maven3.
>> >> >> > java.util.concurrent.TimeoutException: Ping started at
>> 1626936861777
>> >> >> hasn't
>> >> >> > completed by 1626937101778
>> >> >> > at hudson.remoting.PingThread.ping(PingThread.java:134)
>> >> >> > at hudson.remoting.PingThread.run(PingThread.java:90)
>> >> >> > 2021-07-22 06:58:21.983+0000 [id=106295] INFO
>> >> >> > hudson.slaves.ChannelPinger$1#onDead:
>> >> >> > Ping failed. Terminating the channel maven5.
>> >> >> > java.util.concurrent.TimeoutException: Ping started at
>> 1626936861982
>> >> >> hasn't
>> >> >> > completed by 1626937101983
>> >> >> > at hudson.remoting.PingThread.ping(PingThread.java:134)
>> >> >> > at hudson.remoting.PingThread.run(PingThread.java:90)
>> >> >> >
>> >> >> > @Gavin McDonald <gm...@apache.org> In terms of network, is it
>> >> the
>> >> >> same
>> >> >> > environment we use today compared to the ci-builds.apache.org
>> >> >> environment
>> >> >> > ?
>> >> >> >
>> >> >> >
>> >> >> > On Wed, Jul 21, 2021 at 11:48 PM Tibor Digana <
>> >> tibordigana@apache.org>
>> >> >> > wrote:
>> >> >> >
>> >> >> > > In my company, I also used 1GB for Xmx of Java Heap for the
>> Jenkins
>> >> >> JVM
>> >> >> > and
>> >> >> > > it was enough.
>> >> >> > > The subprocesses like Maven need to have much more memory to
>> >> allocate
>> >> >> for
>> >> >> > > themself rather than Jenkins JVM.
>> >> >> > > T
>> >> >> > >
>> >> >> > > On Wed, Jul 21, 2021 at 6:38 PM Arnaud Héritier <
>> >> aheritier@gmail.com>
>> >> >> > > wrote:
>> >> >> > >
>> >> >> > > > I am looking at our builds and I try to understand why our
>> agents
>> >> >> are
>> >> >> > > often
>> >> >> > > > disconnected during the builds.
>> >> >> > > > We have in general a stacktrace like
>> >> >> > > >
>> >> >> > > > maven6 was marked offline: Connection was broken:
>> >> >> java.io.IOException:
>> >> >> > > > Pipe closed after 0 cycles
>> >> >> > > >         at
>> >> >> > > >
>> >> >> > >
>> >> >> >
>> >> >>
>> >>
>> org.apache.sshd.common.channel.ChannelPipedInputStream.read(ChannelPipedInputStream.java:118)
>> >> >> > > >         at
>> >> >> > > >
>> >> >> > >
>> >> >> >
>> >> >>
>> >>
>> org.apache.sshd.common.channel.ChannelPipedInputStream.read(ChannelPipedInputStream.java:101)
>> >> >> > > >         at
>> >> >> > > >
>> >> >> > >
>> >> >> >
>> >> >>
>> >>
>> hudson.remoting.FlightRecorderInputStream.read(FlightRecorderInputStream.java:92)
>> >> >> > > >         at
>> >> >> > > >
>> >> >> >
>> >> >>
>> >>
>> hudson.remoting.ChunkedInputStream.readHeader(ChunkedInputStream.java:73)
>> >> >> > > >         at
>> >> >> > > >
>> >> >> > >
>> >> >> >
>> >> >>
>> >>
>> hudson.remoting.ChunkedInputStream.readUntilBreak(ChunkedInputStream.java:103)
>> >> >> > > >         at
>> >> >> > > >
>> >> >> > >
>> >> >> >
>> >> >>
>> >>
>> hudson.remoting.ChunkedCommandTransport.readBlock(ChunkedCommandTransport.java:39)
>> >> >> > > >         at
>> >> >> > > >
>> >> >> > >
>> >> >> >
>> >> >>
>> >>
>> hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:34)
>> >> >> > > >         at
>> >> >> > > >
>> >> >> > >
>> >> >> >
>> >> >>
>> >>
>> hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:63)
>> >> >> > > >
>> >> >> > > >
>> >> >> > > >
>> >> >> > > > As far I can see we are using 16Gb "hosts" for linux agents
>> >> >> > > >
>> >> >> > > > Something very strange is that the jenkins agent (this small
>> >> >> component
>> >> >> > > > doing the link between the build host and the controller) is
>> >> >> configured
>> >> >> > > > with `-Xms8g -Xmx8g` thus we are reserving for it 50% of the
>> >> server
>> >> >> mem
>> >> >> > > > (even more because of the non-heap)
>> >> >> > > > This one in general should require in general really less.
>> 1Gb is
>> >> >> > > already a
>> >> >> > > > lot from my exp.
>> >> >> > > > Due to this, the OS can see it has the biggest process on the
>> >> host
>> >> >> and
>> >> >> > > > decide to kill it when the rest of the memory is used by the
>> >> build.
>> >> >> > > > I think we should decrease this value.
>> >> >> > > > (I can do it but I don't know how was configured the
>> >> ci.apache.org
>> >> >> > > agents
>> >> >> > > > and I would like to not add more issue if this setting was
>> here
>> >> in
>> >> >> the
>> >> >> > > past
>> >> >> > > >
>> >> >> > > > I don't think it is the root cause of our instabilities (at
>> least
>> >> >> all)
>> >> >> > > and
>> >> >> > > > there is something else I have to find but it's a cheap fix to
>> >> try
>> >> >> > > >
>> >> >> > > > FYI our agents VMs are ~like this today:
>> >> >> > > >
>> >> >> > > > - Java
>> >> >> > > > + Home: `/usr/local/asfpackages/java/oraclejdk-1.8.0-291/jre`
>> >> >> > > > + Vendor: Oracle Corporation
>> >> >> > > > + Version: 1.8.0&#95;291
>> >> >> > > > + Maximum memory: 7.67 GB (8232370176)
>> >> >> > > > + Allocated memory: 7.67 GB (8232370176)
>> >> >> > > > + Free memory: 6.03 GB (6470953760)
>> >> >> > > > + In-use memory: 1.64 GB (1761416416)
>> >> >> > > > + GC strategy: ParallelGC
>> >> >> > > > + Available CPUs: 4
>> >> >> > > >
>> >> >> > > > 8Gb is reserved, 1Gb is used (because the GC does nothing as
>> the
>> >> >> Free
>> >> >> > mem
>> >> >> > > > is high)
>> >> >> > > >
>> >> >> > > > I would be in favor to try to launch them with -Xms128m
>> >> >> > > > -Xmx1g -XX:+UseG1GC -XX:+UseStringDeduplication
>> >> >> > > >
>> >> >> > > > I think it's enough customization to start with
>> >> >> > > >
>> >> >> > > > Cheers
>> >> >> > > >
>> >> >> > > > On Wed, Jul 21, 2021 at 1:28 PM Arnaud Héritier <
>> >> >> aheritier@gmail.com>
>> >> >> > > > wrote:
>> >> >> > > >
>> >> >> > > > > I am not sure about the setup
>> >> >> > > > > AFAICS we don't use any JDK installer (
>> >> >> > > > > https://ci-maven.apache.org/configureTools/ ) thus I
>> suppose
>> >> that
>> >> >> > the
>> >> >> > > > > different JDKs are supposed to be installed directly on the
>> >> agent
>> >> >> ?
>> >> >> > > > > I am not sure how it was done on the previous environment
>> >> >> > > > >
>> >> >> > > > > On Sun, Jul 18, 2021 at 5:30 PM Tibor Digana <
>> >> >> tibordigana@apache.org
>> >> >> > >
>> >> >> > > > > wrote:
>> >> >> > > > >
>> >> >> > > > >> The new CI  system has the following issue:
>> >> >> > > > >>
>> >> >> > > > >> /home/jenkins/tools/java/latest1.7/bin/java: not found
>> >> >> > > > >>
>> >> >> > > > >>
>> >> >> > > > >>
>> >> >> > > >
>> >> >> > >
>> >> >> >
>> >> >>
>> >>
>> https://ci-maven.apache.org/job/Maven/job/maven-box/job/maven-surefire/job/master/104/execution/node/183/log/
>> >> >> > > > >>
>> >> >> > > > >>
>> >> >> > > > >>
>> >> >> > > > >> On Wed, Jun 30, 2021 at 8:03 PM Gavin McDonald <
>> >> >> > gmcdonald@apache.org>
>> >> >> > > > >> wrote:
>> >> >> > > > >>
>> >> >> > > > >> > Hi Maven folks.
>> >> >> > > > >> >
>> >> >> > > > >> > Infra has decided to separate off the Maven build jobs
>> from
>> >> >> > > > >> > ci-builds.apache.org over to its very own Jenkins
>> >> Controller
>> >> >> and
>> >> >> > > > >> Agents.
>> >> >> > > > >> >
>> >> >> > > > >> > This means that Maven now has a dedicated Jenkins
>> >> environment
>> >> >> for
>> >> >> > > > >> itself.
>> >> >> > > > >> > It
>> >> >> > > > >> > also means that no other projects build jobs can build on
>> >> the
>> >> >> > Maven
>> >> >> > > > >> nodes;
>> >> >> > > > >> > and
>> >> >> > > > >> > then Maven jobs will no longer  be able to build on the
>> >> >> ci-builds
>> >> >> > > > jobs.
>> >> >> > > > >> >
>> >> >> > > > >> > Your new Controller is set up as
>> >> https://ci-maven.apache.org
>> >> >> and
>> >> >> > > all
>> >> >> > > > >> Maven
>> >> >> > > > >> > Committers
>> >> >> > > > >> > can login via LDAP and create jobs.
>> >> >> > > > >> >
>> >> >> > > > >> > At the time of writing, there is one node/agent attached
>> >> but I
>> >> >> am
>> >> >> > > > >> building
>> >> >> > > > >> > 4 more  - all
>> >> >> > > > >> > Ubuntu 20.04 and based in our Azure account.
>> >> >> > > > >> >
>> >> >> > > > >> > We can automagically move all your jobs over from
>> ci-builds
>> >> to
>> >> >> > > > ci-maven
>> >> >> > > > >> - I
>> >> >> > > > >> > just need someone to tell me go ahead and do it.
>> >> >> > > > >> >
>> >> >> > > > >> > In the meantime, feel free to have a test. The remaining
>> 4
>> >> >> agents
>> >> >> > > will
>> >> >> > > > >> be
>> >> >> > > > >> > online
>> >> >> > > > >> > by tomorrow. We will review after a month if 5 is enough
>> >> nodes.
>> >> >> > > > >> >
>> >> >> > > > >> > As with other projects having their own dedicated
>> >> controller,
>> >> >> who
>> >> >> > > have
>> >> >> > > > >> > taken advantage
>> >> >> > > > >> > of this isolation by having some nodes/agents given to
>> the
>> >> >> project
>> >> >> > > as
>> >> >> > > > a
>> >> >> > > > >> > 'targeted donation'
>> >> >> > > > >> > so someone here may know of a Company will to donate 5 -
>> 10
>> >> or
>> >> >> > more
>> >> >> > > > >> nodes
>> >> >> > > > >> > specifically
>> >> >> > > > >> > for Maven Jenkins environment. Infra can afford to hand
>> you
>> >> >> over 5
>> >> >> > > > right
>> >> >> > > > >> > now.
>> >> >> > > > >> >
>> >> >> > > > >> > Let me know if you have any questions, otherwise let me
>> know
>> >> >> when
>> >> >> > I
>> >> >> > > > can
>> >> >> > > > >> > make the
>> >> >> > > > >> > transfer of your jobs.
>> >> >> > > > >> >
>> >> >> > > > >> > Thanks
>> >> >> > > > >> >
>> >> >> > > > >> > --
>> >> >> > > > >> >
>> >> >> > > > >> > *Gavin McDonald*
>> >> >> > > > >> > Systems Administrator
>> >> >> > > > >> > ASF Infrastructure Team
>> >> >> > > > >> >
>> >> >> > > > >>
>> >> >> > > > >
>> >> >> > > > >
>> >> >> > > > > --
>> >> >> > > > > Arnaud Héritier
>> >> >> > > > > Twitter/Skype : aheritier
>> >> >> > > > >
>> >> >> > > >
>> >> >> > > >
>> >> >> > > > --
>> >> >> > > > Arnaud Héritier
>> >> >> > > > Twitter/Skype : aheritier
>> >> >> > > >
>> >> >> > >
>> >> >> >
>> >> >> >
>> >> >> > --
>> >> >> > Arnaud Héritier
>> >> >> > Twitter/Skype : aheritier
>> >> >> >
>> >> >>
>> >> >
>> >> >
>> >> > --
>> >> > Arnaud Héritier
>> >> > Twitter/Skype : aheritier
>> >> >
>> >>
>> >>
>> >> --
>> >>
>> >> *Gavin McDonald*
>> >> Systems Administrator
>> >> ASF Infrastructure Team
>> >>
>> >
>> >
>> > --
>> > Arnaud Héritier
>> > Twitter/Skype : aheritier
>> >
>>
>>
>> --
>>
>> *Gavin McDonald*
>> Systems Administrator
>> ASF Infrastructure Team
>>
>
>
> --
> Arnaud Héritier
> Twitter/Skype : aheritier
>


-- 
Arnaud Héritier
Twitter/GitHub/... : aheritier