You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@maven.apache.org by Tibor Digana <ti...@apache.org> on 2019/04/06 21:25:46 UTC

Re: Failsafe: Killing self fork JVM. PING timeout elapsed.

Hi Jason,

I am sorry I haven't reply.
Excellent analysis!

Can you make an experiment and use pingJob on a "fixed delay" instead of a
"fixed rate"?
If you want to make more, feel free.

We should mabe think about how we would add more intelligence in these
timers in order to detect the "blackout".

Cheers
Tibor




On Fri, Mar 22, 2019 at 9:53 PM Jason Young <ja...@procentive.com>
wrote:

> Notice:
> * an over 3-minute gap between the first two entries above--neither the
> command reader thread nor the pingJob had a chance to run.
> * the command reader thread and the ping job executor evidently get a
> chance to run from 19:18:02 to 19:19:19. They ran with increased frequency
> during that time to "catch up": The command reader thread is programmed to
> read from stdin until there's nothing left to read, and the pingScheduler
> runs the pingJob at a "fixed" rate.
> * Another 3-minute gap and another attempt to catch up at 19:22:28. This
> time though, the command reader thread evidently did not get a turn in
> time.
>
> I did not bother to log what threads the ping job was running under; I can
> do that later. I also did not add any logging on the Maven process side; I
> can do that as well.
>
> This is on a VM with unthrottled access to 4 processor cores and 8GB RAM. I
> think the code under test is CPU- and RAM-intensive. There probably were GC
> pauses, but I don't have proper logging for that in place.
>
> Note that this is all very normal behavior: Both the command reader and the
> ping job's thread (run by the pingScheduler) are instructed to run until
> their work is done, not to sleep or yield. For high throughput, an OS will
> favor continuing to run a thread for a while once it is started over
> switching tasks, so it's normal both that each thread would run for a while
> after a pause, and (by extension) that each would have long pauses as other
> threads had to run, too. Because of the long gaps between runs of the
> command reader thread, my previous advice (schedule the pingJob on a "fixed
> delay" instead of a "fixed rate") would not have saved the build in this
> case (though "fixed delay" is more correct and would save the build in
> other possible cases). Our code base has some unit tests that rely on
> "normal case" timing, but they now have the @Ignore attribute because they
> would rarely, but too often, fail. For most teams and projects, unit tests
> have to be more reliable than that.
>
> So I think we need a solution that does not rely on preemptive scheduling
> from the OS or on thread scheduling, but which ensures that commnads have
> been read before running code depending on commands. There are multiple
> ways to do that; maybe I can find time to write a solution later.
>
>
> On Fri, Mar 22, 2019 at 3:46 PM Jason Young <ja...@procentive.com>
> wrote:
>
> > Thanks Tibor. I'll use `encodeAndWriteToOutput` later. For now, I hacked
> > together some temporary changes to ForkedBooter so that it "logs" (just
> > `System.out.println`) the current time every time:
> >
> >    - the command reader thread reads a NOOP via stdin,
> >    - it starts the ping job,
> >    - it checks for NOOP (`canUseNewPingMechanism(..)` returns `false`),
> >    and
> >    - it would have exited because there is no NOOP (`!pingDone`).
> >
> > I also commented out the code to exit in that case so it keeps running
> and
> > logging.
> >
> > The resulting output starts out showing that it's receiving ping every
> 10s
> > and checking for NOOP every 30s, as expected. But then one large test
> > begins, and:
> >
> > DEBUG 2019/03/22 19:14:40.223 GMT ForkedBooter: Received ping from Maven
> > process.
> > DEBUG 2019/03/22 19:18:02.723 GMT ForkedBooter: Received ping from Maven
> > process.
> > DEBUG 2019/03/22 19:18:03.139 GMT ForkedBooter: Running ping job (i.e.
> > checking if we read a ping earlier).
> > DEBUG 2019/03/22 19:18:03.172 GMT ForkedBooter: Checking for NOOP from
> > Maven process.
> > DEBUG 2019/03/22 19:18:03.186 GMT ForkedBooter: Received ping from Maven
> > process.
> > DEBUG 2019/03/22 19:18:03.187 GMT ForkedBooter: Received ping from Maven
> > process.
> > DEBUG 2019/03/22 19:18:03.187 GMT ForkedBooter: Received ping from Maven
> > process.
> > DEBUG 2019/03/22 19:18:03.187 GMT ForkedBooter: Running ping job (i.e.
> > checking if we read a ping earlier).
> > DEBUG 2019/03/22 19:18:03.188 GMT ForkedBooter: Checking for NOOP from
> > Maven process.
> > DEBUG 2019/03/22 19:18:03.187 GMT ForkedBooter: Received ping from Maven
> > process.
> > DEBUG 2019/03/22 19:18:03.243 GMT ForkedBooter: Received ping from Maven
> > process.
> > DEBUG 2019/03/22 19:18:03.307 GMT ForkedBooter: Received ping from Maven
> > process.
> > DEBUG 2019/03/22 19:18:03.308 GMT ForkedBooter: Received ping from Maven
> > process.
> > DEBUG 2019/03/22 19:18:03.202 GMT ForkedBooter: Running ping job (i.e.
> > checking if we read a ping earlier).
> > DEBUG 2019/03/22 19:18:03.323 GMT ForkedBooter: Received ping from Maven
> > process.
> > DEBUG 2019/03/22 19:18:03.337 GMT ForkedBooter: Checking for NOOP from
> > Maven process.
> > DEBUG 2019/03/22 19:18:03.351 GMT ForkedBooter: Received ping from Maven
> > process.
> > DEBUG 2019/03/22 19:18:03.351 GMT ForkedBooter: Running ping job (i.e.
> > checking if we read a ping earlier).
> > DEBUG 2019/03/22 19:18:03.351 GMT ForkedBooter: Received ping from Maven
> > process.
> > DEBUG 2019/03/22 19:18:03.352 GMT ForkedBooter: Checking for NOOP from
> > Maven process.
> > DEBUG 2019/03/22 19:18:03.366 GMT ForkedBooter: Received ping from Maven
> > process.
> > DEBUG 2019/03/22 19:18:03.393 GMT ForkedBooter: Received ping from Maven
> > process.
> > DEBUG 2019/03/22 19:18:03.380 GMT ForkedBooter: Running ping job (i.e.
> > checking if we read a ping earlier).
> > DEBUG 2019/03/22 19:18:03.409 GMT ForkedBooter: Received ping from Maven
> > process.
> > DEBUG 2019/03/22 19:18:03.424 GMT ForkedBooter: Received ping from Maven
> > process.
> > DEBUG 2019/03/22 19:18:03.423 GMT ForkedBooter: Checking for NOOP from
> > Maven process.
> > DEBUG 2019/03/22 19:18:03.424 GMT ForkedBooter: Received ping from Maven
> > process.
> > DEBUG 2019/03/22 19:18:03.438 GMT ForkedBooter: Running ping job (i.e.
> > checking if we read a ping earlier).
> > DEBUG 2019/03/22 19:18:03.438 GMT ForkedBooter: Received ping from Maven
> > process.
> > DEBUG 2019/03/22 19:18:03.453 GMT ForkedBooter: Checking for NOOP from
> > Maven process.
> > DEBUG 2019/03/22 19:18:03.454 GMT ForkedBooter: Received ping from Maven
> > process.
> > DEBUG 2019/03/22 19:18:03.476 GMT ForkedBooter: Running ping job (i.e.
> > checking if we read a ping earlier).
> > DEBUG 2019/03/22 19:18:03.476 GMT ForkedBooter: Received ping from Maven
> > process.
> > DEBUG 2019/03/22 19:18:03.505 GMT ForkedBooter: Checking for NOOP from
> > Maven process.
> > DEBUG 2019/03/22 19:18:03.505 GMT ForkedBooter: Received ping from Maven
> > process.
> > DEBUG 2019/03/22 19:18:08.872 GMT ForkedBooter: Received ping from Maven
> > process.
> > DEBUG 2019/03/22 19:18:18.859 GMT ForkedBooter: Received ping from Maven
> > process.
> > DEBUG 2019/03/22 19:18:19.282 GMT ForkedBooter: Running ping job (i.e.
> > checking if we read a ping earlier).
> > DEBUG 2019/03/22 19:18:19.330 GMT ForkedBooter: Checking for NOOP from
> > Maven process.
> > DEBUG 2019/03/22 19:18:28.845 GMT ForkedBooter: Received ping from Maven
> > process.
> > DEBUG 2019/03/22 19:18:38.832 GMT ForkedBooter: Received ping from Maven
> > process.
> > DEBUG 2019/03/22 19:18:48.838 GMT ForkedBooter: Received ping from Maven
> > process.
> > DEBUG 2019/03/22 19:18:49.241 GMT ForkedBooter: Running ping job (i.e.
> > checking if we read a ping earlier).
> > DEBUG 2019/03/22 19:18:49.265 GMT ForkedBooter: Checking for NOOP from
> > Maven process.
> > DEBUG 2019/03/22 19:19:01.162 GMT ForkedBooter: Received ping from Maven
> > process.
> > DEBUG 2019/03/22 19:19:10.069 GMT ForkedBooter: Received ping from Maven
> > process.
> > DEBUG 2019/03/22 19:19:19.136 GMT ForkedBooter: Received ping from Maven
> > process.
> > DEBUG 2019/03/22 19:19:19.199 GMT ForkedBooter: Running ping job (i.e.
> > checking if we read a ping earlier).
> > DEBUG 2019/03/22 19:19:19.199 GMT ForkedBooter: Checking for NOOP from
> > Maven process.
> > DEBUG 2019/03/22 19:22:28.698 GMT ForkedBooter: Running ping job (i.e.
> > checking if we read a ping earlier).
> > DEBUG 2019/03/22 19:22:28.698 GMT ForkedBooter: Checking for NOOP from
> > Maven process.
> > WARN 2019/03/22 19:22:28.698 GMT ForkedBooter: Did not receive NOOP from
> > Maven process.
> > DEBUG 2019/03/22 19:22:28.698 GMT ForkedBooter: Running ping job (i.e.
> > checking if we read a ping earlier).
> > DEBUG 2019/03/22 19:22:28.698 GMT ForkedBooter: Checking for NOOP from
> > Maven process.
> > WARN 2019/03/22 19:22:28.698 GMT ForkedBooter: Did not receive NOOP from
> > Maven process.
> > DEBUG 2019/03/22 19:22:28.699 GMT ForkedBooter: Running ping job (i.e.
> > checking if we read a ping earlier).
> > DEBUG 2019/03/22 19:22:28.699 GMT ForkedBooter: Checking for NOOP from
> > Maven process.
> > WARN 2019/03/22 19:22:28.699 GMT ForkedBooter: Did not receive NOOP from
> > Maven process.
> > DEBUG 2019/03/22 19:22:28.699 GMT ForkedBooter: Running ping job (i.e.
> > checking if we read a ping earlier).
> > DEBUG 2019/03/22 19:22:28.699 GMT ForkedBooter: Checking for NOOP from
> > Maven process.
> > WARN 2019/03/22 19:22:28.699 GMT ForkedBooter: Did not receive NOOP from
> > Maven process.
> > DEBUG 2019/03/22 19:22:28.699 GMT ForkedBooter: Running ping job (i.e.
> > checking if we read a ping earlier).
> > DEBUG 2019/03/22 19:22:28.699 GMT ForkedBooter: Checking for NOOP from
> > Maven process.
> > WARN 2019/03/22 19:22:28.699 GMT ForkedBooter: Did not receive NOOP from
> > Maven process.
> > DEBUG 2019/03/22 19:22:28.699 GMT ForkedBooter: Running ping job (i.e.
> > checking if we read a ping earlier).
> > DEBUG 2019/03/22 19:22:28.699 GMT ForkedBooter: Checking for NOOP from
> > Maven process.
> > WARN 2019/03/22 19:22:28.699 GMT ForkedBooter: Did not receive NOOP from
> > Maven process.
> > DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from Maven
> > process.
> > DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from Maven
> > process.
> > DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from Maven
> > process.
> > DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from Maven
> > process.
> > DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from Maven
> > process.
> > DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from Maven
> > process.
> > DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from Maven
> > process.
> > DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from Maven
> > process.
> > DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from Maven
> > process.
> > DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from Maven
> > process.
> > DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from Maven
> > process.
> > DEBUG 2019/03/22 19:22:28.704 GMT ForkedBooter: Received ping from Maven
> > process.
> > DEBUG 2019/03/22 19:22:28.704 GMT ForkedBooter: Received ping from Maven
> > process.
> > DEBUG 2019/03/22 19:22:28.704 GMT ForkedBooter: Received ping from Maven
> > process.
> > DEBUG 2019/03/22 19:22:28.704 GMT ForkedBooter: Received ping from Maven
> > process.
> > DEBUG 2019/03/22 19:22:28.704 GMT ForkedBooter: Received ping from Maven
> > process.
> > DEBUG 2019/03/22 19:22:28.704 GMT ForkedBooter: Received ping from Maven
> > process.
> > DEBUG 2019/03/22 19:22:28.704 GMT ForkedBooter: Received ping from Maven
> > process.
> > DEBUG 2019/03/22 19:22:28.704 GMT ForkedBooter: Received ping from Maven
> > process.
> > ...
> >
> > With these changes, a project which normally takes 15 to 20 minutes to
> run
> > actually took 80m. Would printing directly to stdout from within a test
> do
> > that somehow? If not, we have a problem in our code.
> >
> > My comments on the above report to follow...
> >
> > On Fri, Mar 22, 2019 at 11:58 AM Tibor Digana <ti...@apache.org>
> > wrote:
> >
> >> >> is there an -Xmx value you would recommend to ensure the parent
> process
> >> can send the NOOP to the surefire process?
> >>
> >> no such value is generic value to recommend. Each application or test
> must
> >> find our the value to prevent from OOM.
> >>
> >>
> >> >> I'm making a custom build of surefire-booter to workaround the
> initial
> >> problem by commenting out the code to exit.
> >>
> >> Very good step forward. Then post your findings and we can continue on
> >> making a pull request on GitHub.
> >>
> >>
> >> >> Is there a logging API which I can use instead?
> >>
> >> Yes, it is very low level of logging (sending events to parent process)
> >> but
> >> this will be turned to high level abstraction in several days (branch
> >> surefire-1222_2).
> >> Yet, pls use this function:
> >>
> >> encodeAndWriteToOutput()
> >>
> >>
> >> >>
> >>
> >>
> >>
> >>
> >>
> >> On Fri, Mar 22, 2019 at 4:44 PM Jason Young <jason.young@procentive.com
> >
> >> wrote:
> >>
> >> > To clarify, this is the image ours is based on:
> >> > https://hub.docker.com/_/alpine; we are not using a vanilla Maven
> >> image as
> >> > we have to add a few other items to this image as well. I don't know
> who
> >> > maintains that or adds /etc/mavenrc; I'll bring that up in another
> >> topic in
> >> > another forum. I was just answering a previous question, and also
> >> throwing
> >> > out info for anyone struggling with e.g. OOME in their Maven process
> (a
> >> lot
> >> > of people have that problem when running the Sonar scanner, for
> >> example). I
> >> > don't think we need to worry about why that file is added for this
> >> topic.
> >> >
> >> > But while we're on the subject, is there an -Xmx value you would
> >> recommend
> >> > to ensure the parent process can send the NOOP to the surefire
> process?
> >> >
> >> > I'm making a custom build of surefire-booter to workaround the initial
> >> > problem by commenting out the code to exit. I am also adding logging
> via
> >> > stdout so I can see if ForkedBooter is indeed running the ping job in
> >> small
> >> > increments of time. Is there a logging API which I can use instead?
> >> >
> >> > On Fri, Mar 22, 2019 at 5:44 AM Tibor Digana <ti...@apache.org>
> >> > wrote:
> >> >
> >> > > The base images are developed in
> >> > https://github.com/carlossg/docker-maven,
> >> > > right?
> >> > > Who creates "/etc/mavenrc"?
> >> > >
> >> > > On Thu, Mar 21, 2019 at 12:05 AM Jason Young <
> >> jason.young@procentive.com
> >> > >
> >> > > wrote:
> >> > >
> >> > > > Mikael, sorry I do not appear to have permission to view the link.
> >> > > >
> >> > > > I did some digging in the last couple of days. I see that the
> parent
> >> > > > process reads from stdin. I could not find anywhere that we are
> >> using
> >> > > > stdin. FWIW the failures nearly always happen at least 15m into a
> >> ~20m
> >> > > test
> >> > > > run, so perf is a likely culprit.
> >> > > >
> >> > > > I see also that ForkedBooter reads commands from stdin in one
> >> thread,
> >> > and
> >> > > > uses an executor service to check for a past ping in
> >> > > > ForkedBooter.listenToShutdownCommands(..). When it checks, it also
> >> sets
> >> > > > pingDone to false. The executor is configured to run up to 2
> threads
> >> > > > concurrently to handle the workload, and is set to run at a fixed
> >> rate
> >> > > (not
> >> > > > a fixed delay). If the test suite is busy with testing and GC and
> >> has
> >> > > lots
> >> > > > of threads running, it's entirely possible that a thread won't
> have
> >> a
> >> > > > chance to run for a long time (e.g. 5s). Maybe instead of a 30s
> >> delay,
> >> > > the
> >> > > > VM gets around to checking for a ping every 35s over a long span
> of
> >> > time.
> >> > > > Because we're running at a "fixed rate" and not a "fixed delay",
> >> then
> >> > > after
> >> > > > a couple of minutes we might be a full 30s behind schedule. It's
> >> > possible
> >> > > > the executor will create another thread to run the scheduled task
> >> > because
> >> > > > it's running behind schedule. This new thread checks for a ping,
> >> finds
> >> > > it,
> >> > > > and sets pingDone to false. But then the original thread also
> runs,
> >> > say,
> >> > > 2
> >> > > > seconds afterwards, checks pingDone, and finds it is false.
> >> > > >
> >> > > > So to mitigate the problem, can we a) make the executor run only 1
> >> > thread
> >> > > > and b) schedule the task at a fixed rate? For that matter, is
> there
> >> > > another
> >> > > > scheduled executor we can reuse? I understand why checking for
> ping
> >> > > > requires a separate executor. Should I ask in github?
> >> > > >
> >> > > > Regarding a previous question, I found out that Alpine's Maven
> >> package
> >> > > > comes with an /etc/mavenrc that sets `MAVEN_OPTS="$MAVEN_OPTS
> >> > -Xmx512m"`
> >> > > > which cannot be undone by setting `MAVEN_OPTS` at the command
> line;
> >> you
> >> > > end
> >> > > > up with e.g. `-Xmx1g -Xmx512m`. (Note this applies to the Maven
> >> > (parent)
> >> > > > process, not the surefire/failsafe (child) process.)
> >> > > >
> >> > > > On Wed, Mar 20, 2019 at 3:46 AM Bernd Eckenfels <
> >> > ecki@zusammenkunft.net>
> >> > > > wrote:
> >> > > >
> >> > > > > I guess a timeout caused by FullGC can happen with TCP as well.
> >> > > > Increasing
> >> > > > > the timeout might not be nice but does look like it would help
> in
> >> > both
> >> > > > > cases. (Problems with stdout are more related to unexpected JVM
> >> > > messages
> >> > > > I
> >> > > > > guess)
> >> > > > >
> >> > > > > Gruss
> >> > > > > Bernd
> >> > > > > --
> >> > > > > http://bernd.eckenfels.net
> >> > > > >
> >> > > > > ________________________________
> >> > > > > Von: Mikael Åsberg <m....@gmail.com>
> >> > > > > Gesendet: Mittwoch, März 20, 2019 9:40 AM
> >> > > > > An: Maven Users List
> >> > > > > Betreff: Re: Failsafe: Killing self fork JVM. PING timeout
> >> elapsed.
> >> > > > >
> >> > > > > These issues regarding communication with forked JVMs, won't
> they
> >> be
> >> > > > > resolved once surefire moves to interprocess communication using
> >> > > > > tcp/ip sockets? This happens to be the target feature to be
> >> included
> >> > > > > in the next surefire 3.0.0 milestone:
> >> > > > >
> >> https://issues.apache.org/jira/projects/SUREFIRE/versions/12344668
> >> > > > >
> >> > > > > There are soooo many issues relating to surefire reading stdout
> of
> >> > > > > forked processes (which is my understanding that it is currently
> >> > > > > doing). Many of us are really looking forward to the next
> >> milestone.
> >> > > > >
> >> > > > > On Tue, Mar 19, 2019 at 8:59 PM Jason Young <
> >> > > jason.young@procentive.com>
> >> > > > > wrote:
> >> > > > > >
> >> > > > > > Getting back to my original questions, I know that "ping"
> means
> >> to
> >> > > see
> >> > > > > if a
> >> > > > > > process is there, and "NOOP" implies it's not a command to do
> >> > > anything.
> >> > > > > But
> >> > > > > > what do the terms "ping" and "NOOP" mean in this context, i.e.
> >> how
> >> > do
> >> > > > the
> >> > > > > > processes communicate? I assume they don't sonar. Do other
> >> > processes
> >> > > > also
> >> > > > > > ping NOOPs? Can I PING Chrome with a NOOP from bash? Is it
> with
> >> > TCP?
> >> > > > > >
> >> > > > > > I'm confused about what I should do regarding GC pauses.
> >> > Previously I
> >> > > > had
> >> > > > > > code that would write the amount of remaining heap space (or
> >> > > something
> >> > > > > like
> >> > > > > > that) to stdout after every test to troubleshoot OOMEs. Can
> >> writing
> >> > > to
> >> > > > > > stdout cause the communication failure somehow?
> >> > > > > >
> >> > > > > > On Wed, Mar 13, 2019 at 5:57 PM Jason Young <
> >> > > > jason.young@procentive.com>
> >> > > > > > wrote:
> >> > > > > >
> >> > > > > > > I upgraded failsafe and surefire to 3.0.0-M3 as advised; we
> >> > > > encountered
> >> > > > > > > the same exception. (Still using -Xmx5g, will switch to
> OpenJ9
> >> > soon
> >> > > > in
> >> > > > > case
> >> > > > > > > that helps.)
> >> > > > > > >
> >> > > > > > > BTW I also asked on StackOverflow previously, for anyone
> >> > > interested:
> >> > > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> https://stackoverflow.com/questions/54755846/killing-self-fork-jvm-ping-timeout-elapsed
> >> > > > > > >
> >> > > > > > > On Tue, Feb 26, 2019 at 6:40 PM Jason Young <
> >> > > > > jason.young@procentive.com>
> >> > > > > > > wrote:
> >> > > > > > >
> >> > > > > > >> Thanks again for the information.
> >> > > > > > >>
> >> > > > > > >> We had increased the RAM to 3g some time ago to prevent
> >> OOMEs.
> >> > > More
> >> > > > > > >> recently, I increased the RAM again to 5g for extra
> headroom
> >> > since
> >> > > > we
> >> > > > > had
> >> > > > > > >> more headroom available; the problem hasn't happened since,
> >> but
> >> > it
> >> > > > > hasn't
> >> > > > > > >> been very long.
> >> > > > > > >>
> >> > > > > > >> We use a more customized image based on Alpine 3.8.2. The
> JDK
> >> > and
> >> > > > > Maven
> >> > > > > > >> are obtained via apk.
> >> > > > > > >>
> >> > > > > > >> I will try upgrading failsafe (and surefire while I'm at
> it)
> >> > > sooner,
> >> > > > > and
> >> > > > > > >> probably do some experimentation with JVMs another time
> (not
> >> > > > pressing
> >> > > > > for
> >> > > > > > >> me ATM).
> >> > > > > > >>
> >> > > > > > >> On Tue, Feb 26, 2019 at 12:20 PM Tibor Digana <
> >> > > > tibordigana@apache.org
> >> > > > > >
> >> > > > > > >> wrote:
> >> > > > > > >>
> >> > > > > > >>> >> I'll try to enable some logging about GC pauses to see
> >> > what's
> >> > > up
> >> > > > > > >>>
> >> > > > > > >>> Pls do not keep such setting after tuning the GC because
> >> this
> >> > may
> >> > > > > > >>> sometime
> >> > > > > > >>> break the interprocess communication between Maven process
> >> and
> >> > > > > surefire
> >> > > > > > >>> process.
> >> > > > > > >>> It's worth to list GC information in a file and not in the
> >> > > console
> >> > > > > logs.
> >> > > > > > >>> This can be configured, I guess.
> >> > > > > > >>>
> >> > > > > > >>> >> Do you think the value is simply too low?
> >> > > > > > >>>
> >> > > > > > >>> GCing many objects may take some time and I remember we
> had
> >> a
> >> > > user
> >> > > > > who
> >> > > > > > >>> had
> >> > > > > > >>> this problem a year or two ago.
> >> > > > > > >>> We check every third NOOP (which is 3 x 10 sec) as a fix
> >> > instead
> >> > > of
> >> > > > > every
> >> > > > > > >>> NOP. So 30 seconds looked satisfactory.
> >> > > > > > >>> I think you use old version 2.20 or something like that.
> The
> >> > > fixes
> >> > > > > for
> >> > > > > > >>> docker have been done so far, so please use the latest
> >> version
> >> > > > > 3.0.0-M3.
> >> > > > > > >>> See this page
> >> > > > > > >>>
> >> > > >
> https://maven.apache.org/surefire/maven-surefire-plugin/docker.html
> >> ,
> >> > > > > we
> >> > > > > > >>> used maven:3.5.3-jdk-8-alpine in this test. Which base
> image
> >> > did
> >> > > > you
> >> > > > > use?
> >> > > > > > >>>
> >> > > > > > >>> Cheers
> >> > > > > > >>> Tibor
> >> > > > > > >>>
> >> > > > > > >>> On Tue, Feb 26, 2019 at 5:24 PM Jason Young <
> >> > > > > jason.young@procentive.com>
> >> > > > > > >>> wrote:
> >> > > > > > >>>
> >> > > > > > >>> > Thanks for the information. It's good to see someone
> >> > > understands
> >> > > > a
> >> > > > > > >>> little
> >> > > > > > >>> > about this.
> >> > > > > > >>> >
> >> > > > > > >>> > Incidentally, we have been looking at other GCs and VMs
> >> for
> >> > the
> >> > > > > > >>> application
> >> > > > > > >>> > in production environments, so I'll look into how these
> >> > affect
> >> > > > > tests as
> >> > > > > > >>> > well. I'll try to enable some logging about GC pauses to
> >> see
> >> > > > > what's up.
> >> > > > > > >>> >
> >> > > > > > >>> > How would `-Xmx3g` cause long GC cycles? Do you think
> the
> >> > value
> >> > > > is
> >> > > > > > >>> simply
> >> > > > > > >>> > too low?
> >> > > > > > >>> >
> >> > > > > > >>> > FWIW we're running the Maven build in an Alpine-based
> >> Docker
> >> > > > > container.
> >> > > > > > >>> >
> >> > > > > > >>> > On Sat, Feb 23, 2019 at 6:36 AM Tibor Digana <
> >> > > > > tibordigana@apache.org>
> >> > > > > > >>> > wrote:
> >> > > > > > >>> >
> >> > > > > > >>> > > Hi Jason,
> >> > > > > > >>> > >
> >> > > > > > >>> > > We spoke about this issue on our chat in ASF Slack:
> >> > > > > > >>> > > "I think his tests have been paused for a long GC
> >> periods
> >> > and
> >> > > > > timed
> >> > > > > > >>> out
> >> > > > > > >>> > 3x
> >> > > > > > >>> > > PING period = 30 seconds. After this period forked JVM
> >> > > supposed
> >> > > > > the
> >> > > > > > >>> Maven
> >> > > > > > >>> > > process was killed by JenkinsCI and therefore all
> >> surefire
> >> > > > > processes
> >> > > > > > >>> are
> >> > > > > > >>> > > killed as well and all the file handlers and memory
> >> > > > consumptions
> >> > > > > are
> >> > > > > > >>> > > freed."
> >> > > > > > >>> > >
> >> > > > > > >>> > > "But I have to say that `-Xmx3g` may cause long GC
> >> cycles,
> >> > > see
> >> > > > > > >>> > >
> >> > > > > > >>> > >
> >> > > > > > >>> >
> >> > > > > > >>>
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> https://maven.apache.org/surefire/maven-surefire-plugin/examples/shutdown.html
> >> > > > > > >>> > > "
> >> > > > > > >>> > >
> >> > > > > > >>> > > You are using java-1.8-openjdk. I guess you should use
> >> > > > > Shenandoah GC
> >> > > > > > >>> > which
> >> > > > > > >>> > > is an experimental algorithm in JVM 1.8. This would
> >> > > > significantly
> >> > > > > > >>> short
> >> > > > > > >>> > > the GC cycles.
> >> > > > > > >>> > >
> >> > > > > > >>> > > We should of cource provide a new configuration
> >> parameter
> >> > to
> >> > > > give
> >> > > > > > >>> you a
> >> > > > > > >>> > > chance to prolong the PING.
> >> > > > > > >>> > >
> >> > > > > > >>> > > Cheers
> >> > > > > > >>> > > Tibor
> >> > > > > > >>> > >
> >> > > > > > >>> >
> >> > > > > > >>> >
> >> > > > > > >>> > --
> >> > > > > > >>> >
> >> > > > > > >>> > Jason Young
> >> > > > > > >>> >
> >> > > > > > >>>
> >> > > > > > >>
> >> > > > > > >>
> >> > > > > >
> >> > > > > > --
> >> > > > > > Jason Young
> >> > > > > > Software Engineer | PROCENTIVE
> >> > > > > > [image: Phone] 715 245 8000 x7609
> >> > > > > > [image: Mobile] 706 870 3540
> >> > > > > > [image: Web] procentive.com
> >> > > > > > Confidentiality Notice: This message is intended for the sole
> >> use
> >> > of
> >> > > > the
> >> > > > > > individual and entity to which it is addressed, and may
> contain
> >> > > > > information
> >> > > > > > that is privileged, confidential and exempt from disclosure
> >> under
> >> > > > > > applicable law. Any unauthorized review, use, disclosure or
> >> > > > distribution
> >> > > > > of
> >> > > > > > this email message, including any attachment, is prohibited.
> If
> >> you
> >> > > are
> >> > > > > not
> >> > > > > > the intended recipient, please advise the sender by reply
> email
> >> and
> >> > > > > destroy
> >> > > > > > all copies of the original message.
> >> > > > >
> >> > > > >
> >> ---------------------------------------------------------------------
> >> > > > > To unsubscribe, e-mail: users-unsubscribe@maven.apache.org
> >> > > > > For additional commands, e-mail: users-help@maven.apache.org
> >> > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> >
>

Re: Failsafe: Killing self fork JVM. PING timeout elapsed.

Posted by Tibor Digana <ti...@apache.org>.
Hi Jason,

Thanks for investigating and pointing to *pcprop*. I will continue on it.

If I made an improvement in a branch, would you be able to install it in
your repository and run your project with Surefire 3.0.0-SNAPSHOT?

Cheers
Tibor

On Wed, Apr 10, 2019 at 3:18 AM Jason Young <ja...@procentive.com>
wrote:

> Thanks.
>
> I finally took a little time to look at why the `ps` command would not
> work. For our project, we cannot use a default Maven image, so I made a
> custom one based on alpine. One piece of software the Maven alpine image
> has is `pcprops`; adding that fixes the problem:
>
> $ docker run -it alpine sh
> > / # apk update
> > ...
> > OK: 8445 distinct packages available
> > / # apk upgrade
> > ...
> > OK: 4 MiB in 13 packages
> > / # ps -o etime= -p 1
> > ps: unrecognized option: p
> > BusyBox v1.26.2 (2018-05-30 13:51:20 GMT) multi-call binary.
> > Usage: ps [-o COL1,COL2=HEADER]
> > Show list of processes
> > -o COL1,COL2=HEADER Select columns for display
> > / # apk add procps
> > ...
> > OK: 12 MiB in 19 packages
> > / # ps -o etime= -p 1
> >       05:21
>
>
> Hope that helps anyone else using a custom-built alpine-based Maven image.
>
> I'm not sure if I'll have time to experiment much with Surefire changes in
> the near future; my focus at the moment is figuring out all the reasons our
> test suite sometimes takes 5 or more times as long as normal (often also
> causes Surefire to kill the child process, but even if it didn't, we have a
> problem!).
>
> On Sun, Apr 7, 2019 at 2:04 AM Tibor Digana <ti...@apache.org>
> wrote:
>
> > Hi Jason,
> >
> > I am sorry I haven't reply.
> > Excellent analysis!
> >
> > Can you make an experiment and use pingJob on a "fixed delay" instead of
> a
> > "fixed rate"?
> > If you want to make more, feel free.
> >
> > We should mabe think about how we would add more intelligence in these
> > timers in order to detect the "blackout".
> >
> > Cheers
> > Tibor
> >
> >
> >
> >
> > On Fri, Mar 22, 2019 at 9:53 PM Jason Young <ja...@procentive.com>
> > wrote:
> >
> > > Notice:
> > > * an over 3-minute gap between the first two entries above--neither the
> > > command reader thread nor the pingJob had a chance to run.
> > > * the command reader thread and the ping job executor evidently get a
> > > chance to run from 19:18:02 to 19:19:19. They ran with increased
> > frequency
> > > during that time to "catch up": The command reader thread is programmed
> > to
> > > read from stdin until there's nothing left to read, and the
> pingScheduler
> > > runs the pingJob at a "fixed" rate.
> > > * Another 3-minute gap and another attempt to catch up at 19:22:28.
> This
> > > time though, the command reader thread evidently did not get a turn in
> > > time.
> > >
> > > I did not bother to log what threads the ping job was running under; I
> > can
> > > do that later. I also did not add any logging on the Maven process
> side;
> > I
> > > can do that as well.
> > >
> > > This is on a VM with unthrottled access to 4 processor cores and 8GB
> > RAM. I
> > > think the code under test is CPU- and RAM-intensive. There probably
> were
> > GC
> > > pauses, but I don't have proper logging for that in place.
> > >
> > > Note that this is all very normal behavior: Both the command reader and
> > the
> > > ping job's thread (run by the pingScheduler) are instructed to run
> until
> > > their work is done, not to sleep or yield. For high throughput, an OS
> > will
> > > favor continuing to run a thread for a while once it is started over
> > > switching tasks, so it's normal both that each thread would run for a
> > while
> > > after a pause, and (by extension) that each would have long pauses as
> > other
> > > threads had to run, too. Because of the long gaps between runs of the
> > > command reader thread, my previous advice (schedule the pingJob on a
> > "fixed
> > > delay" instead of a "fixed rate") would not have saved the build in
> this
> > > case (though "fixed delay" is more correct and would save the build in
> > > other possible cases). Our code base has some unit tests that rely on
> > > "normal case" timing, but they now have the @Ignore attribute because
> > they
> > > would rarely, but too often, fail. For most teams and projects, unit
> > tests
> > > have to be more reliable than that.
> > >
> > > So I think we need a solution that does not rely on preemptive
> scheduling
> > > from the OS or on thread scheduling, but which ensures that commnads
> have
> > > been read before running code depending on commands. There are multiple
> > > ways to do that; maybe I can find time to write a solution later.
> > >
> > >
> > > On Fri, Mar 22, 2019 at 3:46 PM Jason Young <
> jason.young@procentive.com>
> > > wrote:
> > >
> > > > Thanks Tibor. I'll use `encodeAndWriteToOutput` later. For now, I
> > hacked
> > > > together some temporary changes to ForkedBooter so that it "logs"
> (just
> > > > `System.out.println`) the current time every time:
> > > >
> > > >    - the command reader thread reads a NOOP via stdin,
> > > >    - it starts the ping job,
> > > >    - it checks for NOOP (`canUseNewPingMechanism(..)` returns
> `false`),
> > > >    and
> > > >    - it would have exited because there is no NOOP (`!pingDone`).
> > > >
> > > > I also commented out the code to exit in that case so it keeps
> running
> > > and
> > > > logging.
> > > >
> > > > The resulting output starts out showing that it's receiving ping
> every
> > > 10s
> > > > and checking for NOOP every 30s, as expected. But then one large test
> > > > begins, and:
> > > >
> > > > DEBUG 2019/03/22 19:14:40.223 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:18:02.723 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:18:03.139 GMT ForkedBooter: Running ping job
> (i.e.
> > > > checking if we read a ping earlier).
> > > > DEBUG 2019/03/22 19:18:03.172 GMT ForkedBooter: Checking for NOOP
> from
> > > > Maven process.
> > > > DEBUG 2019/03/22 19:18:03.186 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:18:03.187 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:18:03.187 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:18:03.187 GMT ForkedBooter: Running ping job
> (i.e.
> > > > checking if we read a ping earlier).
> > > > DEBUG 2019/03/22 19:18:03.188 GMT ForkedBooter: Checking for NOOP
> from
> > > > Maven process.
> > > > DEBUG 2019/03/22 19:18:03.187 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:18:03.243 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:18:03.307 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:18:03.308 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:18:03.202 GMT ForkedBooter: Running ping job
> (i.e.
> > > > checking if we read a ping earlier).
> > > > DEBUG 2019/03/22 19:18:03.323 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:18:03.337 GMT ForkedBooter: Checking for NOOP
> from
> > > > Maven process.
> > > > DEBUG 2019/03/22 19:18:03.351 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:18:03.351 GMT ForkedBooter: Running ping job
> (i.e.
> > > > checking if we read a ping earlier).
> > > > DEBUG 2019/03/22 19:18:03.351 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:18:03.352 GMT ForkedBooter: Checking for NOOP
> from
> > > > Maven process.
> > > > DEBUG 2019/03/22 19:18:03.366 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:18:03.393 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:18:03.380 GMT ForkedBooter: Running ping job
> (i.e.
> > > > checking if we read a ping earlier).
> > > > DEBUG 2019/03/22 19:18:03.409 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:18:03.424 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:18:03.423 GMT ForkedBooter: Checking for NOOP
> from
> > > > Maven process.
> > > > DEBUG 2019/03/22 19:18:03.424 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:18:03.438 GMT ForkedBooter: Running ping job
> (i.e.
> > > > checking if we read a ping earlier).
> > > > DEBUG 2019/03/22 19:18:03.438 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:18:03.453 GMT ForkedBooter: Checking for NOOP
> from
> > > > Maven process.
> > > > DEBUG 2019/03/22 19:18:03.454 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:18:03.476 GMT ForkedBooter: Running ping job
> (i.e.
> > > > checking if we read a ping earlier).
> > > > DEBUG 2019/03/22 19:18:03.476 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:18:03.505 GMT ForkedBooter: Checking for NOOP
> from
> > > > Maven process.
> > > > DEBUG 2019/03/22 19:18:03.505 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:18:08.872 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:18:18.859 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:18:19.282 GMT ForkedBooter: Running ping job
> (i.e.
> > > > checking if we read a ping earlier).
> > > > DEBUG 2019/03/22 19:18:19.330 GMT ForkedBooter: Checking for NOOP
> from
> > > > Maven process.
> > > > DEBUG 2019/03/22 19:18:28.845 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:18:38.832 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:18:48.838 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:18:49.241 GMT ForkedBooter: Running ping job
> (i.e.
> > > > checking if we read a ping earlier).
> > > > DEBUG 2019/03/22 19:18:49.265 GMT ForkedBooter: Checking for NOOP
> from
> > > > Maven process.
> > > > DEBUG 2019/03/22 19:19:01.162 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:19:10.069 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:19:19.136 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:19:19.199 GMT ForkedBooter: Running ping job
> (i.e.
> > > > checking if we read a ping earlier).
> > > > DEBUG 2019/03/22 19:19:19.199 GMT ForkedBooter: Checking for NOOP
> from
> > > > Maven process.
> > > > DEBUG 2019/03/22 19:22:28.698 GMT ForkedBooter: Running ping job
> (i.e.
> > > > checking if we read a ping earlier).
> > > > DEBUG 2019/03/22 19:22:28.698 GMT ForkedBooter: Checking for NOOP
> from
> > > > Maven process.
> > > > WARN 2019/03/22 19:22:28.698 GMT ForkedBooter: Did not receive NOOP
> > from
> > > > Maven process.
> > > > DEBUG 2019/03/22 19:22:28.698 GMT ForkedBooter: Running ping job
> (i.e.
> > > > checking if we read a ping earlier).
> > > > DEBUG 2019/03/22 19:22:28.698 GMT ForkedBooter: Checking for NOOP
> from
> > > > Maven process.
> > > > WARN 2019/03/22 19:22:28.698 GMT ForkedBooter: Did not receive NOOP
> > from
> > > > Maven process.
> > > > DEBUG 2019/03/22 19:22:28.699 GMT ForkedBooter: Running ping job
> (i.e.
> > > > checking if we read a ping earlier).
> > > > DEBUG 2019/03/22 19:22:28.699 GMT ForkedBooter: Checking for NOOP
> from
> > > > Maven process.
> > > > WARN 2019/03/22 19:22:28.699 GMT ForkedBooter: Did not receive NOOP
> > from
> > > > Maven process.
> > > > DEBUG 2019/03/22 19:22:28.699 GMT ForkedBooter: Running ping job
> (i.e.
> > > > checking if we read a ping earlier).
> > > > DEBUG 2019/03/22 19:22:28.699 GMT ForkedBooter: Checking for NOOP
> from
> > > > Maven process.
> > > > WARN 2019/03/22 19:22:28.699 GMT ForkedBooter: Did not receive NOOP
> > from
> > > > Maven process.
> > > > DEBUG 2019/03/22 19:22:28.699 GMT ForkedBooter: Running ping job
> (i.e.
> > > > checking if we read a ping earlier).
> > > > DEBUG 2019/03/22 19:22:28.699 GMT ForkedBooter: Checking for NOOP
> from
> > > > Maven process.
> > > > WARN 2019/03/22 19:22:28.699 GMT ForkedBooter: Did not receive NOOP
> > from
> > > > Maven process.
> > > > DEBUG 2019/03/22 19:22:28.699 GMT ForkedBooter: Running ping job
> (i.e.
> > > > checking if we read a ping earlier).
> > > > DEBUG 2019/03/22 19:22:28.699 GMT ForkedBooter: Checking for NOOP
> from
> > > > Maven process.
> > > > WARN 2019/03/22 19:22:28.699 GMT ForkedBooter: Did not receive NOOP
> > from
> > > > Maven process.
> > > > DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:22:28.704 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:22:28.704 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:22:28.704 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:22:28.704 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:22:28.704 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:22:28.704 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:22:28.704 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:22:28.704 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > ...
> > > >
> > > > With these changes, a project which normally takes 15 to 20 minutes
> to
> > > run
> > > > actually took 80m. Would printing directly to stdout from within a
> test
> > > do
> > > > that somehow? If not, we have a problem in our code.
> > > >
> > > > My comments on the above report to follow...
> > > >
> > > > On Fri, Mar 22, 2019 at 11:58 AM Tibor Digana <
> tibordigana@apache.org>
> > > > wrote:
> > > >
> > > >> >> is there an -Xmx value you would recommend to ensure the parent
> > > process
> > > >> can send the NOOP to the surefire process?
> > > >>
> > > >> no such value is generic value to recommend. Each application or
> test
> > > must
> > > >> find our the value to prevent from OOM.
> > > >>
> > > >>
> > > >> >> I'm making a custom build of surefire-booter to workaround the
> > > initial
> > > >> problem by commenting out the code to exit.
> > > >>
> > > >> Very good step forward. Then post your findings and we can continue
> on
> > > >> making a pull request on GitHub.
> > > >>
> > > >>
> > > >> >> Is there a logging API which I can use instead?
> > > >>
> > > >> Yes, it is very low level of logging (sending events to parent
> > process)
> > > >> but
> > > >> this will be turned to high level abstraction in several days
> (branch
> > > >> surefire-1222_2).
> > > >> Yet, pls use this function:
> > > >>
> > > >> encodeAndWriteToOutput()
> > > >>
> > > >>
> > > >> >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >> On Fri, Mar 22, 2019 at 4:44 PM Jason Young <
> > jason.young@procentive.com
> > > >
> > > >> wrote:
> > > >>
> > > >> > To clarify, this is the image ours is based on:
> > > >> > https://hub.docker.com/_/alpine; we are not using a vanilla Maven
> > > >> image as
> > > >> > we have to add a few other items to this image as well. I don't
> know
> > > who
> > > >> > maintains that or adds /etc/mavenrc; I'll bring that up in another
> > > >> topic in
> > > >> > another forum. I was just answering a previous question, and also
> > > >> throwing
> > > >> > out info for anyone struggling with e.g. OOME in their Maven
> process
> > > (a
> > > >> lot
> > > >> > of people have that problem when running the Sonar scanner, for
> > > >> example). I
> > > >> > don't think we need to worry about why that file is added for this
> > > >> topic.
> > > >> >
> > > >> > But while we're on the subject, is there an -Xmx value you would
> > > >> recommend
> > > >> > to ensure the parent process can send the NOOP to the surefire
> > > process?
> > > >> >
> > > >> > I'm making a custom build of surefire-booter to workaround the
> > initial
> > > >> > problem by commenting out the code to exit. I am also adding
> logging
> > > via
> > > >> > stdout so I can see if ForkedBooter is indeed running the ping job
> > in
> > > >> small
> > > >> > increments of time. Is there a logging API which I can use
> instead?
> > > >> >
> > > >> > On Fri, Mar 22, 2019 at 5:44 AM Tibor Digana <
> > tibordigana@apache.org>
> > > >> > wrote:
> > > >> >
> > > >> > > The base images are developed in
> > > >> > https://github.com/carlossg/docker-maven,
> > > >> > > right?
> > > >> > > Who creates "/etc/mavenrc"?
> > > >> > >
> > > >> > > On Thu, Mar 21, 2019 at 12:05 AM Jason Young <
> > > >> jason.young@procentive.com
> > > >> > >
> > > >> > > wrote:
> > > >> > >
> > > >> > > > Mikael, sorry I do not appear to have permission to view the
> > link.
> > > >> > > >
> > > >> > > > I did some digging in the last couple of days. I see that the
> > > parent
> > > >> > > > process reads from stdin. I could not find anywhere that we
> are
> > > >> using
> > > >> > > > stdin. FWIW the failures nearly always happen at least 15m
> into
> > a
> > > >> ~20m
> > > >> > > test
> > > >> > > > run, so perf is a likely culprit.
> > > >> > > >
> > > >> > > > I see also that ForkedBooter reads commands from stdin in one
> > > >> thread,
> > > >> > and
> > > >> > > > uses an executor service to check for a past ping in
> > > >> > > > ForkedBooter.listenToShutdownCommands(..). When it checks, it
> > also
> > > >> sets
> > > >> > > > pingDone to false. The executor is configured to run up to 2
> > > threads
> > > >> > > > concurrently to handle the workload, and is set to run at a
> > fixed
> > > >> rate
> > > >> > > (not
> > > >> > > > a fixed delay). If the test suite is busy with testing and GC
> > and
> > > >> has
> > > >> > > lots
> > > >> > > > of threads running, it's entirely possible that a thread won't
> > > have
> > > >> a
> > > >> > > > chance to run for a long time (e.g. 5s). Maybe instead of a
> 30s
> > > >> delay,
> > > >> > > the
> > > >> > > > VM gets around to checking for a ping every 35s over a long
> span
> > > of
> > > >> > time.
> > > >> > > > Because we're running at a "fixed rate" and not a "fixed
> delay",
> > > >> then
> > > >> > > after
> > > >> > > > a couple of minutes we might be a full 30s behind schedule.
> It's
> > > >> > possible
> > > >> > > > the executor will create another thread to run the scheduled
> > task
> > > >> > because
> > > >> > > > it's running behind schedule. This new thread checks for a
> ping,
> > > >> finds
> > > >> > > it,
> > > >> > > > and sets pingDone to false. But then the original thread also
> > > runs,
> > > >> > say,
> > > >> > > 2
> > > >> > > > seconds afterwards, checks pingDone, and finds it is false.
> > > >> > > >
> > > >> > > > So to mitigate the problem, can we a) make the executor run
> > only 1
> > > >> > thread
> > > >> > > > and b) schedule the task at a fixed rate? For that matter, is
> > > there
> > > >> > > another
> > > >> > > > scheduled executor we can reuse? I understand why checking for
> > > ping
> > > >> > > > requires a separate executor. Should I ask in github?
> > > >> > > >
> > > >> > > > Regarding a previous question, I found out that Alpine's Maven
> > > >> package
> > > >> > > > comes with an /etc/mavenrc that sets `MAVEN_OPTS="$MAVEN_OPTS
> > > >> > -Xmx512m"`
> > > >> > > > which cannot be undone by setting `MAVEN_OPTS` at the command
> > > line;
> > > >> you
> > > >> > > end
> > > >> > > > up with e.g. `-Xmx1g -Xmx512m`. (Note this applies to the
> Maven
> > > >> > (parent)
> > > >> > > > process, not the surefire/failsafe (child) process.)
> > > >> > > >
> > > >> > > > On Wed, Mar 20, 2019 at 3:46 AM Bernd Eckenfels <
> > > >> > ecki@zusammenkunft.net>
> > > >> > > > wrote:
> > > >> > > >
> > > >> > > > > I guess a timeout caused by FullGC can happen with TCP as
> > well.
> > > >> > > > Increasing
> > > >> > > > > the timeout might not be nice but does look like it would
> help
> > > in
> > > >> > both
> > > >> > > > > cases. (Problems with stdout are more related to unexpected
> > JVM
> > > >> > > messages
> > > >> > > > I
> > > >> > > > > guess)
> > > >> > > > >
> > > >> > > > > Gruss
> > > >> > > > > Bernd
> > > >> > > > > --
> > > >> > > > > http://bernd.eckenfels.net
> > > >> > > > >
> > > >> > > > > ________________________________
> > > >> > > > > Von: Mikael Åsberg <m....@gmail.com>
> > > >> > > > > Gesendet: Mittwoch, März 20, 2019 9:40 AM
> > > >> > > > > An: Maven Users List
> > > >> > > > > Betreff: Re: Failsafe: Killing self fork JVM. PING timeout
> > > >> elapsed.
> > > >> > > > >
> > > >> > > > > These issues regarding communication with forked JVMs, won't
> > > they
> > > >> be
> > > >> > > > > resolved once surefire moves to interprocess communication
> > using
> > > >> > > > > tcp/ip sockets? This happens to be the target feature to be
> > > >> included
> > > >> > > > > in the next surefire 3.0.0 milestone:
> > > >> > > > >
> > > >> https://issues.apache.org/jira/projects/SUREFIRE/versions/12344668
> > > >> > > > >
> > > >> > > > > There are soooo many issues relating to surefire reading
> > stdout
> > > of
> > > >> > > > > forked processes (which is my understanding that it is
> > currently
> > > >> > > > > doing). Many of us are really looking forward to the next
> > > >> milestone.
> > > >> > > > >
> > > >> > > > > On Tue, Mar 19, 2019 at 8:59 PM Jason Young <
> > > >> > > jason.young@procentive.com>
> > > >> > > > > wrote:
> > > >> > > > > >
> > > >> > > > > > Getting back to my original questions, I know that "ping"
> > > means
> > > >> to
> > > >> > > see
> > > >> > > > > if a
> > > >> > > > > > process is there, and "NOOP" implies it's not a command to
> > do
> > > >> > > anything.
> > > >> > > > > But
> > > >> > > > > > what do the terms "ping" and "NOOP" mean in this context,
> > i.e.
> > > >> how
> > > >> > do
> > > >> > > > the
> > > >> > > > > > processes communicate? I assume they don't sonar. Do other
> > > >> > processes
> > > >> > > > also
> > > >> > > > > > ping NOOPs? Can I PING Chrome with a NOOP from bash? Is it
> > > with
> > > >> > TCP?
> > > >> > > > > >
> > > >> > > > > > I'm confused about what I should do regarding GC pauses.
> > > >> > Previously I
> > > >> > > > had
> > > >> > > > > > code that would write the amount of remaining heap space
> (or
> > > >> > > something
> > > >> > > > > like
> > > >> > > > > > that) to stdout after every test to troubleshoot OOMEs.
> Can
> > > >> writing
> > > >> > > to
> > > >> > > > > > stdout cause the communication failure somehow?
> > > >> > > > > >
> > > >> > > > > > On Wed, Mar 13, 2019 at 5:57 PM Jason Young <
> > > >> > > > jason.young@procentive.com>
> > > >> > > > > > wrote:
> > > >> > > > > >
> > > >> > > > > > > I upgraded failsafe and surefire to 3.0.0-M3 as advised;
> > we
> > > >> > > > encountered
> > > >> > > > > > > the same exception. (Still using -Xmx5g, will switch to
> > > OpenJ9
> > > >> > soon
> > > >> > > > in
> > > >> > > > > case
> > > >> > > > > > > that helps.)
> > > >> > > > > > >
> > > >> > > > > > > BTW I also asked on StackOverflow previously, for anyone
> > > >> > > interested:
> > > >> > > > > > >
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > >
> >
> https://stackoverflow.com/questions/54755846/killing-self-fork-jvm-ping-timeout-elapsed
> > > >> > > > > > >
> > > >> > > > > > > On Tue, Feb 26, 2019 at 6:40 PM Jason Young <
> > > >> > > > > jason.young@procentive.com>
> > > >> > > > > > > wrote:
> > > >> > > > > > >
> > > >> > > > > > >> Thanks again for the information.
> > > >> > > > > > >>
> > > >> > > > > > >> We had increased the RAM to 3g some time ago to prevent
> > > >> OOMEs.
> > > >> > > More
> > > >> > > > > > >> recently, I increased the RAM again to 5g for extra
> > > headroom
> > > >> > since
> > > >> > > > we
> > > >> > > > > had
> > > >> > > > > > >> more headroom available; the problem hasn't happened
> > since,
> > > >> but
> > > >> > it
> > > >> > > > > hasn't
> > > >> > > > > > >> been very long.
> > > >> > > > > > >>
> > > >> > > > > > >> We use a more customized image based on Alpine 3.8.2.
> The
> > > JDK
> > > >> > and
> > > >> > > > > Maven
> > > >> > > > > > >> are obtained via apk.
> > > >> > > > > > >>
> > > >> > > > > > >> I will try upgrading failsafe (and surefire while I'm
> at
> > > it)
> > > >> > > sooner,
> > > >> > > > > and
> > > >> > > > > > >> probably do some experimentation with JVMs another time
> > > (not
> > > >> > > > pressing
> > > >> > > > > for
> > > >> > > > > > >> me ATM).
> > > >> > > > > > >>
> > > >> > > > > > >> On Tue, Feb 26, 2019 at 12:20 PM Tibor Digana <
> > > >> > > > tibordigana@apache.org
> > > >> > > > > >
> > > >> > > > > > >> wrote:
> > > >> > > > > > >>
> > > >> > > > > > >>> >> I'll try to enable some logging about GC pauses to
> > see
> > > >> > what's
> > > >> > > up
> > > >> > > > > > >>>
> > > >> > > > > > >>> Pls do not keep such setting after tuning the GC
> because
> > > >> this
> > > >> > may
> > > >> > > > > > >>> sometime
> > > >> > > > > > >>> break the interprocess communication between Maven
> > process
> > > >> and
> > > >> > > > > surefire
> > > >> > > > > > >>> process.
> > > >> > > > > > >>> It's worth to list GC information in a file and not in
> > the
> > > >> > > console
> > > >> > > > > logs.
> > > >> > > > > > >>> This can be configured, I guess.
> > > >> > > > > > >>>
> > > >> > > > > > >>> >> Do you think the value is simply too low?
> > > >> > > > > > >>>
> > > >> > > > > > >>> GCing many objects may take some time and I remember
> we
> > > had
> > > >> a
> > > >> > > user
> > > >> > > > > who
> > > >> > > > > > >>> had
> > > >> > > > > > >>> this problem a year or two ago.
> > > >> > > > > > >>> We check every third NOOP (which is 3 x 10 sec) as a
> fix
> > > >> > instead
> > > >> > > of
> > > >> > > > > every
> > > >> > > > > > >>> NOP. So 30 seconds looked satisfactory.
> > > >> > > > > > >>> I think you use old version 2.20 or something like
> that.
> > > The
> > > >> > > fixes
> > > >> > > > > for
> > > >> > > > > > >>> docker have been done so far, so please use the latest
> > > >> version
> > > >> > > > > 3.0.0-M3.
> > > >> > > > > > >>> See this page
> > > >> > > > > > >>>
> > > >> > > >
> > > https://maven.apache.org/surefire/maven-surefire-plugin/docker.html
> > > >> ,
> > > >> > > > > we
> > > >> > > > > > >>> used maven:3.5.3-jdk-8-alpine in this test. Which base
> > > image
> > > >> > did
> > > >> > > > you
> > > >> > > > > use?
> > > >> > > > > > >>>
> > > >> > > > > > >>> Cheers
> > > >> > > > > > >>> Tibor
> > > >> > > > > > >>>
> > > >> > > > > > >>> On Tue, Feb 26, 2019 at 5:24 PM Jason Young <
> > > >> > > > > jason.young@procentive.com>
> > > >> > > > > > >>> wrote:
> > > >> > > > > > >>>
> > > >> > > > > > >>> > Thanks for the information. It's good to see someone
> > > >> > > understands
> > > >> > > > a
> > > >> > > > > > >>> little
> > > >> > > > > > >>> > about this.
> > > >> > > > > > >>> >
> > > >> > > > > > >>> > Incidentally, we have been looking at other GCs and
> > VMs
> > > >> for
> > > >> > the
> > > >> > > > > > >>> application
> > > >> > > > > > >>> > in production environments, so I'll look into how
> > these
> > > >> > affect
> > > >> > > > > tests as
> > > >> > > > > > >>> > well. I'll try to enable some logging about GC
> pauses
> > to
> > > >> see
> > > >> > > > > what's up.
> > > >> > > > > > >>> >
> > > >> > > > > > >>> > How would `-Xmx3g` cause long GC cycles? Do you
> think
> > > the
> > > >> > value
> > > >> > > > is
> > > >> > > > > > >>> simply
> > > >> > > > > > >>> > too low?
> > > >> > > > > > >>> >
> > > >> > > > > > >>> > FWIW we're running the Maven build in an
> Alpine-based
> > > >> Docker
> > > >> > > > > container.
> > > >> > > > > > >>> >
> > > >> > > > > > >>> > On Sat, Feb 23, 2019 at 6:36 AM Tibor Digana <
> > > >> > > > > tibordigana@apache.org>
> > > >> > > > > > >>> > wrote:
> > > >> > > > > > >>> >
> > > >> > > > > > >>> > > Hi Jason,
> > > >> > > > > > >>> > >
> > > >> > > > > > >>> > > We spoke about this issue on our chat in ASF
> Slack:
> > > >> > > > > > >>> > > "I think his tests have been paused for a long GC
> > > >> periods
> > > >> > and
> > > >> > > > > timed
> > > >> > > > > > >>> out
> > > >> > > > > > >>> > 3x
> > > >> > > > > > >>> > > PING period = 30 seconds. After this period forked
> > JVM
> > > >> > > supposed
> > > >> > > > > the
> > > >> > > > > > >>> Maven
> > > >> > > > > > >>> > > process was killed by JenkinsCI and therefore all
> > > >> surefire
> > > >> > > > > processes
> > > >> > > > > > >>> are
> > > >> > > > > > >>> > > killed as well and all the file handlers and
> memory
> > > >> > > > consumptions
> > > >> > > > > are
> > > >> > > > > > >>> > > freed."
> > > >> > > > > > >>> > >
> > > >> > > > > > >>> > > "But I have to say that `-Xmx3g` may cause long GC
> > > >> cycles,
> > > >> > > see
> > > >> > > > > > >>> > >
> > > >> > > > > > >>> > >
> > > >> > > > > > >>> >
> > > >> > > > > > >>>
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > >
> >
> https://maven.apache.org/surefire/maven-surefire-plugin/examples/shutdown.html
> > > >> > > > > > >>> > > "
> > > >> > > > > > >>> > >
> > > >> > > > > > >>> > > You are using java-1.8-openjdk. I guess you should
> > use
> > > >> > > > > Shenandoah GC
> > > >> > > > > > >>> > which
> > > >> > > > > > >>> > > is an experimental algorithm in JVM 1.8. This
> would
> > > >> > > > significantly
> > > >> > > > > > >>> short
> > > >> > > > > > >>> > > the GC cycles.
> > > >> > > > > > >>> > >
> > > >> > > > > > >>> > > We should of cource provide a new configuration
> > > >> parameter
> > > >> > to
> > > >> > > > give
> > > >> > > > > > >>> you a
> > > >> > > > > > >>> > > chance to prolong the PING.
> > > >> > > > > > >>> > >
> > > >> > > > > > >>> > > Cheers
> > > >> > > > > > >>> > > Tibor
> > > >> > > > > > >>> > >
> > > >> > > > > > >>> >
> > > >> > > > > > >>> >
> > > >> > > > > > >>> > --
> > > >> > > > > > >>> >
> > > >> > > > > > >>> > Jason Young
> > > >> > > > > > >>> >
> > > >> > > > > > >>>
> > > >> > > > > > >>
> > > >> > > > > > >>
> > > >> > > > > >
> > > >> > > > > > --
> > > >> > > > > > Jason Young
> > > >> > > > > > Software Engineer | PROCENTIVE
> > > >> > > > > > [image: Phone] 715 245 8000 x7609
> > > >> > > > > > [image: Mobile] 706 870 3540
> > > >> > > > > > [image: Web] procentive.com
> > > >> > > > > > Confidentiality Notice: This message is intended for the
> > sole
> > > >> use
> > > >> > of
> > > >> > > > the
> > > >> > > > > > individual and entity to which it is addressed, and may
> > > contain
> > > >> > > > > information
> > > >> > > > > > that is privileged, confidential and exempt from
> disclosure
> > > >> under
> > > >> > > > > > applicable law. Any unauthorized review, use, disclosure
> or
> > > >> > > > distribution
> > > >> > > > > of
> > > >> > > > > > this email message, including any attachment, is
> prohibited.
> > > If
> > > >> you
> > > >> > > are
> > > >> > > > > not
> > > >> > > > > > the intended recipient, please advise the sender by reply
> > > email
> > > >> and
> > > >> > > > > destroy
> > > >> > > > > > all copies of the original message.
> > > >> > > > >
> > > >> > > > >
> > > >>
> ---------------------------------------------------------------------
> > > >> > > > > To unsubscribe, e-mail: users-unsubscribe@maven.apache.org
> > > >> > > > > For additional commands, e-mail:
> users-help@maven.apache.org
> > > >> > > > >
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > > >
> > >
> >
>

Re: Failsafe: Killing self fork JVM. PING timeout elapsed.

Posted by Tibor Digana <ti...@apache.org>.
Pls check this commit.
It is the algorithm
https://github.com/apache/maven-surefire/pull/231/commits/112f3e4348f8ed731182064561604ff0fa9ca30c
which avoids killing JVM when GC has paused the JVM for longer than 15
seconds.

On Sat, Apr 20, 2019 at 5:27 PM Tibor Digana <ti...@apache.org> wrote:

> Hey Jason,
>
> Here is the PR I promissed
> https://github.com/apache/maven-surefire/pull/231
> I will think about more tolerant algorithm.
> Feel free to help us in this PR on GitHub.
>
> Cheers
> Tibor
>
>
> On Wed, Apr 10, 2019 at 3:18 AM Jason Young <ja...@procentive.com>
> wrote:
>
>> Thanks.
>>
>> I finally took a little time to look at why the `ps` command would not
>> work. For our project, we cannot use a default Maven image, so I made a
>> custom one based on alpine. One piece of software the Maven alpine image
>> has is `pcprops`; adding that fixes the problem:
>>
>> $ docker run -it alpine sh
>> > / # apk update
>> > ...
>> > OK: 8445 distinct packages available
>> > / # apk upgrade
>> > ...
>> > OK: 4 MiB in 13 packages
>> > / # ps -o etime= -p 1
>> > ps: unrecognized option: p
>> > BusyBox v1.26.2 (2018-05-30 13:51:20 GMT) multi-call binary.
>> > Usage: ps [-o COL1,COL2=HEADER]
>> > Show list of processes
>> > -o COL1,COL2=HEADER Select columns for display
>> > / # apk add procps
>> > ...
>> > OK: 12 MiB in 19 packages
>> > / # ps -o etime= -p 1
>> >       05:21
>>
>>
>> Hope that helps anyone else using a custom-built alpine-based Maven image.
>>
>> I'm not sure if I'll have time to experiment much with Surefire changes in
>> the near future; my focus at the moment is figuring out all the reasons
>> our
>> test suite sometimes takes 5 or more times as long as normal (often also
>> causes Surefire to kill the child process, but even if it didn't, we have
>> a
>> problem!).
>>
>> On Sun, Apr 7, 2019 at 2:04 AM Tibor Digana <ti...@apache.org>
>> wrote:
>>
>> > Hi Jason,
>> >
>> > I am sorry I haven't reply.
>> > Excellent analysis!
>> >
>> > Can you make an experiment and use pingJob on a "fixed delay" instead
>> of a
>> > "fixed rate"?
>> > If you want to make more, feel free.
>> >
>> > We should mabe think about how we would add more intelligence in these
>> > timers in order to detect the "blackout".
>> >
>> > Cheers
>> > Tibor
>> >
>> >
>> >
>> >
>> > On Fri, Mar 22, 2019 at 9:53 PM Jason Young <jason.young@procentive.com
>> >
>> > wrote:
>> >
>> > > Notice:
>> > > * an over 3-minute gap between the first two entries above--neither
>> the
>> > > command reader thread nor the pingJob had a chance to run.
>> > > * the command reader thread and the ping job executor evidently get a
>> > > chance to run from 19:18:02 to 19:19:19. They ran with increased
>> > frequency
>> > > during that time to "catch up": The command reader thread is
>> programmed
>> > to
>> > > read from stdin until there's nothing left to read, and the
>> pingScheduler
>> > > runs the pingJob at a "fixed" rate.
>> > > * Another 3-minute gap and another attempt to catch up at 19:22:28.
>> This
>> > > time though, the command reader thread evidently did not get a turn in
>> > > time.
>> > >
>> > > I did not bother to log what threads the ping job was running under; I
>> > can
>> > > do that later. I also did not add any logging on the Maven process
>> side;
>> > I
>> > > can do that as well.
>> > >
>> > > This is on a VM with unthrottled access to 4 processor cores and 8GB
>> > RAM. I
>> > > think the code under test is CPU- and RAM-intensive. There probably
>> were
>> > GC
>> > > pauses, but I don't have proper logging for that in place.
>> > >
>> > > Note that this is all very normal behavior: Both the command reader
>> and
>> > the
>> > > ping job's thread (run by the pingScheduler) are instructed to run
>> until
>> > > their work is done, not to sleep or yield. For high throughput, an OS
>> > will
>> > > favor continuing to run a thread for a while once it is started over
>> > > switching tasks, so it's normal both that each thread would run for a
>> > while
>> > > after a pause, and (by extension) that each would have long pauses as
>> > other
>> > > threads had to run, too. Because of the long gaps between runs of the
>> > > command reader thread, my previous advice (schedule the pingJob on a
>> > "fixed
>> > > delay" instead of a "fixed rate") would not have saved the build in
>> this
>> > > case (though "fixed delay" is more correct and would save the build in
>> > > other possible cases). Our code base has some unit tests that rely on
>> > > "normal case" timing, but they now have the @Ignore attribute because
>> > they
>> > > would rarely, but too often, fail. For most teams and projects, unit
>> > tests
>> > > have to be more reliable than that.
>> > >
>> > > So I think we need a solution that does not rely on preemptive
>> scheduling
>> > > from the OS or on thread scheduling, but which ensures that commnads
>> have
>> > > been read before running code depending on commands. There are
>> multiple
>> > > ways to do that; maybe I can find time to write a solution later.
>> > >
>> > >
>> > > On Fri, Mar 22, 2019 at 3:46 PM Jason Young <
>> jason.young@procentive.com>
>> > > wrote:
>> > >
>> > > > Thanks Tibor. I'll use `encodeAndWriteToOutput` later. For now, I
>> > hacked
>> > > > together some temporary changes to ForkedBooter so that it "logs"
>> (just
>> > > > `System.out.println`) the current time every time:
>> > > >
>> > > >    - the command reader thread reads a NOOP via stdin,
>> > > >    - it starts the ping job,
>> > > >    - it checks for NOOP (`canUseNewPingMechanism(..)` returns
>> `false`),
>> > > >    and
>> > > >    - it would have exited because there is no NOOP (`!pingDone`).
>> > > >
>> > > > I also commented out the code to exit in that case so it keeps
>> running
>> > > and
>> > > > logging.
>> > > >
>> > > > The resulting output starts out showing that it's receiving ping
>> every
>> > > 10s
>> > > > and checking for NOOP every 30s, as expected. But then one large
>> test
>> > > > begins, and:
>> > > >
>> > > > DEBUG 2019/03/22 19:14:40.223 GMT ForkedBooter: Received ping from
>> > Maven
>> > > > process.
>> > > > DEBUG 2019/03/22 19:18:02.723 GMT ForkedBooter: Received ping from
>> > Maven
>> > > > process.
>> > > > DEBUG 2019/03/22 19:18:03.139 GMT ForkedBooter: Running ping job
>> (i.e.
>> > > > checking if we read a ping earlier).
>> > > > DEBUG 2019/03/22 19:18:03.172 GMT ForkedBooter: Checking for NOOP
>> from
>> > > > Maven process.
>> > > > DEBUG 2019/03/22 19:18:03.186 GMT ForkedBooter: Received ping from
>> > Maven
>> > > > process.
>> > > > DEBUG 2019/03/22 19:18:03.187 GMT ForkedBooter: Received ping from
>> > Maven
>> > > > process.
>> > > > DEBUG 2019/03/22 19:18:03.187 GMT ForkedBooter: Received ping from
>> > Maven
>> > > > process.
>> > > > DEBUG 2019/03/22 19:18:03.187 GMT ForkedBooter: Running ping job
>> (i.e.
>> > > > checking if we read a ping earlier).
>> > > > DEBUG 2019/03/22 19:18:03.188 GMT ForkedBooter: Checking for NOOP
>> from
>> > > > Maven process.
>> > > > DEBUG 2019/03/22 19:18:03.187 GMT ForkedBooter: Received ping from
>> > Maven
>> > > > process.
>> > > > DEBUG 2019/03/22 19:18:03.243 GMT ForkedBooter: Received ping from
>> > Maven
>> > > > process.
>> > > > DEBUG 2019/03/22 19:18:03.307 GMT ForkedBooter: Received ping from
>> > Maven
>> > > > process.
>> > > > DEBUG 2019/03/22 19:18:03.308 GMT ForkedBooter: Received ping from
>> > Maven
>> > > > process.
>> > > > DEBUG 2019/03/22 19:18:03.202 GMT ForkedBooter: Running ping job
>> (i.e.
>> > > > checking if we read a ping earlier).
>> > > > DEBUG 2019/03/22 19:18:03.323 GMT ForkedBooter: Received ping from
>> > Maven
>> > > > process.
>> > > > DEBUG 2019/03/22 19:18:03.337 GMT ForkedBooter: Checking for NOOP
>> from
>> > > > Maven process.
>> > > > DEBUG 2019/03/22 19:18:03.351 GMT ForkedBooter: Received ping from
>> > Maven
>> > > > process.
>> > > > DEBUG 2019/03/22 19:18:03.351 GMT ForkedBooter: Running ping job
>> (i.e.
>> > > > checking if we read a ping earlier).
>> > > > DEBUG 2019/03/22 19:18:03.351 GMT ForkedBooter: Received ping from
>> > Maven
>> > > > process.
>> > > > DEBUG 2019/03/22 19:18:03.352 GMT ForkedBooter: Checking for NOOP
>> from
>> > > > Maven process.
>> > > > DEBUG 2019/03/22 19:18:03.366 GMT ForkedBooter: Received ping from
>> > Maven
>> > > > process.
>> > > > DEBUG 2019/03/22 19:18:03.393 GMT ForkedBooter: Received ping from
>> > Maven
>> > > > process.
>> > > > DEBUG 2019/03/22 19:18:03.380 GMT ForkedBooter: Running ping job
>> (i.e.
>> > > > checking if we read a ping earlier).
>> > > > DEBUG 2019/03/22 19:18:03.409 GMT ForkedBooter: Received ping from
>> > Maven
>> > > > process.
>> > > > DEBUG 2019/03/22 19:18:03.424 GMT ForkedBooter: Received ping from
>> > Maven
>> > > > process.
>> > > > DEBUG 2019/03/22 19:18:03.423 GMT ForkedBooter: Checking for NOOP
>> from
>> > > > Maven process.
>> > > > DEBUG 2019/03/22 19:18:03.424 GMT ForkedBooter: Received ping from
>> > Maven
>> > > > process.
>> > > > DEBUG 2019/03/22 19:18:03.438 GMT ForkedBooter: Running ping job
>> (i.e.
>> > > > checking if we read a ping earlier).
>> > > > DEBUG 2019/03/22 19:18:03.438 GMT ForkedBooter: Received ping from
>> > Maven
>> > > > process.
>> > > > DEBUG 2019/03/22 19:18:03.453 GMT ForkedBooter: Checking for NOOP
>> from
>> > > > Maven process.
>> > > > DEBUG 2019/03/22 19:18:03.454 GMT ForkedBooter: Received ping from
>> > Maven
>> > > > process.
>> > > > DEBUG 2019/03/22 19:18:03.476 GMT ForkedBooter: Running ping job
>> (i.e.
>> > > > checking if we read a ping earlier).
>> > > > DEBUG 2019/03/22 19:18:03.476 GMT ForkedBooter: Received ping from
>> > Maven
>> > > > process.
>> > > > DEBUG 2019/03/22 19:18:03.505 GMT ForkedBooter: Checking for NOOP
>> from
>> > > > Maven process.
>> > > > DEBUG 2019/03/22 19:18:03.505 GMT ForkedBooter: Received ping from
>> > Maven
>> > > > process.
>> > > > DEBUG 2019/03/22 19:18:08.872 GMT ForkedBooter: Received ping from
>> > Maven
>> > > > process.
>> > > > DEBUG 2019/03/22 19:18:18.859 GMT ForkedBooter: Received ping from
>> > Maven
>> > > > process.
>> > > > DEBUG 2019/03/22 19:18:19.282 GMT ForkedBooter: Running ping job
>> (i.e.
>> > > > checking if we read a ping earlier).
>> > > > DEBUG 2019/03/22 19:18:19.330 GMT ForkedBooter: Checking for NOOP
>> from
>> > > > Maven process.
>> > > > DEBUG 2019/03/22 19:18:28.845 GMT ForkedBooter: Received ping from
>> > Maven
>> > > > process.
>> > > > DEBUG 2019/03/22 19:18:38.832 GMT ForkedBooter: Received ping from
>> > Maven
>> > > > process.
>> > > > DEBUG 2019/03/22 19:18:48.838 GMT ForkedBooter: Received ping from
>> > Maven
>> > > > process.
>> > > > DEBUG 2019/03/22 19:18:49.241 GMT ForkedBooter: Running ping job
>> (i.e.
>> > > > checking if we read a ping earlier).
>> > > > DEBUG 2019/03/22 19:18:49.265 GMT ForkedBooter: Checking for NOOP
>> from
>> > > > Maven process.
>> > > > DEBUG 2019/03/22 19:19:01.162 GMT ForkedBooter: Received ping from
>> > Maven
>> > > > process.
>> > > > DEBUG 2019/03/22 19:19:10.069 GMT ForkedBooter: Received ping from
>> > Maven
>> > > > process.
>> > > > DEBUG 2019/03/22 19:19:19.136 GMT ForkedBooter: Received ping from
>> > Maven
>> > > > process.
>> > > > DEBUG 2019/03/22 19:19:19.199 GMT ForkedBooter: Running ping job
>> (i.e.
>> > > > checking if we read a ping earlier).
>> > > > DEBUG 2019/03/22 19:19:19.199 GMT ForkedBooter: Checking for NOOP
>> from
>> > > > Maven process.
>> > > > DEBUG 2019/03/22 19:22:28.698 GMT ForkedBooter: Running ping job
>> (i.e.
>> > > > checking if we read a ping earlier).
>> > > > DEBUG 2019/03/22 19:22:28.698 GMT ForkedBooter: Checking for NOOP
>> from
>> > > > Maven process.
>> > > > WARN 2019/03/22 19:22:28.698 GMT ForkedBooter: Did not receive NOOP
>> > from
>> > > > Maven process.
>> > > > DEBUG 2019/03/22 19:22:28.698 GMT ForkedBooter: Running ping job
>> (i.e.
>> > > > checking if we read a ping earlier).
>> > > > DEBUG 2019/03/22 19:22:28.698 GMT ForkedBooter: Checking for NOOP
>> from
>> > > > Maven process.
>> > > > WARN 2019/03/22 19:22:28.698 GMT ForkedBooter: Did not receive NOOP
>> > from
>> > > > Maven process.
>> > > > DEBUG 2019/03/22 19:22:28.699 GMT ForkedBooter: Running ping job
>> (i.e.
>> > > > checking if we read a ping earlier).
>> > > > DEBUG 2019/03/22 19:22:28.699 GMT ForkedBooter: Checking for NOOP
>> from
>> > > > Maven process.
>> > > > WARN 2019/03/22 19:22:28.699 GMT ForkedBooter: Did not receive NOOP
>> > from
>> > > > Maven process.
>> > > > DEBUG 2019/03/22 19:22:28.699 GMT ForkedBooter: Running ping job
>> (i.e.
>> > > > checking if we read a ping earlier).
>> > > > DEBUG 2019/03/22 19:22:28.699 GMT ForkedBooter: Checking for NOOP
>> from
>> > > > Maven process.
>> > > > WARN 2019/03/22 19:22:28.699 GMT ForkedBooter: Did not receive NOOP
>> > from
>> > > > Maven process.
>> > > > DEBUG 2019/03/22 19:22:28.699 GMT ForkedBooter: Running ping job
>> (i.e.
>> > > > checking if we read a ping earlier).
>> > > > DEBUG 2019/03/22 19:22:28.699 GMT ForkedBooter: Checking for NOOP
>> from
>> > > > Maven process.
>> > > > WARN 2019/03/22 19:22:28.699 GMT ForkedBooter: Did not receive NOOP
>> > from
>> > > > Maven process.
>> > > > DEBUG 2019/03/22 19:22:28.699 GMT ForkedBooter: Running ping job
>> (i.e.
>> > > > checking if we read a ping earlier).
>> > > > DEBUG 2019/03/22 19:22:28.699 GMT ForkedBooter: Checking for NOOP
>> from
>> > > > Maven process.
>> > > > WARN 2019/03/22 19:22:28.699 GMT ForkedBooter: Did not receive NOOP
>> > from
>> > > > Maven process.
>> > > > DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from
>> > Maven
>> > > > process.
>> > > > DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from
>> > Maven
>> > > > process.
>> > > > DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from
>> > Maven
>> > > > process.
>> > > > DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from
>> > Maven
>> > > > process.
>> > > > DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from
>> > Maven
>> > > > process.
>> > > > DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from
>> > Maven
>> > > > process.
>> > > > DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from
>> > Maven
>> > > > process.
>> > > > DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from
>> > Maven
>> > > > process.
>> > > > DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from
>> > Maven
>> > > > process.
>> > > > DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from
>> > Maven
>> > > > process.
>> > > > DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from
>> > Maven
>> > > > process.
>> > > > DEBUG 2019/03/22 19:22:28.704 GMT ForkedBooter: Received ping from
>> > Maven
>> > > > process.
>> > > > DEBUG 2019/03/22 19:22:28.704 GMT ForkedBooter: Received ping from
>> > Maven
>> > > > process.
>> > > > DEBUG 2019/03/22 19:22:28.704 GMT ForkedBooter: Received ping from
>> > Maven
>> > > > process.
>> > > > DEBUG 2019/03/22 19:22:28.704 GMT ForkedBooter: Received ping from
>> > Maven
>> > > > process.
>> > > > DEBUG 2019/03/22 19:22:28.704 GMT ForkedBooter: Received ping from
>> > Maven
>> > > > process.
>> > > > DEBUG 2019/03/22 19:22:28.704 GMT ForkedBooter: Received ping from
>> > Maven
>> > > > process.
>> > > > DEBUG 2019/03/22 19:22:28.704 GMT ForkedBooter: Received ping from
>> > Maven
>> > > > process.
>> > > > DEBUG 2019/03/22 19:22:28.704 GMT ForkedBooter: Received ping from
>> > Maven
>> > > > process.
>> > > > ...
>> > > >
>> > > > With these changes, a project which normally takes 15 to 20 minutes
>> to
>> > > run
>> > > > actually took 80m. Would printing directly to stdout from within a
>> test
>> > > do
>> > > > that somehow? If not, we have a problem in our code.
>> > > >
>> > > > My comments on the above report to follow...
>> > > >
>> > > > On Fri, Mar 22, 2019 at 11:58 AM Tibor Digana <
>> tibordigana@apache.org>
>> > > > wrote:
>> > > >
>> > > >> >> is there an -Xmx value you would recommend to ensure the parent
>> > > process
>> > > >> can send the NOOP to the surefire process?
>> > > >>
>> > > >> no such value is generic value to recommend. Each application or
>> test
>> > > must
>> > > >> find our the value to prevent from OOM.
>> > > >>
>> > > >>
>> > > >> >> I'm making a custom build of surefire-booter to workaround the
>> > > initial
>> > > >> problem by commenting out the code to exit.
>> > > >>
>> > > >> Very good step forward. Then post your findings and we can
>> continue on
>> > > >> making a pull request on GitHub.
>> > > >>
>> > > >>
>> > > >> >> Is there a logging API which I can use instead?
>> > > >>
>> > > >> Yes, it is very low level of logging (sending events to parent
>> > process)
>> > > >> but
>> > > >> this will be turned to high level abstraction in several days
>> (branch
>> > > >> surefire-1222_2).
>> > > >> Yet, pls use this function:
>> > > >>
>> > > >> encodeAndWriteToOutput()
>> > > >>
>> > > >>
>> > > >> >>
>> > > >>
>> > > >>
>> > > >>
>> > > >>
>> > > >>
>> > > >> On Fri, Mar 22, 2019 at 4:44 PM Jason Young <
>> > jason.young@procentive.com
>> > > >
>> > > >> wrote:
>> > > >>
>> > > >> > To clarify, this is the image ours is based on:
>> > > >> > https://hub.docker.com/_/alpine; we are not using a vanilla
>> Maven
>> > > >> image as
>> > > >> > we have to add a few other items to this image as well. I don't
>> know
>> > > who
>> > > >> > maintains that or adds /etc/mavenrc; I'll bring that up in
>> another
>> > > >> topic in
>> > > >> > another forum. I was just answering a previous question, and also
>> > > >> throwing
>> > > >> > out info for anyone struggling with e.g. OOME in their Maven
>> process
>> > > (a
>> > > >> lot
>> > > >> > of people have that problem when running the Sonar scanner, for
>> > > >> example). I
>> > > >> > don't think we need to worry about why that file is added for
>> this
>> > > >> topic.
>> > > >> >
>> > > >> > But while we're on the subject, is there an -Xmx value you would
>> > > >> recommend
>> > > >> > to ensure the parent process can send the NOOP to the surefire
>> > > process?
>> > > >> >
>> > > >> > I'm making a custom build of surefire-booter to workaround the
>> > initial
>> > > >> > problem by commenting out the code to exit. I am also adding
>> logging
>> > > via
>> > > >> > stdout so I can see if ForkedBooter is indeed running the ping
>> job
>> > in
>> > > >> small
>> > > >> > increments of time. Is there a logging API which I can use
>> instead?
>> > > >> >
>> > > >> > On Fri, Mar 22, 2019 at 5:44 AM Tibor Digana <
>> > tibordigana@apache.org>
>> > > >> > wrote:
>> > > >> >
>> > > >> > > The base images are developed in
>> > > >> > https://github.com/carlossg/docker-maven,
>> > > >> > > right?
>> > > >> > > Who creates "/etc/mavenrc"?
>> > > >> > >
>> > > >> > > On Thu, Mar 21, 2019 at 12:05 AM Jason Young <
>> > > >> jason.young@procentive.com
>> > > >> > >
>> > > >> > > wrote:
>> > > >> > >
>> > > >> > > > Mikael, sorry I do not appear to have permission to view the
>> > link.
>> > > >> > > >
>> > > >> > > > I did some digging in the last couple of days. I see that the
>> > > parent
>> > > >> > > > process reads from stdin. I could not find anywhere that we
>> are
>> > > >> using
>> > > >> > > > stdin. FWIW the failures nearly always happen at least 15m
>> into
>> > a
>> > > >> ~20m
>> > > >> > > test
>> > > >> > > > run, so perf is a likely culprit.
>> > > >> > > >
>> > > >> > > > I see also that ForkedBooter reads commands from stdin in one
>> > > >> thread,
>> > > >> > and
>> > > >> > > > uses an executor service to check for a past ping in
>> > > >> > > > ForkedBooter.listenToShutdownCommands(..). When it checks, it
>> > also
>> > > >> sets
>> > > >> > > > pingDone to false. The executor is configured to run up to 2
>> > > threads
>> > > >> > > > concurrently to handle the workload, and is set to run at a
>> > fixed
>> > > >> rate
>> > > >> > > (not
>> > > >> > > > a fixed delay). If the test suite is busy with testing and GC
>> > and
>> > > >> has
>> > > >> > > lots
>> > > >> > > > of threads running, it's entirely possible that a thread
>> won't
>> > > have
>> > > >> a
>> > > >> > > > chance to run for a long time (e.g. 5s). Maybe instead of a
>> 30s
>> > > >> delay,
>> > > >> > > the
>> > > >> > > > VM gets around to checking for a ping every 35s over a long
>> span
>> > > of
>> > > >> > time.
>> > > >> > > > Because we're running at a "fixed rate" and not a "fixed
>> delay",
>> > > >> then
>> > > >> > > after
>> > > >> > > > a couple of minutes we might be a full 30s behind schedule.
>> It's
>> > > >> > possible
>> > > >> > > > the executor will create another thread to run the scheduled
>> > task
>> > > >> > because
>> > > >> > > > it's running behind schedule. This new thread checks for a
>> ping,
>> > > >> finds
>> > > >> > > it,
>> > > >> > > > and sets pingDone to false. But then the original thread also
>> > > runs,
>> > > >> > say,
>> > > >> > > 2
>> > > >> > > > seconds afterwards, checks pingDone, and finds it is false.
>> > > >> > > >
>> > > >> > > > So to mitigate the problem, can we a) make the executor run
>> > only 1
>> > > >> > thread
>> > > >> > > > and b) schedule the task at a fixed rate? For that matter, is
>> > > there
>> > > >> > > another
>> > > >> > > > scheduled executor we can reuse? I understand why checking
>> for
>> > > ping
>> > > >> > > > requires a separate executor. Should I ask in github?
>> > > >> > > >
>> > > >> > > > Regarding a previous question, I found out that Alpine's
>> Maven
>> > > >> package
>> > > >> > > > comes with an /etc/mavenrc that sets `MAVEN_OPTS="$MAVEN_OPTS
>> > > >> > -Xmx512m"`
>> > > >> > > > which cannot be undone by setting `MAVEN_OPTS` at the command
>> > > line;
>> > > >> you
>> > > >> > > end
>> > > >> > > > up with e.g. `-Xmx1g -Xmx512m`. (Note this applies to the
>> Maven
>> > > >> > (parent)
>> > > >> > > > process, not the surefire/failsafe (child) process.)
>> > > >> > > >
>> > > >> > > > On Wed, Mar 20, 2019 at 3:46 AM Bernd Eckenfels <
>> > > >> > ecki@zusammenkunft.net>
>> > > >> > > > wrote:
>> > > >> > > >
>> > > >> > > > > I guess a timeout caused by FullGC can happen with TCP as
>> > well.
>> > > >> > > > Increasing
>> > > >> > > > > the timeout might not be nice but does look like it would
>> help
>> > > in
>> > > >> > both
>> > > >> > > > > cases. (Problems with stdout are more related to unexpected
>> > JVM
>> > > >> > > messages
>> > > >> > > > I
>> > > >> > > > > guess)
>> > > >> > > > >
>> > > >> > > > > Gruss
>> > > >> > > > > Bernd
>> > > >> > > > > --
>> > > >> > > > > http://bernd.eckenfels.net
>> > > >> > > > >
>> > > >> > > > > ________________________________
>> > > >> > > > > Von: Mikael Åsberg <m....@gmail.com>
>> > > >> > > > > Gesendet: Mittwoch, März 20, 2019 9:40 AM
>> > > >> > > > > An: Maven Users List
>> > > >> > > > > Betreff: Re: Failsafe: Killing self fork JVM. PING timeout
>> > > >> elapsed.
>> > > >> > > > >
>> > > >> > > > > These issues regarding communication with forked JVMs,
>> won't
>> > > they
>> > > >> be
>> > > >> > > > > resolved once surefire moves to interprocess communication
>> > using
>> > > >> > > > > tcp/ip sockets? This happens to be the target feature to be
>> > > >> included
>> > > >> > > > > in the next surefire 3.0.0 milestone:
>> > > >> > > > >
>> > > >> https://issues.apache.org/jira/projects/SUREFIRE/versions/12344668
>> > > >> > > > >
>> > > >> > > > > There are soooo many issues relating to surefire reading
>> > stdout
>> > > of
>> > > >> > > > > forked processes (which is my understanding that it is
>> > currently
>> > > >> > > > > doing). Many of us are really looking forward to the next
>> > > >> milestone.
>> > > >> > > > >
>> > > >> > > > > On Tue, Mar 19, 2019 at 8:59 PM Jason Young <
>> > > >> > > jason.young@procentive.com>
>> > > >> > > > > wrote:
>> > > >> > > > > >
>> > > >> > > > > > Getting back to my original questions, I know that "ping"
>> > > means
>> > > >> to
>> > > >> > > see
>> > > >> > > > > if a
>> > > >> > > > > > process is there, and "NOOP" implies it's not a command
>> to
>> > do
>> > > >> > > anything.
>> > > >> > > > > But
>> > > >> > > > > > what do the terms "ping" and "NOOP" mean in this context,
>> > i.e.
>> > > >> how
>> > > >> > do
>> > > >> > > > the
>> > > >> > > > > > processes communicate? I assume they don't sonar. Do
>> other
>> > > >> > processes
>> > > >> > > > also
>> > > >> > > > > > ping NOOPs? Can I PING Chrome with a NOOP from bash? Is
>> it
>> > > with
>> > > >> > TCP?
>> > > >> > > > > >
>> > > >> > > > > > I'm confused about what I should do regarding GC pauses.
>> > > >> > Previously I
>> > > >> > > > had
>> > > >> > > > > > code that would write the amount of remaining heap space
>> (or
>> > > >> > > something
>> > > >> > > > > like
>> > > >> > > > > > that) to stdout after every test to troubleshoot OOMEs.
>> Can
>> > > >> writing
>> > > >> > > to
>> > > >> > > > > > stdout cause the communication failure somehow?
>> > > >> > > > > >
>> > > >> > > > > > On Wed, Mar 13, 2019 at 5:57 PM Jason Young <
>> > > >> > > > jason.young@procentive.com>
>> > > >> > > > > > wrote:
>> > > >> > > > > >
>> > > >> > > > > > > I upgraded failsafe and surefire to 3.0.0-M3 as
>> advised;
>> > we
>> > > >> > > > encountered
>> > > >> > > > > > > the same exception. (Still using -Xmx5g, will switch to
>> > > OpenJ9
>> > > >> > soon
>> > > >> > > > in
>> > > >> > > > > case
>> > > >> > > > > > > that helps.)
>> > > >> > > > > > >
>> > > >> > > > > > > BTW I also asked on StackOverflow previously, for
>> anyone
>> > > >> > > interested:
>> > > >> > > > > > >
>> > > >> > > > >
>> > > >> > > >
>> > > >> > >
>> > > >> >
>> > > >>
>> > >
>> >
>> https://stackoverflow.com/questions/54755846/killing-self-fork-jvm-ping-timeout-elapsed
>> > > >> > > > > > >
>> > > >> > > > > > > On Tue, Feb 26, 2019 at 6:40 PM Jason Young <
>> > > >> > > > > jason.young@procentive.com>
>> > > >> > > > > > > wrote:
>> > > >> > > > > > >
>> > > >> > > > > > >> Thanks again for the information.
>> > > >> > > > > > >>
>> > > >> > > > > > >> We had increased the RAM to 3g some time ago to
>> prevent
>> > > >> OOMEs.
>> > > >> > > More
>> > > >> > > > > > >> recently, I increased the RAM again to 5g for extra
>> > > headroom
>> > > >> > since
>> > > >> > > > we
>> > > >> > > > > had
>> > > >> > > > > > >> more headroom available; the problem hasn't happened
>> > since,
>> > > >> but
>> > > >> > it
>> > > >> > > > > hasn't
>> > > >> > > > > > >> been very long.
>> > > >> > > > > > >>
>> > > >> > > > > > >> We use a more customized image based on Alpine 3.8.2.
>> The
>> > > JDK
>> > > >> > and
>> > > >> > > > > Maven
>> > > >> > > > > > >> are obtained via apk.
>> > > >> > > > > > >>
>> > > >> > > > > > >> I will try upgrading failsafe (and surefire while I'm
>> at
>> > > it)
>> > > >> > > sooner,
>> > > >> > > > > and
>> > > >> > > > > > >> probably do some experimentation with JVMs another
>> time
>> > > (not
>> > > >> > > > pressing
>> > > >> > > > > for
>> > > >> > > > > > >> me ATM).
>> > > >> > > > > > >>
>> > > >> > > > > > >> On Tue, Feb 26, 2019 at 12:20 PM Tibor Digana <
>> > > >> > > > tibordigana@apache.org
>> > > >> > > > > >
>> > > >> > > > > > >> wrote:
>> > > >> > > > > > >>
>> > > >> > > > > > >>> >> I'll try to enable some logging about GC pauses to
>> > see
>> > > >> > what's
>> > > >> > > up
>> > > >> > > > > > >>>
>> > > >> > > > > > >>> Pls do not keep such setting after tuning the GC
>> because
>> > > >> this
>> > > >> > may
>> > > >> > > > > > >>> sometime
>> > > >> > > > > > >>> break the interprocess communication between Maven
>> > process
>> > > >> and
>> > > >> > > > > surefire
>> > > >> > > > > > >>> process.
>> > > >> > > > > > >>> It's worth to list GC information in a file and not
>> in
>> > the
>> > > >> > > console
>> > > >> > > > > logs.
>> > > >> > > > > > >>> This can be configured, I guess.
>> > > >> > > > > > >>>
>> > > >> > > > > > >>> >> Do you think the value is simply too low?
>> > > >> > > > > > >>>
>> > > >> > > > > > >>> GCing many objects may take some time and I remember
>> we
>> > > had
>> > > >> a
>> > > >> > > user
>> > > >> > > > > who
>> > > >> > > > > > >>> had
>> > > >> > > > > > >>> this problem a year or two ago.
>> > > >> > > > > > >>> We check every third NOOP (which is 3 x 10 sec) as a
>> fix
>> > > >> > instead
>> > > >> > > of
>> > > >> > > > > every
>> > > >> > > > > > >>> NOP. So 30 seconds looked satisfactory.
>> > > >> > > > > > >>> I think you use old version 2.20 or something like
>> that.
>> > > The
>> > > >> > > fixes
>> > > >> > > > > for
>> > > >> > > > > > >>> docker have been done so far, so please use the
>> latest
>> > > >> version
>> > > >> > > > > 3.0.0-M3.
>> > > >> > > > > > >>> See this page
>> > > >> > > > > > >>>
>> > > >> > > >
>> > > https://maven.apache.org/surefire/maven-surefire-plugin/docker.html
>> > > >> ,
>> > > >> > > > > we
>> > > >> > > > > > >>> used maven:3.5.3-jdk-8-alpine in this test. Which
>> base
>> > > image
>> > > >> > did
>> > > >> > > > you
>> > > >> > > > > use?
>> > > >> > > > > > >>>
>> > > >> > > > > > >>> Cheers
>> > > >> > > > > > >>> Tibor
>> > > >> > > > > > >>>
>> > > >> > > > > > >>> On Tue, Feb 26, 2019 at 5:24 PM Jason Young <
>> > > >> > > > > jason.young@procentive.com>
>> > > >> > > > > > >>> wrote:
>> > > >> > > > > > >>>
>> > > >> > > > > > >>> > Thanks for the information. It's good to see
>> someone
>> > > >> > > understands
>> > > >> > > > a
>> > > >> > > > > > >>> little
>> > > >> > > > > > >>> > about this.
>> > > >> > > > > > >>> >
>> > > >> > > > > > >>> > Incidentally, we have been looking at other GCs and
>> > VMs
>> > > >> for
>> > > >> > the
>> > > >> > > > > > >>> application
>> > > >> > > > > > >>> > in production environments, so I'll look into how
>> > these
>> > > >> > affect
>> > > >> > > > > tests as
>> > > >> > > > > > >>> > well. I'll try to enable some logging about GC
>> pauses
>> > to
>> > > >> see
>> > > >> > > > > what's up.
>> > > >> > > > > > >>> >
>> > > >> > > > > > >>> > How would `-Xmx3g` cause long GC cycles? Do you
>> think
>> > > the
>> > > >> > value
>> > > >> > > > is
>> > > >> > > > > > >>> simply
>> > > >> > > > > > >>> > too low?
>> > > >> > > > > > >>> >
>> > > >> > > > > > >>> > FWIW we're running the Maven build in an
>> Alpine-based
>> > > >> Docker
>> > > >> > > > > container.
>> > > >> > > > > > >>> >
>> > > >> > > > > > >>> > On Sat, Feb 23, 2019 at 6:36 AM Tibor Digana <
>> > > >> > > > > tibordigana@apache.org>
>> > > >> > > > > > >>> > wrote:
>> > > >> > > > > > >>> >
>> > > >> > > > > > >>> > > Hi Jason,
>> > > >> > > > > > >>> > >
>> > > >> > > > > > >>> > > We spoke about this issue on our chat in ASF
>> Slack:
>> > > >> > > > > > >>> > > "I think his tests have been paused for a long GC
>> > > >> periods
>> > > >> > and
>> > > >> > > > > timed
>> > > >> > > > > > >>> out
>> > > >> > > > > > >>> > 3x
>> > > >> > > > > > >>> > > PING period = 30 seconds. After this period
>> forked
>> > JVM
>> > > >> > > supposed
>> > > >> > > > > the
>> > > >> > > > > > >>> Maven
>> > > >> > > > > > >>> > > process was killed by JenkinsCI and therefore all
>> > > >> surefire
>> > > >> > > > > processes
>> > > >> > > > > > >>> are
>> > > >> > > > > > >>> > > killed as well and all the file handlers and
>> memory
>> > > >> > > > consumptions
>> > > >> > > > > are
>> > > >> > > > > > >>> > > freed."
>> > > >> > > > > > >>> > >
>> > > >> > > > > > >>> > > "But I have to say that `-Xmx3g` may cause long
>> GC
>> > > >> cycles,
>> > > >> > > see
>> > > >> > > > > > >>> > >
>> > > >> > > > > > >>> > >
>> > > >> > > > > > >>> >
>> > > >> > > > > > >>>
>> > > >> > > > >
>> > > >> > > >
>> > > >> > >
>> > > >> >
>> > > >>
>> > >
>> >
>> https://maven.apache.org/surefire/maven-surefire-plugin/examples/shutdown.html
>> > > >> > > > > > >>> > > "
>> > > >> > > > > > >>> > >
>> > > >> > > > > > >>> > > You are using java-1.8-openjdk. I guess you
>> should
>> > use
>> > > >> > > > > Shenandoah GC
>> > > >> > > > > > >>> > which
>> > > >> > > > > > >>> > > is an experimental algorithm in JVM 1.8. This
>> would
>> > > >> > > > significantly
>> > > >> > > > > > >>> short
>> > > >> > > > > > >>> > > the GC cycles.
>> > > >> > > > > > >>> > >
>> > > >> > > > > > >>> > > We should of cource provide a new configuration
>> > > >> parameter
>> > > >> > to
>> > > >> > > > give
>> > > >> > > > > > >>> you a
>> > > >> > > > > > >>> > > chance to prolong the PING.
>> > > >> > > > > > >>> > >
>> > > >> > > > > > >>> > > Cheers
>> > > >> > > > > > >>> > > Tibor
>> > > >> > > > > > >>> > >
>> > > >> > > > > > >>> >
>> > > >> > > > > > >>> >
>> > > >> > > > > > >>> > --
>> > > >> > > > > > >>> >
>> > > >> > > > > > >>> > Jason Young
>> > > >> > > > > > >>> >
>> > > >> > > > > > >>>
>> > > >> > > > > > >>
>> > > >> > > > > > >>
>> > > >> > > > > >
>> > > >> > > > > > --
>> > > >> > > > > > Jason Young
>> > > >> > > > > > Software Engineer | PROCENTIVE
>> > > >> > > > > > [image: Phone] 715 245 8000 x7609
>> > > >> > > > > > [image: Mobile] 706 870 3540
>> > > >> > > > > > [image: Web] procentive.com
>> > > >> > > > > > Confidentiality Notice: This message is intended for the
>> > sole
>> > > >> use
>> > > >> > of
>> > > >> > > > the
>> > > >> > > > > > individual and entity to which it is addressed, and may
>> > > contain
>> > > >> > > > > information
>> > > >> > > > > > that is privileged, confidential and exempt from
>> disclosure
>> > > >> under
>> > > >> > > > > > applicable law. Any unauthorized review, use, disclosure
>> or
>> > > >> > > > distribution
>> > > >> > > > > of
>> > > >> > > > > > this email message, including any attachment, is
>> prohibited.
>> > > If
>> > > >> you
>> > > >> > > are
>> > > >> > > > > not
>> > > >> > > > > > the intended recipient, please advise the sender by reply
>> > > email
>> > > >> and
>> > > >> > > > > destroy
>> > > >> > > > > > all copies of the original message.
>> > > >> > > > >
>> > > >> > > > >
>> > > >>
>> ---------------------------------------------------------------------
>> > > >> > > > > To unsubscribe, e-mail: users-unsubscribe@maven.apache.org
>> > > >> > > > > For additional commands, e-mail:
>> users-help@maven.apache.org
>> > > >> > > > >
>> > > >> > > > >
>> > > >> > > >
>> > > >> > >
>> > > >> >
>> > > >>
>> > > >
>> > >
>> >
>>
>

Re: Failsafe: Killing self fork JVM. PING timeout elapsed.

Posted by Tibor Digana <ti...@apache.org>.
Hey Jason,

Here is the PR I promissed https://github.com/apache/maven-surefire/pull/231
I will think about more tolerant algorithm.
Feel free to help us in this PR on GitHub.

Cheers
Tibor


On Wed, Apr 10, 2019 at 3:18 AM Jason Young <ja...@procentive.com>
wrote:

> Thanks.
>
> I finally took a little time to look at why the `ps` command would not
> work. For our project, we cannot use a default Maven image, so I made a
> custom one based on alpine. One piece of software the Maven alpine image
> has is `pcprops`; adding that fixes the problem:
>
> $ docker run -it alpine sh
> > / # apk update
> > ...
> > OK: 8445 distinct packages available
> > / # apk upgrade
> > ...
> > OK: 4 MiB in 13 packages
> > / # ps -o etime= -p 1
> > ps: unrecognized option: p
> > BusyBox v1.26.2 (2018-05-30 13:51:20 GMT) multi-call binary.
> > Usage: ps [-o COL1,COL2=HEADER]
> > Show list of processes
> > -o COL1,COL2=HEADER Select columns for display
> > / # apk add procps
> > ...
> > OK: 12 MiB in 19 packages
> > / # ps -o etime= -p 1
> >       05:21
>
>
> Hope that helps anyone else using a custom-built alpine-based Maven image.
>
> I'm not sure if I'll have time to experiment much with Surefire changes in
> the near future; my focus at the moment is figuring out all the reasons our
> test suite sometimes takes 5 or more times as long as normal (often also
> causes Surefire to kill the child process, but even if it didn't, we have a
> problem!).
>
> On Sun, Apr 7, 2019 at 2:04 AM Tibor Digana <ti...@apache.org>
> wrote:
>
> > Hi Jason,
> >
> > I am sorry I haven't reply.
> > Excellent analysis!
> >
> > Can you make an experiment and use pingJob on a "fixed delay" instead of
> a
> > "fixed rate"?
> > If you want to make more, feel free.
> >
> > We should mabe think about how we would add more intelligence in these
> > timers in order to detect the "blackout".
> >
> > Cheers
> > Tibor
> >
> >
> >
> >
> > On Fri, Mar 22, 2019 at 9:53 PM Jason Young <ja...@procentive.com>
> > wrote:
> >
> > > Notice:
> > > * an over 3-minute gap between the first two entries above--neither the
> > > command reader thread nor the pingJob had a chance to run.
> > > * the command reader thread and the ping job executor evidently get a
> > > chance to run from 19:18:02 to 19:19:19. They ran with increased
> > frequency
> > > during that time to "catch up": The command reader thread is programmed
> > to
> > > read from stdin until there's nothing left to read, and the
> pingScheduler
> > > runs the pingJob at a "fixed" rate.
> > > * Another 3-minute gap and another attempt to catch up at 19:22:28.
> This
> > > time though, the command reader thread evidently did not get a turn in
> > > time.
> > >
> > > I did not bother to log what threads the ping job was running under; I
> > can
> > > do that later. I also did not add any logging on the Maven process
> side;
> > I
> > > can do that as well.
> > >
> > > This is on a VM with unthrottled access to 4 processor cores and 8GB
> > RAM. I
> > > think the code under test is CPU- and RAM-intensive. There probably
> were
> > GC
> > > pauses, but I don't have proper logging for that in place.
> > >
> > > Note that this is all very normal behavior: Both the command reader and
> > the
> > > ping job's thread (run by the pingScheduler) are instructed to run
> until
> > > their work is done, not to sleep or yield. For high throughput, an OS
> > will
> > > favor continuing to run a thread for a while once it is started over
> > > switching tasks, so it's normal both that each thread would run for a
> > while
> > > after a pause, and (by extension) that each would have long pauses as
> > other
> > > threads had to run, too. Because of the long gaps between runs of the
> > > command reader thread, my previous advice (schedule the pingJob on a
> > "fixed
> > > delay" instead of a "fixed rate") would not have saved the build in
> this
> > > case (though "fixed delay" is more correct and would save the build in
> > > other possible cases). Our code base has some unit tests that rely on
> > > "normal case" timing, but they now have the @Ignore attribute because
> > they
> > > would rarely, but too often, fail. For most teams and projects, unit
> > tests
> > > have to be more reliable than that.
> > >
> > > So I think we need a solution that does not rely on preemptive
> scheduling
> > > from the OS or on thread scheduling, but which ensures that commnads
> have
> > > been read before running code depending on commands. There are multiple
> > > ways to do that; maybe I can find time to write a solution later.
> > >
> > >
> > > On Fri, Mar 22, 2019 at 3:46 PM Jason Young <
> jason.young@procentive.com>
> > > wrote:
> > >
> > > > Thanks Tibor. I'll use `encodeAndWriteToOutput` later. For now, I
> > hacked
> > > > together some temporary changes to ForkedBooter so that it "logs"
> (just
> > > > `System.out.println`) the current time every time:
> > > >
> > > >    - the command reader thread reads a NOOP via stdin,
> > > >    - it starts the ping job,
> > > >    - it checks for NOOP (`canUseNewPingMechanism(..)` returns
> `false`),
> > > >    and
> > > >    - it would have exited because there is no NOOP (`!pingDone`).
> > > >
> > > > I also commented out the code to exit in that case so it keeps
> running
> > > and
> > > > logging.
> > > >
> > > > The resulting output starts out showing that it's receiving ping
> every
> > > 10s
> > > > and checking for NOOP every 30s, as expected. But then one large test
> > > > begins, and:
> > > >
> > > > DEBUG 2019/03/22 19:14:40.223 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:18:02.723 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:18:03.139 GMT ForkedBooter: Running ping job
> (i.e.
> > > > checking if we read a ping earlier).
> > > > DEBUG 2019/03/22 19:18:03.172 GMT ForkedBooter: Checking for NOOP
> from
> > > > Maven process.
> > > > DEBUG 2019/03/22 19:18:03.186 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:18:03.187 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:18:03.187 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:18:03.187 GMT ForkedBooter: Running ping job
> (i.e.
> > > > checking if we read a ping earlier).
> > > > DEBUG 2019/03/22 19:18:03.188 GMT ForkedBooter: Checking for NOOP
> from
> > > > Maven process.
> > > > DEBUG 2019/03/22 19:18:03.187 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:18:03.243 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:18:03.307 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:18:03.308 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:18:03.202 GMT ForkedBooter: Running ping job
> (i.e.
> > > > checking if we read a ping earlier).
> > > > DEBUG 2019/03/22 19:18:03.323 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:18:03.337 GMT ForkedBooter: Checking for NOOP
> from
> > > > Maven process.
> > > > DEBUG 2019/03/22 19:18:03.351 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:18:03.351 GMT ForkedBooter: Running ping job
> (i.e.
> > > > checking if we read a ping earlier).
> > > > DEBUG 2019/03/22 19:18:03.351 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:18:03.352 GMT ForkedBooter: Checking for NOOP
> from
> > > > Maven process.
> > > > DEBUG 2019/03/22 19:18:03.366 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:18:03.393 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:18:03.380 GMT ForkedBooter: Running ping job
> (i.e.
> > > > checking if we read a ping earlier).
> > > > DEBUG 2019/03/22 19:18:03.409 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:18:03.424 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:18:03.423 GMT ForkedBooter: Checking for NOOP
> from
> > > > Maven process.
> > > > DEBUG 2019/03/22 19:18:03.424 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:18:03.438 GMT ForkedBooter: Running ping job
> (i.e.
> > > > checking if we read a ping earlier).
> > > > DEBUG 2019/03/22 19:18:03.438 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:18:03.453 GMT ForkedBooter: Checking for NOOP
> from
> > > > Maven process.
> > > > DEBUG 2019/03/22 19:18:03.454 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:18:03.476 GMT ForkedBooter: Running ping job
> (i.e.
> > > > checking if we read a ping earlier).
> > > > DEBUG 2019/03/22 19:18:03.476 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:18:03.505 GMT ForkedBooter: Checking for NOOP
> from
> > > > Maven process.
> > > > DEBUG 2019/03/22 19:18:03.505 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:18:08.872 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:18:18.859 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:18:19.282 GMT ForkedBooter: Running ping job
> (i.e.
> > > > checking if we read a ping earlier).
> > > > DEBUG 2019/03/22 19:18:19.330 GMT ForkedBooter: Checking for NOOP
> from
> > > > Maven process.
> > > > DEBUG 2019/03/22 19:18:28.845 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:18:38.832 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:18:48.838 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:18:49.241 GMT ForkedBooter: Running ping job
> (i.e.
> > > > checking if we read a ping earlier).
> > > > DEBUG 2019/03/22 19:18:49.265 GMT ForkedBooter: Checking for NOOP
> from
> > > > Maven process.
> > > > DEBUG 2019/03/22 19:19:01.162 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:19:10.069 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:19:19.136 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:19:19.199 GMT ForkedBooter: Running ping job
> (i.e.
> > > > checking if we read a ping earlier).
> > > > DEBUG 2019/03/22 19:19:19.199 GMT ForkedBooter: Checking for NOOP
> from
> > > > Maven process.
> > > > DEBUG 2019/03/22 19:22:28.698 GMT ForkedBooter: Running ping job
> (i.e.
> > > > checking if we read a ping earlier).
> > > > DEBUG 2019/03/22 19:22:28.698 GMT ForkedBooter: Checking for NOOP
> from
> > > > Maven process.
> > > > WARN 2019/03/22 19:22:28.698 GMT ForkedBooter: Did not receive NOOP
> > from
> > > > Maven process.
> > > > DEBUG 2019/03/22 19:22:28.698 GMT ForkedBooter: Running ping job
> (i.e.
> > > > checking if we read a ping earlier).
> > > > DEBUG 2019/03/22 19:22:28.698 GMT ForkedBooter: Checking for NOOP
> from
> > > > Maven process.
> > > > WARN 2019/03/22 19:22:28.698 GMT ForkedBooter: Did not receive NOOP
> > from
> > > > Maven process.
> > > > DEBUG 2019/03/22 19:22:28.699 GMT ForkedBooter: Running ping job
> (i.e.
> > > > checking if we read a ping earlier).
> > > > DEBUG 2019/03/22 19:22:28.699 GMT ForkedBooter: Checking for NOOP
> from
> > > > Maven process.
> > > > WARN 2019/03/22 19:22:28.699 GMT ForkedBooter: Did not receive NOOP
> > from
> > > > Maven process.
> > > > DEBUG 2019/03/22 19:22:28.699 GMT ForkedBooter: Running ping job
> (i.e.
> > > > checking if we read a ping earlier).
> > > > DEBUG 2019/03/22 19:22:28.699 GMT ForkedBooter: Checking for NOOP
> from
> > > > Maven process.
> > > > WARN 2019/03/22 19:22:28.699 GMT ForkedBooter: Did not receive NOOP
> > from
> > > > Maven process.
> > > > DEBUG 2019/03/22 19:22:28.699 GMT ForkedBooter: Running ping job
> (i.e.
> > > > checking if we read a ping earlier).
> > > > DEBUG 2019/03/22 19:22:28.699 GMT ForkedBooter: Checking for NOOP
> from
> > > > Maven process.
> > > > WARN 2019/03/22 19:22:28.699 GMT ForkedBooter: Did not receive NOOP
> > from
> > > > Maven process.
> > > > DEBUG 2019/03/22 19:22:28.699 GMT ForkedBooter: Running ping job
> (i.e.
> > > > checking if we read a ping earlier).
> > > > DEBUG 2019/03/22 19:22:28.699 GMT ForkedBooter: Checking for NOOP
> from
> > > > Maven process.
> > > > WARN 2019/03/22 19:22:28.699 GMT ForkedBooter: Did not receive NOOP
> > from
> > > > Maven process.
> > > > DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:22:28.704 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:22:28.704 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:22:28.704 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:22:28.704 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:22:28.704 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:22:28.704 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:22:28.704 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > DEBUG 2019/03/22 19:22:28.704 GMT ForkedBooter: Received ping from
> > Maven
> > > > process.
> > > > ...
> > > >
> > > > With these changes, a project which normally takes 15 to 20 minutes
> to
> > > run
> > > > actually took 80m. Would printing directly to stdout from within a
> test
> > > do
> > > > that somehow? If not, we have a problem in our code.
> > > >
> > > > My comments on the above report to follow...
> > > >
> > > > On Fri, Mar 22, 2019 at 11:58 AM Tibor Digana <
> tibordigana@apache.org>
> > > > wrote:
> > > >
> > > >> >> is there an -Xmx value you would recommend to ensure the parent
> > > process
> > > >> can send the NOOP to the surefire process?
> > > >>
> > > >> no such value is generic value to recommend. Each application or
> test
> > > must
> > > >> find our the value to prevent from OOM.
> > > >>
> > > >>
> > > >> >> I'm making a custom build of surefire-booter to workaround the
> > > initial
> > > >> problem by commenting out the code to exit.
> > > >>
> > > >> Very good step forward. Then post your findings and we can continue
> on
> > > >> making a pull request on GitHub.
> > > >>
> > > >>
> > > >> >> Is there a logging API which I can use instead?
> > > >>
> > > >> Yes, it is very low level of logging (sending events to parent
> > process)
> > > >> but
> > > >> this will be turned to high level abstraction in several days
> (branch
> > > >> surefire-1222_2).
> > > >> Yet, pls use this function:
> > > >>
> > > >> encodeAndWriteToOutput()
> > > >>
> > > >>
> > > >> >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >> On Fri, Mar 22, 2019 at 4:44 PM Jason Young <
> > jason.young@procentive.com
> > > >
> > > >> wrote:
> > > >>
> > > >> > To clarify, this is the image ours is based on:
> > > >> > https://hub.docker.com/_/alpine; we are not using a vanilla Maven
> > > >> image as
> > > >> > we have to add a few other items to this image as well. I don't
> know
> > > who
> > > >> > maintains that or adds /etc/mavenrc; I'll bring that up in another
> > > >> topic in
> > > >> > another forum. I was just answering a previous question, and also
> > > >> throwing
> > > >> > out info for anyone struggling with e.g. OOME in their Maven
> process
> > > (a
> > > >> lot
> > > >> > of people have that problem when running the Sonar scanner, for
> > > >> example). I
> > > >> > don't think we need to worry about why that file is added for this
> > > >> topic.
> > > >> >
> > > >> > But while we're on the subject, is there an -Xmx value you would
> > > >> recommend
> > > >> > to ensure the parent process can send the NOOP to the surefire
> > > process?
> > > >> >
> > > >> > I'm making a custom build of surefire-booter to workaround the
> > initial
> > > >> > problem by commenting out the code to exit. I am also adding
> logging
> > > via
> > > >> > stdout so I can see if ForkedBooter is indeed running the ping job
> > in
> > > >> small
> > > >> > increments of time. Is there a logging API which I can use
> instead?
> > > >> >
> > > >> > On Fri, Mar 22, 2019 at 5:44 AM Tibor Digana <
> > tibordigana@apache.org>
> > > >> > wrote:
> > > >> >
> > > >> > > The base images are developed in
> > > >> > https://github.com/carlossg/docker-maven,
> > > >> > > right?
> > > >> > > Who creates "/etc/mavenrc"?
> > > >> > >
> > > >> > > On Thu, Mar 21, 2019 at 12:05 AM Jason Young <
> > > >> jason.young@procentive.com
> > > >> > >
> > > >> > > wrote:
> > > >> > >
> > > >> > > > Mikael, sorry I do not appear to have permission to view the
> > link.
> > > >> > > >
> > > >> > > > I did some digging in the last couple of days. I see that the
> > > parent
> > > >> > > > process reads from stdin. I could not find anywhere that we
> are
> > > >> using
> > > >> > > > stdin. FWIW the failures nearly always happen at least 15m
> into
> > a
> > > >> ~20m
> > > >> > > test
> > > >> > > > run, so perf is a likely culprit.
> > > >> > > >
> > > >> > > > I see also that ForkedBooter reads commands from stdin in one
> > > >> thread,
> > > >> > and
> > > >> > > > uses an executor service to check for a past ping in
> > > >> > > > ForkedBooter.listenToShutdownCommands(..). When it checks, it
> > also
> > > >> sets
> > > >> > > > pingDone to false. The executor is configured to run up to 2
> > > threads
> > > >> > > > concurrently to handle the workload, and is set to run at a
> > fixed
> > > >> rate
> > > >> > > (not
> > > >> > > > a fixed delay). If the test suite is busy with testing and GC
> > and
> > > >> has
> > > >> > > lots
> > > >> > > > of threads running, it's entirely possible that a thread won't
> > > have
> > > >> a
> > > >> > > > chance to run for a long time (e.g. 5s). Maybe instead of a
> 30s
> > > >> delay,
> > > >> > > the
> > > >> > > > VM gets around to checking for a ping every 35s over a long
> span
> > > of
> > > >> > time.
> > > >> > > > Because we're running at a "fixed rate" and not a "fixed
> delay",
> > > >> then
> > > >> > > after
> > > >> > > > a couple of minutes we might be a full 30s behind schedule.
> It's
> > > >> > possible
> > > >> > > > the executor will create another thread to run the scheduled
> > task
> > > >> > because
> > > >> > > > it's running behind schedule. This new thread checks for a
> ping,
> > > >> finds
> > > >> > > it,
> > > >> > > > and sets pingDone to false. But then the original thread also
> > > runs,
> > > >> > say,
> > > >> > > 2
> > > >> > > > seconds afterwards, checks pingDone, and finds it is false.
> > > >> > > >
> > > >> > > > So to mitigate the problem, can we a) make the executor run
> > only 1
> > > >> > thread
> > > >> > > > and b) schedule the task at a fixed rate? For that matter, is
> > > there
> > > >> > > another
> > > >> > > > scheduled executor we can reuse? I understand why checking for
> > > ping
> > > >> > > > requires a separate executor. Should I ask in github?
> > > >> > > >
> > > >> > > > Regarding a previous question, I found out that Alpine's Maven
> > > >> package
> > > >> > > > comes with an /etc/mavenrc that sets `MAVEN_OPTS="$MAVEN_OPTS
> > > >> > -Xmx512m"`
> > > >> > > > which cannot be undone by setting `MAVEN_OPTS` at the command
> > > line;
> > > >> you
> > > >> > > end
> > > >> > > > up with e.g. `-Xmx1g -Xmx512m`. (Note this applies to the
> Maven
> > > >> > (parent)
> > > >> > > > process, not the surefire/failsafe (child) process.)
> > > >> > > >
> > > >> > > > On Wed, Mar 20, 2019 at 3:46 AM Bernd Eckenfels <
> > > >> > ecki@zusammenkunft.net>
> > > >> > > > wrote:
> > > >> > > >
> > > >> > > > > I guess a timeout caused by FullGC can happen with TCP as
> > well.
> > > >> > > > Increasing
> > > >> > > > > the timeout might not be nice but does look like it would
> help
> > > in
> > > >> > both
> > > >> > > > > cases. (Problems with stdout are more related to unexpected
> > JVM
> > > >> > > messages
> > > >> > > > I
> > > >> > > > > guess)
> > > >> > > > >
> > > >> > > > > Gruss
> > > >> > > > > Bernd
> > > >> > > > > --
> > > >> > > > > http://bernd.eckenfels.net
> > > >> > > > >
> > > >> > > > > ________________________________
> > > >> > > > > Von: Mikael Åsberg <m....@gmail.com>
> > > >> > > > > Gesendet: Mittwoch, März 20, 2019 9:40 AM
> > > >> > > > > An: Maven Users List
> > > >> > > > > Betreff: Re: Failsafe: Killing self fork JVM. PING timeout
> > > >> elapsed.
> > > >> > > > >
> > > >> > > > > These issues regarding communication with forked JVMs, won't
> > > they
> > > >> be
> > > >> > > > > resolved once surefire moves to interprocess communication
> > using
> > > >> > > > > tcp/ip sockets? This happens to be the target feature to be
> > > >> included
> > > >> > > > > in the next surefire 3.0.0 milestone:
> > > >> > > > >
> > > >> https://issues.apache.org/jira/projects/SUREFIRE/versions/12344668
> > > >> > > > >
> > > >> > > > > There are soooo many issues relating to surefire reading
> > stdout
> > > of
> > > >> > > > > forked processes (which is my understanding that it is
> > currently
> > > >> > > > > doing). Many of us are really looking forward to the next
> > > >> milestone.
> > > >> > > > >
> > > >> > > > > On Tue, Mar 19, 2019 at 8:59 PM Jason Young <
> > > >> > > jason.young@procentive.com>
> > > >> > > > > wrote:
> > > >> > > > > >
> > > >> > > > > > Getting back to my original questions, I know that "ping"
> > > means
> > > >> to
> > > >> > > see
> > > >> > > > > if a
> > > >> > > > > > process is there, and "NOOP" implies it's not a command to
> > do
> > > >> > > anything.
> > > >> > > > > But
> > > >> > > > > > what do the terms "ping" and "NOOP" mean in this context,
> > i.e.
> > > >> how
> > > >> > do
> > > >> > > > the
> > > >> > > > > > processes communicate? I assume they don't sonar. Do other
> > > >> > processes
> > > >> > > > also
> > > >> > > > > > ping NOOPs? Can I PING Chrome with a NOOP from bash? Is it
> > > with
> > > >> > TCP?
> > > >> > > > > >
> > > >> > > > > > I'm confused about what I should do regarding GC pauses.
> > > >> > Previously I
> > > >> > > > had
> > > >> > > > > > code that would write the amount of remaining heap space
> (or
> > > >> > > something
> > > >> > > > > like
> > > >> > > > > > that) to stdout after every test to troubleshoot OOMEs.
> Can
> > > >> writing
> > > >> > > to
> > > >> > > > > > stdout cause the communication failure somehow?
> > > >> > > > > >
> > > >> > > > > > On Wed, Mar 13, 2019 at 5:57 PM Jason Young <
> > > >> > > > jason.young@procentive.com>
> > > >> > > > > > wrote:
> > > >> > > > > >
> > > >> > > > > > > I upgraded failsafe and surefire to 3.0.0-M3 as advised;
> > we
> > > >> > > > encountered
> > > >> > > > > > > the same exception. (Still using -Xmx5g, will switch to
> > > OpenJ9
> > > >> > soon
> > > >> > > > in
> > > >> > > > > case
> > > >> > > > > > > that helps.)
> > > >> > > > > > >
> > > >> > > > > > > BTW I also asked on StackOverflow previously, for anyone
> > > >> > > interested:
> > > >> > > > > > >
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > >
> >
> https://stackoverflow.com/questions/54755846/killing-self-fork-jvm-ping-timeout-elapsed
> > > >> > > > > > >
> > > >> > > > > > > On Tue, Feb 26, 2019 at 6:40 PM Jason Young <
> > > >> > > > > jason.young@procentive.com>
> > > >> > > > > > > wrote:
> > > >> > > > > > >
> > > >> > > > > > >> Thanks again for the information.
> > > >> > > > > > >>
> > > >> > > > > > >> We had increased the RAM to 3g some time ago to prevent
> > > >> OOMEs.
> > > >> > > More
> > > >> > > > > > >> recently, I increased the RAM again to 5g for extra
> > > headroom
> > > >> > since
> > > >> > > > we
> > > >> > > > > had
> > > >> > > > > > >> more headroom available; the problem hasn't happened
> > since,
> > > >> but
> > > >> > it
> > > >> > > > > hasn't
> > > >> > > > > > >> been very long.
> > > >> > > > > > >>
> > > >> > > > > > >> We use a more customized image based on Alpine 3.8.2.
> The
> > > JDK
> > > >> > and
> > > >> > > > > Maven
> > > >> > > > > > >> are obtained via apk.
> > > >> > > > > > >>
> > > >> > > > > > >> I will try upgrading failsafe (and surefire while I'm
> at
> > > it)
> > > >> > > sooner,
> > > >> > > > > and
> > > >> > > > > > >> probably do some experimentation with JVMs another time
> > > (not
> > > >> > > > pressing
> > > >> > > > > for
> > > >> > > > > > >> me ATM).
> > > >> > > > > > >>
> > > >> > > > > > >> On Tue, Feb 26, 2019 at 12:20 PM Tibor Digana <
> > > >> > > > tibordigana@apache.org
> > > >> > > > > >
> > > >> > > > > > >> wrote:
> > > >> > > > > > >>
> > > >> > > > > > >>> >> I'll try to enable some logging about GC pauses to
> > see
> > > >> > what's
> > > >> > > up
> > > >> > > > > > >>>
> > > >> > > > > > >>> Pls do not keep such setting after tuning the GC
> because
> > > >> this
> > > >> > may
> > > >> > > > > > >>> sometime
> > > >> > > > > > >>> break the interprocess communication between Maven
> > process
> > > >> and
> > > >> > > > > surefire
> > > >> > > > > > >>> process.
> > > >> > > > > > >>> It's worth to list GC information in a file and not in
> > the
> > > >> > > console
> > > >> > > > > logs.
> > > >> > > > > > >>> This can be configured, I guess.
> > > >> > > > > > >>>
> > > >> > > > > > >>> >> Do you think the value is simply too low?
> > > >> > > > > > >>>
> > > >> > > > > > >>> GCing many objects may take some time and I remember
> we
> > > had
> > > >> a
> > > >> > > user
> > > >> > > > > who
> > > >> > > > > > >>> had
> > > >> > > > > > >>> this problem a year or two ago.
> > > >> > > > > > >>> We check every third NOOP (which is 3 x 10 sec) as a
> fix
> > > >> > instead
> > > >> > > of
> > > >> > > > > every
> > > >> > > > > > >>> NOP. So 30 seconds looked satisfactory.
> > > >> > > > > > >>> I think you use old version 2.20 or something like
> that.
> > > The
> > > >> > > fixes
> > > >> > > > > for
> > > >> > > > > > >>> docker have been done so far, so please use the latest
> > > >> version
> > > >> > > > > 3.0.0-M3.
> > > >> > > > > > >>> See this page
> > > >> > > > > > >>>
> > > >> > > >
> > > https://maven.apache.org/surefire/maven-surefire-plugin/docker.html
> > > >> ,
> > > >> > > > > we
> > > >> > > > > > >>> used maven:3.5.3-jdk-8-alpine in this test. Which base
> > > image
> > > >> > did
> > > >> > > > you
> > > >> > > > > use?
> > > >> > > > > > >>>
> > > >> > > > > > >>> Cheers
> > > >> > > > > > >>> Tibor
> > > >> > > > > > >>>
> > > >> > > > > > >>> On Tue, Feb 26, 2019 at 5:24 PM Jason Young <
> > > >> > > > > jason.young@procentive.com>
> > > >> > > > > > >>> wrote:
> > > >> > > > > > >>>
> > > >> > > > > > >>> > Thanks for the information. It's good to see someone
> > > >> > > understands
> > > >> > > > a
> > > >> > > > > > >>> little
> > > >> > > > > > >>> > about this.
> > > >> > > > > > >>> >
> > > >> > > > > > >>> > Incidentally, we have been looking at other GCs and
> > VMs
> > > >> for
> > > >> > the
> > > >> > > > > > >>> application
> > > >> > > > > > >>> > in production environments, so I'll look into how
> > these
> > > >> > affect
> > > >> > > > > tests as
> > > >> > > > > > >>> > well. I'll try to enable some logging about GC
> pauses
> > to
> > > >> see
> > > >> > > > > what's up.
> > > >> > > > > > >>> >
> > > >> > > > > > >>> > How would `-Xmx3g` cause long GC cycles? Do you
> think
> > > the
> > > >> > value
> > > >> > > > is
> > > >> > > > > > >>> simply
> > > >> > > > > > >>> > too low?
> > > >> > > > > > >>> >
> > > >> > > > > > >>> > FWIW we're running the Maven build in an
> Alpine-based
> > > >> Docker
> > > >> > > > > container.
> > > >> > > > > > >>> >
> > > >> > > > > > >>> > On Sat, Feb 23, 2019 at 6:36 AM Tibor Digana <
> > > >> > > > > tibordigana@apache.org>
> > > >> > > > > > >>> > wrote:
> > > >> > > > > > >>> >
> > > >> > > > > > >>> > > Hi Jason,
> > > >> > > > > > >>> > >
> > > >> > > > > > >>> > > We spoke about this issue on our chat in ASF
> Slack:
> > > >> > > > > > >>> > > "I think his tests have been paused for a long GC
> > > >> periods
> > > >> > and
> > > >> > > > > timed
> > > >> > > > > > >>> out
> > > >> > > > > > >>> > 3x
> > > >> > > > > > >>> > > PING period = 30 seconds. After this period forked
> > JVM
> > > >> > > supposed
> > > >> > > > > the
> > > >> > > > > > >>> Maven
> > > >> > > > > > >>> > > process was killed by JenkinsCI and therefore all
> > > >> surefire
> > > >> > > > > processes
> > > >> > > > > > >>> are
> > > >> > > > > > >>> > > killed as well and all the file handlers and
> memory
> > > >> > > > consumptions
> > > >> > > > > are
> > > >> > > > > > >>> > > freed."
> > > >> > > > > > >>> > >
> > > >> > > > > > >>> > > "But I have to say that `-Xmx3g` may cause long GC
> > > >> cycles,
> > > >> > > see
> > > >> > > > > > >>> > >
> > > >> > > > > > >>> > >
> > > >> > > > > > >>> >
> > > >> > > > > > >>>
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > >
> >
> https://maven.apache.org/surefire/maven-surefire-plugin/examples/shutdown.html
> > > >> > > > > > >>> > > "
> > > >> > > > > > >>> > >
> > > >> > > > > > >>> > > You are using java-1.8-openjdk. I guess you should
> > use
> > > >> > > > > Shenandoah GC
> > > >> > > > > > >>> > which
> > > >> > > > > > >>> > > is an experimental algorithm in JVM 1.8. This
> would
> > > >> > > > significantly
> > > >> > > > > > >>> short
> > > >> > > > > > >>> > > the GC cycles.
> > > >> > > > > > >>> > >
> > > >> > > > > > >>> > > We should of cource provide a new configuration
> > > >> parameter
> > > >> > to
> > > >> > > > give
> > > >> > > > > > >>> you a
> > > >> > > > > > >>> > > chance to prolong the PING.
> > > >> > > > > > >>> > >
> > > >> > > > > > >>> > > Cheers
> > > >> > > > > > >>> > > Tibor
> > > >> > > > > > >>> > >
> > > >> > > > > > >>> >
> > > >> > > > > > >>> >
> > > >> > > > > > >>> > --
> > > >> > > > > > >>> >
> > > >> > > > > > >>> > Jason Young
> > > >> > > > > > >>> >
> > > >> > > > > > >>>
> > > >> > > > > > >>
> > > >> > > > > > >>
> > > >> > > > > >
> > > >> > > > > > --
> > > >> > > > > > Jason Young
> > > >> > > > > > Software Engineer | PROCENTIVE
> > > >> > > > > > [image: Phone] 715 245 8000 x7609
> > > >> > > > > > [image: Mobile] 706 870 3540
> > > >> > > > > > [image: Web] procentive.com
> > > >> > > > > > Confidentiality Notice: This message is intended for the
> > sole
> > > >> use
> > > >> > of
> > > >> > > > the
> > > >> > > > > > individual and entity to which it is addressed, and may
> > > contain
> > > >> > > > > information
> > > >> > > > > > that is privileged, confidential and exempt from
> disclosure
> > > >> under
> > > >> > > > > > applicable law. Any unauthorized review, use, disclosure
> or
> > > >> > > > distribution
> > > >> > > > > of
> > > >> > > > > > this email message, including any attachment, is
> prohibited.
> > > If
> > > >> you
> > > >> > > are
> > > >> > > > > not
> > > >> > > > > > the intended recipient, please advise the sender by reply
> > > email
> > > >> and
> > > >> > > > > destroy
> > > >> > > > > > all copies of the original message.
> > > >> > > > >
> > > >> > > > >
> > > >>
> ---------------------------------------------------------------------
> > > >> > > > > To unsubscribe, e-mail: users-unsubscribe@maven.apache.org
> > > >> > > > > For additional commands, e-mail:
> users-help@maven.apache.org
> > > >> > > > >
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > > >
> > >
> >
>

Re: Failsafe: Killing self fork JVM. PING timeout elapsed.

Posted by Jason Young <ja...@procentive.com>.
Thanks.

I finally took a little time to look at why the `ps` command would not
work. For our project, we cannot use a default Maven image, so I made a
custom one based on alpine. One piece of software the Maven alpine image
has is `pcprops`; adding that fixes the problem:

$ docker run -it alpine sh
> / # apk update
> ...
> OK: 8445 distinct packages available
> / # apk upgrade
> ...
> OK: 4 MiB in 13 packages
> / # ps -o etime= -p 1
> ps: unrecognized option: p
> BusyBox v1.26.2 (2018-05-30 13:51:20 GMT) multi-call binary.
> Usage: ps [-o COL1,COL2=HEADER]
> Show list of processes
> -o COL1,COL2=HEADER Select columns for display
> / # apk add procps
> ...
> OK: 12 MiB in 19 packages
> / # ps -o etime= -p 1
>       05:21


Hope that helps anyone else using a custom-built alpine-based Maven image.

I'm not sure if I'll have time to experiment much with Surefire changes in
the near future; my focus at the moment is figuring out all the reasons our
test suite sometimes takes 5 or more times as long as normal (often also
causes Surefire to kill the child process, but even if it didn't, we have a
problem!).

On Sun, Apr 7, 2019 at 2:04 AM Tibor Digana <ti...@apache.org> wrote:

> Hi Jason,
>
> I am sorry I haven't reply.
> Excellent analysis!
>
> Can you make an experiment and use pingJob on a "fixed delay" instead of a
> "fixed rate"?
> If you want to make more, feel free.
>
> We should mabe think about how we would add more intelligence in these
> timers in order to detect the "blackout".
>
> Cheers
> Tibor
>
>
>
>
> On Fri, Mar 22, 2019 at 9:53 PM Jason Young <ja...@procentive.com>
> wrote:
>
> > Notice:
> > * an over 3-minute gap between the first two entries above--neither the
> > command reader thread nor the pingJob had a chance to run.
> > * the command reader thread and the ping job executor evidently get a
> > chance to run from 19:18:02 to 19:19:19. They ran with increased
> frequency
> > during that time to "catch up": The command reader thread is programmed
> to
> > read from stdin until there's nothing left to read, and the pingScheduler
> > runs the pingJob at a "fixed" rate.
> > * Another 3-minute gap and another attempt to catch up at 19:22:28. This
> > time though, the command reader thread evidently did not get a turn in
> > time.
> >
> > I did not bother to log what threads the ping job was running under; I
> can
> > do that later. I also did not add any logging on the Maven process side;
> I
> > can do that as well.
> >
> > This is on a VM with unthrottled access to 4 processor cores and 8GB
> RAM. I
> > think the code under test is CPU- and RAM-intensive. There probably were
> GC
> > pauses, but I don't have proper logging for that in place.
> >
> > Note that this is all very normal behavior: Both the command reader and
> the
> > ping job's thread (run by the pingScheduler) are instructed to run until
> > their work is done, not to sleep or yield. For high throughput, an OS
> will
> > favor continuing to run a thread for a while once it is started over
> > switching tasks, so it's normal both that each thread would run for a
> while
> > after a pause, and (by extension) that each would have long pauses as
> other
> > threads had to run, too. Because of the long gaps between runs of the
> > command reader thread, my previous advice (schedule the pingJob on a
> "fixed
> > delay" instead of a "fixed rate") would not have saved the build in this
> > case (though "fixed delay" is more correct and would save the build in
> > other possible cases). Our code base has some unit tests that rely on
> > "normal case" timing, but they now have the @Ignore attribute because
> they
> > would rarely, but too often, fail. For most teams and projects, unit
> tests
> > have to be more reliable than that.
> >
> > So I think we need a solution that does not rely on preemptive scheduling
> > from the OS or on thread scheduling, but which ensures that commnads have
> > been read before running code depending on commands. There are multiple
> > ways to do that; maybe I can find time to write a solution later.
> >
> >
> > On Fri, Mar 22, 2019 at 3:46 PM Jason Young <ja...@procentive.com>
> > wrote:
> >
> > > Thanks Tibor. I'll use `encodeAndWriteToOutput` later. For now, I
> hacked
> > > together some temporary changes to ForkedBooter so that it "logs" (just
> > > `System.out.println`) the current time every time:
> > >
> > >    - the command reader thread reads a NOOP via stdin,
> > >    - it starts the ping job,
> > >    - it checks for NOOP (`canUseNewPingMechanism(..)` returns `false`),
> > >    and
> > >    - it would have exited because there is no NOOP (`!pingDone`).
> > >
> > > I also commented out the code to exit in that case so it keeps running
> > and
> > > logging.
> > >
> > > The resulting output starts out showing that it's receiving ping every
> > 10s
> > > and checking for NOOP every 30s, as expected. But then one large test
> > > begins, and:
> > >
> > > DEBUG 2019/03/22 19:14:40.223 GMT ForkedBooter: Received ping from
> Maven
> > > process.
> > > DEBUG 2019/03/22 19:18:02.723 GMT ForkedBooter: Received ping from
> Maven
> > > process.
> > > DEBUG 2019/03/22 19:18:03.139 GMT ForkedBooter: Running ping job (i.e.
> > > checking if we read a ping earlier).
> > > DEBUG 2019/03/22 19:18:03.172 GMT ForkedBooter: Checking for NOOP from
> > > Maven process.
> > > DEBUG 2019/03/22 19:18:03.186 GMT ForkedBooter: Received ping from
> Maven
> > > process.
> > > DEBUG 2019/03/22 19:18:03.187 GMT ForkedBooter: Received ping from
> Maven
> > > process.
> > > DEBUG 2019/03/22 19:18:03.187 GMT ForkedBooter: Received ping from
> Maven
> > > process.
> > > DEBUG 2019/03/22 19:18:03.187 GMT ForkedBooter: Running ping job (i.e.
> > > checking if we read a ping earlier).
> > > DEBUG 2019/03/22 19:18:03.188 GMT ForkedBooter: Checking for NOOP from
> > > Maven process.
> > > DEBUG 2019/03/22 19:18:03.187 GMT ForkedBooter: Received ping from
> Maven
> > > process.
> > > DEBUG 2019/03/22 19:18:03.243 GMT ForkedBooter: Received ping from
> Maven
> > > process.
> > > DEBUG 2019/03/22 19:18:03.307 GMT ForkedBooter: Received ping from
> Maven
> > > process.
> > > DEBUG 2019/03/22 19:18:03.308 GMT ForkedBooter: Received ping from
> Maven
> > > process.
> > > DEBUG 2019/03/22 19:18:03.202 GMT ForkedBooter: Running ping job (i.e.
> > > checking if we read a ping earlier).
> > > DEBUG 2019/03/22 19:18:03.323 GMT ForkedBooter: Received ping from
> Maven
> > > process.
> > > DEBUG 2019/03/22 19:18:03.337 GMT ForkedBooter: Checking for NOOP from
> > > Maven process.
> > > DEBUG 2019/03/22 19:18:03.351 GMT ForkedBooter: Received ping from
> Maven
> > > process.
> > > DEBUG 2019/03/22 19:18:03.351 GMT ForkedBooter: Running ping job (i.e.
> > > checking if we read a ping earlier).
> > > DEBUG 2019/03/22 19:18:03.351 GMT ForkedBooter: Received ping from
> Maven
> > > process.
> > > DEBUG 2019/03/22 19:18:03.352 GMT ForkedBooter: Checking for NOOP from
> > > Maven process.
> > > DEBUG 2019/03/22 19:18:03.366 GMT ForkedBooter: Received ping from
> Maven
> > > process.
> > > DEBUG 2019/03/22 19:18:03.393 GMT ForkedBooter: Received ping from
> Maven
> > > process.
> > > DEBUG 2019/03/22 19:18:03.380 GMT ForkedBooter: Running ping job (i.e.
> > > checking if we read a ping earlier).
> > > DEBUG 2019/03/22 19:18:03.409 GMT ForkedBooter: Received ping from
> Maven
> > > process.
> > > DEBUG 2019/03/22 19:18:03.424 GMT ForkedBooter: Received ping from
> Maven
> > > process.
> > > DEBUG 2019/03/22 19:18:03.423 GMT ForkedBooter: Checking for NOOP from
> > > Maven process.
> > > DEBUG 2019/03/22 19:18:03.424 GMT ForkedBooter: Received ping from
> Maven
> > > process.
> > > DEBUG 2019/03/22 19:18:03.438 GMT ForkedBooter: Running ping job (i.e.
> > > checking if we read a ping earlier).
> > > DEBUG 2019/03/22 19:18:03.438 GMT ForkedBooter: Received ping from
> Maven
> > > process.
> > > DEBUG 2019/03/22 19:18:03.453 GMT ForkedBooter: Checking for NOOP from
> > > Maven process.
> > > DEBUG 2019/03/22 19:18:03.454 GMT ForkedBooter: Received ping from
> Maven
> > > process.
> > > DEBUG 2019/03/22 19:18:03.476 GMT ForkedBooter: Running ping job (i.e.
> > > checking if we read a ping earlier).
> > > DEBUG 2019/03/22 19:18:03.476 GMT ForkedBooter: Received ping from
> Maven
> > > process.
> > > DEBUG 2019/03/22 19:18:03.505 GMT ForkedBooter: Checking for NOOP from
> > > Maven process.
> > > DEBUG 2019/03/22 19:18:03.505 GMT ForkedBooter: Received ping from
> Maven
> > > process.
> > > DEBUG 2019/03/22 19:18:08.872 GMT ForkedBooter: Received ping from
> Maven
> > > process.
> > > DEBUG 2019/03/22 19:18:18.859 GMT ForkedBooter: Received ping from
> Maven
> > > process.
> > > DEBUG 2019/03/22 19:18:19.282 GMT ForkedBooter: Running ping job (i.e.
> > > checking if we read a ping earlier).
> > > DEBUG 2019/03/22 19:18:19.330 GMT ForkedBooter: Checking for NOOP from
> > > Maven process.
> > > DEBUG 2019/03/22 19:18:28.845 GMT ForkedBooter: Received ping from
> Maven
> > > process.
> > > DEBUG 2019/03/22 19:18:38.832 GMT ForkedBooter: Received ping from
> Maven
> > > process.
> > > DEBUG 2019/03/22 19:18:48.838 GMT ForkedBooter: Received ping from
> Maven
> > > process.
> > > DEBUG 2019/03/22 19:18:49.241 GMT ForkedBooter: Running ping job (i.e.
> > > checking if we read a ping earlier).
> > > DEBUG 2019/03/22 19:18:49.265 GMT ForkedBooter: Checking for NOOP from
> > > Maven process.
> > > DEBUG 2019/03/22 19:19:01.162 GMT ForkedBooter: Received ping from
> Maven
> > > process.
> > > DEBUG 2019/03/22 19:19:10.069 GMT ForkedBooter: Received ping from
> Maven
> > > process.
> > > DEBUG 2019/03/22 19:19:19.136 GMT ForkedBooter: Received ping from
> Maven
> > > process.
> > > DEBUG 2019/03/22 19:19:19.199 GMT ForkedBooter: Running ping job (i.e.
> > > checking if we read a ping earlier).
> > > DEBUG 2019/03/22 19:19:19.199 GMT ForkedBooter: Checking for NOOP from
> > > Maven process.
> > > DEBUG 2019/03/22 19:22:28.698 GMT ForkedBooter: Running ping job (i.e.
> > > checking if we read a ping earlier).
> > > DEBUG 2019/03/22 19:22:28.698 GMT ForkedBooter: Checking for NOOP from
> > > Maven process.
> > > WARN 2019/03/22 19:22:28.698 GMT ForkedBooter: Did not receive NOOP
> from
> > > Maven process.
> > > DEBUG 2019/03/22 19:22:28.698 GMT ForkedBooter: Running ping job (i.e.
> > > checking if we read a ping earlier).
> > > DEBUG 2019/03/22 19:22:28.698 GMT ForkedBooter: Checking for NOOP from
> > > Maven process.
> > > WARN 2019/03/22 19:22:28.698 GMT ForkedBooter: Did not receive NOOP
> from
> > > Maven process.
> > > DEBUG 2019/03/22 19:22:28.699 GMT ForkedBooter: Running ping job (i.e.
> > > checking if we read a ping earlier).
> > > DEBUG 2019/03/22 19:22:28.699 GMT ForkedBooter: Checking for NOOP from
> > > Maven process.
> > > WARN 2019/03/22 19:22:28.699 GMT ForkedBooter: Did not receive NOOP
> from
> > > Maven process.
> > > DEBUG 2019/03/22 19:22:28.699 GMT ForkedBooter: Running ping job (i.e.
> > > checking if we read a ping earlier).
> > > DEBUG 2019/03/22 19:22:28.699 GMT ForkedBooter: Checking for NOOP from
> > > Maven process.
> > > WARN 2019/03/22 19:22:28.699 GMT ForkedBooter: Did not receive NOOP
> from
> > > Maven process.
> > > DEBUG 2019/03/22 19:22:28.699 GMT ForkedBooter: Running ping job (i.e.
> > > checking if we read a ping earlier).
> > > DEBUG 2019/03/22 19:22:28.699 GMT ForkedBooter: Checking for NOOP from
> > > Maven process.
> > > WARN 2019/03/22 19:22:28.699 GMT ForkedBooter: Did not receive NOOP
> from
> > > Maven process.
> > > DEBUG 2019/03/22 19:22:28.699 GMT ForkedBooter: Running ping job (i.e.
> > > checking if we read a ping earlier).
> > > DEBUG 2019/03/22 19:22:28.699 GMT ForkedBooter: Checking for NOOP from
> > > Maven process.
> > > WARN 2019/03/22 19:22:28.699 GMT ForkedBooter: Did not receive NOOP
> from
> > > Maven process.
> > > DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from
> Maven
> > > process.
> > > DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from
> Maven
> > > process.
> > > DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from
> Maven
> > > process.
> > > DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from
> Maven
> > > process.
> > > DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from
> Maven
> > > process.
> > > DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from
> Maven
> > > process.
> > > DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from
> Maven
> > > process.
> > > DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from
> Maven
> > > process.
> > > DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from
> Maven
> > > process.
> > > DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from
> Maven
> > > process.
> > > DEBUG 2019/03/22 19:22:28.703 GMT ForkedBooter: Received ping from
> Maven
> > > process.
> > > DEBUG 2019/03/22 19:22:28.704 GMT ForkedBooter: Received ping from
> Maven
> > > process.
> > > DEBUG 2019/03/22 19:22:28.704 GMT ForkedBooter: Received ping from
> Maven
> > > process.
> > > DEBUG 2019/03/22 19:22:28.704 GMT ForkedBooter: Received ping from
> Maven
> > > process.
> > > DEBUG 2019/03/22 19:22:28.704 GMT ForkedBooter: Received ping from
> Maven
> > > process.
> > > DEBUG 2019/03/22 19:22:28.704 GMT ForkedBooter: Received ping from
> Maven
> > > process.
> > > DEBUG 2019/03/22 19:22:28.704 GMT ForkedBooter: Received ping from
> Maven
> > > process.
> > > DEBUG 2019/03/22 19:22:28.704 GMT ForkedBooter: Received ping from
> Maven
> > > process.
> > > DEBUG 2019/03/22 19:22:28.704 GMT ForkedBooter: Received ping from
> Maven
> > > process.
> > > ...
> > >
> > > With these changes, a project which normally takes 15 to 20 minutes to
> > run
> > > actually took 80m. Would printing directly to stdout from within a test
> > do
> > > that somehow? If not, we have a problem in our code.
> > >
> > > My comments on the above report to follow...
> > >
> > > On Fri, Mar 22, 2019 at 11:58 AM Tibor Digana <ti...@apache.org>
> > > wrote:
> > >
> > >> >> is there an -Xmx value you would recommend to ensure the parent
> > process
> > >> can send the NOOP to the surefire process?
> > >>
> > >> no such value is generic value to recommend. Each application or test
> > must
> > >> find our the value to prevent from OOM.
> > >>
> > >>
> > >> >> I'm making a custom build of surefire-booter to workaround the
> > initial
> > >> problem by commenting out the code to exit.
> > >>
> > >> Very good step forward. Then post your findings and we can continue on
> > >> making a pull request on GitHub.
> > >>
> > >>
> > >> >> Is there a logging API which I can use instead?
> > >>
> > >> Yes, it is very low level of logging (sending events to parent
> process)
> > >> but
> > >> this will be turned to high level abstraction in several days (branch
> > >> surefire-1222_2).
> > >> Yet, pls use this function:
> > >>
> > >> encodeAndWriteToOutput()
> > >>
> > >>
> > >> >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> On Fri, Mar 22, 2019 at 4:44 PM Jason Young <
> jason.young@procentive.com
> > >
> > >> wrote:
> > >>
> > >> > To clarify, this is the image ours is based on:
> > >> > https://hub.docker.com/_/alpine; we are not using a vanilla Maven
> > >> image as
> > >> > we have to add a few other items to this image as well. I don't know
> > who
> > >> > maintains that or adds /etc/mavenrc; I'll bring that up in another
> > >> topic in
> > >> > another forum. I was just answering a previous question, and also
> > >> throwing
> > >> > out info for anyone struggling with e.g. OOME in their Maven process
> > (a
> > >> lot
> > >> > of people have that problem when running the Sonar scanner, for
> > >> example). I
> > >> > don't think we need to worry about why that file is added for this
> > >> topic.
> > >> >
> > >> > But while we're on the subject, is there an -Xmx value you would
> > >> recommend
> > >> > to ensure the parent process can send the NOOP to the surefire
> > process?
> > >> >
> > >> > I'm making a custom build of surefire-booter to workaround the
> initial
> > >> > problem by commenting out the code to exit. I am also adding logging
> > via
> > >> > stdout so I can see if ForkedBooter is indeed running the ping job
> in
> > >> small
> > >> > increments of time. Is there a logging API which I can use instead?
> > >> >
> > >> > On Fri, Mar 22, 2019 at 5:44 AM Tibor Digana <
> tibordigana@apache.org>
> > >> > wrote:
> > >> >
> > >> > > The base images are developed in
> > >> > https://github.com/carlossg/docker-maven,
> > >> > > right?
> > >> > > Who creates "/etc/mavenrc"?
> > >> > >
> > >> > > On Thu, Mar 21, 2019 at 12:05 AM Jason Young <
> > >> jason.young@procentive.com
> > >> > >
> > >> > > wrote:
> > >> > >
> > >> > > > Mikael, sorry I do not appear to have permission to view the
> link.
> > >> > > >
> > >> > > > I did some digging in the last couple of days. I see that the
> > parent
> > >> > > > process reads from stdin. I could not find anywhere that we are
> > >> using
> > >> > > > stdin. FWIW the failures nearly always happen at least 15m into
> a
> > >> ~20m
> > >> > > test
> > >> > > > run, so perf is a likely culprit.
> > >> > > >
> > >> > > > I see also that ForkedBooter reads commands from stdin in one
> > >> thread,
> > >> > and
> > >> > > > uses an executor service to check for a past ping in
> > >> > > > ForkedBooter.listenToShutdownCommands(..). When it checks, it
> also
> > >> sets
> > >> > > > pingDone to false. The executor is configured to run up to 2
> > threads
> > >> > > > concurrently to handle the workload, and is set to run at a
> fixed
> > >> rate
> > >> > > (not
> > >> > > > a fixed delay). If the test suite is busy with testing and GC
> and
> > >> has
> > >> > > lots
> > >> > > > of threads running, it's entirely possible that a thread won't
> > have
> > >> a
> > >> > > > chance to run for a long time (e.g. 5s). Maybe instead of a 30s
> > >> delay,
> > >> > > the
> > >> > > > VM gets around to checking for a ping every 35s over a long span
> > of
> > >> > time.
> > >> > > > Because we're running at a "fixed rate" and not a "fixed delay",
> > >> then
> > >> > > after
> > >> > > > a couple of minutes we might be a full 30s behind schedule. It's
> > >> > possible
> > >> > > > the executor will create another thread to run the scheduled
> task
> > >> > because
> > >> > > > it's running behind schedule. This new thread checks for a ping,
> > >> finds
> > >> > > it,
> > >> > > > and sets pingDone to false. But then the original thread also
> > runs,
> > >> > say,
> > >> > > 2
> > >> > > > seconds afterwards, checks pingDone, and finds it is false.
> > >> > > >
> > >> > > > So to mitigate the problem, can we a) make the executor run
> only 1
> > >> > thread
> > >> > > > and b) schedule the task at a fixed rate? For that matter, is
> > there
> > >> > > another
> > >> > > > scheduled executor we can reuse? I understand why checking for
> > ping
> > >> > > > requires a separate executor. Should I ask in github?
> > >> > > >
> > >> > > > Regarding a previous question, I found out that Alpine's Maven
> > >> package
> > >> > > > comes with an /etc/mavenrc that sets `MAVEN_OPTS="$MAVEN_OPTS
> > >> > -Xmx512m"`
> > >> > > > which cannot be undone by setting `MAVEN_OPTS` at the command
> > line;
> > >> you
> > >> > > end
> > >> > > > up with e.g. `-Xmx1g -Xmx512m`. (Note this applies to the Maven
> > >> > (parent)
> > >> > > > process, not the surefire/failsafe (child) process.)
> > >> > > >
> > >> > > > On Wed, Mar 20, 2019 at 3:46 AM Bernd Eckenfels <
> > >> > ecki@zusammenkunft.net>
> > >> > > > wrote:
> > >> > > >
> > >> > > > > I guess a timeout caused by FullGC can happen with TCP as
> well.
> > >> > > > Increasing
> > >> > > > > the timeout might not be nice but does look like it would help
> > in
> > >> > both
> > >> > > > > cases. (Problems with stdout are more related to unexpected
> JVM
> > >> > > messages
> > >> > > > I
> > >> > > > > guess)
> > >> > > > >
> > >> > > > > Gruss
> > >> > > > > Bernd
> > >> > > > > --
> > >> > > > > http://bernd.eckenfels.net
> > >> > > > >
> > >> > > > > ________________________________
> > >> > > > > Von: Mikael Åsberg <m....@gmail.com>
> > >> > > > > Gesendet: Mittwoch, März 20, 2019 9:40 AM
> > >> > > > > An: Maven Users List
> > >> > > > > Betreff: Re: Failsafe: Killing self fork JVM. PING timeout
> > >> elapsed.
> > >> > > > >
> > >> > > > > These issues regarding communication with forked JVMs, won't
> > they
> > >> be
> > >> > > > > resolved once surefire moves to interprocess communication
> using
> > >> > > > > tcp/ip sockets? This happens to be the target feature to be
> > >> included
> > >> > > > > in the next surefire 3.0.0 milestone:
> > >> > > > >
> > >> https://issues.apache.org/jira/projects/SUREFIRE/versions/12344668
> > >> > > > >
> > >> > > > > There are soooo many issues relating to surefire reading
> stdout
> > of
> > >> > > > > forked processes (which is my understanding that it is
> currently
> > >> > > > > doing). Many of us are really looking forward to the next
> > >> milestone.
> > >> > > > >
> > >> > > > > On Tue, Mar 19, 2019 at 8:59 PM Jason Young <
> > >> > > jason.young@procentive.com>
> > >> > > > > wrote:
> > >> > > > > >
> > >> > > > > > Getting back to my original questions, I know that "ping"
> > means
> > >> to
> > >> > > see
> > >> > > > > if a
> > >> > > > > > process is there, and "NOOP" implies it's not a command to
> do
> > >> > > anything.
> > >> > > > > But
> > >> > > > > > what do the terms "ping" and "NOOP" mean in this context,
> i.e.
> > >> how
> > >> > do
> > >> > > > the
> > >> > > > > > processes communicate? I assume they don't sonar. Do other
> > >> > processes
> > >> > > > also
> > >> > > > > > ping NOOPs? Can I PING Chrome with a NOOP from bash? Is it
> > with
> > >> > TCP?
> > >> > > > > >
> > >> > > > > > I'm confused about what I should do regarding GC pauses.
> > >> > Previously I
> > >> > > > had
> > >> > > > > > code that would write the amount of remaining heap space (or
> > >> > > something
> > >> > > > > like
> > >> > > > > > that) to stdout after every test to troubleshoot OOMEs. Can
> > >> writing
> > >> > > to
> > >> > > > > > stdout cause the communication failure somehow?
> > >> > > > > >
> > >> > > > > > On Wed, Mar 13, 2019 at 5:57 PM Jason Young <
> > >> > > > jason.young@procentive.com>
> > >> > > > > > wrote:
> > >> > > > > >
> > >> > > > > > > I upgraded failsafe and surefire to 3.0.0-M3 as advised;
> we
> > >> > > > encountered
> > >> > > > > > > the same exception. (Still using -Xmx5g, will switch to
> > OpenJ9
> > >> > soon
> > >> > > > in
> > >> > > > > case
> > >> > > > > > > that helps.)
> > >> > > > > > >
> > >> > > > > > > BTW I also asked on StackOverflow previously, for anyone
> > >> > > interested:
> > >> > > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> https://stackoverflow.com/questions/54755846/killing-self-fork-jvm-ping-timeout-elapsed
> > >> > > > > > >
> > >> > > > > > > On Tue, Feb 26, 2019 at 6:40 PM Jason Young <
> > >> > > > > jason.young@procentive.com>
> > >> > > > > > > wrote:
> > >> > > > > > >
> > >> > > > > > >> Thanks again for the information.
> > >> > > > > > >>
> > >> > > > > > >> We had increased the RAM to 3g some time ago to prevent
> > >> OOMEs.
> > >> > > More
> > >> > > > > > >> recently, I increased the RAM again to 5g for extra
> > headroom
> > >> > since
> > >> > > > we
> > >> > > > > had
> > >> > > > > > >> more headroom available; the problem hasn't happened
> since,
> > >> but
> > >> > it
> > >> > > > > hasn't
> > >> > > > > > >> been very long.
> > >> > > > > > >>
> > >> > > > > > >> We use a more customized image based on Alpine 3.8.2. The
> > JDK
> > >> > and
> > >> > > > > Maven
> > >> > > > > > >> are obtained via apk.
> > >> > > > > > >>
> > >> > > > > > >> I will try upgrading failsafe (and surefire while I'm at
> > it)
> > >> > > sooner,
> > >> > > > > and
> > >> > > > > > >> probably do some experimentation with JVMs another time
> > (not
> > >> > > > pressing
> > >> > > > > for
> > >> > > > > > >> me ATM).
> > >> > > > > > >>
> > >> > > > > > >> On Tue, Feb 26, 2019 at 12:20 PM Tibor Digana <
> > >> > > > tibordigana@apache.org
> > >> > > > > >
> > >> > > > > > >> wrote:
> > >> > > > > > >>
> > >> > > > > > >>> >> I'll try to enable some logging about GC pauses to
> see
> > >> > what's
> > >> > > up
> > >> > > > > > >>>
> > >> > > > > > >>> Pls do not keep such setting after tuning the GC because
> > >> this
> > >> > may
> > >> > > > > > >>> sometime
> > >> > > > > > >>> break the interprocess communication between Maven
> process
> > >> and
> > >> > > > > surefire
> > >> > > > > > >>> process.
> > >> > > > > > >>> It's worth to list GC information in a file and not in
> the
> > >> > > console
> > >> > > > > logs.
> > >> > > > > > >>> This can be configured, I guess.
> > >> > > > > > >>>
> > >> > > > > > >>> >> Do you think the value is simply too low?
> > >> > > > > > >>>
> > >> > > > > > >>> GCing many objects may take some time and I remember we
> > had
> > >> a
> > >> > > user
> > >> > > > > who
> > >> > > > > > >>> had
> > >> > > > > > >>> this problem a year or two ago.
> > >> > > > > > >>> We check every third NOOP (which is 3 x 10 sec) as a fix
> > >> > instead
> > >> > > of
> > >> > > > > every
> > >> > > > > > >>> NOP. So 30 seconds looked satisfactory.
> > >> > > > > > >>> I think you use old version 2.20 or something like that.
> > The
> > >> > > fixes
> > >> > > > > for
> > >> > > > > > >>> docker have been done so far, so please use the latest
> > >> version
> > >> > > > > 3.0.0-M3.
> > >> > > > > > >>> See this page
> > >> > > > > > >>>
> > >> > > >
> > https://maven.apache.org/surefire/maven-surefire-plugin/docker.html
> > >> ,
> > >> > > > > we
> > >> > > > > > >>> used maven:3.5.3-jdk-8-alpine in this test. Which base
> > image
> > >> > did
> > >> > > > you
> > >> > > > > use?
> > >> > > > > > >>>
> > >> > > > > > >>> Cheers
> > >> > > > > > >>> Tibor
> > >> > > > > > >>>
> > >> > > > > > >>> On Tue, Feb 26, 2019 at 5:24 PM Jason Young <
> > >> > > > > jason.young@procentive.com>
> > >> > > > > > >>> wrote:
> > >> > > > > > >>>
> > >> > > > > > >>> > Thanks for the information. It's good to see someone
> > >> > > understands
> > >> > > > a
> > >> > > > > > >>> little
> > >> > > > > > >>> > about this.
> > >> > > > > > >>> >
> > >> > > > > > >>> > Incidentally, we have been looking at other GCs and
> VMs
> > >> for
> > >> > the
> > >> > > > > > >>> application
> > >> > > > > > >>> > in production environments, so I'll look into how
> these
> > >> > affect
> > >> > > > > tests as
> > >> > > > > > >>> > well. I'll try to enable some logging about GC pauses
> to
> > >> see
> > >> > > > > what's up.
> > >> > > > > > >>> >
> > >> > > > > > >>> > How would `-Xmx3g` cause long GC cycles? Do you think
> > the
> > >> > value
> > >> > > > is
> > >> > > > > > >>> simply
> > >> > > > > > >>> > too low?
> > >> > > > > > >>> >
> > >> > > > > > >>> > FWIW we're running the Maven build in an Alpine-based
> > >> Docker
> > >> > > > > container.
> > >> > > > > > >>> >
> > >> > > > > > >>> > On Sat, Feb 23, 2019 at 6:36 AM Tibor Digana <
> > >> > > > > tibordigana@apache.org>
> > >> > > > > > >>> > wrote:
> > >> > > > > > >>> >
> > >> > > > > > >>> > > Hi Jason,
> > >> > > > > > >>> > >
> > >> > > > > > >>> > > We spoke about this issue on our chat in ASF Slack:
> > >> > > > > > >>> > > "I think his tests have been paused for a long GC
> > >> periods
> > >> > and
> > >> > > > > timed
> > >> > > > > > >>> out
> > >> > > > > > >>> > 3x
> > >> > > > > > >>> > > PING period = 30 seconds. After this period forked
> JVM
> > >> > > supposed
> > >> > > > > the
> > >> > > > > > >>> Maven
> > >> > > > > > >>> > > process was killed by JenkinsCI and therefore all
> > >> surefire
> > >> > > > > processes
> > >> > > > > > >>> are
> > >> > > > > > >>> > > killed as well and all the file handlers and memory
> > >> > > > consumptions
> > >> > > > > are
> > >> > > > > > >>> > > freed."
> > >> > > > > > >>> > >
> > >> > > > > > >>> > > "But I have to say that `-Xmx3g` may cause long GC
> > >> cycles,
> > >> > > see
> > >> > > > > > >>> > >
> > >> > > > > > >>> > >
> > >> > > > > > >>> >
> > >> > > > > > >>>
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> https://maven.apache.org/surefire/maven-surefire-plugin/examples/shutdown.html
> > >> > > > > > >>> > > "
> > >> > > > > > >>> > >
> > >> > > > > > >>> > > You are using java-1.8-openjdk. I guess you should
> use
> > >> > > > > Shenandoah GC
> > >> > > > > > >>> > which
> > >> > > > > > >>> > > is an experimental algorithm in JVM 1.8. This would
> > >> > > > significantly
> > >> > > > > > >>> short
> > >> > > > > > >>> > > the GC cycles.
> > >> > > > > > >>> > >
> > >> > > > > > >>> > > We should of cource provide a new configuration
> > >> parameter
> > >> > to
> > >> > > > give
> > >> > > > > > >>> you a
> > >> > > > > > >>> > > chance to prolong the PING.
> > >> > > > > > >>> > >
> > >> > > > > > >>> > > Cheers
> > >> > > > > > >>> > > Tibor
> > >> > > > > > >>> > >
> > >> > > > > > >>> >
> > >> > > > > > >>> >
> > >> > > > > > >>> > --
> > >> > > > > > >>> >
> > >> > > > > > >>> > Jason Young
> > >> > > > > > >>> >
> > >> > > > > > >>>
> > >> > > > > > >>
> > >> > > > > > >>
> > >> > > > > >
> > >> > > > > > --
> > >> > > > > > Jason Young
> > >> > > > > > Software Engineer | PROCENTIVE
> > >> > > > > > [image: Phone] 715 245 8000 x7609
> > >> > > > > > [image: Mobile] 706 870 3540
> > >> > > > > > [image: Web] procentive.com
> > >> > > > > > Confidentiality Notice: This message is intended for the
> sole
> > >> use
> > >> > of
> > >> > > > the
> > >> > > > > > individual and entity to which it is addressed, and may
> > contain
> > >> > > > > information
> > >> > > > > > that is privileged, confidential and exempt from disclosure
> > >> under
> > >> > > > > > applicable law. Any unauthorized review, use, disclosure or
> > >> > > > distribution
> > >> > > > > of
> > >> > > > > > this email message, including any attachment, is prohibited.
> > If
> > >> you
> > >> > > are
> > >> > > > > not
> > >> > > > > > the intended recipient, please advise the sender by reply
> > email
> > >> and
> > >> > > > > destroy
> > >> > > > > > all copies of the original message.
> > >> > > > >
> > >> > > > >
> > >> ---------------------------------------------------------------------
> > >> > > > > To unsubscribe, e-mail: users-unsubscribe@maven.apache.org
> > >> > > > > For additional commands, e-mail: users-help@maven.apache.org
> > >> > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >
> >
>