You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Dawid Weiss <da...@gmail.com> on 2018/12/14 08:22:23 UTC

Hadoop threads leaking and falling in an endless loop, logging

Jira is down, but for the record -- the tests that recently fill up
disk space and cause havoc on test machines are failing because of
Hadoop's threads that fall into an endless loop in DataXceiverServer
(when the test framework calls interrupt on leaks threads).
Simplifying a bit, it looks like this:

  public void run() {
    Peer peer = null;
    while (datanode.shouldRun && !datanode.shutdownForUpgrade) {
      try {
        peer = peerServer.accept();
        ...
      } catch (IOException ie) {
        IOUtils.cleanup(null, peer);
        LOG.warn(datanode.getDisplayName() + ":DataXceiverServer: ", ie);
      }
    }

There are no timeouts on this loop, it just keeps logging forever.
Don't know if this "datanode" can be cleaned up properly, but it
definitely should be (in an afterclass hook). Otherwise the logs will
keep growing and there's not much we can do about it (from test
infrastructure point of view).

D.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Hadoop threads leaking and falling in an endless loop, logging

Posted by David Smiley <da...@gmail.com>.

+1 to self destruct timer. Great analysis as usual Dawid!
On Fri, Dec 14, 2018 at 8:48 PM Dawid Weiss <da...@gmail.com> wrote:

> So, digging in the huge logs problem I discovered a subtle issue with
> suite timeouts (thanks for keeping an eye open, Steve!) -- the
> framework can in fact hang while trying to interrupt leaked threads;
> this in combination with the spinning hadoop zombie threads that keep
> on logging results in filling up all the disk space. And once the disk
> space is exhausted, everything goes down.
>
> The hanging suite timeouts issue is interesting. The loop in the
> randomized runner used thread.join(timeoutMillis) method and iteration
> count to try to kill leaked threads. Well, turns out join(timeout) can
> hang indefinitely because this method is synchronized on the thread
> being joined... So regardless of the timeout value, it'll never return
> if the thread's monitor is never released... Nasty.
>
> https://github.com/randomizedtesting/randomizedtesting/issues/275
>
> I wonder if we should add the universal JVM kill switch option to
> forked JVMs... This wouldn't solve things, but would at least kill
> those hung forked processed before they become insane. The option that
> kills the JVM after a certain amount of time in OpenJDK is
> -XX:SelfDestructTimer=[mins]. Just a thought...
>
> Dawid
>
> Dawid
> On Fri, Dec 14, 2018 at 6:43 PM Dawid Weiss <da...@gmail.com> wrote:
> >
> > Correction: I can reproduce the problem (on Linux). Looking into why
> > suite timeout doesn't work properly.
> >
> > D.
> > On Fri, Dec 14, 2018 at 6:12 PM Dawid Weiss <da...@gmail.com>
> wrote:
> > >
> > > > Hadoop is up to 2.9.2, we're on 2.7.4. Have you seen any hint that
> > > > this behavior is better in a more recent version?
> > >
> > > I can't reproduce this problem, unfortunately. Even with the thread
> > > locked in an endless loop the runner should make progress and
> > > eventually terminate. That it doesn't is very suspicious; could be a
> > > bug somewhere, but without a stack trace from the hung JVM I can't
> > > really figure out why it's stalling.
> > >
> > > A better way to move forward would be to remove those annotations that
> > > currently leak threads and resources between tests, but I realize it's
> > > difficult with external software we don't have full control over.
> > >
> > > D.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
> --
Lucene/Solr Search Committer (PMC), Developer, Author, Speaker
LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
http://www.solrenterprisesearchserver.com

Re: Hadoop threads leaking and falling in an endless loop, logging

Posted by Dawid Weiss <da...@gmail.com>.

So, digging in the huge logs problem I discovered a subtle issue with
suite timeouts (thanks for keeping an eye open, Steve!) -- the
framework can in fact hang while trying to interrupt leaked threads;
this in combination with the spinning hadoop zombie threads that keep
on logging results in filling up all the disk space. And once the disk
space is exhausted, everything goes down.

The hanging suite timeouts issue is interesting. The loop in the
randomized runner used thread.join(timeoutMillis) method and iteration
count to try to kill leaked threads. Well, turns out join(timeout) can
hang indefinitely because this method is synchronized on the thread
being joined... So regardless of the timeout value, it'll never return
if the thread's monitor is never released... Nasty.

https://github.com/randomizedtesting/randomizedtesting/issues/275

I wonder if we should add the universal JVM kill switch option to
forked JVMs... This wouldn't solve things, but would at least kill
those hung forked processed before they become insane. The option that
kills the JVM after a certain amount of time in OpenJDK is
-XX:SelfDestructTimer=[mins]. Just a thought...

Dawid

Dawid
On Fri, Dec 14, 2018 at 6:43 PM Dawid Weiss <da...@gmail.com> wrote:
>
> Correction: I can reproduce the problem (on Linux). Looking into why
> suite timeout doesn't work properly.
>
> D.
> On Fri, Dec 14, 2018 at 6:12 PM Dawid Weiss <da...@gmail.com> wrote:
> >
> > > Hadoop is up to 2.9.2, we're on 2.7.4. Have you seen any hint that
> > > this behavior is better in a more recent version?
> >
> > I can't reproduce this problem, unfortunately. Even with the thread
> > locked in an endless loop the runner should make progress and
> > eventually terminate. That it doesn't is very suspicious; could be a
> > bug somewhere, but without a stack trace from the hung JVM I can't
> > really figure out why it's stalling.
> >
> > A better way to move forward would be to remove those annotations that
> > currently leak threads and resources between tests, but I realize it's
> > difficult with external software we don't have full control over.
> >
> > D.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Hadoop threads leaking and falling in an endless loop, logging

Posted by Dawid Weiss <da...@gmail.com>.

Correction: I can reproduce the problem (on Linux). Looking into why
suite timeout doesn't work properly.

D.
On Fri, Dec 14, 2018 at 6:12 PM Dawid Weiss <da...@gmail.com> wrote:
>
> > Hadoop is up to 2.9.2, we're on 2.7.4. Have you seen any hint that
> > this behavior is better in a more recent version?
>
> I can't reproduce this problem, unfortunately. Even with the thread
> locked in an endless loop the runner should make progress and
> eventually terminate. That it doesn't is very suspicious; could be a
> bug somewhere, but without a stack trace from the hung JVM I can't
> really figure out why it's stalling.
>
> A better way to move forward would be to remove those annotations that
> currently leak threads and resources between tests, but I realize it's
> difficult with external software we don't have full control over.
>
> D.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Hadoop threads leaking and falling in an endless loop, logging

Posted by Dawid Weiss <da...@gmail.com>.

> Hadoop is up to 2.9.2, we're on 2.7.4. Have you seen any hint that
> this behavior is better in a more recent version?

I can't reproduce this problem, unfortunately. Even with the thread
locked in an endless loop the runner should make progress and
eventually terminate. That it doesn't is very suspicious; could be a
bug somewhere, but without a stack trace from the hung JVM I can't
really figure out why it's stalling.

A better way to move forward would be to remove those annotations that
currently leak threads and resources between tests, but I realize it's
difficult with external software we don't have full control over.

D.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Hadoop threads leaking and falling in an endless loop, logging

Posted by Erick Erickson <er...@gmail.com>.

Hadoop is up to 2.9.2, we're on 2.7.4. Have you seen any hint that
this behavior is better in a more recent version? Or anyone
else for that matter.

If so we should raise a ticket. Or even if we should upgrade on
general principles...

Erick

On Fri, Dec 14, 2018 at 12:22 AM Dawid Weiss <da...@gmail.com> wrote:
>
> Jira is down, but for the record -- the tests that recently fill up
> disk space and cause havoc on test machines are failing because of
> Hadoop's threads that fall into an endless loop in DataXceiverServer
> (when the test framework calls interrupt on leaks threads).
> Simplifying a bit, it looks like this:
>
>   public void run() {
>     Peer peer = null;
>     while (datanode.shouldRun && !datanode.shutdownForUpgrade) {
>       try {
>         peer = peerServer.accept();
>         ...
>       } catch (IOException ie) {
>         IOUtils.cleanup(null, peer);
>         LOG.warn(datanode.getDisplayName() + ":DataXceiverServer: ", ie);
>       }
>     }
>
> There are no timeouts on this loop, it just keeps logging forever.
> Don't know if this "datanode" can be cleaned up properly, but it
> definitely should be (in an afterclass hook). Otherwise the logs will
> keep growing and there's not much we can do about it (from test
> infrastructure point of view).
>
> D.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org