You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Steven Bower <sb...@alcyon.net> on 2013/05/15 23:06:08 UTC

Transaction Logs Leaking FileDescriptors

We have a system in which a client is sending 1 record at a time (via REST)
followed by a commit. This has produced ~65k tlog files and the JVM has run
out of file descriptors... I grabbed a heap dump from the JVM and I can see
~52k "unreachable" FileDescriptors... This leads me to believe that the
TransactionLog is not properly closing all of it's files before getting rid
of the object...

I've verified with lsof that indeed there are ~60k tlog files that are open
currently..

This is Solr 4.3.0

Thanks,

steve

Re: Transaction Logs Leaking FileDescriptors

Posted by Yonik Seeley <yo...@lucidworks.com>.

On Wed, May 15, 2013 at 5:06 PM, Steven Bower <sb...@alcyon.net> wrote:
> This leads me to believe that the
> TransactionLog is not properly closing all of it's files before getting rid
> of the object...

I tried some ad hoc tests, and I can't reproduce this behavior yet.
There must be some other code path that increments the reference
count, but does not decrement.

-Yonik
http://lucidworks.com

Re: Transaction Logs Leaking FileDescriptors

Posted by Yonik Seeley <yo...@lucidworks.com>.

On Wed, May 15, 2013 at 5:20 PM, Steven Bower <sb...@alcyon.net> wrote:
> I'm hunting through the UpdateHandler code to try and find where this
> happens now..

UpdateLog.addOldLog()

-Yonik
http://lucidworks.com

Re: Transaction Logs Leaking FileDescriptors

Posted by Steven Bower <sb...@alcyon.net>.

Created https://issues.apache.org/jira/browse/SOLR-4831 to capture this
issue


On Thu, May 16, 2013 at 10:10 AM, Steven Bower <sb...@alcyon.net> wrote:

> Looking at the timestamps on the tlog files they seem to have all been
> created around the same time (04:55).. starting around this time I start
> seeing the exception below (there were 1628).. in fact its getting tons of
> these (200k+) but most of the time inside regular commits...
>
> 2013-15-05 04:55:06.634 ERROR UpdateLog [recoveryExecutor-6-thread-7922] -
> java.lang.ArrayIndexOutOfBoundsException: 2603
>         at
> org.apache.lucene.codecs.lucene40.BitVector.get(BitVector.java:146)
>         at
> org.apache.lucene.codecs.lucene41.Lucene41PostingsReader$BlockDocsEnum.nextDoc(Lucene41PostingsReader.java:492)
>         at
> org.apache.lucene.index.BufferedDeletesStream.applyTermDeletes(BufferedDeletesStream.java:407)
>         at
> org.apache.lucene.index.BufferedDeletesStream.applyDeletes(BufferedDeletesStream.java:273)
>         at
> org.apache.lucene.index.IndexWriter.applyAllDeletes(IndexWriter.java:2973)
>         at
> org.apache.lucene.index.IndexWriter.maybeApplyDeletes(IndexWriter.java:2964)
>         at
> org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:2704)
>         at
> org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:2839)
>         at
> org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:2819)
>         at
> org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:536)
>         at
> org.apache.solr.update.UpdateLog$LogReplayer.doReplay(UpdateLog.java:1339)
>         at
> org.apache.solr.update.UpdateLog$LogReplayer.run(UpdateLog.java:1163)
>         at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
>         at
> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>         at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
>         at
> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:619)
>
>
>
> On Thu, May 16, 2013 at 9:35 AM, Yonik Seeley <yo...@lucidworks.com>wrote:
>
>> See https://issues.apache.org/jira/browse/SOLR-3939
>>
>> Do you see these log messages from this in your logs?
>>       log.info("I may be the new leader - try and sync");
>>
>> How reproducible is this bug for you?  It would be great to know if
>> the patch in the issue fixes things.
>>
>> -Yonik
>> http://lucidworks.com
>>
>>
>> On Wed, May 15, 2013 at 6:04 PM, Steven Bower <sb...@alcyon.net> wrote:
>> > They are visible to ls...
>> >
>> >
>> > On Wed, May 15, 2013 at 5:49 PM, Yonik Seeley <yo...@lucidworks.com>
>> wrote:
>> >
>> >> On Wed, May 15, 2013 at 5:20 PM, Steven Bower <sb...@alcyon.net>
>> wrote:
>> >> > when the TransactionLog objects are dereferenced
>> >> > their RandomAccessFile object is not closed..
>> >>
>> >> Have the files been deleted (unlinked from the directory), or are they
>> >> still visible via "ls"?
>> >>
>> >> -Yonik
>> >> http://lucidworks.com
>> >>
>>
>
>

Re: Transaction Logs Leaking FileDescriptors

Posted by Steven Bower <sb...@alcyon.net>.

Looking at the timestamps on the tlog files they seem to have all been
created around the same time (04:55).. starting around this time I start
seeing the exception below (there were 1628).. in fact its getting tons of
these (200k+) but most of the time inside regular commits...

2013-15-05 04:55:06.634 ERROR UpdateLog [recoveryExecutor-6-thread-7922] -
java.lang.ArrayIndexOutOfBoundsException: 2603
        at
org.apache.lucene.codecs.lucene40.BitVector.get(BitVector.java:146)
        at
org.apache.lucene.codecs.lucene41.Lucene41PostingsReader$BlockDocsEnum.nextDoc(Lucene41PostingsReader.java:492)
        at
org.apache.lucene.index.BufferedDeletesStream.applyTermDeletes(BufferedDeletesStream.java:407)
        at
org.apache.lucene.index.BufferedDeletesStream.applyDeletes(BufferedDeletesStream.java:273)
        at
org.apache.lucene.index.IndexWriter.applyAllDeletes(IndexWriter.java:2973)
        at
org.apache.lucene.index.IndexWriter.maybeApplyDeletes(IndexWriter.java:2964)
        at
org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:2704)
        at
org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:2839)
        at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:2819)
        at
org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:536)
        at
org.apache.solr.update.UpdateLog$LogReplayer.doReplay(UpdateLog.java:1339)
        at
org.apache.solr.update.UpdateLog$LogReplayer.run(UpdateLog.java:1163)
        at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
        at
java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
        at
java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:619)



On Thu, May 16, 2013 at 9:35 AM, Yonik Seeley <yo...@lucidworks.com> wrote:

> See https://issues.apache.org/jira/browse/SOLR-3939
>
> Do you see these log messages from this in your logs?
>       log.info("I may be the new leader - try and sync");
>
> How reproducible is this bug for you?  It would be great to know if
> the patch in the issue fixes things.
>
> -Yonik
> http://lucidworks.com
>
>
> On Wed, May 15, 2013 at 6:04 PM, Steven Bower <sb...@alcyon.net> wrote:
> > They are visible to ls...
> >
> >
> > On Wed, May 15, 2013 at 5:49 PM, Yonik Seeley <yo...@lucidworks.com>
> wrote:
> >
> >> On Wed, May 15, 2013 at 5:20 PM, Steven Bower <sb...@alcyon.net>
> wrote:
> >> > when the TransactionLog objects are dereferenced
> >> > their RandomAccessFile object is not closed..
> >>
> >> Have the files been deleted (unlinked from the directory), or are they
> >> still visible via "ls"?
> >>
> >> -Yonik
> >> http://lucidworks.com
> >>
>

Re: Transaction Logs Leaking FileDescriptors

Posted by Yonik Seeley <yo...@lucidworks.com>.

See https://issues.apache.org/jira/browse/SOLR-3939

Do you see these log messages from this in your logs?
      log.info("I may be the new leader - try and sync");

How reproducible is this bug for you?  It would be great to know if
the patch in the issue fixes things.

-Yonik
http://lucidworks.com


On Wed, May 15, 2013 at 6:04 PM, Steven Bower <sb...@alcyon.net> wrote:
> They are visible to ls...
>
>
> On Wed, May 15, 2013 at 5:49 PM, Yonik Seeley <yo...@lucidworks.com> wrote:
>
>> On Wed, May 15, 2013 at 5:20 PM, Steven Bower <sb...@alcyon.net> wrote:
>> > when the TransactionLog objects are dereferenced
>> > their RandomAccessFile object is not closed..
>>
>> Have the files been deleted (unlinked from the directory), or are they
>> still visible via "ls"?
>>
>> -Yonik
>> http://lucidworks.com
>>

Re: Transaction Logs Leaking FileDescriptors

Posted by Steven Bower <sb...@alcyon.net>.

They are visible to ls...


On Wed, May 15, 2013 at 5:49 PM, Yonik Seeley <yo...@lucidworks.com> wrote:

> On Wed, May 15, 2013 at 5:20 PM, Steven Bower <sb...@alcyon.net> wrote:
> > when the TransactionLog objects are dereferenced
> > their RandomAccessFile object is not closed..
>
> Have the files been deleted (unlinked from the directory), or are they
> still visible via "ls"?
>
> -Yonik
> http://lucidworks.com
>

Re: Transaction Logs Leaking FileDescriptors

Posted by Yonik Seeley <yo...@lucidworks.com>.

On Wed, May 15, 2013 at 5:20 PM, Steven Bower <sb...@alcyon.net> wrote:
> when the TransactionLog objects are dereferenced
> their RandomAccessFile object is not closed..

Have the files been deleted (unlinked from the directory), or are they
still visible via "ls"?

-Yonik
http://lucidworks.com

Re: Transaction Logs Leaking FileDescriptors

Posted by Steven Bower <sb...@alcyon.net>.

There seem to be quite a few places where the RecentUpdates class is used
but is not properly created/closed throughout the code...

For example in RecoveryStrategy it does this correctly:

   UpdateLog.RecentUpdates recentUpdates = null;
    try {
      recentUpdates = ulog.getRecentUpdates();
      recentVersions = recentUpdates.getVersions(ulog.numRecordsToKeep);
    } catch (Throwable t) {
      SolrException.log(log, "Corrupt tlog - ignoring. core=" + coreName,
t);
      recentVersions = new ArrayList<Long>(0);
    } finally {
      if (recentUpdates != null) {
        recentUpdates.close();
      }
    }

But in a number of other places its used more like this:

UpdateLog.RecentUpdates recentUpdates = ulog.getRecentUpdates()
try {
  ... some code ...
} finally {
  recentUpdates.close();
}

The problem it would seem is that RecentUpdates.getRecentUpdates() can fail
when it calls update() as it is doing IO on the log itself.. in that case
you'll get orphaned references to the log...

I'm not 100% sure this is my problem.. I'm scouring the logs to see if this
codepath was triggered...

steve


On Wed, May 15, 2013 at 5:26 PM, Walter Underwood <wu...@wunderwood.org>wrote:

> Maybe we need a flag in the update handler to ignore commit requests.
>
> I just enabled a similar thing for our JVM, because something, somewhere
> was calling System.gc(). You can completely ignore explicit GC calls or you
> can turn them into requests for a concurrent GC.
>
> A similar setting for Solr might map commit requests to hard commit
> (default), soft commit, or none.
>
> wunder
>
> On May 15, 2013, at 2:20 PM, Steven Bower wrote:
>
> > Most definetly understand the don't commit after each record...
> > unfortunately the data is being fed by another team which I cannot
> > control...
> >
> > Limiting the number of potential tlog files is good but I think there is
> > also an issue in that when the TransactionLog objects are dereferenced
> > their RandomAccessFile object is not closed.. thus delaying release of
> the
> > descriptor until the object is GC'd...
> >
> > I'm hunting through the UpdateHandler code to try and find where this
> > happens now..
> >
> > steve
> >
> >
> > On Wed, May 15, 2013 at 5:13 PM, Yonik Seeley <yo...@lucidworks.com>
> wrote:
> >
> >> Hmmm, we keep open a number of tlog files based on the number of
> >> records in each file (so we always have a certain amount of history),
> >> but IIRC, the number of tlog files is also capped.  Perhaps there is a
> >> bug when the limit to tlog files is reached (as opposed to the number
> >> of documents in the tlog files).
> >>
> >> I'll see if I can create a test case to reproduce this.
> >>
> >> Separately, you'll get a lot better performance if you don't commit
> >> per update of course (or at least use something like commitWithin).
> >>
> >> -Yonik
> >> http://lucidworks.com
> >>
> >> On Wed, May 15, 2013 at 5:06 PM, Steven Bower <sb...@alcyon.net>
> wrote:
> >>> We have a system in which a client is sending 1 record at a time (via
> >> REST)
> >>> followed by a commit. This has produced ~65k tlog files and the JVM has
> >> run
> >>> out of file descriptors... I grabbed a heap dump from the JVM and I can
> >> see
> >>> ~52k "unreachable" FileDescriptors... This leads me to believe that the
> >>> TransactionLog is not properly closing all of it's files before getting
> >> rid
> >>> of the object...
> >>>
> >>> I've verified with lsof that indeed there are ~60k tlog files that are
> >> open
> >>> currently..
> >>>
> >>> This is Solr 4.3.0
> >>>
> >>> Thanks,
> >>>
> >>> steve
> >>
>
>
>
>
>

Re: Transaction Logs Leaking FileDescriptors

Posted by Walter Underwood <wu...@wunderwood.org>.

Maybe we need a flag in the update handler to ignore commit requests.

I just enabled a similar thing for our JVM, because something, somewhere was calling System.gc(). You can completely ignore explicit GC calls or you can turn them into requests for a concurrent GC.

A similar setting for Solr might map commit requests to hard commit (default), soft commit, or none.

wunder

On May 15, 2013, at 2:20 PM, Steven Bower wrote:

> Most definetly understand the don't commit after each record...
> unfortunately the data is being fed by another team which I cannot
> control...
> 
> Limiting the number of potential tlog files is good but I think there is
> also an issue in that when the TransactionLog objects are dereferenced
> their RandomAccessFile object is not closed.. thus delaying release of the
> descriptor until the object is GC'd...
> 
> I'm hunting through the UpdateHandler code to try and find where this
> happens now..
> 
> steve
> 
> 
> On Wed, May 15, 2013 at 5:13 PM, Yonik Seeley <yo...@lucidworks.com> wrote:
> 
>> Hmmm, we keep open a number of tlog files based on the number of
>> records in each file (so we always have a certain amount of history),
>> but IIRC, the number of tlog files is also capped.  Perhaps there is a
>> bug when the limit to tlog files is reached (as opposed to the number
>> of documents in the tlog files).
>> 
>> I'll see if I can create a test case to reproduce this.
>> 
>> Separately, you'll get a lot better performance if you don't commit
>> per update of course (or at least use something like commitWithin).
>> 
>> -Yonik
>> http://lucidworks.com
>> 
>> On Wed, May 15, 2013 at 5:06 PM, Steven Bower <sb...@alcyon.net> wrote:
>>> We have a system in which a client is sending 1 record at a time (via
>> REST)
>>> followed by a commit. This has produced ~65k tlog files and the JVM has
>> run
>>> out of file descriptors... I grabbed a heap dump from the JVM and I can
>> see
>>> ~52k "unreachable" FileDescriptors... This leads me to believe that the
>>> TransactionLog is not properly closing all of it's files before getting
>> rid
>>> of the object...
>>> 
>>> I've verified with lsof that indeed there are ~60k tlog files that are
>> open
>>> currently..
>>> 
>>> This is Solr 4.3.0
>>> 
>>> Thanks,
>>> 
>>> steve
>>

Re: Transaction Logs Leaking FileDescriptors

Posted by Steven Bower <sb...@alcyon.net>.

Most definetly understand the don't commit after each record...
unfortunately the data is being fed by another team which I cannot
control...

Limiting the number of potential tlog files is good but I think there is
also an issue in that when the TransactionLog objects are dereferenced
their RandomAccessFile object is not closed.. thus delaying release of the
descriptor until the object is GC'd...

I'm hunting through the UpdateHandler code to try and find where this
happens now..

steve

On Wed, May 15, 2013 at 5:13 PM, Yonik Seeley <yo...@lucidworks.com> wrote:

> Hmmm, we keep open a number of tlog files based on the number of
> records in each file (so we always have a certain amount of history),
> but IIRC, the number of tlog files is also capped.  Perhaps there is a
> bug when the limit to tlog files is reached (as opposed to the number
> of documents in the tlog files).
>
> I'll see if I can create a test case to reproduce this.
>
> Separately, you'll get a lot better performance if you don't commit
> per update of course (or at least use something like commitWithin).
>
> -Yonik
> http://lucidworks.com
>
> On Wed, May 15, 2013 at 5:06 PM, Steven Bower <sb...@alcyon.net> wrote:
> > We have a system in which a client is sending 1 record at a time (via
> REST)
> > followed by a commit. This has produced ~65k tlog files and the JVM has
> run
> > out of file descriptors... I grabbed a heap dump from the JVM and I can
> see
> > ~52k "unreachable" FileDescriptors... This leads me to believe that the
> > TransactionLog is not properly closing all of it's files before getting
> rid
> > of the object...
> >
> > I've verified with lsof that indeed there are ~60k tlog files that are
> open
> > currently..
> >
> > This is Solr 4.3.0
> >
> > Thanks,
> >
> > steve
>

Re: Transaction Logs Leaking FileDescriptors

Posted by Yonik Seeley <yo...@lucidworks.com>.

Hmmm, we keep open a number of tlog files based on the number of
records in each file (so we always have a certain amount of history),
but IIRC, the number of tlog files is also capped.  Perhaps there is a
bug when the limit to tlog files is reached (as opposed to the number
of documents in the tlog files).

I'll see if I can create a test case to reproduce this.

Separately, you'll get a lot better performance if you don't commit
per update of course (or at least use something like commitWithin).

-Yonik
http://lucidworks.com

On Wed, May 15, 2013 at 5:06 PM, Steven Bower <sb...@alcyon.net> wrote:
> We have a system in which a client is sending 1 record at a time (via REST)
> followed by a commit. This has produced ~65k tlog files and the JVM has run
> out of file descriptors... I grabbed a heap dump from the JVM and I can see
> ~52k "unreachable" FileDescriptors... This leads me to believe that the
> TransactionLog is not properly closing all of it's files before getting rid
> of the object...
>
> I've verified with lsof that indeed there are ~60k tlog files that are open
> currently..
>
> This is Solr 4.3.0
>
> Thanks,
>
> steve