You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Vitaly Funstein <vf...@gmail.com> on 2014/09/08 20:45:27 UTC

Re: BlockTreeTermsReader consumes crazy amount of memory

UPDATE:

After making the changes we discussed to enable sharing of SegmentReaders
between the NRT reader and a commit point reader, specifically calling
through to DirectoryReader.openIfChanged(DirectoryReader, IndexCommit), I
am seeing this exception, sporadically:

Caused by: java.lang.NullPointerException
        at java.io.File.<init>(File.java:305)
        at
org.terracotta.shaded.lucene.store.NIOFSDirectory.openInput(NIOFSDirectory.java:80)
        at
org.terracotta.shaded.lucene.codecs.lucene40.BitVector.<init>(BitVector.java:327)
        at
org.terracotta.shaded.lucene.codecs.lucene40.Lucene40LiveDocsFormat.readLiveDocs(Lucene40LiveDocsFormat.java:90)
        at
org.terracotta.shaded.lucene.index.SegmentReader.<init>(SegmentReader.java:131)
        at
org.terracotta.shaded.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:194)
        at
org.terracotta.shaded.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:326)
        at
org.terracotta.shaded.lucene.index.StandardDirectoryReader$2.doBody(StandardDirectoryReader.java:320)
        at
org.terracotta.shaded.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:702)
        at
org.terracotta.shaded.lucene.index.StandardDirectoryReader.doOpenFromCommit(StandardDirectoryReader.java:315)
        at
org.terracotta.shaded.lucene.index.StandardDirectoryReader.doOpenFromWriter(StandardDirectoryReader.java:278)
        at
org.terracotta.shaded.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:260)
        at
org.terracotta.shaded.lucene.index.DirectoryReader.openIfChanged(DirectoryReader.java:183)

Looking at the source quickly, it appears the child argument to the File
ctor is null; is it somehow possible that the segment infos in the commit
point wasn't fully written out somehow, on prior commit? Sounds unlikely,
yet disturbing... but nothing else has changed in my code, i.e. the way
commits are performed and indexes are reopened.


On Fri, Aug 29, 2014 at 2:03 AM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> On Thu, Aug 28, 2014 at 5:38 PM, Vitaly Funstein <vf...@gmail.com>
> wrote:
> > On Thu, Aug 28, 2014 at 1:25 PM, Michael McCandless <
> > lucene@mikemccandless.com> wrote:
> >
> >>
> >> The segments_N file can be different, that's fine: after that, we then
> >> re-use SegmentReaders when they are in common between the two commit
> >> points.  Each segments_N file refers to many segments...
> >>
> >>
> > Yes, you are totally right - I didn't follow the code far enough the
> first
> > time around. :) This is an excellent idea, actually - I can probably
> > arrange maintained commit points as an MRU data structure (e.g.
> > LinkedHashMap with access order), and simply grab the most recently
> opened
> > reader to pass in when obtaining a new one from the new commit point - to
> > maximize segment reader reuse.
>
> That's great!
>
> >> You can set it (min and max) as high as you want; the only hard
> >> requirement is that max >= 2*(min-1), I believe.
> >>
> >
> > Looks like this is used inside Lucene41PostingsFormat, which simply
> passes
> > in those defaults - so you are effectively saying the minimum (and
> > therefore, maximum) block size can be raised to reuse the size of the
> terms
> > index inside those TreeMap nodes?
>
> Yes, but it then increases cost at search time to locate a given term,
> because more scanning is then required once we seek to the block that
> might have the term.
>
> This reduces the size of the FST, but if RAM is being used by
> something else inside BT, it won't help.  But from your screen shot it
> looked like it was almost entirely the FST, which is what I would
> expect.
>
> >> > We are already using a customized codec though, so perhaps adding
> >> > this to the codec is okay and transparent?
> >>
> >> Hmmm :)  Customized in what manner?
> >>
> >>
> > We need to have the ability to turn off stored fields compression, so
> there
> > is one codec in case the system is configured that way. The other one
> > exists for compression on, but there I tweaked stored fields format for
> > bias toward decompression, as well as a smaller chunk size - based on
> some
> > empirical observations in executed tests. I am guessing I'll just add
> > another customization to both that deals with the block sizing for
> postings
> > format, and see what difference that makes...
>
> Ahh, OK.  Yes, just add this custom terms index block sizing too.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: BlockTreeTermsReader consumes crazy amount of memory

Posted by Vitaly Funstein <vf...@gmail.com>.

Very nice - it does pass now!

I wish there was a better way of incorporating the patch than just
shadowing the original StandardDirectoryReader with a patched one, but
unfortunately this class is final and FilterDirectoryReader doesn't seem to
do help here, making a cleaner approach seemingly impossible... but this is
a separate issue.

On Wed, Sep 10, 2014 at 6:35 PM, Robert Muir <rc...@gmail.com> wrote:

> Yes, there is also a safety check, but IMO it should be removed.
>
> See the patch on the issue, the test passes now.
>
> On Wed, Sep 10, 2014 at 9:31 PM, Vitaly Funstein <vf...@gmail.com>
> wrote:
> > Seems to me the bug occurs regardless of whether the passed in newer
> reader
> > is NRT or non-NRT. This is because the user operates at the level of
> > DirectoryReader, not SegmentReader and modifying the test code to do the
> > following reproduces the bug:
> >
> >     writer.commit();
> >     DirectoryReader latest = DirectoryReader.open(writer, true);
> >
> >     // This reader will be used for searching against commit point 1
> >     DirectoryReader searchReader = DirectoryReader.openIfChanged(latest,
> > ic1); //  <=== Exception/Assertion thrown here
> >
> >
> > On Wed, Sep 10, 2014 at 6:26 PM, Robert Muir <rc...@gmail.com> wrote:
> >
> >> Thats because there are 3 constructors in segmentreader:
> >>
> >> 1. one used for opening new (checks hasDeletions, only reads liveDocs if
> >> so)
> >> 2. one used for non-NRT reopen <-- problem one for you
> >> 3. one used for NRT reopen (takes a LiveDocs as a param, so no bug)
> >>
> >> so personally i think you should be able to do this, we just have to
> >> add the hasDeletions check to #2
> >>
> >> On Wed, Sep 10, 2014 at 7:46 PM, Vitaly Funstein <vf...@gmail.com>
> >> wrote:
> >> > One other observation - if instead of a reader opened at a later
> commit
> >> > point (T1), I pass in an NRT reader *without* doing the second commit
> on
> >> > the index prior, then there is no exception. This probably also
> hinges on
> >> > the assumption that no buffered docs have been flushed after T0, thus
> >> > creating new segment files, as well... unfortunately, our system can't
> >> make
> >> > either assumption.
> >> >
> >> > On Wed, Sep 10, 2014 at 4:30 PM, Vitaly Funstein <vfunstein@gmail.com
> >
> >> > wrote:
> >> >
> >> >> Normally, reopens only go forwards in time, so if you could ensure
> >> >>> that when you reopen one reader to another, the 2nd one is always
> >> >>> "newer", then I think you should never hit this issue
> >> >>
> >> >>
> >> >> Mike, I'm not sure if I fully understand your suggestion. In a
> nutshell,
> >> >> the use here case is as follows: I want to be able to search the
> index
> >> at a
> >> >> particular point in time, let's call it T0. To that end, I save the
> >> state
> >> >> at that time via a commit and take a snapshot of the index. After
> that,
> >> the
> >> >> index is free to move on, to another point in time, say T1 - and
> likely
> >> >> does. The optimization we have been discussing (and this is what the
> >> test
> >> >> code I posted does) basically asks the reader to go back to point T0,
> >> while
> >> >> reusing as much of the state of the index from T1, as long as it is
> >> >> unchanged between the two.
> >> >>
> >> >> That's what DirectoryReader.openIfChanged(DirectoryReader,
> IndexCommit)
> >> is
> >> >> supposed to do internally... or am I misinterpreting the
> >> >> intent/implementation of it?
> >> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: BlockTreeTermsReader consumes crazy amount of memory

Posted by Robert Muir <rc...@gmail.com>.

Yes, there is also a safety check, but IMO it should be removed.

See the patch on the issue, the test passes now.

On Wed, Sep 10, 2014 at 9:31 PM, Vitaly Funstein <vf...@gmail.com> wrote:
> Seems to me the bug occurs regardless of whether the passed in newer reader
> is NRT or non-NRT. This is because the user operates at the level of
> DirectoryReader, not SegmentReader and modifying the test code to do the
> following reproduces the bug:
>
>     writer.commit();
>     DirectoryReader latest = DirectoryReader.open(writer, true);
>
>     // This reader will be used for searching against commit point 1
>     DirectoryReader searchReader = DirectoryReader.openIfChanged(latest,
> ic1); //  <=== Exception/Assertion thrown here
>
>
> On Wed, Sep 10, 2014 at 6:26 PM, Robert Muir <rc...@gmail.com> wrote:
>
>> Thats because there are 3 constructors in segmentreader:
>>
>> 1. one used for opening new (checks hasDeletions, only reads liveDocs if
>> so)
>> 2. one used for non-NRT reopen <-- problem one for you
>> 3. one used for NRT reopen (takes a LiveDocs as a param, so no bug)
>>
>> so personally i think you should be able to do this, we just have to
>> add the hasDeletions check to #2
>>
>> On Wed, Sep 10, 2014 at 7:46 PM, Vitaly Funstein <vf...@gmail.com>
>> wrote:
>> > One other observation - if instead of a reader opened at a later commit
>> > point (T1), I pass in an NRT reader *without* doing the second commit on
>> > the index prior, then there is no exception. This probably also hinges on
>> > the assumption that no buffered docs have been flushed after T0, thus
>> > creating new segment files, as well... unfortunately, our system can't
>> make
>> > either assumption.
>> >
>> > On Wed, Sep 10, 2014 at 4:30 PM, Vitaly Funstein <vf...@gmail.com>
>> > wrote:
>> >
>> >> Normally, reopens only go forwards in time, so if you could ensure
>> >>> that when you reopen one reader to another, the 2nd one is always
>> >>> "newer", then I think you should never hit this issue
>> >>
>> >>
>> >> Mike, I'm not sure if I fully understand your suggestion. In a nutshell,
>> >> the use here case is as follows: I want to be able to search the index
>> at a
>> >> particular point in time, let's call it T0. To that end, I save the
>> state
>> >> at that time via a commit and take a snapshot of the index. After that,
>> the
>> >> index is free to move on, to another point in time, say T1 - and likely
>> >> does. The optimization we have been discussing (and this is what the
>> test
>> >> code I posted does) basically asks the reader to go back to point T0,
>> while
>> >> reusing as much of the state of the index from T1, as long as it is
>> >> unchanged between the two.
>> >>
>> >> That's what DirectoryReader.openIfChanged(DirectoryReader, IndexCommit)
>> is
>> >> supposed to do internally... or am I misinterpreting the
>> >> intent/implementation of it?
>> >>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: BlockTreeTermsReader consumes crazy amount of memory

Posted by Vitaly Funstein <vf...@gmail.com>.

Seems to me the bug occurs regardless of whether the passed in newer reader
is NRT or non-NRT. This is because the user operates at the level of
DirectoryReader, not SegmentReader and modifying the test code to do the
following reproduces the bug:

    writer.commit();
    DirectoryReader latest = DirectoryReader.open(writer, true);

    // This reader will be used for searching against commit point 1
    DirectoryReader searchReader = DirectoryReader.openIfChanged(latest,
ic1); //  <=== Exception/Assertion thrown here


On Wed, Sep 10, 2014 at 6:26 PM, Robert Muir <rc...@gmail.com> wrote:

> Thats because there are 3 constructors in segmentreader:
>
> 1. one used for opening new (checks hasDeletions, only reads liveDocs if
> so)
> 2. one used for non-NRT reopen <-- problem one for you
> 3. one used for NRT reopen (takes a LiveDocs as a param, so no bug)
>
> so personally i think you should be able to do this, we just have to
> add the hasDeletions check to #2
>
> On Wed, Sep 10, 2014 at 7:46 PM, Vitaly Funstein <vf...@gmail.com>
> wrote:
> > One other observation - if instead of a reader opened at a later commit
> > point (T1), I pass in an NRT reader *without* doing the second commit on
> > the index prior, then there is no exception. This probably also hinges on
> > the assumption that no buffered docs have been flushed after T0, thus
> > creating new segment files, as well... unfortunately, our system can't
> make
> > either assumption.
> >
> > On Wed, Sep 10, 2014 at 4:30 PM, Vitaly Funstein <vf...@gmail.com>
> > wrote:
> >
> >> Normally, reopens only go forwards in time, so if you could ensure
> >>> that when you reopen one reader to another, the 2nd one is always
> >>> "newer", then I think you should never hit this issue
> >>
> >>
> >> Mike, I'm not sure if I fully understand your suggestion. In a nutshell,
> >> the use here case is as follows: I want to be able to search the index
> at a
> >> particular point in time, let's call it T0. To that end, I save the
> state
> >> at that time via a commit and take a snapshot of the index. After that,
> the
> >> index is free to move on, to another point in time, say T1 - and likely
> >> does. The optimization we have been discussing (and this is what the
> test
> >> code I posted does) basically asks the reader to go back to point T0,
> while
> >> reusing as much of the state of the index from T1, as long as it is
> >> unchanged between the two.
> >>
> >> That's what DirectoryReader.openIfChanged(DirectoryReader, IndexCommit)
> is
> >> supposed to do internally... or am I misinterpreting the
> >> intent/implementation of it?
> >>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: BlockTreeTermsReader consumes crazy amount of memory

Posted by Robert Muir <rc...@gmail.com>.

Thats because there are 3 constructors in segmentreader:

1. one used for opening new (checks hasDeletions, only reads liveDocs if so)
2. one used for non-NRT reopen <-- problem one for you
3. one used for NRT reopen (takes a LiveDocs as a param, so no bug)

so personally i think you should be able to do this, we just have to
add the hasDeletions check to #2

On Wed, Sep 10, 2014 at 7:46 PM, Vitaly Funstein <vf...@gmail.com> wrote:
> One other observation - if instead of a reader opened at a later commit
> point (T1), I pass in an NRT reader *without* doing the second commit on
> the index prior, then there is no exception. This probably also hinges on
> the assumption that no buffered docs have been flushed after T0, thus
> creating new segment files, as well... unfortunately, our system can't make
> either assumption.
>
> On Wed, Sep 10, 2014 at 4:30 PM, Vitaly Funstein <vf...@gmail.com>
> wrote:
>
>> Normally, reopens only go forwards in time, so if you could ensure
>>> that when you reopen one reader to another, the 2nd one is always
>>> "newer", then I think you should never hit this issue
>>
>>
>> Mike, I'm not sure if I fully understand your suggestion. In a nutshell,
>> the use here case is as follows: I want to be able to search the index at a
>> particular point in time, let's call it T0. To that end, I save the state
>> at that time via a commit and take a snapshot of the index. After that, the
>> index is free to move on, to another point in time, say T1 - and likely
>> does. The optimization we have been discussing (and this is what the test
>> code I posted does) basically asks the reader to go back to point T0, while
>> reusing as much of the state of the index from T1, as long as it is
>> unchanged between the two.
>>
>> That's what DirectoryReader.openIfChanged(DirectoryReader, IndexCommit) is
>> supposed to do internally... or am I misinterpreting the
>> intent/implementation of it?
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: BlockTreeTermsReader consumes crazy amount of memory

Posted by Vitaly Funstein <vf...@gmail.com>.

One other observation - if instead of a reader opened at a later commit
point (T1), I pass in an NRT reader *without* doing the second commit on
the index prior, then there is no exception. This probably also hinges on
the assumption that no buffered docs have been flushed after T0, thus
creating new segment files, as well... unfortunately, our system can't make
either assumption.

On Wed, Sep 10, 2014 at 4:30 PM, Vitaly Funstein <vf...@gmail.com>
wrote:

> Normally, reopens only go forwards in time, so if you could ensure
>> that when you reopen one reader to another, the 2nd one is always
>> "newer", then I think you should never hit this issue
>
>
> Mike, I'm not sure if I fully understand your suggestion. In a nutshell,
> the use here case is as follows: I want to be able to search the index at a
> particular point in time, let's call it T0. To that end, I save the state
> at that time via a commit and take a snapshot of the index. After that, the
> index is free to move on, to another point in time, say T1 - and likely
> does. The optimization we have been discussing (and this is what the test
> code I posted does) basically asks the reader to go back to point T0, while
> reusing as much of the state of the index from T1, as long as it is
> unchanged between the two.
>
> That's what DirectoryReader.openIfChanged(DirectoryReader, IndexCommit) is
> supposed to do internally... or am I misinterpreting the
> intent/implementation of it?
>

Re: BlockTreeTermsReader consumes crazy amount of memory

Posted by Vitaly Funstein <vf...@gmail.com>.

>
> Normally, reopens only go forwards in time, so if you could ensure
> that when you reopen one reader to another, the 2nd one is always
> "newer", then I think you should never hit this issue


Mike, I'm not sure if I fully understand your suggestion. In a nutshell,
the use here case is as follows: I want to be able to search the index at a
particular point in time, let's call it T0. To that end, I save the state
at that time via a commit and take a snapshot of the index. After that, the
index is free to move on, to another point in time, say T1 - and likely
does. The optimization we have been discussing (and this is what the test
code I posted does) basically asks the reader to go back to point T0, while
reusing as much of the state of the index from T1, as long as it is
unchanged between the two.

That's what DirectoryReader.openIfChanged(DirectoryReader, IndexCommit) is
supposed to do internally... or am I misinterpreting the
intent/implementation of it?

Re: BlockTreeTermsReader consumes crazy amount of memory

Posted by Michael McCandless <lu...@mikemccandless.com>.

Thanks, I'll look at the issue soon.

Right, segment merging won't spontaneously create deletes.  Deletes
are only made if you explicitly delete OR (tricky) there is a
non-aborting exception (e.g. an analysis problem) hit while indexing a
document; in that case IW indexes a portion of the document (up to
where the exception happened) and then marks the document deleted.

Normally, reopens only go forwards in time, so if you could ensure
that when you reopen one reader to another, the 2nd one is always
"newer", then I think you should never hit this issue?

Mike McCandless

http://blog.mikemccandless.com


On Tue, Sep 9, 2014 at 9:09 PM, Vitaly Funstein <vf...@gmail.com> wrote:
> Okay, created LUCENE-5931 for this. As it turns out, my original test
> actually does do deletes on the index so please disregard my question about
> segment merging.
>
>
> On Tue, Sep 9, 2014 at 3:00 PM, <vf...@gmail.com> wrote:
>
>> I'm on 4.6.1. I'll file an issue for sure, but is there a workaround you
>> could think of in the meantime? As you probably remember, the reason for
>> doing this in the first place was to prevent the catastrophic heap
>> exhaustion when SegmentReader instances are opened from scratch for every
>> new IndexReader created.
>>
>> Coming to think of it, I think this issue can be observed even when the
>> "old" reader is not NRT but even another commit point.
>>
>> But just to check my understanding of Lucene in general - was my hunch
>> correct in that deletes can be present as a result of segment merging and
>> not actual user-driven index updates?
>>
>> > On Sep 9, 2014, at 10:46 AM, Michael McCandless <
>> lucene@mikemccandless.com> wrote:
>> >
>> > Hmm, which Lucene version are you using?  We recently beefed up the
>> > checking in this code, so you ought to be hitting an exception in
>> > newer versions.
>> >
>> > But that being said, I think the bug is real: if you try to reopen
>> > from a newer NRT reader down to an older (commit point) reader then
>> > you can hit this.
>> >
>> > Can you open an issue and maybe post a test case showing it?  Thanks.
>> >
>> > Mike McCandless
>> >
>> > http://blog.mikemccandless.com
>> >
>> >
>> >> On Tue, Sep 9, 2014 at 2:30 AM, Vitaly Funstein <vf...@gmail.com>
>> wrote:
>> >> I think I see the bug here, but maybe I'm wrong. Here's my theory:
>> >>
>> >> Suppose no segments at a particular commit point contain any deletes.
>> Now,
>> >> we also hold open an NRT reader into the index, which may end up with
>> some
>> >> deletes, after the commit occurred. Then, according to the following
>> >> conditional in StandardDirectoryReader, we shall get into the two arg
>> ctor
>> >> of SegmentReader:
>> >>
>> >>            if (newReaders[i].getSegmentInfo().getDelGen() ==
>> >> infos.info(i).getDelGen())
>> >> {
>> >>              // only DV updates
>> >>              newReaders[i] = new SegmentReader(infos.info(i),
>> >> newReaders[i], newReaders[i].getLiveDocs(), newReaders[i].numDocs());
>> >>            } else {
>> >>              // both DV and liveDocs have changed
>> >>              newReaders[i] = new SegmentReader(infos.info(i),
>> >> newReaders[i]);
>> >>            }
>> >>
>> >> That constructor looks like this:
>> >>
>> >>  SegmentReader(SegmentCommitInfo si, SegmentReader sr) throws
>> IOException {
>> >>    this(si, sr,
>> >>         si.info.getCodec().liveDocsFormat().readLiveDocs(si.info.dir,
>> si,
>> >> IOContext.READONCE),
>> >>         si.info.getDocCount() - si.getDelCount());
>> >>  }
>> >>
>> >> At this point, the SegmentInfo we're tryng to read live docs on is from
>> the
>> >> commit point, and if there weren't any deletes, then the following
>> results
>> >> in a null for the relevant file name, in
>> >> Lucene40LiveDocsFormat.readLiveDocs():
>> >>
>> >>    String filename = IndexFileNames.fileNameFromGeneration(
>> info.info.name,
>> >> DELETES_EXTENSION, info.getDelGen());
>> >>    final BitVector liveDocs = new BitVector(dir, filename, context);
>> >>
>> >> This is where filename ends up being null, which gets passed all the way
>> >> down to the File constructor.
>> >>
>> >> In a nutshell, I think the bug is that it is assumed that the segments
>> from
>> >> commit point have deletes, when they may not, yet the original
>> >> SegmentReader for the segment that we are trying to reuse does.
>> >>
>> >> What I am not quite clear yet is how we arrive at this point, because
>> the
>> >> test that causes the exception doesn't do any deletes/updates... could
>> >> deletes occur as a result of a segment merge? This might explain the
>> >> sporadic nature of this exception, since merge timings aren't
>> deterministic.
>> >>
>> >>
>> >> On Mon, Sep 8, 2014 at 11:45 AM, Vitaly Funstein <vf...@gmail.com>
>> >> wrote:
>> >>
>> >>> UPDATE:
>> >>>
>> >>> After making the changes we discussed to enable sharing of
>> SegmentReaders
>> >>> between the NRT reader and a commit point reader, specifically calling
>> >>> through to DirectoryReader.openIfChanged(DirectoryReader,
>> IndexCommit), I
>> >>> am seeing this exception, sporadically:
>> >>>
>> >>> Caused by: java.lang.NullPointerException
>> >>>        at java.io.File.<init>(File.java:305)
>> >>>        at
>> >>>
>> org.terracotta.shaded.lucene.store.NIOFSDirectory.openInput(NIOFSDirectory.java:80)
>> >>>        at
>> >>>
>> org.terracotta.shaded.lucene.codecs.lucene40.BitVector.<init>(BitVector.java:327)
>> >>>        at
>> >>>
>> org.terracotta.shaded.lucene.codecs.lucene40.Lucene40LiveDocsFormat.readLiveDocs(Lucene40LiveDocsFormat.java:90)
>> >>>        at
>> >>>
>> org.terracotta.shaded.lucene.index.SegmentReader.<init>(SegmentReader.java:131)
>> >>>        at
>> >>>
>> org.terracotta.shaded.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:194)
>> >>>        at
>> >>>
>> org.terracotta.shaded.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:326)
>> >>>        at
>> >>>
>> org.terracotta.shaded.lucene.index.StandardDirectoryReader$2.doBody(StandardDirectoryReader.java:320)
>> >>>        at
>> >>>
>> org.terracotta.shaded.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:702)
>> >>>        at
>> >>>
>> org.terracotta.shaded.lucene.index.StandardDirectoryReader.doOpenFromCommit(StandardDirectoryReader.java:315)
>> >>>        at
>> >>>
>> org.terracotta.shaded.lucene.index.StandardDirectoryReader.doOpenFromWriter(StandardDirectoryReader.java:278)
>> >>>        at
>> >>>
>> org.terracotta.shaded.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:260)
>> >>>        at
>> >>>
>> org.terracotta.shaded.lucene.index.DirectoryReader.openIfChanged(DirectoryReader.java:183)
>> >>>
>> >>> Looking at the source quickly, it appears the child argument to the
>> File
>> >>> ctor is null; is it somehow possible that the segment infos in the
>> commit
>> >>> point wasn't fully written out somehow, on prior commit? Sounds
>> unlikely,
>> >>> yet disturbing... but nothing else has changed in my code, i.e. the way
>> >>> commits are performed and indexes are reopened.
>> >>>
>> >>>
>> >>> On Fri, Aug 29, 2014 at 2:03 AM, Michael McCandless <
>> >>> lucene@mikemccandless.com> wrote:
>> >>>
>> >>>> On Thu, Aug 28, 2014 at 5:38 PM, Vitaly Funstein <vfunstein@gmail.com
>> >
>> >>>> wrote:
>> >>>>> On Thu, Aug 28, 2014 at 1:25 PM, Michael McCandless <
>> >>>>> lucene@mikemccandless.com> wrote:
>> >>>>>
>> >>>>>>
>> >>>>>> The segments_N file can be different, that's fine: after that, we
>> then
>> >>>>>> re-use SegmentReaders when they are in common between the two commit
>> >>>>>> points.  Each segments_N file refers to many segments...
>> >>>>> Yes, you are totally right - I didn't follow the code far enough the
>> >>>> first
>> >>>>> time around. :) This is an excellent idea, actually - I can probably
>> >>>>> arrange maintained commit points as an MRU data structure (e.g.
>> >>>>> LinkedHashMap with access order), and simply grab the most recently
>> >>>> opened
>> >>>>> reader to pass in when obtaining a new one from the new commit point
>> -
>> >>>> to
>> >>>>> maximize segment reader reuse.
>> >>>>
>> >>>> That's great!
>> >>>>
>> >>>>>> You can set it (min and max) as high as you want; the only hard
>> >>>>>> requirement is that max >= 2*(min-1), I believe.
>> >>>>>
>> >>>>> Looks like this is used inside Lucene41PostingsFormat, which simply
>> >>>> passes
>> >>>>> in those defaults - so you are effectively saying the minimum (and
>> >>>>> therefore, maximum) block size can be raised to reuse the size of the
>> >>>> terms
>> >>>>> index inside those TreeMap nodes?
>> >>>>
>> >>>> Yes, but it then increases cost at search time to locate a given term,
>> >>>> because more scanning is then required once we seek to the block that
>> >>>> might have the term.
>> >>>>
>> >>>> This reduces the size of the FST, but if RAM is being used by
>> >>>> something else inside BT, it won't help.  But from your screen shot it
>> >>>> looked like it was almost entirely the FST, which is what I would
>> >>>> expect.
>> >>>>
>> >>>>>>> We are already using a customized codec though, so perhaps adding
>> >>>>>>> this to the codec is okay and transparent?
>> >>>>>>
>> >>>>>> Hmmm :)  Customized in what manner?
>> >>>>> We need to have the ability to turn off stored fields compression, so
>> >>>> there
>> >>>>> is one codec in case the system is configured that way. The other one
>> >>>>> exists for compression on, but there I tweaked stored fields format
>> for
>> >>>>> bias toward decompression, as well as a smaller chunk size - based on
>> >>>> some
>> >>>>> empirical observations in executed tests. I am guessing I'll just add
>> >>>>> another customization to both that deals with the block sizing for
>> >>>> postings
>> >>>>> format, and see what difference that makes...
>> >>>>
>> >>>> Ahh, OK.  Yes, just add this custom terms index block sizing too.
>> >>>>
>> >>>> Mike McCandless
>> >>>>
>> >>>> http://blog.mikemccandless.com
>> >>>>
>> >>>> ---------------------------------------------------------------------
>> >>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> >>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> > For additional commands, e-mail: java-user-help@lucene.apache.org
>> >
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: BlockTreeTermsReader consumes crazy amount of memory

Posted by Vitaly Funstein <vf...@gmail.com>.

Okay, created LUCENE-5931 for this. As it turns out, my original test
actually does do deletes on the index so please disregard my question about
segment merging.


On Tue, Sep 9, 2014 at 3:00 PM, <vf...@gmail.com> wrote:

> I'm on 4.6.1. I'll file an issue for sure, but is there a workaround you
> could think of in the meantime? As you probably remember, the reason for
> doing this in the first place was to prevent the catastrophic heap
> exhaustion when SegmentReader instances are opened from scratch for every
> new IndexReader created.
>
> Coming to think of it, I think this issue can be observed even when the
> "old" reader is not NRT but even another commit point.
>
> But just to check my understanding of Lucene in general - was my hunch
> correct in that deletes can be present as a result of segment merging and
> not actual user-driven index updates?
>
> > On Sep 9, 2014, at 10:46 AM, Michael McCandless <
> lucene@mikemccandless.com> wrote:
> >
> > Hmm, which Lucene version are you using?  We recently beefed up the
> > checking in this code, so you ought to be hitting an exception in
> > newer versions.
> >
> > But that being said, I think the bug is real: if you try to reopen
> > from a newer NRT reader down to an older (commit point) reader then
> > you can hit this.
> >
> > Can you open an issue and maybe post a test case showing it?  Thanks.
> >
> > Mike McCandless
> >
> > http://blog.mikemccandless.com
> >
> >
> >> On Tue, Sep 9, 2014 at 2:30 AM, Vitaly Funstein <vf...@gmail.com>
> wrote:
> >> I think I see the bug here, but maybe I'm wrong. Here's my theory:
> >>
> >> Suppose no segments at a particular commit point contain any deletes.
> Now,
> >> we also hold open an NRT reader into the index, which may end up with
> some
> >> deletes, after the commit occurred. Then, according to the following
> >> conditional in StandardDirectoryReader, we shall get into the two arg
> ctor
> >> of SegmentReader:
> >>
> >>            if (newReaders[i].getSegmentInfo().getDelGen() ==
> >> infos.info(i).getDelGen())
> >> {
> >>              // only DV updates
> >>              newReaders[i] = new SegmentReader(infos.info(i),
> >> newReaders[i], newReaders[i].getLiveDocs(), newReaders[i].numDocs());
> >>            } else {
> >>              // both DV and liveDocs have changed
> >>              newReaders[i] = new SegmentReader(infos.info(i),
> >> newReaders[i]);
> >>            }
> >>
> >> That constructor looks like this:
> >>
> >>  SegmentReader(SegmentCommitInfo si, SegmentReader sr) throws
> IOException {
> >>    this(si, sr,
> >>         si.info.getCodec().liveDocsFormat().readLiveDocs(si.info.dir,
> si,
> >> IOContext.READONCE),
> >>         si.info.getDocCount() - si.getDelCount());
> >>  }
> >>
> >> At this point, the SegmentInfo we're tryng to read live docs on is from
> the
> >> commit point, and if there weren't any deletes, then the following
> results
> >> in a null for the relevant file name, in
> >> Lucene40LiveDocsFormat.readLiveDocs():
> >>
> >>    String filename = IndexFileNames.fileNameFromGeneration(
> info.info.name,
> >> DELETES_EXTENSION, info.getDelGen());
> >>    final BitVector liveDocs = new BitVector(dir, filename, context);
> >>
> >> This is where filename ends up being null, which gets passed all the way
> >> down to the File constructor.
> >>
> >> In a nutshell, I think the bug is that it is assumed that the segments
> from
> >> commit point have deletes, when they may not, yet the original
> >> SegmentReader for the segment that we are trying to reuse does.
> >>
> >> What I am not quite clear yet is how we arrive at this point, because
> the
> >> test that causes the exception doesn't do any deletes/updates... could
> >> deletes occur as a result of a segment merge? This might explain the
> >> sporadic nature of this exception, since merge timings aren't
> deterministic.
> >>
> >>
> >> On Mon, Sep 8, 2014 at 11:45 AM, Vitaly Funstein <vf...@gmail.com>
> >> wrote:
> >>
> >>> UPDATE:
> >>>
> >>> After making the changes we discussed to enable sharing of
> SegmentReaders
> >>> between the NRT reader and a commit point reader, specifically calling
> >>> through to DirectoryReader.openIfChanged(DirectoryReader,
> IndexCommit), I
> >>> am seeing this exception, sporadically:
> >>>
> >>> Caused by: java.lang.NullPointerException
> >>>        at java.io.File.<init>(File.java:305)
> >>>        at
> >>>
> org.terracotta.shaded.lucene.store.NIOFSDirectory.openInput(NIOFSDirectory.java:80)
> >>>        at
> >>>
> org.terracotta.shaded.lucene.codecs.lucene40.BitVector.<init>(BitVector.java:327)
> >>>        at
> >>>
> org.terracotta.shaded.lucene.codecs.lucene40.Lucene40LiveDocsFormat.readLiveDocs(Lucene40LiveDocsFormat.java:90)
> >>>        at
> >>>
> org.terracotta.shaded.lucene.index.SegmentReader.<init>(SegmentReader.java:131)
> >>>        at
> >>>
> org.terracotta.shaded.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:194)
> >>>        at
> >>>
> org.terracotta.shaded.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:326)
> >>>        at
> >>>
> org.terracotta.shaded.lucene.index.StandardDirectoryReader$2.doBody(StandardDirectoryReader.java:320)
> >>>        at
> >>>
> org.terracotta.shaded.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:702)
> >>>        at
> >>>
> org.terracotta.shaded.lucene.index.StandardDirectoryReader.doOpenFromCommit(StandardDirectoryReader.java:315)
> >>>        at
> >>>
> org.terracotta.shaded.lucene.index.StandardDirectoryReader.doOpenFromWriter(StandardDirectoryReader.java:278)
> >>>        at
> >>>
> org.terracotta.shaded.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:260)
> >>>        at
> >>>
> org.terracotta.shaded.lucene.index.DirectoryReader.openIfChanged(DirectoryReader.java:183)
> >>>
> >>> Looking at the source quickly, it appears the child argument to the
> File
> >>> ctor is null; is it somehow possible that the segment infos in the
> commit
> >>> point wasn't fully written out somehow, on prior commit? Sounds
> unlikely,
> >>> yet disturbing... but nothing else has changed in my code, i.e. the way
> >>> commits are performed and indexes are reopened.
> >>>
> >>>
> >>> On Fri, Aug 29, 2014 at 2:03 AM, Michael McCandless <
> >>> lucene@mikemccandless.com> wrote:
> >>>
> >>>> On Thu, Aug 28, 2014 at 5:38 PM, Vitaly Funstein <vfunstein@gmail.com
> >
> >>>> wrote:
> >>>>> On Thu, Aug 28, 2014 at 1:25 PM, Michael McCandless <
> >>>>> lucene@mikemccandless.com> wrote:
> >>>>>
> >>>>>>
> >>>>>> The segments_N file can be different, that's fine: after that, we
> then
> >>>>>> re-use SegmentReaders when they are in common between the two commit
> >>>>>> points.  Each segments_N file refers to many segments...
> >>>>> Yes, you are totally right - I didn't follow the code far enough the
> >>>> first
> >>>>> time around. :) This is an excellent idea, actually - I can probably
> >>>>> arrange maintained commit points as an MRU data structure (e.g.
> >>>>> LinkedHashMap with access order), and simply grab the most recently
> >>>> opened
> >>>>> reader to pass in when obtaining a new one from the new commit point
> -
> >>>> to
> >>>>> maximize segment reader reuse.
> >>>>
> >>>> That's great!
> >>>>
> >>>>>> You can set it (min and max) as high as you want; the only hard
> >>>>>> requirement is that max >= 2*(min-1), I believe.
> >>>>>
> >>>>> Looks like this is used inside Lucene41PostingsFormat, which simply
> >>>> passes
> >>>>> in those defaults - so you are effectively saying the minimum (and
> >>>>> therefore, maximum) block size can be raised to reuse the size of the
> >>>> terms
> >>>>> index inside those TreeMap nodes?
> >>>>
> >>>> Yes, but it then increases cost at search time to locate a given term,
> >>>> because more scanning is then required once we seek to the block that
> >>>> might have the term.
> >>>>
> >>>> This reduces the size of the FST, but if RAM is being used by
> >>>> something else inside BT, it won't help.  But from your screen shot it
> >>>> looked like it was almost entirely the FST, which is what I would
> >>>> expect.
> >>>>
> >>>>>>> We are already using a customized codec though, so perhaps adding
> >>>>>>> this to the codec is okay and transparent?
> >>>>>>
> >>>>>> Hmmm :)  Customized in what manner?
> >>>>> We need to have the ability to turn off stored fields compression, so
> >>>> there
> >>>>> is one codec in case the system is configured that way. The other one
> >>>>> exists for compression on, but there I tweaked stored fields format
> for
> >>>>> bias toward decompression, as well as a smaller chunk size - based on
> >>>> some
> >>>>> empirical observations in executed tests. I am guessing I'll just add
> >>>>> another customization to both that deals with the block sizing for
> >>>> postings
> >>>>> format, and see what difference that makes...
> >>>>
> >>>> Ahh, OK.  Yes, just add this custom terms index block sizing too.
> >>>>
> >>>> Mike McCandless
> >>>>
> >>>> http://blog.mikemccandless.com
> >>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>>> For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
>

Re: BlockTreeTermsReader consumes crazy amount of memory

Posted by vf...@gmail.com.

I'm on 4.6.1. I'll file an issue for sure, but is there a workaround you could think of in the meantime? As you probably remember, the reason for doing this in the first place was to prevent the catastrophic heap exhaustion when SegmentReader instances are opened from scratch for every new IndexReader created. 

Coming to think of it, I think this issue can be observed even when the "old" reader is not NRT but even another commit point. 

But just to check my understanding of Lucene in general - was my hunch correct in that deletes can be present as a result of segment merging and not actual user-driven index updates?

> On Sep 9, 2014, at 10:46 AM, Michael McCandless <lu...@mikemccandless.com> wrote:
> 
> Hmm, which Lucene version are you using?  We recently beefed up the
> checking in this code, so you ought to be hitting an exception in
> newer versions.
> 
> But that being said, I think the bug is real: if you try to reopen
> from a newer NRT reader down to an older (commit point) reader then
> you can hit this.
> 
> Can you open an issue and maybe post a test case showing it?  Thanks.
> 
> Mike McCandless
> 
> http://blog.mikemccandless.com
> 
> 
>> On Tue, Sep 9, 2014 at 2:30 AM, Vitaly Funstein <vf...@gmail.com> wrote:
>> I think I see the bug here, but maybe I'm wrong. Here's my theory:
>> 
>> Suppose no segments at a particular commit point contain any deletes. Now,
>> we also hold open an NRT reader into the index, which may end up with some
>> deletes, after the commit occurred. Then, according to the following
>> conditional in StandardDirectoryReader, we shall get into the two arg ctor
>> of SegmentReader:
>> 
>>            if (newReaders[i].getSegmentInfo().getDelGen() ==
>> infos.info(i).getDelGen())
>> {
>>              // only DV updates
>>              newReaders[i] = new SegmentReader(infos.info(i),
>> newReaders[i], newReaders[i].getLiveDocs(), newReaders[i].numDocs());
>>            } else {
>>              // both DV and liveDocs have changed
>>              newReaders[i] = new SegmentReader(infos.info(i),
>> newReaders[i]);
>>            }
>> 
>> That constructor looks like this:
>> 
>>  SegmentReader(SegmentCommitInfo si, SegmentReader sr) throws IOException {
>>    this(si, sr,
>>         si.info.getCodec().liveDocsFormat().readLiveDocs(si.info.dir, si,
>> IOContext.READONCE),
>>         si.info.getDocCount() - si.getDelCount());
>>  }
>> 
>> At this point, the SegmentInfo we're tryng to read live docs on is from the
>> commit point, and if there weren't any deletes, then the following results
>> in a null for the relevant file name, in
>> Lucene40LiveDocsFormat.readLiveDocs():
>> 
>>    String filename = IndexFileNames.fileNameFromGeneration(info.info.name,
>> DELETES_EXTENSION, info.getDelGen());
>>    final BitVector liveDocs = new BitVector(dir, filename, context);
>> 
>> This is where filename ends up being null, which gets passed all the way
>> down to the File constructor.
>> 
>> In a nutshell, I think the bug is that it is assumed that the segments from
>> commit point have deletes, when they may not, yet the original
>> SegmentReader for the segment that we are trying to reuse does.
>> 
>> What I am not quite clear yet is how we arrive at this point, because the
>> test that causes the exception doesn't do any deletes/updates... could
>> deletes occur as a result of a segment merge? This might explain the
>> sporadic nature of this exception, since merge timings aren't deterministic.
>> 
>> 
>> On Mon, Sep 8, 2014 at 11:45 AM, Vitaly Funstein <vf...@gmail.com>
>> wrote:
>> 
>>> UPDATE:
>>> 
>>> After making the changes we discussed to enable sharing of SegmentReaders
>>> between the NRT reader and a commit point reader, specifically calling
>>> through to DirectoryReader.openIfChanged(DirectoryReader, IndexCommit), I
>>> am seeing this exception, sporadically:
>>> 
>>> Caused by: java.lang.NullPointerException
>>>        at java.io.File.<init>(File.java:305)
>>>        at
>>> org.terracotta.shaded.lucene.store.NIOFSDirectory.openInput(NIOFSDirectory.java:80)
>>>        at
>>> org.terracotta.shaded.lucene.codecs.lucene40.BitVector.<init>(BitVector.java:327)
>>>        at
>>> org.terracotta.shaded.lucene.codecs.lucene40.Lucene40LiveDocsFormat.readLiveDocs(Lucene40LiveDocsFormat.java:90)
>>>        at
>>> org.terracotta.shaded.lucene.index.SegmentReader.<init>(SegmentReader.java:131)
>>>        at
>>> org.terracotta.shaded.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:194)
>>>        at
>>> org.terracotta.shaded.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:326)
>>>        at
>>> org.terracotta.shaded.lucene.index.StandardDirectoryReader$2.doBody(StandardDirectoryReader.java:320)
>>>        at
>>> org.terracotta.shaded.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:702)
>>>        at
>>> org.terracotta.shaded.lucene.index.StandardDirectoryReader.doOpenFromCommit(StandardDirectoryReader.java:315)
>>>        at
>>> org.terracotta.shaded.lucene.index.StandardDirectoryReader.doOpenFromWriter(StandardDirectoryReader.java:278)
>>>        at
>>> org.terracotta.shaded.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:260)
>>>        at
>>> org.terracotta.shaded.lucene.index.DirectoryReader.openIfChanged(DirectoryReader.java:183)
>>> 
>>> Looking at the source quickly, it appears the child argument to the File
>>> ctor is null; is it somehow possible that the segment infos in the commit
>>> point wasn't fully written out somehow, on prior commit? Sounds unlikely,
>>> yet disturbing... but nothing else has changed in my code, i.e. the way
>>> commits are performed and indexes are reopened.
>>> 
>>> 
>>> On Fri, Aug 29, 2014 at 2:03 AM, Michael McCandless <
>>> lucene@mikemccandless.com> wrote:
>>> 
>>>> On Thu, Aug 28, 2014 at 5:38 PM, Vitaly Funstein <vf...@gmail.com>
>>>> wrote:
>>>>> On Thu, Aug 28, 2014 at 1:25 PM, Michael McCandless <
>>>>> lucene@mikemccandless.com> wrote:
>>>>> 
>>>>>> 
>>>>>> The segments_N file can be different, that's fine: after that, we then
>>>>>> re-use SegmentReaders when they are in common between the two commit
>>>>>> points.  Each segments_N file refers to many segments...
>>>>> Yes, you are totally right - I didn't follow the code far enough the
>>>> first
>>>>> time around. :) This is an excellent idea, actually - I can probably
>>>>> arrange maintained commit points as an MRU data structure (e.g.
>>>>> LinkedHashMap with access order), and simply grab the most recently
>>>> opened
>>>>> reader to pass in when obtaining a new one from the new commit point -
>>>> to
>>>>> maximize segment reader reuse.
>>>> 
>>>> That's great!
>>>> 
>>>>>> You can set it (min and max) as high as you want; the only hard
>>>>>> requirement is that max >= 2*(min-1), I believe.
>>>>> 
>>>>> Looks like this is used inside Lucene41PostingsFormat, which simply
>>>> passes
>>>>> in those defaults - so you are effectively saying the minimum (and
>>>>> therefore, maximum) block size can be raised to reuse the size of the
>>>> terms
>>>>> index inside those TreeMap nodes?
>>>> 
>>>> Yes, but it then increases cost at search time to locate a given term,
>>>> because more scanning is then required once we seek to the block that
>>>> might have the term.
>>>> 
>>>> This reduces the size of the FST, but if RAM is being used by
>>>> something else inside BT, it won't help.  But from your screen shot it
>>>> looked like it was almost entirely the FST, which is what I would
>>>> expect.
>>>> 
>>>>>>> We are already using a customized codec though, so perhaps adding
>>>>>>> this to the codec is okay and transparent?
>>>>>> 
>>>>>> Hmmm :)  Customized in what manner?
>>>>> We need to have the ability to turn off stored fields compression, so
>>>> there
>>>>> is one codec in case the system is configured that way. The other one
>>>>> exists for compression on, but there I tweaked stored fields format for
>>>>> bias toward decompression, as well as a smaller chunk size - based on
>>>> some
>>>>> empirical observations in executed tests. I am guessing I'll just add
>>>>> another customization to both that deals with the block sizing for
>>>> postings
>>>>> format, and see what difference that makes...
>>>> 
>>>> Ahh, OK.  Yes, just add this custom terms index block sizing too.
>>>> 
>>>> Mike McCandless
>>>> 
>>>> http://blog.mikemccandless.com
>>>> 
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: BlockTreeTermsReader consumes crazy amount of memory

Posted by Michael McCandless <lu...@mikemccandless.com>.

Hmm, which Lucene version are you using?  We recently beefed up the
checking in this code, so you ought to be hitting an exception in
newer versions.

But that being said, I think the bug is real: if you try to reopen
from a newer NRT reader down to an older (commit point) reader then
you can hit this.

Can you open an issue and maybe post a test case showing it?  Thanks.

Mike McCandless

http://blog.mikemccandless.com


On Tue, Sep 9, 2014 at 2:30 AM, Vitaly Funstein <vf...@gmail.com> wrote:
> I think I see the bug here, but maybe I'm wrong. Here's my theory:
>
> Suppose no segments at a particular commit point contain any deletes. Now,
> we also hold open an NRT reader into the index, which may end up with some
> deletes, after the commit occurred. Then, according to the following
> conditional in StandardDirectoryReader, we shall get into the two arg ctor
> of SegmentReader:
>
>             if (newReaders[i].getSegmentInfo().getDelGen() ==
> infos.info(i).getDelGen())
> {
>               // only DV updates
>               newReaders[i] = new SegmentReader(infos.info(i),
> newReaders[i], newReaders[i].getLiveDocs(), newReaders[i].numDocs());
>             } else {
>               // both DV and liveDocs have changed
>               newReaders[i] = new SegmentReader(infos.info(i),
> newReaders[i]);
>             }
>
> That constructor looks like this:
>
>   SegmentReader(SegmentCommitInfo si, SegmentReader sr) throws IOException {
>     this(si, sr,
>          si.info.getCodec().liveDocsFormat().readLiveDocs(si.info.dir, si,
> IOContext.READONCE),
>          si.info.getDocCount() - si.getDelCount());
>   }
>
> At this point, the SegmentInfo we're tryng to read live docs on is from the
> commit point, and if there weren't any deletes, then the following results
> in a null for the relevant file name, in
> Lucene40LiveDocsFormat.readLiveDocs():
>
>     String filename = IndexFileNames.fileNameFromGeneration(info.info.name,
> DELETES_EXTENSION, info.getDelGen());
>     final BitVector liveDocs = new BitVector(dir, filename, context);
>
> This is where filename ends up being null, which gets passed all the way
> down to the File constructor.
>
> In a nutshell, I think the bug is that it is assumed that the segments from
> commit point have deletes, when they may not, yet the original
> SegmentReader for the segment that we are trying to reuse does.
>
> What I am not quite clear yet is how we arrive at this point, because the
> test that causes the exception doesn't do any deletes/updates... could
> deletes occur as a result of a segment merge? This might explain the
> sporadic nature of this exception, since merge timings aren't deterministic.
>
>
> On Mon, Sep 8, 2014 at 11:45 AM, Vitaly Funstein <vf...@gmail.com>
> wrote:
>
>> UPDATE:
>>
>> After making the changes we discussed to enable sharing of SegmentReaders
>> between the NRT reader and a commit point reader, specifically calling
>> through to DirectoryReader.openIfChanged(DirectoryReader, IndexCommit), I
>> am seeing this exception, sporadically:
>>
>> Caused by: java.lang.NullPointerException
>>         at java.io.File.<init>(File.java:305)
>>         at
>> org.terracotta.shaded.lucene.store.NIOFSDirectory.openInput(NIOFSDirectory.java:80)
>>         at
>> org.terracotta.shaded.lucene.codecs.lucene40.BitVector.<init>(BitVector.java:327)
>>         at
>> org.terracotta.shaded.lucene.codecs.lucene40.Lucene40LiveDocsFormat.readLiveDocs(Lucene40LiveDocsFormat.java:90)
>>         at
>> org.terracotta.shaded.lucene.index.SegmentReader.<init>(SegmentReader.java:131)
>>         at
>> org.terracotta.shaded.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:194)
>>         at
>> org.terracotta.shaded.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:326)
>>         at
>> org.terracotta.shaded.lucene.index.StandardDirectoryReader$2.doBody(StandardDirectoryReader.java:320)
>>         at
>> org.terracotta.shaded.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:702)
>>         at
>> org.terracotta.shaded.lucene.index.StandardDirectoryReader.doOpenFromCommit(StandardDirectoryReader.java:315)
>>         at
>> org.terracotta.shaded.lucene.index.StandardDirectoryReader.doOpenFromWriter(StandardDirectoryReader.java:278)
>>         at
>> org.terracotta.shaded.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:260)
>>         at
>> org.terracotta.shaded.lucene.index.DirectoryReader.openIfChanged(DirectoryReader.java:183)
>>
>> Looking at the source quickly, it appears the child argument to the File
>> ctor is null; is it somehow possible that the segment infos in the commit
>> point wasn't fully written out somehow, on prior commit? Sounds unlikely,
>> yet disturbing... but nothing else has changed in my code, i.e. the way
>> commits are performed and indexes are reopened.
>>
>>
>> On Fri, Aug 29, 2014 at 2:03 AM, Michael McCandless <
>> lucene@mikemccandless.com> wrote:
>>
>>> On Thu, Aug 28, 2014 at 5:38 PM, Vitaly Funstein <vf...@gmail.com>
>>> wrote:
>>> > On Thu, Aug 28, 2014 at 1:25 PM, Michael McCandless <
>>> > lucene@mikemccandless.com> wrote:
>>> >
>>> >>
>>> >> The segments_N file can be different, that's fine: after that, we then
>>> >> re-use SegmentReaders when they are in common between the two commit
>>> >> points.  Each segments_N file refers to many segments...
>>> >>
>>> >>
>>> > Yes, you are totally right - I didn't follow the code far enough the
>>> first
>>> > time around. :) This is an excellent idea, actually - I can probably
>>> > arrange maintained commit points as an MRU data structure (e.g.
>>> > LinkedHashMap with access order), and simply grab the most recently
>>> opened
>>> > reader to pass in when obtaining a new one from the new commit point -
>>> to
>>> > maximize segment reader reuse.
>>>
>>> That's great!
>>>
>>> >> You can set it (min and max) as high as you want; the only hard
>>> >> requirement is that max >= 2*(min-1), I believe.
>>> >>
>>> >
>>> > Looks like this is used inside Lucene41PostingsFormat, which simply
>>> passes
>>> > in those defaults - so you are effectively saying the minimum (and
>>> > therefore, maximum) block size can be raised to reuse the size of the
>>> terms
>>> > index inside those TreeMap nodes?
>>>
>>> Yes, but it then increases cost at search time to locate a given term,
>>> because more scanning is then required once we seek to the block that
>>> might have the term.
>>>
>>> This reduces the size of the FST, but if RAM is being used by
>>> something else inside BT, it won't help.  But from your screen shot it
>>> looked like it was almost entirely the FST, which is what I would
>>> expect.
>>>
>>> >> > We are already using a customized codec though, so perhaps adding
>>> >> > this to the codec is okay and transparent?
>>> >>
>>> >> Hmmm :)  Customized in what manner?
>>> >>
>>> >>
>>> > We need to have the ability to turn off stored fields compression, so
>>> there
>>> > is one codec in case the system is configured that way. The other one
>>> > exists for compression on, but there I tweaked stored fields format for
>>> > bias toward decompression, as well as a smaller chunk size - based on
>>> some
>>> > empirical observations in executed tests. I am guessing I'll just add
>>> > another customization to both that deals with the block sizing for
>>> postings
>>> > format, and see what difference that makes...
>>>
>>> Ahh, OK.  Yes, just add this custom terms index block sizing too.
>>>
>>> Mike McCandless
>>>
>>> http://blog.mikemccandless.com
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: BlockTreeTermsReader consumes crazy amount of memory

Posted by Vitaly Funstein <vf...@gmail.com>.

I think I see the bug here, but maybe I'm wrong. Here's my theory:

Suppose no segments at a particular commit point contain any deletes. Now,
we also hold open an NRT reader into the index, which may end up with some
deletes, after the commit occurred. Then, according to the following
conditional in StandardDirectoryReader, we shall get into the two arg ctor
of SegmentReader:

            if (newReaders[i].getSegmentInfo().getDelGen() ==
infos.info(i).getDelGen())
{
              // only DV updates
              newReaders[i] = new SegmentReader(infos.info(i),
newReaders[i], newReaders[i].getLiveDocs(), newReaders[i].numDocs());
            } else {
              // both DV and liveDocs have changed
              newReaders[i] = new SegmentReader(infos.info(i),
newReaders[i]);
            }

That constructor looks like this:

  SegmentReader(SegmentCommitInfo si, SegmentReader sr) throws IOException {
    this(si, sr,
         si.info.getCodec().liveDocsFormat().readLiveDocs(si.info.dir, si,
IOContext.READONCE),
         si.info.getDocCount() - si.getDelCount());
  }

At this point, the SegmentInfo we're tryng to read live docs on is from the
commit point, and if there weren't any deletes, then the following results
in a null for the relevant file name, in
Lucene40LiveDocsFormat.readLiveDocs():

    String filename = IndexFileNames.fileNameFromGeneration(info.info.name,
DELETES_EXTENSION, info.getDelGen());
    final BitVector liveDocs = new BitVector(dir, filename, context);

This is where filename ends up being null, which gets passed all the way
down to the File constructor.

In a nutshell, I think the bug is that it is assumed that the segments from
commit point have deletes, when they may not, yet the original
SegmentReader for the segment that we are trying to reuse does.

What I am not quite clear yet is how we arrive at this point, because the
test that causes the exception doesn't do any deletes/updates... could
deletes occur as a result of a segment merge? This might explain the
sporadic nature of this exception, since merge timings aren't deterministic.


On Mon, Sep 8, 2014 at 11:45 AM, Vitaly Funstein <vf...@gmail.com>
wrote:

> UPDATE:
>
> After making the changes we discussed to enable sharing of SegmentReaders
> between the NRT reader and a commit point reader, specifically calling
> through to DirectoryReader.openIfChanged(DirectoryReader, IndexCommit), I
> am seeing this exception, sporadically:
>
> Caused by: java.lang.NullPointerException
>         at java.io.File.<init>(File.java:305)
>         at
> org.terracotta.shaded.lucene.store.NIOFSDirectory.openInput(NIOFSDirectory.java:80)
>         at
> org.terracotta.shaded.lucene.codecs.lucene40.BitVector.<init>(BitVector.java:327)
>         at
> org.terracotta.shaded.lucene.codecs.lucene40.Lucene40LiveDocsFormat.readLiveDocs(Lucene40LiveDocsFormat.java:90)
>         at
> org.terracotta.shaded.lucene.index.SegmentReader.<init>(SegmentReader.java:131)
>         at
> org.terracotta.shaded.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:194)
>         at
> org.terracotta.shaded.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:326)
>         at
> org.terracotta.shaded.lucene.index.StandardDirectoryReader$2.doBody(StandardDirectoryReader.java:320)
>         at
> org.terracotta.shaded.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:702)
>         at
> org.terracotta.shaded.lucene.index.StandardDirectoryReader.doOpenFromCommit(StandardDirectoryReader.java:315)
>         at
> org.terracotta.shaded.lucene.index.StandardDirectoryReader.doOpenFromWriter(StandardDirectoryReader.java:278)
>         at
> org.terracotta.shaded.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:260)
>         at
> org.terracotta.shaded.lucene.index.DirectoryReader.openIfChanged(DirectoryReader.java:183)
>
> Looking at the source quickly, it appears the child argument to the File
> ctor is null; is it somehow possible that the segment infos in the commit
> point wasn't fully written out somehow, on prior commit? Sounds unlikely,
> yet disturbing... but nothing else has changed in my code, i.e. the way
> commits are performed and indexes are reopened.
>
>
> On Fri, Aug 29, 2014 at 2:03 AM, Michael McCandless <
> lucene@mikemccandless.com> wrote:
>
>> On Thu, Aug 28, 2014 at 5:38 PM, Vitaly Funstein <vf...@gmail.com>
>> wrote:
>> > On Thu, Aug 28, 2014 at 1:25 PM, Michael McCandless <
>> > lucene@mikemccandless.com> wrote:
>> >
>> >>
>> >> The segments_N file can be different, that's fine: after that, we then
>> >> re-use SegmentReaders when they are in common between the two commit
>> >> points.  Each segments_N file refers to many segments...
>> >>
>> >>
>> > Yes, you are totally right - I didn't follow the code far enough the
>> first
>> > time around. :) This is an excellent idea, actually - I can probably
>> > arrange maintained commit points as an MRU data structure (e.g.
>> > LinkedHashMap with access order), and simply grab the most recently
>> opened
>> > reader to pass in when obtaining a new one from the new commit point -
>> to
>> > maximize segment reader reuse.
>>
>> That's great!
>>
>> >> You can set it (min and max) as high as you want; the only hard
>> >> requirement is that max >= 2*(min-1), I believe.
>> >>
>> >
>> > Looks like this is used inside Lucene41PostingsFormat, which simply
>> passes
>> > in those defaults - so you are effectively saying the minimum (and
>> > therefore, maximum) block size can be raised to reuse the size of the
>> terms
>> > index inside those TreeMap nodes?
>>
>> Yes, but it then increases cost at search time to locate a given term,
>> because more scanning is then required once we seek to the block that
>> might have the term.
>>
>> This reduces the size of the FST, but if RAM is being used by
>> something else inside BT, it won't help.  But from your screen shot it
>> looked like it was almost entirely the FST, which is what I would
>> expect.
>>
>> >> > We are already using a customized codec though, so perhaps adding
>> >> > this to the codec is okay and transparent?
>> >>
>> >> Hmmm :)  Customized in what manner?
>> >>
>> >>
>> > We need to have the ability to turn off stored fields compression, so
>> there
>> > is one codec in case the system is configured that way. The other one
>> > exists for compression on, but there I tweaked stored fields format for
>> > bias toward decompression, as well as a smaller chunk size - based on
>> some
>> > empirical observations in executed tests. I am guessing I'll just add
>> > another customization to both that deals with the block sizing for
>> postings
>> > format, and see what difference that makes...
>>
>> Ahh, OK.  Yes, just add this custom terms index block sizing too.
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>