You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Jason Rutherglen <ja...@gmail.com> on 2008/06/24 14:01:53 UTC

SegmentReader with custom setting of deletedDocs, single reusable FieldsReader

One of the bottlenecks I have noticed testing Ocean realtime search is the
delete process which involves writing several files for each possibly single
delete of a document in SegmentReader.  The best way to handle the deletes
is too simply keep them in memory without flushing them to disk, saving on
writing out an entire BitVector per delete.  The deletes are saved in the
transaction log which is be replayed on recovery.

I am not sure of the best way to approach this, perhaps it is creating a
custom class that inherits from SegmentReader.  It could reuse the existing
reopen and also provide a way to set the deletedDocs BitVector.  Also it
would be able to reuse FieldsReader by providing locking around FieldsReader
for all SegmentReaders of the segment to use.  Otherwise in the current
architecture each new SegmentReader opens a new FieldsReader which is
non-optimal.  The deletes would be saved to disk but instead of per delete,
periodically like a checkpoint.

Re: SegmentReader with custom setting of deletedDocs, single reusable FieldsReader

Posted by Yonik Seeley <yo...@apache.org>.

On Wed, Jun 25, 2008 at 11:30 AM, Jason Rutherglen
<ja...@gmail.com> wrote:
> I read other parts of the email but glanced over this part.  Would terms be
> automatically sorted as they came in?  If implemented it would be nice to be
> able to get an encoded representation (probably byte array) of the document
> and postings which could be written to a log, and then reentered in another
> IndexWriter recreating the document and postings.

I was talking simpler...  If one could open an IndexReader on the
index (including uncommitted documents in the open IndexWriter), then
you can easily search for a document and retrieve it's stored fields
in order to re-index it with changes (and still maintain decent
performance).

-Yonik


> On Wed, Jun 25, 2008 at 8:41 AM, Yonik Seeley <yo...@apache.org> wrote:
>>
>> On Wed, Jun 25, 2008 at 6:29 AM, Michael McCandless
>> <lu...@mikemccandless.com> wrote:
>> > We've also discussed at one point creating an IndexReader impl that
>> > searches
>> > the RAM buffer that DocumentsWriter writes to when adding documents.  I
>> > think it's easier than it sounds, on first glance, because
>> > DocumentsWriter
>> > is in fact writing the postings in nearly the same format as is used
>> > when
>> > the segment is flushed.
>>
>> That would be very nice, and should also make it much easier to
>> implement updateable documents (changing/adding/removing single
>> fields).
>>
>> -Yonik
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: SegmentReader with custom setting of deletedDocs, single reusable FieldsReader

Posted by Jason Rutherglen <ja...@gmail.com>.

I read other parts of the email but glanced over this part.  Would terms be
automatically sorted as they came in?  If implemented it would be nice to be
able to get an encoded representation (probably byte array) of the document
and postings which could be written to a log, and then reentered in another
IndexWriter recreating the document and postings.

On Wed, Jun 25, 2008 at 8:41 AM, Yonik Seeley <yo...@apache.org> wrote:

> On Wed, Jun 25, 2008 at 6:29 AM, Michael McCandless
> <lu...@mikemccandless.com> wrote:
> > We've also discussed at one point creating an IndexReader impl that
> searches
> > the RAM buffer that DocumentsWriter writes to when adding documents.  I
> > think it's easier than it sounds, on first glance, because
> DocumentsWriter
> > is in fact writing the postings in nearly the same format as is used when
> > the segment is flushed.
>
> That would be very nice, and should also make it much easier to
> implement updateable documents (changing/adding/removing single
> fields).
>
> -Yonik
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

Re: SegmentReader with custom setting of deletedDocs, single reusable FieldsReader

Posted by Yonik Seeley <yo...@apache.org>.

On Wed, Jun 25, 2008 at 6:29 AM, Michael McCandless
<lu...@mikemccandless.com> wrote:
> We've also discussed at one point creating an IndexReader impl that searches
> the RAM buffer that DocumentsWriter writes to when adding documents.  I
> think it's easier than it sounds, on first glance, because DocumentsWriter
> is in fact writing the postings in nearly the same format as is used when
> the segment is flushed.

That would be very nice, and should also make it much easier to
implement updateable documents (changing/adding/removing single
fields).

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: SegmentReader with custom setting of deletedDocs, single reusable FieldsReader

Posted by Yonik Seeley <yo...@apache.org>.

On Fri, Jun 27, 2008 at 2:43 PM, Michael McCandless
<lu...@mikemccandless.com> wrote:
> We could instead keep these SegmentReaders open and reuse them for
> applying deletes.

We discussed caching SegmentReaders in the original buffered deletes issue too
http://issues.apache.org/jira/browse/LUCENE-565

> Then the IndexWriter could present an IndexReader
> (MultiReader) that reads these segments, plus the IndexReader reading
> buffered docs in RAM.  This would basically be a "combined
> IndexWriter / IndexReader".

And all the lucene developers of the world rejoiced :-)

> I think the IndexReader that reads DocumentWriter's RAM buffer would
> still search a point-in-time snapshot of the index, unlike
> InstantiatedIndexReader, and require an explicit reopen() to refresh.

Right.  That provides the most implementation flexibility too (even
w/o such issues as handling delete-by-query).

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: SegmentReader with custom setting of deletedDocs, single reusable FieldsReader

Posted by Jason Rutherglen <ja...@gmail.com>.

One of the main functions of the Ocean code is to, as properly stated,
aggresively close old IndexReaders freeing resources.  This is why
deletedDocs from closed SegmentReaders should be reused.

> I do a search, then I call documents() to get a Documents instance, I
interact with that to load all my documents, then I close it?

Exactly.  I implemented the threadlocal version as well however and uploaded
it in LUCENE-1314.

> cloned IndexInputs share the same RandomAccessFile instance

That does seem to be an issue.  This is probably why on a 4 core machine
that is fully maxed with queries I see 75% CPU utilization using a single
IndexReader.  When using a multi threaded Searcher CPU goes to 100% as
expected.  Going to 8 core servers this problem is only exacerbated.

> JSR 203

Like NIO it may probably not work well enough to use the first 2-3
versions.

> high-speed SSDs

Have a limited number of writes they can do before they start to fail.
Removes a lot of the benefits.  I don't know of rapidly updating databases
being able to use SSDs right now.  The manufacturers are addressing the
issue.

> we need is a new layer, under oal.store, which would manage when to create
a new file descriptor, per thread

Sounds like the best backwards compatible solution.  The BufferedIndexInput
buffer could default to a larger size and be a thread local so it can be
reused.  An analogy for users would be it's like a J2EE SQL connection pool.

There would need to be a pool of RandomAccessFiles per file that was checked
on by a global thread that monitors them for inactivity.  The open new file
descriptor method would check to see if it was going over limit, and if so,
wait.  This could solve the contention issues.

On Sun, Jun 29, 2008 at 11:13 AM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> One overarching question here: I understand, with Ocean, that you'd
> expect to be re-opening IndexReaders very frequently (after each add
> or delete) to be "real-time".  But: wouldn't you also expect to be
> aggressively closing the old ones as well (ie, after the in-flight
> searches have finished with them)?  Ie I would think you would not
> have a great many IndexReaders (SegmentReaders) open at a time.
>
> More stuff below:
>
> Jason Rutherglen <ja...@gmail.com> wrote:
>
> > I've been looking more at how to improve the IndexReader.document call.
> > There are a few options.  I implemented the IndexReader.documents call
> which
> > has the down side of not being backward compatible.
>
> Is this the new Documents class you proposed?  Is the thinking that
> each instance of Documents would only last for one search?  Ie, I do a
> search, then I call documents() to get a Documents instance, I
> interact with that to load all my documents, then I close it?
>
> > Probably the only way
> > to achieve both ends is the threadlocal as I noticed term vectors does
> the
> > same thing.  This raises the issue of too many file descriptors for term
> > vectors if there are many reopens, does it not?
>
> Actually, when you clone a TermVectorsReader, which then clones the 3
> IndexInputs, for FSDirectory this does not result in opening
> additional file descriptors.  Instead, the cloned IndexInputs share
> the same RandomAccessFile instance, and synchronize on it so that no
> two can be reading from the file at once.  Of course, this means
> there's still contention since all threads must share the same
> RandomAccessFile instance (but see LUCENE-753 as Yonik suggested).
>
> I think the best way to eventually solve this is to use asynchronous
> IO (JSR 203, to be in Java 7).  If N threads want to load M documents
> each (to show their page of results) then you really want the OS to
> see all M*N requests at once so that the IO system can best schedule
> things.  Modern hard drives, and I believe the high-speed SSDs as
> well, have substantial concurrency available, so to utilize that you
> really want to get the full queue down to devices.  But this solution
> is quite a ways off!
>
> To "emulate" asynchronous IO, we should be able to allow multiple
> threads to access the same file at once, each with their own private
> RandomAccessFile instance.  But of course we can't generally afford
> that today because we'd quickly run out of file descriptors.  Maybe
> what we need is a new layer, under oal.store, which would manage when
> to create a new file descriptor, per thread, and when not to.  This
> layer would be responsible for keeping total # descriptors under a
> certain limit, but would otherwise be free to go up to that limit if
> it seemed like there was contention.  Not sure if there would be
> enough gains to make this worthwhile...
>
> > It would seem that copying
> > the reference to termVectorsLocal on reopens would help with this.  If
> this
> > is amenable then the same could be done for fieldsReader with a
> > fieldsReaderThreadLocal.
>
> I agree, we should be copying this when we copy fieldsReader over.
> (And the same with termVectorsReader if we take this same approach).
> Can you include that in your new patch as well?  (Or, under a new
> issue).  I'm losing track of all these changes!
>
> > IndexReader.document as it is is really a lame duck.  The
> > IndexReader.document call being synchronized at the top level drags down
> the
> > performance of systems that store data in Lucene.  A single file
> descriptor
> > for all threads on an index that is constantly returning results with
> fields
> > is a serious problem.  Users are always complaining about this issue and
> now
> > I know why.
> >
> > This should be a separate issue from IndexReader.clone.
>
> Agreed.
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

Re: SegmentReader with custom setting of deletedDocs, single reusable FieldsReader

Posted by Michael McCandless <lu...@mikemccandless.com>.

One overarching question here: I understand, with Ocean, that you'd
expect to be re-opening IndexReaders very frequently (after each add
or delete) to be "real-time".  But: wouldn't you also expect to be
aggressively closing the old ones as well (ie, after the in-flight
searches have finished with them)?  Ie I would think you would not
have a great many IndexReaders (SegmentReaders) open at a time.

More stuff below:

Jason Rutherglen <ja...@gmail.com> wrote:

> I've been looking more at how to improve the IndexReader.document call.
> There are a few options.  I implemented the IndexReader.documents call which
> has the down side of not being backward compatible.

Is this the new Documents class you proposed?  Is the thinking that
each instance of Documents would only last for one search?  Ie, I do a
search, then I call documents() to get a Documents instance, I
interact with that to load all my documents, then I close it?

> Probably the only way
> to achieve both ends is the threadlocal as I noticed term vectors does the
> same thing.  This raises the issue of too many file descriptors for term
> vectors if there are many reopens, does it not?

Actually, when you clone a TermVectorsReader, which then clones the 3
IndexInputs, for FSDirectory this does not result in opening
additional file descriptors.  Instead, the cloned IndexInputs share
the same RandomAccessFile instance, and synchronize on it so that no
two can be reading from the file at once.  Of course, this means
there's still contention since all threads must share the same
RandomAccessFile instance (but see LUCENE-753 as Yonik suggested).

I think the best way to eventually solve this is to use asynchronous
IO (JSR 203, to be in Java 7).  If N threads want to load M documents
each (to show their page of results) then you really want the OS to
see all M*N requests at once so that the IO system can best schedule
things.  Modern hard drives, and I believe the high-speed SSDs as
well, have substantial concurrency available, so to utilize that you
really want to get the full queue down to devices.  But this solution
is quite a ways off!

To "emulate" asynchronous IO, we should be able to allow multiple
threads to access the same file at once, each with their own private
RandomAccessFile instance.  But of course we can't generally afford
that today because we'd quickly run out of file descriptors.  Maybe
what we need is a new layer, under oal.store, which would manage when
to create a new file descriptor, per thread, and when not to.  This
layer would be responsible for keeping total # descriptors under a
certain limit, but would otherwise be free to go up to that limit if
it seemed like there was contention.  Not sure if there would be
enough gains to make this worthwhile...

> It would seem that copying
> the reference to termVectorsLocal on reopens would help with this.  If this
> is amenable then the same could be done for fieldsReader with a
> fieldsReaderThreadLocal.

I agree, we should be copying this when we copy fieldsReader over.
(And the same with termVectorsReader if we take this same approach).
Can you include that in your new patch as well?  (Or, under a new
issue).  I'm losing track of all these changes!

> IndexReader.document as it is is really a lame duck.  The
> IndexReader.document call being synchronized at the top level drags down the
> performance of systems that store data in Lucene.  A single file descriptor
> for all threads on an index that is constantly returning results with fields
> is a serious problem.  Users are always complaining about this issue and now
> I know why.
>
> This should be a separate issue from IndexReader.clone.

Agreed.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: SegmentReader with custom setting of deletedDocs, single reusable FieldsReader

Posted by Jason Rutherglen <ja...@gmail.com>.

To complete the removal any low level synchronization issues, seems like a
good idea to add a clone method to CSIndexInput that clones the base
IndexInput to remove the synchronization in readInternal.

On Sun, Jun 29, 2008 at 1:44 PM, Jason Rutherglen <
jason.rutherglen@gmail.com> wrote:

> I voted for http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6265734about pread not working on Windows.  After thinking more about the pool of
> RandomAccessFiles I think LUCENE-753<https://issues.apache.org/jira/browse/LUCENE-753>is the best solution.  I am not sure how much work nor if pool of
> RandomAccessFiles creates more synchronization problems and if it is only to
> benefit windows, does not seem worthwhile.
>
> Seems like LUCENE-753 needs the conditional part on the OS and it can be
> committed.
>
> On Sun, Jun 29, 2008 at 10:47 AM, Yonik Seeley <yo...@apache.org> wrote:
>
>> On Sun, Jun 29, 2008 at 9:42 AM, Jason Rutherglen
>> <ja...@gmail.com> wrote:
>> > IndexReader.document as it is is really a lame duck.  The
>> > IndexReader.document call being synchronized at the top level drags down
>> the
>> > performance of systems that store data in Lucene.  A single file
>> descriptor
>> > for all threads on an index that is constantly returning results with
>> fields
>> > is a serious problem.  Users are always complaining about this issue and
>> now
>> > I know why.
>>
>> Each part of the index (e.g. tis, frq) is actually only covered by a
>> single file descriptor by default - stored fields aren't unique in
>> that regard.
>>
>> It's probably the case that the stored fields of a given document are
>> much less likely to be in OS cache though... and in that case having
>> multiple requests in-flight to the disk could improve things.
>>
>> On anything except Windows, using pread may be the answer (after the
>> other synchronization is also removed of course):
>> https://issues.apache.org/jira/browse/LUCENE-753
>>
>> -Yonik
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>>
>

Re: SegmentReader with custom setting of deletedDocs, single reusable FieldsReader

Posted by Jason Rutherglen <ja...@gmail.com>.

I voted for http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6265734 about
pread not working on Windows.  After thinking more about the pool of
RandomAccessFiles I think
LUCENE-753<https://issues.apache.org/jira/browse/LUCENE-753>is the
best solution.  I am not sure how much work nor if pool of
RandomAccessFiles creates more synchronization problems and if it is only to
benefit windows, does not seem worthwhile.

Seems like LUCENE-753 needs the conditional part on the OS and it can be
committed.

On Sun, Jun 29, 2008 at 10:47 AM, Yonik Seeley <yo...@apache.org> wrote:

> On Sun, Jun 29, 2008 at 9:42 AM, Jason Rutherglen
> <ja...@gmail.com> wrote:
> > IndexReader.document as it is is really a lame duck.  The
> > IndexReader.document call being synchronized at the top level drags down
> the
> > performance of systems that store data in Lucene.  A single file
> descriptor
> > for all threads on an index that is constantly returning results with
> fields
> > is a serious problem.  Users are always complaining about this issue and
> now
> > I know why.
>
> Each part of the index (e.g. tis, frq) is actually only covered by a
> single file descriptor by default - stored fields aren't unique in
> that regard.
>
> It's probably the case that the stored fields of a given document are
> much less likely to be in OS cache though... and in that case having
> multiple requests in-flight to the disk could improve things.
>
> On anything except Windows, using pread may be the answer (after the
> other synchronization is also removed of course):
> https://issues.apache.org/jira/browse/LUCENE-753
>
> -Yonik
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

Re: SegmentReader with custom setting of deletedDocs, single reusable FieldsReader

Posted by Jason Rutherglen <ja...@gmail.com>.

bq: Each part of the index (e.g. tis, frq) is actually only covered by a
single file descriptor

The code seems to indicate otherwise.  When a query comes in, a cloned
SegmentTermEnum is used with it's own file descriptor.  After the query is
completed, SegmentTermEnum is closed along with the file descriptor.  With
FieldsReader currently a document call is made, along with potentially many
other document calls from other threads, and only one may pass using only
one file descriptor.

public SegmentTermEnum terms() {
  return (SegmentTermEnum)origEnum.clone();
}

protected Object clone() {
    SegmentTermEnum clone = null;
    clone.input = (IndexInput) input.clone(); // new file descriptor
    return clone;
}

bq: using pread may be the answer

Yes.  What about the alternative of increasing the buffer size.  That is
something where threadlocal could be used to reuse byte buffers as creating
new large buffers would be expensive.

On Sun, Jun 29, 2008 at 10:47 AM, Yonik Seeley <yo...@apache.org> wrote:

> On Sun, Jun 29, 2008 at 9:42 AM, Jason Rutherglen
> <ja...@gmail.com> wrote:
> > IndexReader.document as it is is really a lame duck.  The
> > IndexReader.document call being synchronized at the top level drags down
> the
> > performance of systems that store data in Lucene.  A single file
> descriptor
> > for all threads on an index that is constantly returning results with
> fields
> > is a serious problem.  Users are always complaining about this issue and
> now
> > I know why.
>
> Each part of the index (e.g. tis, frq) is actually only covered by a
> single file descriptor by default - stored fields aren't unique in
> that regard.
>
> It's probably the case that the stored fields of a given document are
> much less likely to be in OS cache though... and in that case having
> multiple requests in-flight to the disk could improve things.
>
> On anything except Windows, using pread may be the answer (after the
> other synchronization is also removed of course):
> https://issues.apache.org/jira/browse/LUCENE-753
>
> -Yonik
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

Re: SegmentReader with custom setting of deletedDocs, single reusable FieldsReader

Posted by Yonik Seeley <yo...@apache.org>.

On Sun, Jun 29, 2008 at 9:42 AM, Jason Rutherglen
<ja...@gmail.com> wrote:
> IndexReader.document as it is is really a lame duck.  The
> IndexReader.document call being synchronized at the top level drags down the
> performance of systems that store data in Lucene.  A single file descriptor
> for all threads on an index that is constantly returning results with fields
> is a serious problem.  Users are always complaining about this issue and now
> I know why.

Each part of the index (e.g. tis, frq) is actually only covered by a
single file descriptor by default - stored fields aren't unique in
that regard.

It's probably the case that the stored fields of a given document are
much less likely to be in OS cache though... and in that case having
multiple requests in-flight to the disk could improve things.

On anything except Windows, using pread may be the answer (after the
other synchronization is also removed of course):
https://issues.apache.org/jira/browse/LUCENE-753

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: SegmentReader with custom setting of deletedDocs, single reusable FieldsReader

Posted by Jason Rutherglen <ja...@gmail.com>.

I've been looking more at how to improve the IndexReader.document call.
There are a few options.  I implemented the IndexReader.documents call which
has the down side of not being backward compatible.  Probably the only way
to achieve both ends is the threadlocal as I noticed term vectors does the
same thing.  This raises the issue of too many file descriptors for term
vectors if there are many reopens, does it not?  It would seem that copying
the reference to termVectorsLocal on reopens would help with this.  If this
is amenable then the same could be done for fieldsReader with a
fieldsReaderThreadLocal.

IndexReader.document as it is is really a lame duck.  The
IndexReader.document call being synchronized at the top level drags down the
performance of systems that store data in Lucene.  A single file descriptor
for all threads on an index that is constantly returning results with fields
is a serious problem.  Users are always complaining about this issue and now
I know why.

This should be a separate issue from IndexReader.clone.

On Sun, Jun 29, 2008 at 5:41 AM, Michael McCandless <
lucene@mikemccandless.com> wrote:

>
> Jason Rutherglen wrote:
>
>  A possible solution to FieldsReader is to have an IndexReader.documents()
>> method that returns a Documents class.  The Documents class maintains an
>> underlying FieldsReader and file descriptor that can be closed like TermEnum
>> or TermDocs etc.  Of course it would have a document(int n, FieldSelector
>> selector) method.  The issue is what the default behavior would be for
>> IndexReader.document for the SegmentReader.clone/reopen(boolean force).  I'm
>> not sure how efficient it would be to open and close a FieldsReader per
>> IndexReader.document call.
>>
>
> I don't think we want to open/close FieldsReader per document() call?
>
> I think we should test simply synchronizing FieldsReader, first.  I think
> that's the simplest solution.  Modern JVMs have apparently gotten better
> about synchronized calls, especially when there is little contention.  In
> the typical usage of Lucene there would be no contention.  If the
> performance cost is negligible then it makes SegmentReader.doReopen very
> simple -- no external locking or special subclassing is necessary.
>
>  I was using InstantiatedIndex and performing a commit when changes came
>> in, but realized that during the commit incoming searches could see wrong
>> results.  Now, like the InstantiatedIndex javadocs suggest, each one is
>> immutable.
>>
>> The IndexReader over the RAM buffer sounds good.  As an interim solution
>> it would be beneficial to have the SegmentReader.clone/reopen(boolean force)
>> so that the first version of Ocean can be completed and I can move on to
>> other projects like Tag Index.
>>
>
> I agree we should still implement a SegmentReader.clone.  Jason can you
> update the patch?  (change from boolean force to clone(); synchronization of
> FieldsReader; undoing the mixed up import line shuffling).
>
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

Re: SegmentReader with custom setting of deletedDocs, single reusable FieldsReader

Posted by Michael McCandless <lu...@mikemccandless.com>.

Jason Rutherglen wrote:

> A possible solution to FieldsReader is to have an  
> IndexReader.documents() method that returns a Documents class.  The  
> Documents class maintains an underlying FieldsReader and file  
> descriptor that can be closed like TermEnum or TermDocs etc.  Of  
> course it would have a document(int n, FieldSelector selector)  
> method.  The issue is what the default behavior would be for  
> IndexReader.document for the SegmentReader.clone/reopen(boolean  
> force).  I'm not sure how efficient it would be to open and close a  
> FieldsReader per IndexReader.document call.

I don't think we want to open/close FieldsReader per document() call?

I think we should test simply synchronizing FieldsReader, first.  I  
think that's the simplest solution.  Modern JVMs have apparently  
gotten better about synchronized calls, especially when there is  
little contention.  In the typical usage of Lucene there would be no  
contention.  If the performance cost is negligible then it makes  
SegmentReader.doReopen very simple -- no external locking or special  
subclassing is necessary.

> I was using InstantiatedIndex and performing a commit when changes  
> came in, but realized that during the commit incoming searches could  
> see wrong results.  Now, like the InstantiatedIndex javadocs  
> suggest, each one is immutable.
>
> The IndexReader over the RAM buffer sounds good.  As an interim  
> solution it would be beneficial to have the SegmentReader.clone/ 
> reopen(boolean force) so that the first version of Ocean can be  
> completed and I can move on to other projects like Tag Index.

I agree we should still implement a SegmentReader.clone.  Jason can  
you update the patch?  (change from boolean force to clone();  
synchronization of FieldsReader; undoing the mixed up import line  
shuffling).

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: SegmentReader with custom setting of deletedDocs, single reusable FieldsReader

Posted by Jason Rutherglen <ja...@gmail.com>.

A possible solution to FieldsReader is to have an IndexReader.documents()
method that returns a Documents class.  The Documents class maintains an
underlying FieldsReader and file descriptor that can be closed like TermEnum
or TermDocs etc.  Of course it would have a document(int n, FieldSelector
selector) method.  The issue is what the default behavior would be for
IndexReader.document for the SegmentReader.clone/reopen(boolean force).  I'm
not sure how efficient it would be to open and close a FieldsReader per
IndexReader.document call.

I was using InstantiatedIndex and performing a commit when changes came in,
but realized that during the commit incoming searches could see wrong
results.  Now, like the InstantiatedIndex javadocs suggest, each one is
immutable.

The IndexReader over the RAM buffer sounds good.  As an interim solution it
would be beneficial to have the SegmentReader.clone/reopen(boolean force) so
that the first version of Ocean can be completed and I can move on to other
projects like Tag Index.

On Fri, Jun 27, 2008 at 2:43 PM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> Jason Rutherglen <ja...@gmail.com> wrote:
>
> > One of the things I do not understand about IndexWriter deletes is
> > it does not reuse an already open TermInfosReader with the tii
> > loaded.  Isn't this slower than deleting using an already open
> > IndexReader?
>
> That's right: every time IW decides to flush deletes (which is
> currently before a merge starts, when autoCommit=false), it visits
> each segment and 1) opens a SegmentReader, 2) translates all buffered
> deletes (by term, by query) into docIDs stored into the deletedDocs of
> that SegmentReader, writes new _X_N.del files to record the deletes
> for the segment and then closes the SegmentReader.
>
> We could instead keep these SegmentReaders open and reuse them for
> applying deletes.  Then the IndexWriter could present an IndexReader
> (MultiReader) that reads these segments, plus the IndexReader reading
> buffered docs in RAM.  This would basically be a "combined
> IndexWriter / IndexReader".
>
> I think the IndexReader that reads DocumentWriter's RAM buffer would
> still search a point-in-time snapshot of the index, unlike
> InstantiatedIndexReader, and require an explicit reopen() to refresh.
> This is because some non-trivial computation is still required when
> there are changes.  EG if a delete-by-query has happened, reopen()
> must resolve that query into docIDs.
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

Re: SegmentReader with custom setting of deletedDocs, single reusable FieldsReader

Posted by Michael McCandless <lu...@mikemccandless.com>.

Jason Rutherglen <ja...@gmail.com> wrote:

> One of the things I do not understand about IndexWriter deletes is
> it does not reuse an already open TermInfosReader with the tii
> loaded.  Isn't this slower than deleting using an already open
> IndexReader?

That's right: every time IW decides to flush deletes (which is
currently before a merge starts, when autoCommit=false), it visits
each segment and 1) opens a SegmentReader, 2) translates all buffered
deletes (by term, by query) into docIDs stored into the deletedDocs of
that SegmentReader, writes new _X_N.del files to record the deletes
for the segment and then closes the SegmentReader.

We could instead keep these SegmentReaders open and reuse them for
applying deletes.  Then the IndexWriter could present an IndexReader
(MultiReader) that reads these segments, plus the IndexReader reading
buffered docs in RAM.  This would basically be a "combined
IndexWriter / IndexReader".

I think the IndexReader that reads DocumentWriter's RAM buffer would
still search a point-in-time snapshot of the index, unlike
InstantiatedIndexReader, and require an explicit reopen() to refresh.
This is because some non-trivial computation is still required when
there are changes.  EG if a delete-by-query has happened, reopen()
must resolve that query into docIDs.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: SegmentReader with custom setting of deletedDocs, single reusable FieldsReader

Posted by Jason Rutherglen <ja...@gmail.com>.

I understand what you are saying.  I am not sure it is worth "clearly quite
a bit more work" given how easy it is to simply be able to have more control
over the IndexReader deletedDocs BitVector which seems like a feature that
should be in there anyways, perhaps even allowing SortedVIntList to be
used.  The other issue with going down the path of integrating too much with
IndexWriter is I am not sure how to integrate the realtime document
additions to IndexWriter which is handled best by InstantiatedIndex.  When
merging needs to happen in Ocean the IndexWriter.addIndexes(IndexReader[]
readers) is used to merge SegmentReaders and InstantiatedIndexReaders.

One of the things I do not understand about IndexWriter deletes is it does
not reuse an already open TermInfosReader with the tii loaded.  Isn't this
slower than deleting using an already open IndexReader?

In any case the method of using deletedDocs in SegmentReader using the patch
given seems to work quite well in Ocean now.  I think long term there is
probably some way to integrate more with IndexWriter, but really that is
something more in line with removing the concept of IndexReader and
IndexWriter and creating an IndexReaderWriter class.

On Wed, Jun 25, 2008 at 6:29 AM, Michael McCandless <
lucene@mikemccandless.com> wrote:

>
> Jason Rutherglen wrote:
>
>  One of the bottlenecks I have noticed testing Ocean realtime search is the
>> delete process which involves writing several files for each possibly single
>> delete of a document in SegmentReader.  The best way to handle the deletes
>> is too simply keep them in memory without flushing them to disk, saving on
>> writing out an entire BitVector per delete.  The deletes are saved in the
>> transaction log which is be replayed on recovery.
>>
>> I am not sure of the best way to approach this, perhaps it is creating a
>> custom class that inherits from SegmentReader.  It could reuse the existing
>> reopen and also provide a way to set the deletedDocs BitVector.  Also it
>> would be able to reuse FieldsReader by providing locking around FieldsReader
>> for all SegmentReaders of the segment to use.  Otherwise in the current
>> architecture each new SegmentReader opens a new FieldsReader which is
>> non-optimal.  The deletes would be saved to disk but instead of per delete,
>> periodically like a checkpoint.
>>
>
> Or ... maybe you could do the deletes through IndexWriter (somehow, if we
> can get docIDs properly) and then SegmentReaders could somehow tap into the
> buffered deleted docIDs that IndexWriter already maintains.  IndexWriter is
> already doing this buffering, flush/commit anyway.
>
> We've also discussed at one point creating an IndexReader impl that
> searches the RAM buffer that DocumentsWriter writes to when adding
> documents.  I think it's easier than it sounds, on first glance, because
> DocumentsWriter is in fact writing the postings in nearly the same format as
> is used when the segment is flushed.
>
> So if we had this IndexReader impl, plus extended SegmentReader so it could
> tap into pending deletes buffered in IndexWriter, you could get realtime
> search without having to use Directory as an intermediary.  Though, it is
> clearly quite a bit more work :)
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

Re: SegmentReader with custom setting of deletedDocs, single reusable FieldsReader

Posted by Michael McCandless <lu...@mikemccandless.com>.

Jason Rutherglen wrote:

> One of the bottlenecks I have noticed testing Ocean realtime search  
> is the delete process which involves writing several files for each  
> possibly single delete of a document in SegmentReader.  The best way  
> to handle the deletes is too simply keep them in memory without  
> flushing them to disk, saving on writing out an entire BitVector per  
> delete.  The deletes are saved in the transaction log which is be  
> replayed on recovery.
>
> I am not sure of the best way to approach this, perhaps it is  
> creating a custom class that inherits from SegmentReader.  It could  
> reuse the existing reopen and also provide a way to set the  
> deletedDocs BitVector.  Also it would be able to reuse FieldsReader  
> by providing locking around FieldsReader for all SegmentReaders of  
> the segment to use.  Otherwise in the current architecture each new  
> SegmentReader opens a new FieldsReader which is non-optimal.  The  
> deletes would be saved to disk but instead of per delete,  
> periodically like a checkpoint.

Or ... maybe you could do the deletes through IndexWriter (somehow, if  
we can get docIDs properly) and then SegmentReaders could somehow tap  
into the buffered deleted docIDs that IndexWriter already maintains.   
IndexWriter is already doing this buffering, flush/commit anyway.

We've also discussed at one point creating an IndexReader impl that  
searches the RAM buffer that DocumentsWriter writes to when adding  
documents.  I think it's easier than it sounds, on first glance,  
because DocumentsWriter is in fact writing the postings in nearly the  
same format as is used when the segment is flushed.

So if we had this IndexReader impl, plus extended SegmentReader so it  
could tap into pending deletes buffered in IndexWriter, you could get  
realtime search without having to use Directory as an intermediary.   
Though, it is clearly quite a bit more work :)

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org