You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Shai Erera <se...@gmail.com> on 2009/06/08 22:07:02 UTC

Some thoughts around the use of reader.isDeleted and hasDeletions

Hi

I recently read CHANGES to learn more about the readOnly parameter
IndexReader now supports, and came across LUCENE-1329 with a comment that
isDeleted was made not synchronized if readOnly=true (e.g.
ReadOnlyIndexReader), which can affect search code, as it is usually the
bottleneck for search operations.

I searched the code and was surprised to see isDeleted and hasDeletions are
not called from any search code. Instead, the code, such as SegmentTermDocs,
uses the deletedDocs instance directly. So in fact isDeleted wasn't the
bottleneck (unless this was part of the changes in 1329). Anyway, doesn't
matter, that's good !

However, I did find out some indexing code whic calls these two, like
SegmentMerger when it merges fields and term vectors. I think that we can
improve that code by writing some specialized code for merging - if the
reader has no deletions, there's no point checking for each document if
there are deletions and/or if the document was deleted. In fact,
SegmentMerger checks for each document: (1) whether the reader has deletions
and if the document was deleted, (2) if the reader has a matching reader and
(3) if checkAbort is not null.

I have a couple of suggestions to simplify that code:
1. Create two specialized copyFieldsWithDeletions/copyFieldsNoDeletions to
get rid of the unnecessary if (hasDeletions) check for each doc.
2. In these, check if the reader has matching reader or not, and execute the
appropriate code in each case.
3. Also, check up front if checkAbort != null.
3.1 (3) duplicates the code - so perhaps instead I can create a dummy
checkAbort, which does nothing? That way we'll always call
checkAbort.work(). But this adds a method call, so I'm not sure.

(same optimizations for mergeVectors()).

In addition, I think something can be done w/ SegmentTermDocs. Either create
a specialized STD based on whether it has deletions or not, or create a
dummy BitVector which returns false for every check. That way we can
eliminate the checks in each next(), skipTo(). Dummy BitVector will leave
the file as-is, but will add a method call, so I think I lean towards the
specialized STD. This can be as simple as making STD abstract, with a static
create() method which creates the right instance.

I believe there are other places where we can make such optimizations. If
the overall direction seems right to you, I can open an issue and start to
work on a patch. Currently I've patched SegmentMerger, and besides the class
getting bigger, nothing bad happened (meaning all tests pass), though it
might be interesting to check how it affects performance.

What do you think?

Shai

Re: Some thoughts around the use of reader.isDeleted and hasDeletions

Posted by Jason Rutherglen <ja...@gmail.com>.

> I searched the code and was surprised to see isDeleted and hasDeletions
are not called from any search code.

It was weeded out over time, MatchAllDocsQuery for example used to call it.
I think it was to offer users (who are using isDeleted) a way to access
deleted docs without a performance hit.

In the future deletedDocs will be a standardized randomaccessfilter with an
modifiable API that can be used across any filter.  Basically genericizing
and modularizing the API.  For example, I believe deleted docs could be
stored in column stride fields (when it's completed)?

On Mon, Jun 8, 2009 at 1:07 PM, Shai Erera <se...@gmail.com> wrote:

> Hi
>
> I recently read CHANGES to learn more about the readOnly parameter
> IndexReader now supports, and came across LUCENE-1329 with a comment that
> isDeleted was made not synchronized if readOnly=true (e.g.
> ReadOnlyIndexReader), which can affect search code, as it is usually the
> bottleneck for search operations.
>
> I searched the code and was surprised to see isDeleted and hasDeletions are
> not called from any search code. Instead, the code, such as SegmentTermDocs,
> uses the deletedDocs instance directly. So in fact isDeleted wasn't the
> bottleneck (unless this was part of the changes in 1329). Anyway, doesn't
> matter, that's good !
>
> However, I did find out some indexing code whic calls these two, like
> SegmentMerger when it merges fields and term vectors. I think that we can
> improve that code by writing some specialized code for merging - if the
> reader has no deletions, there's no point checking for each document if
> there are deletions and/or if the document was deleted. In fact,
> SegmentMerger checks for each document: (1) whether the reader has deletions
> and if the document was deleted, (2) if the reader has a matching reader and
> (3) if checkAbort is not null.
>
> I have a couple of suggestions to simplify that code:
> 1. Create two specialized copyFieldsWithDeletions/copyFieldsNoDeletions to
> get rid of the unnecessary if (hasDeletions) check for each doc.
> 2. In these, check if the reader has matching reader or not, and execute
> the appropriate code in each case.
> 3. Also, check up front if checkAbort != null.
> 3.1 (3) duplicates the code - so perhaps instead I can create a dummy
> checkAbort, which does nothing? That way we'll always call
> checkAbort.work(). But this adds a method call, so I'm not sure.
>
> (same optimizations for mergeVectors()).
>
> In addition, I think something can be done w/ SegmentTermDocs. Either
> create a specialized STD based on whether it has deletions or not, or create
> a dummy BitVector which returns false for every check. That way we can
> eliminate the checks in each next(), skipTo(). Dummy BitVector will leave
> the file as-is, but will add a method call, so I think I lean towards the
> specialized STD. This can be as simple as making STD abstract, with a static
> create() method which creates the right instance.
>
> I believe there are other places where we can make such optimizations. If
> the overall direction seems right to you, I can open an issue and start to
> work on a patch. Currently I've patched SegmentMerger, and besides the class
> getting bigger, nothing bad happened (meaning all tests pass), though it
> might be interesting to check how it affects performance.
>
> What do you think?
>
> Shai
>

Re: Some thoughts around the use of reader.isDeleted and hasDeletions

Posted by Yonik Seeley <yo...@lucidimagination.com>.

On Mon, Jun 8, 2009 at 4:07 PM, Shai Erera <se...@gmail.com> wrote:
> if the
> reader has no deletions, there's no point checking for each document if
> there are deletions and/or if the document was deleted.

If there are no deletions, it's just a null pointer check, right?
Or would there be other benefits?

-Yonik
http://www.lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Some thoughts around the use of reader.isDeleted and hasDeletions

Posted by Yonik Seeley <yo...@lucidimagination.com>.

On Tue, Jun 9, 2009 at 11:52 AM, Michael McCandless
<lu...@mikemccandless.com> wrote:
> Actually: I think we should also change IndexReader.document to not
> check if it's deleted?  (Renaming it to something like rawDocument(),
> storedDocument(), something, in the process, and deprecating the old
> one).

+1

-Yonik
http://www.lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Some thoughts around the use of reader.isDeleted and hasDeletions

Posted by Michael McCandless <lu...@mikemccandless.com>.

On Wed, Jun 10, 2009 at 11:16 AM, Shai Erera<se...@gmail.com> wrote:
>> it makes sense because isDeleted() is essentially the *only* thing
>> being done in the loop, and hence we can eliminate the loop entirely
>
> You mean that in case there is a matching segment, we can call
> matchingVectorsReader.rawDocs(rawDocLengths, rawDocLengths2, 0, maxDoc)?

Right, except you'll have to do it in chunks of rawDocLengths.length,
until you get to maxDoc.

> But in case it does not have a matching segment, we'd still need to iterate
> on the docs, and copy the term vectors one by one, right?

Yes.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Some thoughts around the use of reader.isDeleted and hasDeletions

Posted by Michael McCandless <lu...@mikemccandless.com>.

On Thu, Jun 18, 2009 at 11:18 PM, Earwin Burrfoot<ea...@gmail.com> wrote:
> Runtime change. Hard to imagine people relying on failing document() call.

+1

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Some thoughts around the use of reader.isDeleted and hasDeletions

Posted by Earwin Burrfoot <ea...@gmail.com>.

Runtime change. Hard to imagine people relying on failing document() call.

On Fri, Jun 19, 2009 at 00:29, Shai Erera<se...@gmail.com> wrote:
> I've made the changes to SegmentMerger and want to make the following
> changes to IndexReader.document(): (1) don't call ensureOpen() and (2) don't
> check isDeleted.
>
> Question is - can I make these changes on the current impls, or do I need to
> deprecate and come up w/ a new name? Here a new name is not a big challenge
> - we can choose: doc() or getDocument() for example. I don't feel
> rawDocument flows nicely (what's "raw" about it?)
>
> IMO, even though these are back-compat changes (to runtime), they are not
> likely to affect anyone. I mean, why would someone deliberately call
> document() when the reader has already been closed (unless he doesn't know
> it at the time of calling document()). For easy migration (if you rely on
> that feature), I can add isClose()/isOpen() w/ a default impl to call
> ensureOpen().
>
> Or why to call document(doc) if the doc is deleted. What's the scenario?
>
> Anyway, those two changes are necessary as our merging code calls them, but
> already check that a doc is deleted or not before. So it's just a question
> of a new method vs. a runtime change.
>
> What do you think?
>
> Shai
>
> On Wed, Jun 10, 2009 at 6:39 PM, Yonik Seeley <yo...@lucidimagination.com>
> wrote:
>>
>> On Wed, Jun 10, 2009 at 11:16 AM, Shai Erera <se...@gmail.com> wrote:
>> >> it makes sense because isDeleted() is essentially the *only* thing
>> >> being done in the loop, and hence we can eliminate the loop entirely
>> >
>> > You mean that in case there is a matching segment, we can call
>> > matchingVectorsReader.rawDocs(rawDocLengths, rawDocLengths2, 0, maxDoc)?
>>
>> Right... or rather directly calculate numDocs and docNum instead of
>> using the loop.
>>
>> > But in case it does not have a matching segment, we'd still need to
>> > iterate
>> > on the docs, and copy the term vectors one by one, right?
>>
>> Right, and that's the case where I think duplicating the code to
>> remove a single branch-predictable boolean flag isn't warranted as it
>> won't result in a measurable performance increase.
>>
>> -Yonik
>> http://www.lucidimagination.com
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>
>



-- 
Kirill Zakharenko/Кирилл Захаренко (earwin@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Some thoughts around the use of reader.isDeleted and hasDeletions

Posted by Shai Erera <se...@gmail.com>.

I've made the changes to SegmentMerger and want to make the following
changes to IndexReader.document(): (1) don't call ensureOpen() and (2) don't
check isDeleted.

Question is - can I make these changes on the current impls, or do I need to
deprecate and come up w/ a new name? Here a new name is not a big challenge
- we can choose: doc() or getDocument() for example. I don't feel
rawDocument flows nicely (what's "raw" about it?)

IMO, even though these are back-compat changes (to runtime), they are not
likely to affect anyone. I mean, why would someone deliberately call
document() when the reader has already been closed (unless he doesn't know
it at the time of calling document()). For easy migration (if you rely on
that feature), I can add isClose()/isOpen() w/ a default impl to call
ensureOpen().

Or why to call document(doc) if the doc is deleted. What's the scenario?

Anyway, those two changes are necessary as our merging code calls them, but
already check that a doc is deleted or not before. So it's just a question
of a new method vs. a runtime change.

What do you think?

Shai

On Wed, Jun 10, 2009 at 6:39 PM, Yonik Seeley <yo...@lucidimagination.com>wrote:

> On Wed, Jun 10, 2009 at 11:16 AM, Shai Erera <se...@gmail.com> wrote:
> >> it makes sense because isDeleted() is essentially the *only* thing
> >> being done in the loop, and hence we can eliminate the loop entirely
> >
> > You mean that in case there is a matching segment, we can call
> > matchingVectorsReader.rawDocs(rawDocLengths, rawDocLengths2, 0, maxDoc)?
>
> Right... or rather directly calculate numDocs and docNum instead of
> using the loop.
>
> > But in case it does not have a matching segment, we'd still need to
> iterate
> > on the docs, and copy the term vectors one by one, right?
>
> Right, and that's the case where I think duplicating the code to
> remove a single branch-predictable boolean flag isn't warranted as it
> won't result in a measurable performance increase.
>
> -Yonik
> http://www.lucidimagination.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

Re: Some thoughts around the use of reader.isDeleted and hasDeletions

Posted by Yonik Seeley <yo...@lucidimagination.com>.

On Wed, Jun 10, 2009 at 11:16 AM, Shai Erera <se...@gmail.com> wrote:
>> it makes sense because isDeleted() is essentially the *only* thing
>> being done in the loop, and hence we can eliminate the loop entirely
>
> You mean that in case there is a matching segment, we can call
> matchingVectorsReader.rawDocs(rawDocLengths, rawDocLengths2, 0, maxDoc)?

Right... or rather directly calculate numDocs and docNum instead of
using the loop.

> But in case it does not have a matching segment, we'd still need to iterate
> on the docs, and copy the term vectors one by one, right?

Right, and that's the case where I think duplicating the code to
remove a single branch-predictable boolean flag isn't warranted as it
won't result in a measurable performance increase.

-Yonik
http://www.lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Some thoughts around the use of reader.isDeleted and hasDeletions

Posted by Shai Erera <se...@gmail.com>.

>
> it makes sense because isDeleted() is essentially the *only* thing
> being done in the loop, and hence we can eliminate the loop entirely
>

You mean that in case there is a matching segment, we can call
matchingVectorsReader.rawDocs(rawDocLengths, rawDocLengths2, 0, maxDoc)?
But in case it does not have a matching segment, we'd still need to iterate
on the docs, and copy the term vectors one by one, right?

I'm not very familiar w/ the code, so I'd like to confirm my understanding.

Shai

On Tue, Jun 9, 2009 at 9:54 PM, Yonik Seeley <
yonik.seeley@lucidimagination.com> wrote:

> 2009/6/9 Shai Erera <se...@gmail.com>:
> >> If there are no deletions, it's just a null pointer check, right?
> >
> > Well ... one null pointer check here, one null pointer check there and at
> > some point you will see a difference. My point wasn't the null pointer
> check
> > itself, but the pointer check for *every* document in mergeFields() and
> > *every* document in mergeVectors().
>
> I like performance, but it does seem like anything that complicates
> the code (duplication and specialization) should result in an actual
> measurable performance increase.
>
> But in this specific case (I just looked at the code for mergeVectors)
> it makes sense because isDeleted() is essentially the *only* thing
> being done in the loop, and hence we can eliminate the loop entirely
> (an algorithmic change, not just eliminating a null pointer check
> per-document in the context of doing something else per-document).
>
> patch away ;-)
>
> -Yonik
> http://www.lucidimagination.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

Re: Some thoughts around the use of reader.isDeleted and hasDeletions

Posted by Yonik Seeley <yo...@lucidimagination.com>.

2009/6/9 Shai Erera <se...@gmail.com>:
>> If there are no deletions, it's just a null pointer check, right?
>
> Well ... one null pointer check here, one null pointer check there and at
> some point you will see a difference. My point wasn't the null pointer check
> itself, but the pointer check for *every* document in mergeFields() and
> *every* document in mergeVectors().

I like performance, but it does seem like anything that complicates
the code (duplication and specialization) should result in an actual
measurable performance increase.

But in this specific case (I just looked at the code for mergeVectors)
it makes sense because isDeleted() is essentially the *only* thing
being done in the loop, and hence we can eliminate the loop entirely
(an algorithmic change, not just eliminating a null pointer check
per-document in the context of doing something else per-document).

patch away ;-)

-Yonik
http://www.lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Some thoughts around the use of reader.isDeleted and hasDeletions

Posted by Shai Erera <se...@gmail.com>.

>
> Be careful: checkAbort needs to be called "fairly" frequently so
> IndexWriter.close(false) doesn't take too long.
>

Of course - I meant check up front if checkAbort != null, and then always
call work() if it's not null. But I also agree that a dummy impl is the
better approach, since it's not a common case that checkAbort == null.

If there are no deletions, it's just a null pointer check, right?
>

Well ... one null pointer check here, one null pointer check there and at
some point you will see a difference. My point wasn't the null pointer check
itself, but the pointer check for *every* document in mergeFields() and
*every* document in mergeVectors(). So regardless if there are deletions or
not, we do a pointer check (actually a boolean check), and for every
document. For a very large segment, I believe it will amount to something.

Overall I don't think it will improve the performance significantly
(although fixing this in many places can have a certain affect on
performance), but it will clean up the code, and these *redundant* checks
won't stand out (after all, during the iteration the reader won't suddenly
have or not have deletions ...).

So should I open an issue?

Shai

On Tue, Jun 9, 2009 at 7:29 PM, Earwin Burrfoot <ea...@gmail.com> wrote:

> > Actually: I think we should also change IndexReader.document to not
> > check if it's deleted?  (Renaming it to something like rawDocument(),
> > storedDocument(), something, in the process, and deprecating the old
> > one).
> Yup. After all the most common use-case is to load a document after
> finding it in one or another way. Pretty hard to come up with id of a
> deleted document.
>
> --
> Kirill Zakharenko/Кирилл Захаренко (earwin@gmail.com)
> Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
> ICQ: 104465785
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

Re: Some thoughts around the use of reader.isDeleted and hasDeletions

Posted by Earwin Burrfoot <ea...@gmail.com>.

> Actually: I think we should also change IndexReader.document to not
> check if it's deleted?  (Renaming it to something like rawDocument(),
> storedDocument(), something, in the process, and deprecating the old
> one).
Yup. After all the most common use-case is to load a document after
finding it in one or another way. Pretty hard to come up with id of a
deleted document.

-- 
Kirill Zakharenko/Кирилл Захаренко (earwin@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Some thoughts around the use of reader.isDeleted and hasDeletions

Posted by Michael McCandless <lu...@mikemccandless.com>.

> I recently read CHANGES to learn more about the readOnly parameter
> IndexReader now supports, and came across LUCENE-1329 with a comment that
> isDeleted was made not synchronized if readOnly=true (e.g.
> ReadOnlyIndexReader), which can affect search code, as it is usually the
> bottleneck for search operations.
>
> I searched the code and was surprised to see isDeleted and hasDeletions are
> not called from any search code. Instead, the code, such as SegmentTermDocs,
> uses the deletedDocs instance directly. So in fact isDeleted wasn't the
> bottleneck (unless this was part of the changes in 1329). Anyway, doesn't
> matter, that's good !

You're right!  (I had also thought it was more widely used). I guess
apps that iterate through many docs (calling document()) might hit
this?  Also, it's only in 2.9 that MatchAllDocsQuery doesn't use
isDeleted.  So I agree the benefit of using readOnly reader is
presumably not much, as of 2.9...

Actually: I think we should also change IndexReader.document to not
check if it's deleted?  (Renaming it to something like rawDocument(),
storedDocument(), something, in the process, and deprecating the old
one).

> However, I did find out some indexing code whic calls these two, like
> SegmentMerger when it merges fields and term vectors. I think that we can
> improve that code by writing some specialized code for merging - if the
> reader has no deletions, there's no point checking for each document if
> there are deletions and/or if the document was deleted. In fact,
> SegmentMerger checks for each document: (1) whether the reader has deletions
> and if the document was deleted, (2) if the reader has a matching reader and
> (3) if checkAbort is not null.
>
> I have a couple of suggestions to simplify that code:
> 1. Create two specialized copyFieldsWithDeletions/copyFieldsNoDeletions to
> get rid of the unnecessary if (hasDeletions) check for each doc.
>
> 2. In these, check if the reader has matching reader or not, and execute the
> appropriate code in each case.

I think this make sense.

> 3. Also, check up front if checkAbort != null.

Be careful: checkAbort needs to be called "fairly" frequently so
IndexWriter.close(false) doesn't take too long.

> 3.1 (3) duplicates the code - so perhaps instead I can create a dummy
> checkAbort, which does nothing? That way we'll always call
> checkAbort.work(). But this adds a method call, so I'm not sure.

I think a dummy impl is fine; it's only null during unit tests and
in IndexWriter.addIndexes(IndexReader[]).

> In addition, I think something can be done w/ SegmentTermDocs. Either create
> a specialized STD based on whether it has deletions or not, or create a
> dummy BitVector which returns false for every check. That way we can
> eliminate the checks in each next(), skipTo(). Dummy BitVector will leave
> the file as-is, but will add a method call, so I think I lean towards the
> specialized STD. This can be as simple as making STD abstract, with a static
> create() method which creates the right instance.
>
> I believe there are other places where we can make such optimizations. If
> the overall direction seems right to you, I can open an issue and start to
> work on a patch. Currently I've patched SegmentMerger, and besides the class
> getting bigger, nothing bad happened (meaning all tests pass), though it
> might be interesting to check how it affects performance.

I think this makes sense.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org