You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by John Wang <jo...@gmail.com> on 2005/08/17 04:46:19 UTC

search performance enhancement

Hi:

   I posted a bug (36147) a few days ago and didn't hear anything, so
I thought I'd try my luck on this list.

   The idea is to avoid score calculations on documents to be filtered
out anyway. (e.g. via Filter object passed to the searcher class)

   This seems to be an easy change.

   Also it would be nice to expose a method to return a score given a
docid, e.g.

   float getScore(int docid)

   on the Scorer class.

   I am gonna make the change locally and do some performance analysis
on it and will post some numbers later.

Thanks

-John

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: search performance enhancement

Posted by Paul Elschot <pa...@xs4all.nl>.

On Wednesday 21 September 2005 03:29, John Wang wrote:
> Hi Paul and other gurus:
> 
> In a related topic, seems lucene is scoring documents that would hit in a 
> "prohibited" boolean clause, e.g. NOT field:value. It doesn't seem to make 
> sense to score a document that is to be excluded from the result. Is this a 
> difficult thing to fix?

In the svn trunk (the development version) excluded clauses of
BooleanQuery are no not scored.
The default scoring method is so cheap on CPU time that this is hardly
noticeable.

> Also in Paul's ealier comment: "... unless you have large indexes, this will 
> probably not make much difference", what is "large" in this case. 

A BitSet takes one bit per document in the index, and a SortedVIntList
takes about 1 byte per document passing the filter.
So when you need many filters, each passing less than 1/8 of the indexed
docs, a SortedVIntList takes less memory.
 
> In our case, say millions of documents match some query, but after either a 
> Filter is applied or after a NOT query (e.g. query with a NOT/prohibited 
> clause) is applied, the resulting hit list has only 10 documents. Seems the 
> millions of calls to score() is wasted, some of the score() call can be 
> computational intensive.

The FilteredQuery posted here:
http://issues.apache.org/jira/browse/LUCENE-330
will score only the documents passing the filter, for both a BitSet and
a SortedVIntList filter. Btw. this is the new place for the related filter
implementations:
http://issues.apache.org/jira/browse/LUCENE-328

> 
> Am I on the right track?

That's easy: you're using Lucene.

Regards,
Paul Elschot


> Thanks
> 
> -John
> 
> On 8/19/05, Paul Elschot <pa...@xs4all.nl> wrote:
> > 
> > On Friday 19 August 2005 18:09, John Wang wrote:
> > > Hi Paul:
> > >
> > > Thanks for the pointer.
> > >
> > > How would I extend from the patch you submitted to filter out
> > > more documents not using a Filter. e.g.
> > >
> > > have a class to skip documents based on a docID: boolean
> > > isValid(int docID)
> > >
> > > My problem is I want to discard documents at query time without
> > > having to construct a BitSet via filter. I have my own memory
> > > structure to help me skip documents based the query and a docid.
> > 
> > Basically, you need to implement the next() and skipTo(int targetDoc)
> > methods from Scorer on your memory structure. They are somewhat
> > redundant (skipTo(doc() + 1) has the same semantics as next(), except
> > initially), but that is for a mix of historical and performance reasons.
> > Have a look at how this is done in the posted FilteredQuery class
> > for SortedVIntList and BitSet.
> > With only isValid(docId) these next() and skipTo() methods would
> > have to count over the document numbers, which is less than ideal.
> > 
> > When you use the posted code, iirc it is only necessary to implement the
> > SkipFilter interface on your memory structure. One can use that interface
> > to build/cache such memory structures using an IndexReader, and
> > from there the DocNrSkipper interface will do the rest (of the top of
> > my head).
> > One slight problem with the current Lucene implementation is that
> > java.lang.BitSet is not interface.
> > 
> > Regards,
> > Paul Elschot.
> > 
> > > Thanks
> > >
> > > -John
> > >
> > > On 8/16/05, Paul Elschot <pa...@xs4all.nl> wrote:
> > > > Hi John,
> > > >
> > > > On Wednesday 17 August 2005 04:46, John Wang wrote:
> > > > > Hi:
> > > > >
> > > > > I posted a bug (36147) a few days ago and didn't hear anything, so
> > > > > I thought I'd try my luck on this list.
> > > > >
> > > > > The idea is to avoid score calculations on documents to be filtered
> > > > > out anyway. (e.g. via Filter object passed to the searcher class)
> > > > >
> > > > > This seems to be an easy change.
> > > >
> > > > Have a look here:
> > > > http://issues.apache.org/bugzilla/show_bug.cgi?id=32965
> > > >
> > > > > Also it would be nice to expose a method to return a score given a
> > > > > docid, e.g.
> > > > >
> > > > > float getScore(int docid)
> > > > >
> > > > > on the Scorer class.
> > > >
> > > > skipTo(int docid) and score() will do that.
> > > >
> > > > > I am gonna make the change locally and do some performance analysis
> > > > > on it and will post some numbers later.
> > > >
> > > > The default score computations are mostly table lookups, and pretty 
> > fast.
> > > > So, unless you have large indexes, this will probably not make
> > > > much difference, but any performance improvement is wellcome.
> > > > In larger indexes, it helps to use skipTo() while searching.
> > > >
> > > > Regards,
> > > > Paul Elschot
> > > >
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > >
> > > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
> > >
> > >
> > 
> > 
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> > 
> >
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: search performance enhancement

Posted by John Wang <jo...@gmail.com>.

Hi Paul and other gurus:

In a related topic, seems lucene is scoring documents that would hit in a 
"prohibited" boolean clause, e.g. NOT field:value. It doesn't seem to make 
sense to score a document that is to be excluded from the result. Is this a 
difficult thing to fix?

Also in Paul's ealier comment: "... unless you have large indexes, this will 
probably not make much difference", what is "large" in this case. 

In our case, say millions of documents match some query, but after either a 
Filter is applied or after a NOT query (e.g. query with a NOT/prohibited 
clause) is applied, the resulting hit list has only 10 documents. Seems the 
millions of calls to score() is wasted, some of the score() call can be 
computational intensive.

Am I on the right track?

Thanks

-John

On 8/19/05, Paul Elschot <pa...@xs4all.nl> wrote:
> 
> On Friday 19 August 2005 18:09, John Wang wrote:
> > Hi Paul:
> >
> > Thanks for the pointer.
> >
> > How would I extend from the patch you submitted to filter out
> > more documents not using a Filter. e.g.
> >
> > have a class to skip documents based on a docID: boolean
> > isValid(int docID)
> >
> > My problem is I want to discard documents at query time without
> > having to construct a BitSet via filter. I have my own memory
> > structure to help me skip documents based the query and a docid.
> 
> Basically, you need to implement the next() and skipTo(int targetDoc)
> methods from Scorer on your memory structure. They are somewhat
> redundant (skipTo(doc() + 1) has the same semantics as next(), except
> initially), but that is for a mix of historical and performance reasons.
> Have a look at how this is done in the posted FilteredQuery class
> for SortedVIntList and BitSet.
> With only isValid(docId) these next() and skipTo() methods would
> have to count over the document numbers, which is less than ideal.
> 
> When you use the posted code, iirc it is only necessary to implement the
> SkipFilter interface on your memory structure. One can use that interface
> to build/cache such memory structures using an IndexReader, and
> from there the DocNrSkipper interface will do the rest (of the top of
> my head).
> One slight problem with the current Lucene implementation is that
> java.lang.BitSet is not interface.
> 
> Regards,
> Paul Elschot.
> 
> > Thanks
> >
> > -John
> >
> > On 8/16/05, Paul Elschot <pa...@xs4all.nl> wrote:
> > > Hi John,
> > >
> > > On Wednesday 17 August 2005 04:46, John Wang wrote:
> > > > Hi:
> > > >
> > > > I posted a bug (36147) a few days ago and didn't hear anything, so
> > > > I thought I'd try my luck on this list.
> > > >
> > > > The idea is to avoid score calculations on documents to be filtered
> > > > out anyway. (e.g. via Filter object passed to the searcher class)
> > > >
> > > > This seems to be an easy change.
> > >
> > > Have a look here:
> > > http://issues.apache.org/bugzilla/show_bug.cgi?id=32965
> > >
> > > > Also it would be nice to expose a method to return a score given a
> > > > docid, e.g.
> > > >
> > > > float getScore(int docid)
> > > >
> > > > on the Scorer class.
> > >
> > > skipTo(int docid) and score() will do that.
> > >
> > > > I am gonna make the change locally and do some performance analysis
> > > > on it and will post some numbers later.
> > >
> > > The default score computations are mostly table lookups, and pretty 
> fast.
> > > So, unless you have large indexes, this will probably not make
> > > much difference, but any performance improvement is wellcome.
> > > In larger indexes, it helps to use skipTo() while searching.
> > >
> > > Regards,
> > > Paul Elschot
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
> >
> >
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
>

Re: search performance enhancement

Posted by Paul Elschot <pa...@xs4all.nl>.

On Friday 19 August 2005 18:09, John Wang wrote:
> Hi Paul:
> 
>       Thanks for the pointer.
> 
>        How would I extend from the patch you submitted to filter out
> more documents not using a Filter. e.g.
> 
>       have a class to skip documents based on a docID: boolean
> isValid(int docID)
> 
>       My problem is I want to discard documents at query time without
> having to construct a BitSet via filter. I have my own memory
> structure to help me skip documents based the query and a docid.

Basically, you need to implement the next() and skipTo(int targetDoc)
methods from Scorer on your memory structure. They are somewhat
redundant (skipTo(doc() + 1) has the same semantics as next(), except
initially), but that is for a mix of historical and performance reasons.
Have a look at how this is done in the posted FilteredQuery class
for SortedVIntList and BitSet.
With only isValid(docId) these next() and skipTo() methods would
have to count over the document numbers, which is less than ideal.

When you use the posted code, iirc it is only necessary to implement the
SkipFilter interface on your memory structure. One can use that interface
to build/cache  such memory structures using an IndexReader, and
from there the DocNrSkipper interface will do the rest (of the top of
my head).
One slight problem with the current Lucene implementation is that
java.lang.BitSet is not interface.

Regards,
Paul Elschot.

> Thanks
> 
> -John
> 
> On 8/16/05, Paul Elschot <pa...@xs4all.nl> wrote:
> > Hi John,
> > 
> > On Wednesday 17 August 2005 04:46, John Wang wrote:
> > > Hi:
> > >
> > >    I posted a bug (36147) a few days ago and didn't hear anything, so
> > > I thought I'd try my luck on this list.
> > >
> > >    The idea is to avoid score calculations on documents to be filtered
> > > out anyway. (e.g. via Filter object passed to the searcher class)
> > >
> > >    This seems to be an easy change.
> > 
> > Have a look here:
> > http://issues.apache.org/bugzilla/show_bug.cgi?id=32965
> > 
> > >    Also it would be nice to expose a method to return a score given a
> > > docid, e.g.
> > >
> > >    float getScore(int docid)
> > >
> > >    on the Scorer class.
> > 
> > skipTo(int docid) and score() will do that.
> > 
> > >    I am gonna make the change locally and do some performance analysis
> > > on it and will post some numbers later.
> > 
> > The default score computations are mostly table lookups, and pretty fast.
> > So, unless you have large indexes, this will probably not make
> > much difference, but any performance improvement is wellcome.
> > In larger indexes, it helps to use skipTo() while searching.
> > 
> > Regards,
> > Paul Elschot
> > 
> > 
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> > 
> >
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: search performance enhancement

Posted by John Wang <jo...@gmail.com>.

Hi Paul:

      Thanks for the pointer.

       How would I extend from the patch you submitted to filter out
more documents not using a Filter. e.g.

      have a class to skip documents based on a docID: boolean
isValid(int docID)

      My problem is I want to discard documents at query time without
having to construct a BitSet via filter. I have my own memory
structure to help me skip documents based the query and a docid.

Thanks

-John

On 8/16/05, Paul Elschot <pa...@xs4all.nl> wrote:
> Hi John,
> 
> On Wednesday 17 August 2005 04:46, John Wang wrote:
> > Hi:
> >
> >    I posted a bug (36147) a few days ago and didn't hear anything, so
> > I thought I'd try my luck on this list.
> >
> >    The idea is to avoid score calculations on documents to be filtered
> > out anyway. (e.g. via Filter object passed to the searcher class)
> >
> >    This seems to be an easy change.
> 
> Have a look here:
> http://issues.apache.org/bugzilla/show_bug.cgi?id=32965
> 
> >    Also it would be nice to expose a method to return a score given a
> > docid, e.g.
> >
> >    float getScore(int docid)
> >
> >    on the Scorer class.
> 
> skipTo(int docid) and score() will do that.
> 
> >    I am gonna make the change locally and do some performance analysis
> > on it and will post some numbers later.
> 
> The default score computations are mostly table lookups, and pretty fast.
> So, unless you have large indexes, this will probably not make
> much difference, but any performance improvement is wellcome.
> In larger indexes, it helps to use skipTo() while searching.
> 
> Regards,
> Paul Elschot
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: search performance enhancement

Posted by Paul Elschot <pa...@xs4all.nl>.

Hi John,

On Wednesday 17 August 2005 04:46, John Wang wrote:
> Hi:
> 
>    I posted a bug (36147) a few days ago and didn't hear anything, so
> I thought I'd try my luck on this list.
> 
>    The idea is to avoid score calculations on documents to be filtered
> out anyway. (e.g. via Filter object passed to the searcher class)
> 
>    This seems to be an easy change.

Have a look here:
http://issues.apache.org/bugzilla/show_bug.cgi?id=32965

>    Also it would be nice to expose a method to return a score given a
> docid, e.g.
> 
>    float getScore(int docid)
> 
>    on the Scorer class.

skipTo(int docid) and score() will do that.
 
>    I am gonna make the change locally and do some performance analysis
> on it and will post some numbers later.

The default score computations are mostly table lookups, and pretty fast.
So, unless you have large indexes, this will probably not make
much difference, but any performance improvement is wellcome.
In larger indexes, it helps to use skipTo() while searching.

Regards,
Paul Elschot


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org