You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Roy <ro...@gmail.com> on 2004/10/13 08:41:56 UTC

What's the purpose of hashing docid in BooleanScorer

While studying the search code, I got a question: in BooleanScorer,
when collecting docids and scores from subscorers, the docids are
hashed through buckettable first. What's the purpose of doing the
hashing?

Thanks for your help.

Roy

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Re: What's the purpose of hashing docid in BooleanScorer; DisjunctionScorer

Posted by Doug Cutting <cu...@apache.org>.

Paul Elschot wrote:
> I have a DisjunctionScorer based on a PriorityQueue lying around,
> but I can't benchmark it myself at the moment. In case there is
> interest, I'll gladly adapt it to org.apache.lucene.search and 
> add it in bugzilla.

This should look a lot like SpanOrQuery.getSpans().

On a related note, I implemented ConjunctionScorer using Java's 
collection classes rather than a Lucene priority queue, just to see if I 
could.  It turns out to have to allocate memory in sortScorers() which 
makes it slower than it could be, but I have not yet gotten around to 
fixing it.  I'd like to re-write this to look like PhraseScorer and 
NearSpans, which operate without allocation.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Re: What's the purpose of hashing docid in BooleanScorer; DisjunctionScorer

Posted by Paul Elschot <pa...@xs4all.nl>.

On Monday 18 October 2004 23:04, Doug Cutting wrote:
> Christoph Goller wrote:
> > With the current scorer API one could get rid of buckettable and
> > advance all subscores only by one document each time. I am not sure
> > whether the bucketable implementation is really much more efficient.
> > I only see the advantage of inlining some of the scorer.next and
> > score.score code.
>
> Indeed, sub-scorers could be, e.g., kept in a priority queue.  This is
> done in ConjunctionScorer, PhraseScorer, etc.  However this adds a
> priority queue update to the inner search loop.  With long queries and
> with common terms this overhead can be significant.  With short queries
> and/or with rare terms the current ScoreTable-based implementation may
> indeed be slower, but I believe with longer queries containing common
> terms it is substantially faster.
>
> This algorithm is described in:
>
> http://lucene.sourceforge.net/papers/riao97.ps
>
> If we had a priority-queue-based implementation then we could benchmark
> these.  If we found that one were faster than the other for particular
> classes of queries then we could have a query optimizer which
> automatically selects the most efficient implementation...

I have a DisjunctionScorer based on a PriorityQueue lying around,
but I can't benchmark it myself at the moment. In case there is
interest, I'll gladly adapt it to org.apache.lucene.search and 
add it in bugzilla.

Regards,
Paul Elschot


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Re: What's the purpose of hashing docid in BooleanScorer

Posted by Doug Cutting <cu...@apache.org>.

Christoph Goller wrote:
> With the current scorer API one could get rid of buckettable and
> advance all subscores only by one document each time. I am not sure
> whether the bucketable implementation is really much more efficient.
> I only see the advantage of inlining some of the scorer.next and
> score.score code.

Indeed, sub-scorers could be, e.g., kept in a priority queue.  This is 
done in ConjunctionScorer, PhraseScorer, etc.  However this adds a 
priority queue update to the inner search loop.  With long queries and 
with common terms this overhead can be significant.  With short queries 
and/or with rare terms the current ScoreTable-based implementation may 
indeed be slower, but I believe with longer queries containing common 
terms it is substantially faster.

This algorithm is described in:

http://lucene.sourceforge.net/papers/riao97.ps

If we had a priority-queue-based implementation then we could benchmark 
these.  If we found that one were faster than the other for particular 
classes of queries then we could have a query optimizer which 
automatically selects the most efficient implementation...

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Re: What's the purpose of hashing docid in BooleanScorer

Posted by Christoph Goller <go...@detego-software.de>.

Roy schrieb:
> While studying the search code, I got a question: in BooleanScorer,
> when collecting docids and scores from subscorers, the docids are
> hashed through buckettable first. What's the purpose of doing the
> hashing?

The purpose of buckettable is to gather all scores from the subscorers
for a certain interval of documents. Only after evaluating all
subscorers on the current interval of documents we have enough info
to decide whether a document is a hit or not.

With the current scorer API one could get rid of buckettable and
advance all subscores only by one document each time. I am not sure
whether the bucketable implementation is really much more efficient.
I only see the advantage of inlining some of the scorer.next and
score.score code.

Christoph

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org