You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Matt Quail <ma...@cenqua.com> on 2005/06/07 11:42:10 UTC

Documents returned by Scorer

I've been playing around with a custom Query, and I've just realized  
that my Scorer is likely to return the same document more then once.  
Before I delve a bit further, can anyone tell me if this is this a  
Bad Thing?

=Matt

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Doing a Join across indexes [was Documents returned by Scorer]

Posted by Paul Elschot <pa...@xs4all.nl>.

On Wednesday 08 June 2005 01:30, Matt Quail wrote:
> 
> On 08/06/2005, at 1:33 AM, Paul Elschot wrote:
> 
> > On Tuesday 07 June 2005 11:42, Matt Quail wrote:
> >
> >> I've been playing around with a custom Query, and I've just realized
> >> that my Scorer is likely to return the same document more then once.
> >> Before I delve a bit further, can anyone tell me if this is this a
> >> Bad Thing?
> >>
> >
> > Normally, yes. A query is expected to provide a single score for
> > each matching document. The Hits class depends on this.
> > One can suppress later 'hits' by using a BitVector.
> 
> 
> Yes, further delving revealed that returning a document multiple  
> times, and returning doc-ids out-of-order, is a bad idea.
> 
> 
> What I'm actually trying to implement is an efficient join between  
> two indices. I'll describe my current strategy, all comments welcome.
> 
> Imagine you have two indices, and they have some field in common.  
> This might often be a "primary key" field, giving you a 1-1  
> relationship between documents in the two indices, but it general it  
> could be any 1-N, N-1 or M-N relationship.
> 
> (Why are there two indices? There could be many reasons, one that was  
> mentioned on this list recently was an ACL index. You could update/ 
> delete/add from the ACL index without having to re-index the  
> "primary" index).
> 
> Lets call the two indices Left and Right. We want to find documents  
> in Left, where leftdoc.leftfield == rightdoc.rightfield, for some set  
> of rightdoc's satisfying some query.
> 
> Here is the pseudo-code for first attempt, which works but is not  
> necessarily that efficient:
> 
> Inputs:
> - leftJoinField name
> - rightJoinField name
> - rightQuery
> 
> 1) Compute a BitSet of all the docids that match the given query in  
> the right, call it rightDocs.
> 2) Create a TermDocs for the left index
> 3) Create a TermEnum and a TermDocs, for iterating through all the  
> <term,docid> pairs in the right index for the rightJoinField
> 4) for each rterm on the right where there is a <rterm, docid> pair  
> that is in rightDocs:
> 4.1) initialize the left termdoc to new Term(leftField, rterm.text())
> 4.2) iterate through those left docs ids for that term. These are our  
> matching left documents
> 
> 
> Consider the case of a 1-1 relationship, where rightDocs just  
> contains one document. Step 4 requires us to iterate through all the  
> terms on the rightJoinField (effectively, all the documents), just to  
> find the term corresponding to that one document This is not very  
> efficient. For a small number of rightDocs (for some value of  
> 'small'), it would be better to iterate through the Documents, and  
> extract the (stored) term that way.
> 
> However, for a large number of rightDocs, the "bitset" approach in  
> the above algorithm might be better.
> 
> In my second attempt, I might choose between the bitset and extract- 
> document approach, depending on some threshold.
> 
> One cute thing you might be able to do with the extract-document  
> approach is to actually use a Hits object (or HitCollector), which  
> gives you a score for each document. You could feed this score into  
> the join by returning the left documents via a custom Query/Scorer.  
> (okay, I agree this might be troublesome to implement efficiently.)
> 
> 
> Any comments? Has anyone ever attempted something similar?

Using Hits for this is probably not a good idea.
Have a look at the very recent thread on the fastest way to fetch
N documents with unique keys within large numbers of indexes.

Regards,
Paul Elschot


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Doing a Join across indexes [was Documents returned by Scorer]

Posted by Matt Quail <ma...@cenqua.com>.

On 08/06/2005, at 1:33 AM, Paul Elschot wrote:

> On Tuesday 07 June 2005 11:42, Matt Quail wrote:
>
>> I've been playing around with a custom Query, and I've just realized
>> that my Scorer is likely to return the same document more then once.
>> Before I delve a bit further, can anyone tell me if this is this a
>> Bad Thing?
>>
>
> Normally, yes. A query is expected to provide a single score for
> each matching document. The Hits class depends on this.
> One can suppress later 'hits' by using a BitVector.

Yes, further delving revealed that returning a document multiple  
times, and returning doc-ids out-of-order, is a bad idea.

What I'm actually trying to implement is an efficient join between  
two indices. I'll describe my current strategy, all comments welcome.

Imagine you have two indices, and they have some field in common.  
This might often be a "primary key" field, giving you a 1-1  
relationship between documents in the two indices, but it general it  
could be any 1-N, N-1 or M-N relationship.

(Why are there two indices? There could be many reasons, one that was  
mentioned on this list recently was an ACL index. You could update/ 
delete/add from the ACL index without having to re-index the  
"primary" index).

Lets call the two indices Left and Right. We want to find documents  
in Left, where leftdoc.leftfield == rightdoc.rightfield, for some set  
of rightdoc's satisfying some query.

Here is the pseudo-code for first attempt, which works but is not  
necessarily that efficient:

Inputs:
- leftJoinField name
- rightJoinField name
- rightQuery

1) Compute a BitSet of all the docids that match the given query in  
the right, call it rightDocs.
2) Create a TermDocs for the left index
3) Create a TermEnum and a TermDocs, for iterating through all the  
<term,docid> pairs in the right index for the rightJoinField
4) for each rterm on the right where there is a <rterm, docid> pair  
that is in rightDocs:
4.1) initialize the left termdoc to new Term(leftField, rterm.text())
4.2) iterate through those left docs ids for that term. These are our  
matching left documents

Consider the case of a 1-1 relationship, where rightDocs just  
contains one document. Step 4 requires us to iterate through all the  
terms on the rightJoinField (effectively, all the documents), just to  
find the term corresponding to that one document This is not very  
efficient. For a small number of rightDocs (for some value of  
'small'), it would be better to iterate through the Documents, and  
extract the (stored) term that way.

However, for a large number of rightDocs, the "bitset" approach in  
the above algorithm might be better.

In my second attempt, I might choose between the bitset and extract- 
document approach, depending on some threshold.

One cute thing you might be able to do with the extract-document  
approach is to actually use a Hits object (or HitCollector), which  
gives you a score for each document. You could feed this score into  
the join by returning the left documents via a custom Query/Scorer.  
(okay, I agree this might be troublesome to implement efficiently.)

Any comments? Has anyone ever attempted something similar?

=Matt

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Documents returned by Scorer

Posted by Paul Elschot <pa...@xs4all.nl>.

On Tuesday 07 June 2005 11:42, Matt Quail wrote:
> I've been playing around with a custom Query, and I've just realized  
> that my Scorer is likely to return the same document more then once.  
> Before I delve a bit further, can anyone tell me if this is this a  
> Bad Thing?

Normally, yes. A query is expected to provide a single score for
each matching document. The Hits class depends on this.
One can suppress later 'hits' by using a BitVector.

When your scorer implements skipTo it would normally have
to return the documents in document number order.

In the development version all scorers implement skipTo.

Regards,
Paul Elschot

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org