You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Jason Polites <ja...@tpg.com.au> on 2005/01/24 09:14:02 UTC

Duplicate hits using ParallelMultiSearcher

Hello all,

I am looking for a strategy to exclude duplicate entries when searching 
multiple indexes which may contain the same document.  I have an email 
system which archives and indexes emails on a per-recipient basis.  So, each 
email recipient has their own index.  In the case where the same email is 
delivered to more than one recipient, each recipient's index stores a record 
of effectively the same document.  Now, there is a requirement to perform a 
search across multiple indexes, for which I am using the 
ParallelMultiSearcher.  The problem is that this results in duplicate 
entries in the Hits returned.  I can easily transfer the results into some 
form of java.util.Set to guarantee uniqueness, however I have a problem with 
the length() of the Hits object returned.  Ideally I need a way of filtering 
the Hits based on a "no duplicate" rule.  I am aware of the Filter object 
however the unique identifier of my document is a field within the lucene 
document itself (messageid); and I am reluctant to access this field using 
the public API for every Hit as I fear it will have drastic performance 
implications.

The ideal solution for me would be to specify a field during the search 
which is guaranteed to be unique across the Hits returned.  Anyone know of 
an elegant way to do this?  Alternatively is there a way I can de-dupe the 
list myself without loading every document?

Apologies for the length of this question.

P.S.  The separation of indexes per-recipient is a mandatory requirement. 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Duplicate hits using ParallelMultiSearcher

Posted by Jason Polites <ja...@tpg.com.au>.

Agreed on the "set of unique messages", however the problem I have is with 
the "count" of the Hits.  The Hits object may contain 100 results (for 
example), of which only 90 are unique.  Because I am paging through results 
10 at a time, I need to know the total count without loading each document. 
If I get a count of 100 but a Collection of only 90 my paging breaks.

After careful consideration I have decided that the better approach is to 
create a separate "global" index in which all messages are stored.  This 
will not only relieve my duplication issue but should also scale better 
if/when there are several hundred or several thousand distinct indexes.

Thanks,

- JP

----- Original Message ----- 
From: "PA" <pe...@gmail.com>
To: "Lucene Users List" <lu...@jakarta.apache.org>
Sent: Monday, January 24, 2005 10:43 PM
Subject: Re: Duplicate hits using ParallelMultiSearcher

>
> On Jan 24, 2005, at 09:14, Jason Polites wrote:
>
>> I am aware of the Filter object however the unique identifier of my 
>> document is a field within the lucene document itself (messageid); and I 
>> am reluctant to access this field using the public API for every Hit as I 
>> fear it will have drastic performance implications.
>
> Well... I don't see any way around that as you basically want to uniquely 
> identify your messages based on their Message-ID.
>
> That said, you don't need to do it during the search itself. You could 
> simply perform your search as you do now and then create a set of unique 
> messages while preserving Lucene Hits sort ordering for "relevance" 
> purpose.
>
> HTH.
>
> Cheers
>
> --
> PA
> http://alt.textdrive.com/
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Duplicate hits using ParallelMultiSearcher

Posted by PA <pe...@gmail.com>.

On Jan 24, 2005, at 09:14, Jason Polites wrote:

> I am aware of the Filter object however the unique identifier of my 
> document is a field within the lucene document itself (messageid); and 
> I am reluctant to access this field using the public API for every Hit 
> as I fear it will have drastic performance implications.

Well... I don't see any way around that as you basically want to 
uniquely identify your messages based on their Message-ID.

That said, you don't need to do it during the search itself. You could 
simply perform your search as you do now and then create a set of 
unique messages while preserving Lucene Hits sort ordering for 
"relevance" purpose.

HTH.

Cheers

--
PA
http://alt.textdrive.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org