You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Ian Lea <ia...@gmail.com> on 2010/03/04 16:07:23 UTC

Re: how to use DuplicateFilter to get unique documents based on a fieldName

If the field you want to use for deduping is ISBN, create a
DuplicateFilter using whatever your ISBN field name is as the field
name and pass that to one of the search methods that takes a filter.

If your index is large I'd be worried about performance and would look
at deduping at indexing time i.e. have one lucene document per ISBN.


--
Ian.


On Thu, Mar 4, 2010 at 12:43 PM, anisha@ekkitab <an...@ekkitab.com> wrote:
>
> Hi there, Could someone help me with the usage of DuplicateFilters. Here is
> my problem
>
> I have created a search index on book Id , title ,and author from a database
> of books which fall under various categories. Some books fall under more
> than one category. Now, when i issue a search, I get back 'X' books matching
> the search criteria, some of which are repeated, because that books are in
> different documents and its the expected behaviour.
>
> I use the  TopFieldDocCollector . getTotalHits() to get the total count. But
> this includes the repeats as mentioned above. This count is not the actual
> count, Hence when I issue a search on title or author i want to get a unique
> count / list of books. How do I use DuplicateFilter to acheive this.
>
> Please help
>
> Regards
> Anish
> --
> View this message in context: http://old.nabble.com/how-to-use-DuplicateFilter-to-get-unique-documents-based-on-a-fieldName-tp27780251p27780251.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: how to use DuplicateFilter to get unique documents based on a fieldName

Posted by "anisha@ekkitab" <an...@ekkitab.com>.
Hi Ian,

Thanks for your reply. We had actually done what you had suggested first,
and it wasn't working, so I was hoping for some sample code. But then we
found out that the field name on which we wanted the duplicate filter to be
applied was not actually indexed while adding it into the document. ie
Field.Index was set to NO. We changed this, repopulated the documents and
the filtering works now.

We have about 3 million records in the database and documents little more
than that, and we wouldn't want to do the deduping at indexing time, because
one book ( one isbn ) can be available under 2 or more categories( like
fiction, comics & novels, science etc)

-Anisha



Ian Lea wrote:
> 
> If the field you want to use for deduping is ISBN, create a
> DuplicateFilter using whatever your ISBN field name is as the field
> name and pass that to one of the search methods that takes a filter.
> 
> If your index is large I'd be worried about performance and would look
> at deduping at indexing time i.e. have one lucene document per ISBN.
> 
> 
> --
> Ian.
> 
> 
> On Thu, Mar 4, 2010 at 12:43 PM, anisha@ekkitab <an...@ekkitab.com>
> wrote:
>>
>> Hi there, Could someone help me with the usage of DuplicateFilters. Here
>> is
>> my problem
>>
>> I have created a search index on book Id , title ,and author from a
>> database
>> of books which fall under various categories. Some books fall under more
>> than one category. Now, when i issue a search, I get back 'X' books
>> matching
>> the search criteria, some of which are repeated, because that books are
>> in
>> different documents and its the expected behaviour.
>>
>> I use the  TopFieldDocCollector . getTotalHits() to get the total count.
>> But
>> this includes the repeats as mentioned above. This count is not the
>> actual
>> count, Hence when I issue a search on title or author i want to get a
>> unique
>> count / list of books. How do I use DuplicateFilter to acheive this.
>>
>> Please help
>>
>> Regards
>> Anish
>> --
>> View this message in context:
>> http://old.nabble.com/how-to-use-DuplicateFilter-to-get-unique-documents-based-on-a-fieldName-tp27780251p27780251.html
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> 

-- 
View this message in context: http://old.nabble.com/how-to-use-DuplicateFilter-to-get-unique-documents-based-on-a-fieldName-tp27780251p27790381.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org