You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by "anisha@ekkitab" <an...@ekkitab.com> on 2010/03/04 13:43:26 UTC

how to use DuplicateFilter to get unique documents based on a fieldName

Hi there, Could someone help me with the usage of DuplicateFilters. Here is
my problem

I have created a search index on book Id , title ,and author from a database
of books which fall under various categories. Some books fall under more
than one category. Now, when i issue a search, I get back 'X' books matching
the search criteria, some of which are repeated, because that books are in
different documents and its the expected behaviour. 

I use the  TopFieldDocCollector . getTotalHits() to get the total count. But
this includes the repeats as mentioned above. This count is not the actual
count, Hence when I issue a search on title or author i want to get a unique
count / list of books. How do I use DuplicateFilter to acheive this. 

Please help

Regards
Anish
-- 
View this message in context: http://old.nabble.com/how-to-use-DuplicateFilter-to-get-unique-documents-based-on-a-fieldName-tp27780251p27780251.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: how to use DuplicateFilter to get unique documents based on a fieldName

Posted by "anisha@ekkitab" <an...@ekkitab.com>.

Ok sorry for not explaining my problem clearly earlier. We have around 5
fields in each document. ID, ISBN, author, title  and the category which
this book falls under. ( You are right about point 3, we are indeed storing
multiple genre against the book, which means 1 book 1 doc.)

doc.add(new Field("entityId", book.get("id"), Field.Store.YES,
Field.Index.NO));
doc.add(new Field("author", book.get("author"), Field.Store.NO,
Field.Index.TOKENIZED));
etc etc... and this document is added using the IndexWriter.

and when a search is issued we search for the search term in
title/author/isbn/category....based on some inputs... then a set of books
are returned( you are right about point 2 as well... since we search only on
title/author/genre, we were only indexing those ). The way we wanted these
books to be laid out to the user was such that he can navigate through the
categories, which the books he searched for belong to, to kind of being able
to narrow the search. 

While the count of books, for the given search term, under a particular
category was correct, the overall count of the books were incorrect because
of some books being repeated in various categories. For this reason, we
wanted a duplicate filter on the ID which would give us only the unique
books... and there was something wrong in the way it was implemented... the
ID in the document was not indexed as you can see in the above code. When
this was fixed it worked as expected...but for some performance issues..
because of the huge index sizes ( 3 million books ). Anyway looks like we
have figured the solution ( moved the filter out of the search.. applied it
on the result or something like that ) Thanks so much for ur time.

-Anisha



Anshum-2 wrote:
> 
> Hi Anish,
> So am I getting something wrong here? You said "I have created a search
> index on book Id , title ,and author from a database of books which fall
> under various categories." so those are 3 fields, right?
> 1. How do you filter the doc types (as in the genres) at search time? Do
> you
> even need to do that, if yes how?
> 2. If you're doing that 'm assuming you're already indexing the genre
> somehow. Right?
> 3. How about a field for the genre having multi-valued entries (multiple
> field objects going into the same doc with the same field label). This
> would
> help you store 1 doc as 1 doc having multiple genres instead of duplicate
> entries.
> 
> I'm still not sure if I've gotten tre problem correctly, but hope this is
> of
> help!
> 
> --
> Anshum Gupta
> Naukri Labs!
> http://ai-cafe.blogspot.com
> 
> The facts expressed here belong to everybody, the opinions to me. The
> distinction is yours to draw............
> 
> 
> On Fri, Mar 5, 2010 at 12:07 PM, anisha@ekkitab <an...@ekkitab.com>
> wrote:
> 
>>
>> Hi Zhangchi
>>
>>
>> Thanks for your reply.
>>
>> We have about 3 million records (different isbns) in the database and
>> documents little more than that, and we wouldn't want to do the deduping
>> at
>> indexing time, because one book ( one isbn ) can be available under 2 or
>> more categories( like fiction, comics & novels, science etc)
>>
>> We had actually applied filter on the primary key ie ID, and it wasn't
>> working, so I was hoping for some sample code. But then we found out that
>> the field name on which we wanted the duplicate filter to be applied (Id)
>> was not actually indexed while adding it into the document. ie
>> Field.Index
>> was set to NO. We changed this, repopulated the documents and the
>> filtering
>> works now.
>>
>> Thanks for your time.
>>
>>
>>
>>
>> zhangchi wrote:
>> >
>> >
>> > i think you should check the index first.using the lukeall to see if
>> there
>> > is the duplicate books.
>> >
>> > On Thu, 04 Mar 2010 20:43:26 +0800, anisha@ekkitab <an...@ekkitab.com>
>> > wrote:
>> >
>> >>
>> >> Hi there, Could someone help me with the usage of DuplicateFilters.
>> Here
>> >> is
>> >> my problem
>> >>
>> >> I have created a search index on book Id , title ,and author from a
>> >> database
>> >> of books which fall under various categories. Some books fall under
>> more
>> >> than one category. Now, when i issue a search, I get back 'X' books
>> >> matching
>> >> the search criteria, some of which are repeated, because that books
>> are
>> >> in
>> >> different documents and its the expected behaviour.
>> >>
>> >> I use the  TopFieldDocCollector . getTotalHits() to get the total
>> count.
>> >> But
>> >> this includes the repeats as mentioned above. This count is not the
>> >> actual
>> >> count, Hence when I issue a search on title or author i want to get a
>> >> unique
>> >> count / list of books. How do I use DuplicateFilter to acheive this.
>> >>
>> >> Please help
>> >>
>> >> Regards
>> >> Anish
>> >
>> >
>> > --
>> > Using Opera's revolutionary e-mail client: http://www.opera.com/mail/
>> >
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> > For additional commands, e-mail: java-user-help@lucene.apache.org
>> >
>> >
>> >
>>
>> --
>> View this message in context:
>> http://old.nabble.com/how-to-use-DuplicateFilter-to-get-unique-documents-based-on-a-fieldName-tp27780251p27790391.html
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
> 
> 

-- 
View this message in context: http://old.nabble.com/how-to-use-DuplicateFilter-to-get-unique-documents-based-on-a-fieldName-tp27780251p27793771.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: how to use DuplicateFilter to get unique documents based on a fieldName

Posted by Anshum <an...@gmail.com>.

Hi Anish,
So am I getting something wrong here? You said "I have created a search
index on book Id , title ,and author from a database of books which fall
under various categories." so those are 3 fields, right?
1. How do you filter the doc types (as in the genres) at search time? Do you
even need to do that, if yes how?
2. If you're doing that 'm assuming you're already indexing the genre
somehow. Right?
3. How about a field for the genre having multi-valued entries (multiple
field objects going into the same doc with the same field label). This would
help you store 1 doc as 1 doc having multiple genres instead of duplicate
entries.

I'm still not sure if I've gotten tre problem correctly, but hope this is of
help!

--
Anshum Gupta
Naukri Labs!
http://ai-cafe.blogspot.com

The facts expressed here belong to everybody, the opinions to me. The
distinction is yours to draw............


On Fri, Mar 5, 2010 at 12:07 PM, anisha@ekkitab <an...@ekkitab.com> wrote:

>
> Hi Zhangchi
>
>
> Thanks for your reply.
>
> We have about 3 million records (different isbns) in the database and
> documents little more than that, and we wouldn't want to do the deduping at
> indexing time, because one book ( one isbn ) can be available under 2 or
> more categories( like fiction, comics & novels, science etc)
>
> We had actually applied filter on the primary key ie ID, and it wasn't
> working, so I was hoping for some sample code. But then we found out that
> the field name on which we wanted the duplicate filter to be applied (Id)
> was not actually indexed while adding it into the document. ie Field.Index
> was set to NO. We changed this, repopulated the documents and the filtering
> works now.
>
> Thanks for your time.
>
>
>
>
> zhangchi wrote:
> >
> >
> > i think you should check the index first.using the lukeall to see if
> there
> > is the duplicate books.
> >
> > On Thu, 04 Mar 2010 20:43:26 +0800, anisha@ekkitab <an...@ekkitab.com>
> > wrote:
> >
> >>
> >> Hi there, Could someone help me with the usage of DuplicateFilters. Here
> >> is
> >> my problem
> >>
> >> I have created a search index on book Id , title ,and author from a
> >> database
> >> of books which fall under various categories. Some books fall under more
> >> than one category. Now, when i issue a search, I get back 'X' books
> >> matching
> >> the search criteria, some of which are repeated, because that books are
> >> in
> >> different documents and its the expected behaviour.
> >>
> >> I use the  TopFieldDocCollector . getTotalHits() to get the total count.
> >> But
> >> this includes the repeats as mentioned above. This count is not the
> >> actual
> >> count, Hence when I issue a search on title or author i want to get a
> >> unique
> >> count / list of books. How do I use DuplicateFilter to acheive this.
> >>
> >> Please help
> >>
> >> Regards
> >> Anish
> >
> >
> > --
> > Using Opera's revolutionary e-mail client: http://www.opera.com/mail/
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
> >
>
> --
> View this message in context:
> http://old.nabble.com/how-to-use-DuplicateFilter-to-get-unique-documents-based-on-a-fieldName-tp27780251p27790391.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: how to use DuplicateFilter to get unique documents based on a fieldName

Posted by "anisha@ekkitab" <an...@ekkitab.com>.

Hi Zhangchi


Thanks for your reply. 

We have about 3 million records (different isbns) in the database and
documents little more than that, and we wouldn't want to do the deduping at
indexing time, because one book ( one isbn ) can be available under 2 or
more categories( like fiction, comics & novels, science etc)

We had actually applied filter on the primary key ie ID, and it wasn't
working, so I was hoping for some sample code. But then we found out that
the field name on which we wanted the duplicate filter to be applied (Id)
was not actually indexed while adding it into the document. ie Field.Index
was set to NO. We changed this, repopulated the documents and the filtering
works now.

Thanks for your time.




zhangchi wrote:
> 
> 
> i think you should check the index first.using the lukeall to see if there  
> is the duplicate books.
> 
> On Thu, 04 Mar 2010 20:43:26 +0800, anisha@ekkitab <an...@ekkitab.com>  
> wrote:
> 
>>
>> Hi there, Could someone help me with the usage of DuplicateFilters. Here  
>> is
>> my problem
>>
>> I have created a search index on book Id , title ,and author from a  
>> database
>> of books which fall under various categories. Some books fall under more
>> than one category. Now, when i issue a search, I get back 'X' books  
>> matching
>> the search criteria, some of which are repeated, because that books are  
>> in
>> different documents and its the expected behaviour.
>>
>> I use the  TopFieldDocCollector . getTotalHits() to get the total count.  
>> But
>> this includes the repeats as mentioned above. This count is not the  
>> actual
>> count, Hence when I issue a search on title or author i want to get a  
>> unique
>> count / list of books. How do I use DuplicateFilter to acheive this.
>>
>> Please help
>>
>> Regards
>> Anish
> 
> 
> -- 
> Using Opera's revolutionary e-mail client: http://www.opera.com/mail/
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> 

-- 
View this message in context: http://old.nabble.com/how-to-use-DuplicateFilter-to-get-unique-documents-based-on-a-fieldName-tp27780251p27790391.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: how to use DuplicateFilter to get unique documents based on a fieldName

Posted by zhangchi <zh...@sohu.com>.

i think you should check the index first.using the lukeall to see if there  
is the duplicate books.

On Thu, 04 Mar 2010 20:43:26 +0800, anisha@ekkitab <an...@ekkitab.com>  
wrote:

>
> Hi there, Could someone help me with the usage of DuplicateFilters. Here  
> is
> my problem
>
> I have created a search index on book Id , title ,and author from a  
> database
> of books which fall under various categories. Some books fall under more
> than one category. Now, when i issue a search, I get back 'X' books  
> matching
> the search criteria, some of which are repeated, because that books are  
> in
> different documents and its the expected behaviour.
>
> I use the  TopFieldDocCollector . getTotalHits() to get the total count.  
> But
> this includes the repeats as mentioned above. This count is not the  
> actual
> count, Hence when I issue a search on title or author i want to get a  
> unique
> count / list of books. How do I use DuplicateFilter to acheive this.
>
> Please help
>
> Regards
> Anish


-- 
Using Opera's revolutionary e-mail client: http://www.opera.com/mail/


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: how to use DuplicateFilter to get unique documents based on a fieldName

Posted by "anisha@ekkitab" <an...@ekkitab.com>.

Hi Ian,

Thanks for your reply. We had actually done what you had suggested first,
and it wasn't working, so I was hoping for some sample code. But then we
found out that the field name on which we wanted the duplicate filter to be
applied was not actually indexed while adding it into the document. ie
Field.Index was set to NO. We changed this, repopulated the documents and
the filtering works now.

We have about 3 million records in the database and documents little more
than that, and we wouldn't want to do the deduping at indexing time, because
one book ( one isbn ) can be available under 2 or more categories( like
fiction, comics & novels, science etc)

-Anisha



Ian Lea wrote:
> 
> If the field you want to use for deduping is ISBN, create a
> DuplicateFilter using whatever your ISBN field name is as the field
> name and pass that to one of the search methods that takes a filter.
> 
> If your index is large I'd be worried about performance and would look
> at deduping at indexing time i.e. have one lucene document per ISBN.
> 
> 
> --
> Ian.
> 
> 
> On Thu, Mar 4, 2010 at 12:43 PM, anisha@ekkitab <an...@ekkitab.com>
> wrote:
>>
>> Hi there, Could someone help me with the usage of DuplicateFilters. Here
>> is
>> my problem
>>
>> I have created a search index on book Id , title ,and author from a
>> database
>> of books which fall under various categories. Some books fall under more
>> than one category. Now, when i issue a search, I get back 'X' books
>> matching
>> the search criteria, some of which are repeated, because that books are
>> in
>> different documents and its the expected behaviour.
>>
>> I use the  TopFieldDocCollector . getTotalHits() to get the total count.
>> But
>> this includes the repeats as mentioned above. This count is not the
>> actual
>> count, Hence when I issue a search on title or author i want to get a
>> unique
>> count / list of books. How do I use DuplicateFilter to acheive this.
>>
>> Please help
>>
>> Regards
>> Anish
>> --
>> View this message in context:
>> http://old.nabble.com/how-to-use-DuplicateFilter-to-get-unique-documents-based-on-a-fieldName-tp27780251p27780251.html
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> 

-- 
View this message in context: http://old.nabble.com/how-to-use-DuplicateFilter-to-get-unique-documents-based-on-a-fieldName-tp27780251p27790381.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: how to use DuplicateFilter to get unique documents based on a fieldName

Posted by Ian Lea <ia...@gmail.com>.

If the field you want to use for deduping is ISBN, create a
DuplicateFilter using whatever your ISBN field name is as the field
name and pass that to one of the search methods that takes a filter.

If your index is large I'd be worried about performance and would look
at deduping at indexing time i.e. have one lucene document per ISBN.


--
Ian.


On Thu, Mar 4, 2010 at 12:43 PM, anisha@ekkitab <an...@ekkitab.com> wrote:
>
> Hi there, Could someone help me with the usage of DuplicateFilters. Here is
> my problem
>
> I have created a search index on book Id , title ,and author from a database
> of books which fall under various categories. Some books fall under more
> than one category. Now, when i issue a search, I get back 'X' books matching
> the search criteria, some of which are repeated, because that books are in
> different documents and its the expected behaviour.
>
> I use the  TopFieldDocCollector . getTotalHits() to get the total count. But
> this includes the repeats as mentioned above. This count is not the actual
> count, Hence when I issue a search on title or author i want to get a unique
> count / list of books. How do I use DuplicateFilter to acheive this.
>
> Please help
>
> Regards
> Anish
> --
> View this message in context: http://old.nabble.com/how-to-use-DuplicateFilter-to-get-unique-documents-based-on-a-fieldName-tp27780251p27780251.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org