You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Pranav Prakash <pr...@gmail.com> on 2011/06/23 11:26:08 UTC

Removing duplicate documents from search results

How can I remove very similar documents from search results?

My scenario is that there are documents in the index which are almost
similar (people submitting same stuff multiple times, sometimes different
people submitting same stuff). Now when a search is performed for "keyword",
in the top N results, quite frequently, same document comes up multiple
times. I want to remove those duplicate (or possible duplicate) documents.
Very similar to what Google does when they say "In order to show you most
relevant result, duplicates have been removed". How can I achieve this
functionality using Solr? Does Solr has an implied or plugin which could
help me with it?


*Pranav Prakash*

"temet nosce"

Twitter <http://twitter.com/pranavprakash> | Blog <http://blog.myblive.com> |
Google <http://www.google.com/profiles/pranny>

Re: Removing duplicate documents from search results

Posted by Paul Libbrecht <pa...@hoplahup.net>.

Mohammad,

just in case you meant it, I would like to discourage you to try to deduplicate *the search result*.
There are many things that go wrong if you do that; we had it in one version of the ActiveMath search environment (which uses Lucene):
- paging is inappropriate
- total count is wrong unless you go through all the results
- performance can go really bad if you try to go through all the results
- performance does go bad for some search results if you try to fill the page (need to fetch till you find)
- you to go through all search results again and again when delivering the next ones

So, as others have suggested, please be sure to deduplicate somehow at indexing time.

paul

Le 28 juin 2011 à 14:24, Mohammad Shariq a écrit :

> I am making the Hash from URL, but I can't use this as UniqueKey because I
> am using UUID as UniqueKey,
> Since I am using SOLR as  index engine Only and using Riak(key-value
> storage) as storage engine, I dont want to do the overwrite on duplicate.
> I just need to discard the duplicates.
> 
> 
> 
> 2011/6/28 François Schiettecatte <fs...@gmail.com>
> 
>> Create a hash from the url and use that as the unique key, md5 or sha1
>> would probably be good enough.
>> 
>> Cheers
>> 
>> François
>> 
>> On Jun 28, 2011, at 7:29 AM, Mohammad Shariq wrote:
>> 
>>> I also have the problem of duplicate docs.
>>> I am indexing news articles, Every news article will have the source URL,
>>> If two news-article has the same URL, only one need to index,
>>> removal of duplicate at index time.
>>> 
>>> 
>>> 
>>> On 23 June 2011 21:24, simon <mt...@gmail.com> wrote:
>>> 
>>>> have you checked out the deduplication process that's available at
>>>> indexing time ? This includes a fuzzy hash algorithm .
>>>> 
>>>> http://wiki.apache.org/solr/Deduplication
>>>> 
>>>> -Simon
>>>> 
>>>> On Thu, Jun 23, 2011 at 5:55 AM, Pranav Prakash <pr...@gmail.com>
>> wrote:
>>>>> This approach would definitely work is the two documents are *Exactly*
>>>> the
>>>>> same. But this is very fragile. Even if one extra space has been added,
>>>> the
>>>>> whole hash would change. What I am really looking for is some %age
>>>>> similarity between documents, and remove those documents which are more
>>>> than
>>>>> 95% similar.
>>>>> 
>>>>> *Pranav Prakash*
>>>>> 
>>>>> "temet nosce"
>>>>> 
>>>>> Twitter <http://twitter.com/pranavprakash> | Blog <
>>>> http://blog.myblive.com> |
>>>>> Google <http://www.google.com/profiles/pranny>
>>>>> 
>>>>> 
>>>>> On Thu, Jun 23, 2011 at 15:16, Omri Cohen <om...@yotpo.com> wrote:
>>>>> 
>>>>>> What you need to do, is to calculate some HASH (using any message
>> digest
>>>>>> algorithm you want, md5, sha-1 and so on), then do some reading on
>> solr
>>>>>> field collapse capabilities. Should not be too complicated..
>>>>>> 
>>>>>> *Omri Cohen*
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Co-founder @ yotpo.com | omri@yotpo.com | +972-50-7235198 |
>>>> +972-3-6036295
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> My profiles: [image: LinkedIn] <http://www.linkedin.com/in/omric>
>>>> [image:
>>>>>> Twitter] <http://www.twitter.com/omricohe> [image:
>>>>>> WordPress]<http://omricohen.me>
>>>>>> Please consider your environmental responsibility. Before printing
>> this
>>>>>> e-mail message, ask yourself whether you really need a hard copy.
>>>>>> IMPORTANT: The contents of this email and any attachments are
>>>> confidential.
>>>>>> They are intended for the named recipient(s) only. If you have
>> received
>>>>>> this
>>>>>> email by mistake, please notify the sender immediately and do not
>>>> disclose
>>>>>> the contents to anyone or make copies thereof.
>>>>>> Signature powered by
>>>>>> <
>>>>>> 
>>>> 
>> http://www.wisestamp.com/email-install?utm_source=extension&utm_medium=email&utm_campaign=footer
>>>>>>> 
>>>>>> WiseStamp<
>>>>>> 
>>>> 
>> http://www.wisestamp.com/email-install?utm_source=extension&utm_medium=email&utm_campaign=footer
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> ---------- Forwarded message ----------
>>>>>> From: Pranav Prakash <pr...@gmail.com>
>>>>>> Date: Thu, Jun 23, 2011 at 12:26 PM
>>>>>> Subject: Removing duplicate documents from search results
>>>>>> To: solr-user@lucene.apache.org
>>>>>> 
>>>>>> 
>>>>>> How can I remove very similar documents from search results?
>>>>>> 
>>>>>> My scenario is that there are documents in the index which are almost
>>>>>> similar (people submitting same stuff multiple times, sometimes
>>>> different
>>>>>> people submitting same stuff). Now when a search is performed for
>>>>>> "keyword",
>>>>>> in the top N results, quite frequently, same document comes up
>> multiple
>>>>>> times. I want to remove those duplicate (or possible duplicate)
>>>> documents.
>>>>>> Very similar to what Google does when they say "In order to show you
>>>> most
>>>>>> relevant result, duplicates have been removed". How can I achieve this
>>>>>> functionality using Solr? Does Solr has an implied or plugin which
>> could
>>>>>> help me with it?
>>>>>> 
>>>>>> 
>>>>>> *Pranav Prakash*
>>>>>> 
>>>>>> "temet nosce"
>>>>>> 
>>>>>> Twitter <http://twitter.com/pranavprakash> | Blog <
>>>> http://blog.myblive.com
>>>>>>> 
>>>>>> |
>>>>>> Google <http://www.google.com/profiles/pranny>
>>>>>> 
>>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> Thanks and Regards
>>> Mohammad Shariq
>> 
>> 
> 
> 
> -- 
> Thanks and Regards
> Mohammad Shariq

Re: Removing duplicate documents from search results

Posted by François Schiettecatte <fs...@gmail.com>.

Yeah, I read the overview which suggests that duplicates can be prevented from entering the index and scanned the rest, it does not look like you can actually drop the document entirely. Maybe I am missing something here.

François

On Jun 28, 2011, at 9:14 AM, Mohammad Shariq wrote:

> Hey François,
> thanks for your suggestion, I followed the same link (
> http://wiki.apache.org/solr/Deduplication)
> 
> they have the solution*, either make Hash as uniqueKey OR overwrite on
> duplicate,
> I dont need either.
> 
> I need Discard on Duplicate.
> *
> 
>> 
>> 
>> I have not used it but it looks like it will do the trick.
>> 
>> François
>> 
>> On Jun 28, 2011, at 8:44 AM, Pranav Prakash wrote:
>> 
>>> I found the deduplication thing really useful. Although I have not yet
>>> started to work on it, as there are some other low hanging fruits I've to
>>> capture. Will share my thoughts soon.
>>> 
>>> 
>>> *Pranav Prakash*
>>> 
>>> "temet nosce"
>>> 
>>> Twitter <http://twitter.com/pranavprakash> | Blog <
>> http://blog.myblive.com> |
>>> Google <http://www.google.com/profiles/pranny>
>>> 
>>> 
>>> 2011/6/28 François Schiettecatte <fs...@gmail.com>
>>> 
>>>> Maybe there is a way to get Solr to reject documents that already exist
>> in
>>>> the index but I doubt it, maybe someone else with can chime here here.
>> You
>>>> could do a search for each document prior to indexing it so see if it is
>>>> already in the index, that is probably non-optimal, maybe it is easiest
>> to
>>>> check if the document exists in your Riak repository, it no add it and
>> index
>>>> it, and drop if it already exists.
>>>> 
>>>> François
>>>> 
>>>> On Jun 28, 2011, at 8:24 AM, Mohammad Shariq wrote:
>>>> 
>>>>> I am making the Hash from URL, but I can't use this as UniqueKey
>> because
>>>> I
>>>>> am using UUID as UniqueKey,
>>>>> Since I am using SOLR as  index engine Only and using Riak(key-value
>>>>> storage) as storage engine, I dont want to do the overwrite on
>> duplicate.
>>>>> I just need to discard the duplicates.
>>>>> 
>>>>> 
>>>>> 
>>>>> 2011/6/28 François Schiettecatte <fs...@gmail.com>
>>>>> 
>>>>>> Create a hash from the url and use that as the unique key, md5 or sha1
>>>>>> would probably be good enough.
>>>>>> 
>>>>>> Cheers
>>>>>> 
>>>>>> François
>>>>>> 
>>>>>> On Jun 28, 2011, at 7:29 AM, Mohammad Shariq wrote:
>>>>>> 
>>>>>>> I also have the problem of duplicate docs.
>>>>>>> I am indexing news articles, Every news article will have the source
>>>> URL,
>>>>>>> If two news-article has the same URL, only one need to index,
>>>>>>> removal of duplicate at index time.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On 23 June 2011 21:24, simon <mt...@gmail.com> wrote:
>>>>>>> 
>>>>>>>> have you checked out the deduplication process that's available at
>>>>>>>> indexing time ? This includes a fuzzy hash algorithm .
>>>>>>>> 
>>>>>>>> http://wiki.apache.org/solr/Deduplication
>>>>>>>> 
>>>>>>>> -Simon
>>>>>>>> 
>>>>>>>> On Thu, Jun 23, 2011 at 5:55 AM, Pranav Prakash <pr...@gmail.com>
>>>>>> wrote:
>>>>>>>>> This approach would definitely work is the two documents are
>>>> *Exactly*
>>>>>>>> the
>>>>>>>>> same. But this is very fragile. Even if one extra space has been
>>>> added,
>>>>>>>> the
>>>>>>>>> whole hash would change. What I am really looking for is some %age
>>>>>>>>> similarity between documents, and remove those documents which are
>>>> more
>>>>>>>> than
>>>>>>>>> 95% similar.
>>>>>>>>> 
>>>>>>>>> *Pranav Prakash*
>>>>>>>>> 
>>>>>>>>> "temet nosce"
>>>>>>>>> 
>>>>>>>>> Twitter <http://twitter.com/pranavprakash> | Blog <
>>>>>>>> http://blog.myblive.com> |
>>>>>>>>> Google <http://www.google.com/profiles/pranny>
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Thu, Jun 23, 2011 at 15:16, Omri Cohen <om...@yotpo.com> wrote:
>>>>>>>>> 
>>>>>>>>>> What you need to do, is to calculate some HASH (using any message
>>>>>> digest
>>>>>>>>>> algorithm you want, md5, sha-1 and so on), then do some reading on
>>>>>> solr
>>>>>>>>>> field collapse capabilities. Should not be too complicated..
>>>>>>>>>> 
>>>>>>>>>> *Omri Cohen*
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Co-founder @ yotpo.com | omri@yotpo.com | +972-50-7235198 |
>>>>>>>> +972-3-6036295
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> My profiles: [image: LinkedIn] <http://www.linkedin.com/in/omric>
>>>>>>>> [image:
>>>>>>>>>> Twitter] <http://www.twitter.com/omricohe> [image:
>>>>>>>>>> WordPress]<http://omricohen.me>
>>>>>>>>>> Please consider your environmental responsibility. Before printing
>>>>>> this
>>>>>>>>>> e-mail message, ask yourself whether you really need a hard copy.
>>>>>>>>>> IMPORTANT: The contents of this email and any attachments are
>>>>>>>> confidential.
>>>>>>>>>> They are intended for the named recipient(s) only. If you have
>>>>>> received
>>>>>>>>>> this
>>>>>>>>>> email by mistake, please notify the sender immediately and do not
>>>>>>>> disclose
>>>>>>>>>> the contents to anyone or make copies thereof.
>>>>>>>>>> Signature powered by
>>>>>>>>>> <
>>>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>> 
>> http://www.wisestamp.com/email-install?utm_source=extension&utm_medium=email&utm_campaign=footer
>>>>>>>>>>> 
>>>>>>>>>> WiseStamp<
>>>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>> 
>> http://www.wisestamp.com/email-install?utm_source=extension&utm_medium=email&utm_campaign=footer
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> ---------- Forwarded message ----------
>>>>>>>>>> From: Pranav Prakash <pr...@gmail.com>
>>>>>>>>>> Date: Thu, Jun 23, 2011 at 12:26 PM
>>>>>>>>>> Subject: Removing duplicate documents from search results
>>>>>>>>>> To: solr-user@lucene.apache.org
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> How can I remove very similar documents from search results?
>>>>>>>>>> 
>>>>>>>>>> My scenario is that there are documents in the index which are
>>>> almost
>>>>>>>>>> similar (people submitting same stuff multiple times, sometimes
>>>>>>>> different
>>>>>>>>>> people submitting same stuff). Now when a search is performed for
>>>>>>>>>> "keyword",
>>>>>>>>>> in the top N results, quite frequently, same document comes up
>>>>>> multiple
>>>>>>>>>> times. I want to remove those duplicate (or possible duplicate)
>>>>>>>> documents.
>>>>>>>>>> Very similar to what Google does when they say "In order to show
>> you
>>>>>>>> most
>>>>>>>>>> relevant result, duplicates have been removed". How can I achieve
>>>> this
>>>>>>>>>> functionality using Solr? Does Solr has an implied or plugin which
>>>>>> could
>>>>>>>>>> help me with it?
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> *Pranav Prakash*
>>>>>>>>>> 
>>>>>>>>>> "temet nosce"
>>>>>>>>>> 
>>>>>>>>>> Twitter <http://twitter.com/pranavprakash> | Blog <
>>>>>>>> http://blog.myblive.com
>>>>>>>>>>> 
>>>>>>>>>> |
>>>>>>>>>> Google <http://www.google.com/profiles/pranny>
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> Thanks and Regards
>>>>>>> Mohammad Shariq
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> Thanks and Regards
>>>>> Mohammad Shariq
>>>> 
>>>> 
>> 
>> 
> 
> 
> -- 
> Thanks and Regards
> Mohammad Shariq

Re: Removing duplicate documents from search results

Posted by Mohammad Shariq <sh...@gmail.com>.

Hey François,
thanks for your suggestion, I followed the same link (
http://wiki.apache.org/solr/Deduplication)

they have the solution*, either make Hash as uniqueKey OR overwrite on
duplicate,
I dont need either.

I need Discard on Duplicate.
*

>
>
> I have not used it but it looks like it will do the trick.
>
> François
>
> On Jun 28, 2011, at 8:44 AM, Pranav Prakash wrote:
>
> > I found the deduplication thing really useful. Although I have not yet
> > started to work on it, as there are some other low hanging fruits I've to
> > capture. Will share my thoughts soon.
> >
> >
> > *Pranav Prakash*
> >
> > "temet nosce"
> >
> > Twitter <http://twitter.com/pranavprakash> | Blog <
> http://blog.myblive.com> |
> > Google <http://www.google.com/profiles/pranny>
> >
> >
> > 2011/6/28 François Schiettecatte <fs...@gmail.com>
> >
> >> Maybe there is a way to get Solr to reject documents that already exist
> in
> >> the index but I doubt it, maybe someone else with can chime here here.
> You
> >> could do a search for each document prior to indexing it so see if it is
> >> already in the index, that is probably non-optimal, maybe it is easiest
> to
> >> check if the document exists in your Riak repository, it no add it and
> index
> >> it, and drop if it already exists.
> >>
> >> François
> >>
> >> On Jun 28, 2011, at 8:24 AM, Mohammad Shariq wrote:
> >>
> >>> I am making the Hash from URL, but I can't use this as UniqueKey
> because
> >> I
> >>> am using UUID as UniqueKey,
> >>> Since I am using SOLR as  index engine Only and using Riak(key-value
> >>> storage) as storage engine, I dont want to do the overwrite on
> duplicate.
> >>> I just need to discard the duplicates.
> >>>
> >>>
> >>>
> >>> 2011/6/28 François Schiettecatte <fs...@gmail.com>
> >>>
> >>>> Create a hash from the url and use that as the unique key, md5 or sha1
> >>>> would probably be good enough.
> >>>>
> >>>> Cheers
> >>>>
> >>>> François
> >>>>
> >>>> On Jun 28, 2011, at 7:29 AM, Mohammad Shariq wrote:
> >>>>
> >>>>> I also have the problem of duplicate docs.
> >>>>> I am indexing news articles, Every news article will have the source
> >> URL,
> >>>>> If two news-article has the same URL, only one need to index,
> >>>>> removal of duplicate at index time.
> >>>>>
> >>>>>
> >>>>>
> >>>>> On 23 June 2011 21:24, simon <mt...@gmail.com> wrote:
> >>>>>
> >>>>>> have you checked out the deduplication process that's available at
> >>>>>> indexing time ? This includes a fuzzy hash algorithm .
> >>>>>>
> >>>>>> http://wiki.apache.org/solr/Deduplication
> >>>>>>
> >>>>>> -Simon
> >>>>>>
> >>>>>> On Thu, Jun 23, 2011 at 5:55 AM, Pranav Prakash <pr...@gmail.com>
> >>>> wrote:
> >>>>>>> This approach would definitely work is the two documents are
> >> *Exactly*
> >>>>>> the
> >>>>>>> same. But this is very fragile. Even if one extra space has been
> >> added,
> >>>>>> the
> >>>>>>> whole hash would change. What I am really looking for is some %age
> >>>>>>> similarity between documents, and remove those documents which are
> >> more
> >>>>>> than
> >>>>>>> 95% similar.
> >>>>>>>
> >>>>>>> *Pranav Prakash*
> >>>>>>>
> >>>>>>> "temet nosce"
> >>>>>>>
> >>>>>>> Twitter <http://twitter.com/pranavprakash> | Blog <
> >>>>>> http://blog.myblive.com> |
> >>>>>>> Google <http://www.google.com/profiles/pranny>
> >>>>>>>
> >>>>>>>
> >>>>>>> On Thu, Jun 23, 2011 at 15:16, Omri Cohen <om...@yotpo.com> wrote:
> >>>>>>>
> >>>>>>>> What you need to do, is to calculate some HASH (using any message
> >>>> digest
> >>>>>>>> algorithm you want, md5, sha-1 and so on), then do some reading on
> >>>> solr
> >>>>>>>> field collapse capabilities. Should not be too complicated..
> >>>>>>>>
> >>>>>>>> *Omri Cohen*
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Co-founder @ yotpo.com | omri@yotpo.com | +972-50-7235198 |
> >>>>>> +972-3-6036295
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> My profiles: [image: LinkedIn] <http://www.linkedin.com/in/omric>
> >>>>>> [image:
> >>>>>>>> Twitter] <http://www.twitter.com/omricohe> [image:
> >>>>>>>> WordPress]<http://omricohen.me>
> >>>>>>>> Please consider your environmental responsibility. Before printing
> >>>> this
> >>>>>>>> e-mail message, ask yourself whether you really need a hard copy.
> >>>>>>>> IMPORTANT: The contents of this email and any attachments are
> >>>>>> confidential.
> >>>>>>>> They are intended for the named recipient(s) only. If you have
> >>>> received
> >>>>>>>> this
> >>>>>>>> email by mistake, please notify the sender immediately and do not
> >>>>>> disclose
> >>>>>>>> the contents to anyone or make copies thereof.
> >>>>>>>> Signature powered by
> >>>>>>>> <
> >>>>>>>>
> >>>>>>
> >>>>
> >>
> http://www.wisestamp.com/email-install?utm_source=extension&utm_medium=email&utm_campaign=footer
> >>>>>>>>>
> >>>>>>>> WiseStamp<
> >>>>>>>>
> >>>>>>
> >>>>
> >>
> http://www.wisestamp.com/email-install?utm_source=extension&utm_medium=email&utm_campaign=footer
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> ---------- Forwarded message ----------
> >>>>>>>> From: Pranav Prakash <pr...@gmail.com>
> >>>>>>>> Date: Thu, Jun 23, 2011 at 12:26 PM
> >>>>>>>> Subject: Removing duplicate documents from search results
> >>>>>>>> To: solr-user@lucene.apache.org
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> How can I remove very similar documents from search results?
> >>>>>>>>
> >>>>>>>> My scenario is that there are documents in the index which are
> >> almost
> >>>>>>>> similar (people submitting same stuff multiple times, sometimes
> >>>>>> different
> >>>>>>>> people submitting same stuff). Now when a search is performed for
> >>>>>>>> "keyword",
> >>>>>>>> in the top N results, quite frequently, same document comes up
> >>>> multiple
> >>>>>>>> times. I want to remove those duplicate (or possible duplicate)
> >>>>>> documents.
> >>>>>>>> Very similar to what Google does when they say "In order to show
> you
> >>>>>> most
> >>>>>>>> relevant result, duplicates have been removed". How can I achieve
> >> this
> >>>>>>>> functionality using Solr? Does Solr has an implied or plugin which
> >>>> could
> >>>>>>>> help me with it?
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> *Pranav Prakash*
> >>>>>>>>
> >>>>>>>> "temet nosce"
> >>>>>>>>
> >>>>>>>> Twitter <http://twitter.com/pranavprakash> | Blog <
> >>>>>> http://blog.myblive.com
> >>>>>>>>>
> >>>>>>>> |
> >>>>>>>> Google <http://www.google.com/profiles/pranny>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> Thanks and Regards
> >>>>> Mohammad Shariq
> >>>>
> >>>>
> >>>
> >>>
> >>> --
> >>> Thanks and Regards
> >>> Mohammad Shariq
> >>
> >>
>
>


-- 
Thanks and Regards
Mohammad Shariq

Re: Removing duplicate documents from search results

Posted by François Schiettecatte <fs...@gmail.com>.

Indeed, take a look at this:
	
	http://wiki.apache.org/solr/Deduplication

I have not used it but it looks like it will do the trick.

François

On Jun 28, 2011, at 8:44 AM, Pranav Prakash wrote:

> I found the deduplication thing really useful. Although I have not yet
> started to work on it, as there are some other low hanging fruits I've to
> capture. Will share my thoughts soon.
> 
> 
> *Pranav Prakash*
> 
> "temet nosce"
> 
> Twitter <http://twitter.com/pranavprakash> | Blog <http://blog.myblive.com> |
> Google <http://www.google.com/profiles/pranny>
> 
> 
> 2011/6/28 François Schiettecatte <fs...@gmail.com>
> 
>> Maybe there is a way to get Solr to reject documents that already exist in
>> the index but I doubt it, maybe someone else with can chime here here. You
>> could do a search for each document prior to indexing it so see if it is
>> already in the index, that is probably non-optimal, maybe it is easiest to
>> check if the document exists in your Riak repository, it no add it and index
>> it, and drop if it already exists.
>> 
>> François
>> 
>> On Jun 28, 2011, at 8:24 AM, Mohammad Shariq wrote:
>> 
>>> I am making the Hash from URL, but I can't use this as UniqueKey because
>> I
>>> am using UUID as UniqueKey,
>>> Since I am using SOLR as  index engine Only and using Riak(key-value
>>> storage) as storage engine, I dont want to do the overwrite on duplicate.
>>> I just need to discard the duplicates.
>>> 
>>> 
>>> 
>>> 2011/6/28 François Schiettecatte <fs...@gmail.com>
>>> 
>>>> Create a hash from the url and use that as the unique key, md5 or sha1
>>>> would probably be good enough.
>>>> 
>>>> Cheers
>>>> 
>>>> François
>>>> 
>>>> On Jun 28, 2011, at 7:29 AM, Mohammad Shariq wrote:
>>>> 
>>>>> I also have the problem of duplicate docs.
>>>>> I am indexing news articles, Every news article will have the source
>> URL,
>>>>> If two news-article has the same URL, only one need to index,
>>>>> removal of duplicate at index time.
>>>>> 
>>>>> 
>>>>> 
>>>>> On 23 June 2011 21:24, simon <mt...@gmail.com> wrote:
>>>>> 
>>>>>> have you checked out the deduplication process that's available at
>>>>>> indexing time ? This includes a fuzzy hash algorithm .
>>>>>> 
>>>>>> http://wiki.apache.org/solr/Deduplication
>>>>>> 
>>>>>> -Simon
>>>>>> 
>>>>>> On Thu, Jun 23, 2011 at 5:55 AM, Pranav Prakash <pr...@gmail.com>
>>>> wrote:
>>>>>>> This approach would definitely work is the two documents are
>> *Exactly*
>>>>>> the
>>>>>>> same. But this is very fragile. Even if one extra space has been
>> added,
>>>>>> the
>>>>>>> whole hash would change. What I am really looking for is some %age
>>>>>>> similarity between documents, and remove those documents which are
>> more
>>>>>> than
>>>>>>> 95% similar.
>>>>>>> 
>>>>>>> *Pranav Prakash*
>>>>>>> 
>>>>>>> "temet nosce"
>>>>>>> 
>>>>>>> Twitter <http://twitter.com/pranavprakash> | Blog <
>>>>>> http://blog.myblive.com> |
>>>>>>> Google <http://www.google.com/profiles/pranny>
>>>>>>> 
>>>>>>> 
>>>>>>> On Thu, Jun 23, 2011 at 15:16, Omri Cohen <om...@yotpo.com> wrote:
>>>>>>> 
>>>>>>>> What you need to do, is to calculate some HASH (using any message
>>>> digest
>>>>>>>> algorithm you want, md5, sha-1 and so on), then do some reading on
>>>> solr
>>>>>>>> field collapse capabilities. Should not be too complicated..
>>>>>>>> 
>>>>>>>> *Omri Cohen*
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Co-founder @ yotpo.com | omri@yotpo.com | +972-50-7235198 |
>>>>>> +972-3-6036295
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> My profiles: [image: LinkedIn] <http://www.linkedin.com/in/omric>
>>>>>> [image:
>>>>>>>> Twitter] <http://www.twitter.com/omricohe> [image:
>>>>>>>> WordPress]<http://omricohen.me>
>>>>>>>> Please consider your environmental responsibility. Before printing
>>>> this
>>>>>>>> e-mail message, ask yourself whether you really need a hard copy.
>>>>>>>> IMPORTANT: The contents of this email and any attachments are
>>>>>> confidential.
>>>>>>>> They are intended for the named recipient(s) only. If you have
>>>> received
>>>>>>>> this
>>>>>>>> email by mistake, please notify the sender immediately and do not
>>>>>> disclose
>>>>>>>> the contents to anyone or make copies thereof.
>>>>>>>> Signature powered by
>>>>>>>> <
>>>>>>>> 
>>>>>> 
>>>> 
>> http://www.wisestamp.com/email-install?utm_source=extension&utm_medium=email&utm_campaign=footer
>>>>>>>>> 
>>>>>>>> WiseStamp<
>>>>>>>> 
>>>>>> 
>>>> 
>> http://www.wisestamp.com/email-install?utm_source=extension&utm_medium=email&utm_campaign=footer
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> ---------- Forwarded message ----------
>>>>>>>> From: Pranav Prakash <pr...@gmail.com>
>>>>>>>> Date: Thu, Jun 23, 2011 at 12:26 PM
>>>>>>>> Subject: Removing duplicate documents from search results
>>>>>>>> To: solr-user@lucene.apache.org
>>>>>>>> 
>>>>>>>> 
>>>>>>>> How can I remove very similar documents from search results?
>>>>>>>> 
>>>>>>>> My scenario is that there are documents in the index which are
>> almost
>>>>>>>> similar (people submitting same stuff multiple times, sometimes
>>>>>> different
>>>>>>>> people submitting same stuff). Now when a search is performed for
>>>>>>>> "keyword",
>>>>>>>> in the top N results, quite frequently, same document comes up
>>>> multiple
>>>>>>>> times. I want to remove those duplicate (or possible duplicate)
>>>>>> documents.
>>>>>>>> Very similar to what Google does when they say "In order to show you
>>>>>> most
>>>>>>>> relevant result, duplicates have been removed". How can I achieve
>> this
>>>>>>>> functionality using Solr? Does Solr has an implied or plugin which
>>>> could
>>>>>>>> help me with it?
>>>>>>>> 
>>>>>>>> 
>>>>>>>> *Pranav Prakash*
>>>>>>>> 
>>>>>>>> "temet nosce"
>>>>>>>> 
>>>>>>>> Twitter <http://twitter.com/pranavprakash> | Blog <
>>>>>> http://blog.myblive.com
>>>>>>>>> 
>>>>>>>> |
>>>>>>>> Google <http://www.google.com/profiles/pranny>
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> Thanks and Regards
>>>>> Mohammad Shariq
>>>> 
>>>> 
>>> 
>>> 
>>> --
>>> Thanks and Regards
>>> Mohammad Shariq
>> 
>>

Re: Removing duplicate documents from search results

Posted by Pranav Prakash <pr...@gmail.com>.

I found the deduplication thing really useful. Although I have not yet
started to work on it, as there are some other low hanging fruits I've to
capture. Will share my thoughts soon.


*Pranav Prakash*

"temet nosce"

Twitter <http://twitter.com/pranavprakash> | Blog <http://blog.myblive.com> |
Google <http://www.google.com/profiles/pranny>


2011/6/28 François Schiettecatte <fs...@gmail.com>

> Maybe there is a way to get Solr to reject documents that already exist in
> the index but I doubt it, maybe someone else with can chime here here. You
> could do a search for each document prior to indexing it so see if it is
> already in the index, that is probably non-optimal, maybe it is easiest to
> check if the document exists in your Riak repository, it no add it and index
> it, and drop if it already exists.
>
> François
>
> On Jun 28, 2011, at 8:24 AM, Mohammad Shariq wrote:
>
> > I am making the Hash from URL, but I can't use this as UniqueKey because
> I
> > am using UUID as UniqueKey,
> > Since I am using SOLR as  index engine Only and using Riak(key-value
> > storage) as storage engine, I dont want to do the overwrite on duplicate.
> > I just need to discard the duplicates.
> >
> >
> >
> > 2011/6/28 François Schiettecatte <fs...@gmail.com>
> >
> >> Create a hash from the url and use that as the unique key, md5 or sha1
> >> would probably be good enough.
> >>
> >> Cheers
> >>
> >> François
> >>
> >> On Jun 28, 2011, at 7:29 AM, Mohammad Shariq wrote:
> >>
> >>> I also have the problem of duplicate docs.
> >>> I am indexing news articles, Every news article will have the source
> URL,
> >>> If two news-article has the same URL, only one need to index,
> >>> removal of duplicate at index time.
> >>>
> >>>
> >>>
> >>> On 23 June 2011 21:24, simon <mt...@gmail.com> wrote:
> >>>
> >>>> have you checked out the deduplication process that's available at
> >>>> indexing time ? This includes a fuzzy hash algorithm .
> >>>>
> >>>> http://wiki.apache.org/solr/Deduplication
> >>>>
> >>>> -Simon
> >>>>
> >>>> On Thu, Jun 23, 2011 at 5:55 AM, Pranav Prakash <pr...@gmail.com>
> >> wrote:
> >>>>> This approach would definitely work is the two documents are
> *Exactly*
> >>>> the
> >>>>> same. But this is very fragile. Even if one extra space has been
> added,
> >>>> the
> >>>>> whole hash would change. What I am really looking for is some %age
> >>>>> similarity between documents, and remove those documents which are
> more
> >>>> than
> >>>>> 95% similar.
> >>>>>
> >>>>> *Pranav Prakash*
> >>>>>
> >>>>> "temet nosce"
> >>>>>
> >>>>> Twitter <http://twitter.com/pranavprakash> | Blog <
> >>>> http://blog.myblive.com> |
> >>>>> Google <http://www.google.com/profiles/pranny>
> >>>>>
> >>>>>
> >>>>> On Thu, Jun 23, 2011 at 15:16, Omri Cohen <om...@yotpo.com> wrote:
> >>>>>
> >>>>>> What you need to do, is to calculate some HASH (using any message
> >> digest
> >>>>>> algorithm you want, md5, sha-1 and so on), then do some reading on
> >> solr
> >>>>>> field collapse capabilities. Should not be too complicated..
> >>>>>>
> >>>>>> *Omri Cohen*
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> Co-founder @ yotpo.com | omri@yotpo.com | +972-50-7235198 |
> >>>> +972-3-6036295
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> My profiles: [image: LinkedIn] <http://www.linkedin.com/in/omric>
> >>>> [image:
> >>>>>> Twitter] <http://www.twitter.com/omricohe> [image:
> >>>>>> WordPress]<http://omricohen.me>
> >>>>>> Please consider your environmental responsibility. Before printing
> >> this
> >>>>>> e-mail message, ask yourself whether you really need a hard copy.
> >>>>>> IMPORTANT: The contents of this email and any attachments are
> >>>> confidential.
> >>>>>> They are intended for the named recipient(s) only. If you have
> >> received
> >>>>>> this
> >>>>>> email by mistake, please notify the sender immediately and do not
> >>>> disclose
> >>>>>> the contents to anyone or make copies thereof.
> >>>>>> Signature powered by
> >>>>>> <
> >>>>>>
> >>>>
> >>
> http://www.wisestamp.com/email-install?utm_source=extension&utm_medium=email&utm_campaign=footer
> >>>>>>>
> >>>>>> WiseStamp<
> >>>>>>
> >>>>
> >>
> http://www.wisestamp.com/email-install?utm_source=extension&utm_medium=email&utm_campaign=footer
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> ---------- Forwarded message ----------
> >>>>>> From: Pranav Prakash <pr...@gmail.com>
> >>>>>> Date: Thu, Jun 23, 2011 at 12:26 PM
> >>>>>> Subject: Removing duplicate documents from search results
> >>>>>> To: solr-user@lucene.apache.org
> >>>>>>
> >>>>>>
> >>>>>> How can I remove very similar documents from search results?
> >>>>>>
> >>>>>> My scenario is that there are documents in the index which are
> almost
> >>>>>> similar (people submitting same stuff multiple times, sometimes
> >>>> different
> >>>>>> people submitting same stuff). Now when a search is performed for
> >>>>>> "keyword",
> >>>>>> in the top N results, quite frequently, same document comes up
> >> multiple
> >>>>>> times. I want to remove those duplicate (or possible duplicate)
> >>>> documents.
> >>>>>> Very similar to what Google does when they say "In order to show you
> >>>> most
> >>>>>> relevant result, duplicates have been removed". How can I achieve
> this
> >>>>>> functionality using Solr? Does Solr has an implied or plugin which
> >> could
> >>>>>> help me with it?
> >>>>>>
> >>>>>>
> >>>>>> *Pranav Prakash*
> >>>>>>
> >>>>>> "temet nosce"
> >>>>>>
> >>>>>> Twitter <http://twitter.com/pranavprakash> | Blog <
> >>>> http://blog.myblive.com
> >>>>>>>
> >>>>>> |
> >>>>>> Google <http://www.google.com/profiles/pranny>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>>
> >>>
> >>> --
> >>> Thanks and Regards
> >>> Mohammad Shariq
> >>
> >>
> >
> >
> > --
> > Thanks and Regards
> > Mohammad Shariq
>
>

Re: Removing duplicate documents from search results

Posted by François Schiettecatte <fs...@gmail.com>.

Maybe there is a way to get Solr to reject documents that already exist in the index but I doubt it, maybe someone else with can chime here here. You could do a search for each document prior to indexing it so see if it is already in the index, that is probably non-optimal, maybe it is easiest to check if the document exists in your Riak repository, it no add it and index it, and drop if it already exists.

François

On Jun 28, 2011, at 8:24 AM, Mohammad Shariq wrote:

> I am making the Hash from URL, but I can't use this as UniqueKey because I
> am using UUID as UniqueKey,
> Since I am using SOLR as  index engine Only and using Riak(key-value
> storage) as storage engine, I dont want to do the overwrite on duplicate.
> I just need to discard the duplicates.
> 
> 
> 
> 2011/6/28 François Schiettecatte <fs...@gmail.com>
> 
>> Create a hash from the url and use that as the unique key, md5 or sha1
>> would probably be good enough.
>> 
>> Cheers
>> 
>> François
>> 
>> On Jun 28, 2011, at 7:29 AM, Mohammad Shariq wrote:
>> 
>>> I also have the problem of duplicate docs.
>>> I am indexing news articles, Every news article will have the source URL,
>>> If two news-article has the same URL, only one need to index,
>>> removal of duplicate at index time.
>>> 
>>> 
>>> 
>>> On 23 June 2011 21:24, simon <mt...@gmail.com> wrote:
>>> 
>>>> have you checked out the deduplication process that's available at
>>>> indexing time ? This includes a fuzzy hash algorithm .
>>>> 
>>>> http://wiki.apache.org/solr/Deduplication
>>>> 
>>>> -Simon
>>>> 
>>>> On Thu, Jun 23, 2011 at 5:55 AM, Pranav Prakash <pr...@gmail.com>
>> wrote:
>>>>> This approach would definitely work is the two documents are *Exactly*
>>>> the
>>>>> same. But this is very fragile. Even if one extra space has been added,
>>>> the
>>>>> whole hash would change. What I am really looking for is some %age
>>>>> similarity between documents, and remove those documents which are more
>>>> than
>>>>> 95% similar.
>>>>> 
>>>>> *Pranav Prakash*
>>>>> 
>>>>> "temet nosce"
>>>>> 
>>>>> Twitter <http://twitter.com/pranavprakash> | Blog <
>>>> http://blog.myblive.com> |
>>>>> Google <http://www.google.com/profiles/pranny>
>>>>> 
>>>>> 
>>>>> On Thu, Jun 23, 2011 at 15:16, Omri Cohen <om...@yotpo.com> wrote:
>>>>> 
>>>>>> What you need to do, is to calculate some HASH (using any message
>> digest
>>>>>> algorithm you want, md5, sha-1 and so on), then do some reading on
>> solr
>>>>>> field collapse capabilities. Should not be too complicated..
>>>>>> 
>>>>>> *Omri Cohen*
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Co-founder @ yotpo.com | omri@yotpo.com | +972-50-7235198 |
>>>> +972-3-6036295
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> My profiles: [image: LinkedIn] <http://www.linkedin.com/in/omric>
>>>> [image:
>>>>>> Twitter] <http://www.twitter.com/omricohe> [image:
>>>>>> WordPress]<http://omricohen.me>
>>>>>> Please consider your environmental responsibility. Before printing
>> this
>>>>>> e-mail message, ask yourself whether you really need a hard copy.
>>>>>> IMPORTANT: The contents of this email and any attachments are
>>>> confidential.
>>>>>> They are intended for the named recipient(s) only. If you have
>> received
>>>>>> this
>>>>>> email by mistake, please notify the sender immediately and do not
>>>> disclose
>>>>>> the contents to anyone or make copies thereof.
>>>>>> Signature powered by
>>>>>> <
>>>>>> 
>>>> 
>> http://www.wisestamp.com/email-install?utm_source=extension&utm_medium=email&utm_campaign=footer
>>>>>>> 
>>>>>> WiseStamp<
>>>>>> 
>>>> 
>> http://www.wisestamp.com/email-install?utm_source=extension&utm_medium=email&utm_campaign=footer
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> ---------- Forwarded message ----------
>>>>>> From: Pranav Prakash <pr...@gmail.com>
>>>>>> Date: Thu, Jun 23, 2011 at 12:26 PM
>>>>>> Subject: Removing duplicate documents from search results
>>>>>> To: solr-user@lucene.apache.org
>>>>>> 
>>>>>> 
>>>>>> How can I remove very similar documents from search results?
>>>>>> 
>>>>>> My scenario is that there are documents in the index which are almost
>>>>>> similar (people submitting same stuff multiple times, sometimes
>>>> different
>>>>>> people submitting same stuff). Now when a search is performed for
>>>>>> "keyword",
>>>>>> in the top N results, quite frequently, same document comes up
>> multiple
>>>>>> times. I want to remove those duplicate (or possible duplicate)
>>>> documents.
>>>>>> Very similar to what Google does when they say "In order to show you
>>>> most
>>>>>> relevant result, duplicates have been removed". How can I achieve this
>>>>>> functionality using Solr? Does Solr has an implied or plugin which
>> could
>>>>>> help me with it?
>>>>>> 
>>>>>> 
>>>>>> *Pranav Prakash*
>>>>>> 
>>>>>> "temet nosce"
>>>>>> 
>>>>>> Twitter <http://twitter.com/pranavprakash> | Blog <
>>>> http://blog.myblive.com
>>>>>>> 
>>>>>> |
>>>>>> Google <http://www.google.com/profiles/pranny>
>>>>>> 
>>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> Thanks and Regards
>>> Mohammad Shariq
>> 
>> 
> 
> 
> -- 
> Thanks and Regards
> Mohammad Shariq

Re: Removing duplicate documents from search results

Posted by Mohammad Shariq <sh...@gmail.com>.

I am making the Hash from URL, but I can't use this as UniqueKey because I
am using UUID as UniqueKey,
Since I am using SOLR as  index engine Only and using Riak(key-value
storage) as storage engine, I dont want to do the overwrite on duplicate.
I just need to discard the duplicates.



2011/6/28 François Schiettecatte <fs...@gmail.com>

> Create a hash from the url and use that as the unique key, md5 or sha1
> would probably be good enough.
>
> Cheers
>
> François
>
> On Jun 28, 2011, at 7:29 AM, Mohammad Shariq wrote:
>
> > I also have the problem of duplicate docs.
> > I am indexing news articles, Every news article will have the source URL,
> > If two news-article has the same URL, only one need to index,
> > removal of duplicate at index time.
> >
> >
> >
> > On 23 June 2011 21:24, simon <mt...@gmail.com> wrote:
> >
> >> have you checked out the deduplication process that's available at
> >> indexing time ? This includes a fuzzy hash algorithm .
> >>
> >> http://wiki.apache.org/solr/Deduplication
> >>
> >> -Simon
> >>
> >> On Thu, Jun 23, 2011 at 5:55 AM, Pranav Prakash <pr...@gmail.com>
> wrote:
> >>> This approach would definitely work is the two documents are *Exactly*
> >> the
> >>> same. But this is very fragile. Even if one extra space has been added,
> >> the
> >>> whole hash would change. What I am really looking for is some %age
> >>> similarity between documents, and remove those documents which are more
> >> than
> >>> 95% similar.
> >>>
> >>> *Pranav Prakash*
> >>>
> >>> "temet nosce"
> >>>
> >>> Twitter <http://twitter.com/pranavprakash> | Blog <
> >> http://blog.myblive.com> |
> >>> Google <http://www.google.com/profiles/pranny>
> >>>
> >>>
> >>> On Thu, Jun 23, 2011 at 15:16, Omri Cohen <om...@yotpo.com> wrote:
> >>>
> >>>> What you need to do, is to calculate some HASH (using any message
> digest
> >>>> algorithm you want, md5, sha-1 and so on), then do some reading on
> solr
> >>>> field collapse capabilities. Should not be too complicated..
> >>>>
> >>>> *Omri Cohen*
> >>>>
> >>>>
> >>>>
> >>>> Co-founder @ yotpo.com | omri@yotpo.com | +972-50-7235198 |
> >> +972-3-6036295
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> My profiles: [image: LinkedIn] <http://www.linkedin.com/in/omric>
> >> [image:
> >>>> Twitter] <http://www.twitter.com/omricohe> [image:
> >>>> WordPress]<http://omricohen.me>
> >>>> Please consider your environmental responsibility. Before printing
> this
> >>>> e-mail message, ask yourself whether you really need a hard copy.
> >>>> IMPORTANT: The contents of this email and any attachments are
> >> confidential.
> >>>> They are intended for the named recipient(s) only. If you have
> received
> >>>> this
> >>>> email by mistake, please notify the sender immediately and do not
> >> disclose
> >>>> the contents to anyone or make copies thereof.
> >>>> Signature powered by
> >>>> <
> >>>>
> >>
> http://www.wisestamp.com/email-install?utm_source=extension&utm_medium=email&utm_campaign=footer
> >>>>>
> >>>> WiseStamp<
> >>>>
> >>
> http://www.wisestamp.com/email-install?utm_source=extension&utm_medium=email&utm_campaign=footer
> >>>>>
> >>>>
> >>>>
> >>>>
> >>>> ---------- Forwarded message ----------
> >>>> From: Pranav Prakash <pr...@gmail.com>
> >>>> Date: Thu, Jun 23, 2011 at 12:26 PM
> >>>> Subject: Removing duplicate documents from search results
> >>>> To: solr-user@lucene.apache.org
> >>>>
> >>>>
> >>>> How can I remove very similar documents from search results?
> >>>>
> >>>> My scenario is that there are documents in the index which are almost
> >>>> similar (people submitting same stuff multiple times, sometimes
> >> different
> >>>> people submitting same stuff). Now when a search is performed for
> >>>> "keyword",
> >>>> in the top N results, quite frequently, same document comes up
> multiple
> >>>> times. I want to remove those duplicate (or possible duplicate)
> >> documents.
> >>>> Very similar to what Google does when they say "In order to show you
> >> most
> >>>> relevant result, duplicates have been removed". How can I achieve this
> >>>> functionality using Solr? Does Solr has an implied or plugin which
> could
> >>>> help me with it?
> >>>>
> >>>>
> >>>> *Pranav Prakash*
> >>>>
> >>>> "temet nosce"
> >>>>
> >>>> Twitter <http://twitter.com/pranavprakash> | Blog <
> >> http://blog.myblive.com
> >>>>>
> >>>> |
> >>>> Google <http://www.google.com/profiles/pranny>
> >>>>
> >>>
> >>
> >
> >
> >
> > --
> > Thanks and Regards
> > Mohammad Shariq
>
>


-- 
Thanks and Regards
Mohammad Shariq

Re: Removing duplicate documents from search results

Posted by François Schiettecatte <fs...@gmail.com>.

Create a hash from the url and use that as the unique key, md5 or sha1 would probably be good enough.

Cheers

François

On Jun 28, 2011, at 7:29 AM, Mohammad Shariq wrote:

> I also have the problem of duplicate docs.
> I am indexing news articles, Every news article will have the source URL,
> If two news-article has the same URL, only one need to index,
> removal of duplicate at index time.
> 
> 
> 
> On 23 June 2011 21:24, simon <mt...@gmail.com> wrote:
> 
>> have you checked out the deduplication process that's available at
>> indexing time ? This includes a fuzzy hash algorithm .
>> 
>> http://wiki.apache.org/solr/Deduplication
>> 
>> -Simon
>> 
>> On Thu, Jun 23, 2011 at 5:55 AM, Pranav Prakash <pr...@gmail.com> wrote:
>>> This approach would definitely work is the two documents are *Exactly*
>> the
>>> same. But this is very fragile. Even if one extra space has been added,
>> the
>>> whole hash would change. What I am really looking for is some %age
>>> similarity between documents, and remove those documents which are more
>> than
>>> 95% similar.
>>> 
>>> *Pranav Prakash*
>>> 
>>> "temet nosce"
>>> 
>>> Twitter <http://twitter.com/pranavprakash> | Blog <
>> http://blog.myblive.com> |
>>> Google <http://www.google.com/profiles/pranny>
>>> 
>>> 
>>> On Thu, Jun 23, 2011 at 15:16, Omri Cohen <om...@yotpo.com> wrote:
>>> 
>>>> What you need to do, is to calculate some HASH (using any message digest
>>>> algorithm you want, md5, sha-1 and so on), then do some reading on solr
>>>> field collapse capabilities. Should not be too complicated..
>>>> 
>>>> *Omri Cohen*
>>>> 
>>>> 
>>>> 
>>>> Co-founder @ yotpo.com | omri@yotpo.com | +972-50-7235198 |
>> +972-3-6036295
>>>> 
>>>> 
>>>> 
>>>> 
>>>> My profiles: [image: LinkedIn] <http://www.linkedin.com/in/omric>
>> [image:
>>>> Twitter] <http://www.twitter.com/omricohe> [image:
>>>> WordPress]<http://omricohen.me>
>>>> Please consider your environmental responsibility. Before printing this
>>>> e-mail message, ask yourself whether you really need a hard copy.
>>>> IMPORTANT: The contents of this email and any attachments are
>> confidential.
>>>> They are intended for the named recipient(s) only. If you have received
>>>> this
>>>> email by mistake, please notify the sender immediately and do not
>> disclose
>>>> the contents to anyone or make copies thereof.
>>>> Signature powered by
>>>> <
>>>> 
>> http://www.wisestamp.com/email-install?utm_source=extension&utm_medium=email&utm_campaign=footer
>>>>> 
>>>> WiseStamp<
>>>> 
>> http://www.wisestamp.com/email-install?utm_source=extension&utm_medium=email&utm_campaign=footer
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> ---------- Forwarded message ----------
>>>> From: Pranav Prakash <pr...@gmail.com>
>>>> Date: Thu, Jun 23, 2011 at 12:26 PM
>>>> Subject: Removing duplicate documents from search results
>>>> To: solr-user@lucene.apache.org
>>>> 
>>>> 
>>>> How can I remove very similar documents from search results?
>>>> 
>>>> My scenario is that there are documents in the index which are almost
>>>> similar (people submitting same stuff multiple times, sometimes
>> different
>>>> people submitting same stuff). Now when a search is performed for
>>>> "keyword",
>>>> in the top N results, quite frequently, same document comes up multiple
>>>> times. I want to remove those duplicate (or possible duplicate)
>> documents.
>>>> Very similar to what Google does when they say "In order to show you
>> most
>>>> relevant result, duplicates have been removed". How can I achieve this
>>>> functionality using Solr? Does Solr has an implied or plugin which could
>>>> help me with it?
>>>> 
>>>> 
>>>> *Pranav Prakash*
>>>> 
>>>> "temet nosce"
>>>> 
>>>> Twitter <http://twitter.com/pranavprakash> | Blog <
>> http://blog.myblive.com
>>>>> 
>>>> |
>>>> Google <http://www.google.com/profiles/pranny>
>>>> 
>>> 
>> 
> 
> 
> 
> -- 
> Thanks and Regards
> Mohammad Shariq

Re: Removing duplicate documents from search results

Posted by Mohammad Shariq <sh...@gmail.com>.

I also have the problem of duplicate docs.
I am indexing news articles, Every news article will have the source URL,
If two news-article has the same URL, only one need to index,
removal of duplicate at index time.



On 23 June 2011 21:24, simon <mt...@gmail.com> wrote:

> have you checked out the deduplication process that's available at
> indexing time ? This includes a fuzzy hash algorithm .
>
> http://wiki.apache.org/solr/Deduplication
>
> -Simon
>
> On Thu, Jun 23, 2011 at 5:55 AM, Pranav Prakash <pr...@gmail.com> wrote:
> > This approach would definitely work is the two documents are *Exactly*
> the
> > same. But this is very fragile. Even if one extra space has been added,
> the
> > whole hash would change. What I am really looking for is some %age
> > similarity between documents, and remove those documents which are more
> than
> > 95% similar.
> >
> > *Pranav Prakash*
> >
> > "temet nosce"
> >
> > Twitter <http://twitter.com/pranavprakash> | Blog <
> http://blog.myblive.com> |
> > Google <http://www.google.com/profiles/pranny>
> >
> >
> > On Thu, Jun 23, 2011 at 15:16, Omri Cohen <om...@yotpo.com> wrote:
> >
> >> What you need to do, is to calculate some HASH (using any message digest
> >> algorithm you want, md5, sha-1 and so on), then do some reading on solr
> >> field collapse capabilities. Should not be too complicated..
> >>
> >> *Omri Cohen*
> >>
> >>
> >>
> >> Co-founder @ yotpo.com | omri@yotpo.com | +972-50-7235198 |
> +972-3-6036295
> >>
> >>
> >>
> >>
> >> My profiles: [image: LinkedIn] <http://www.linkedin.com/in/omric>
> [image:
> >> Twitter] <http://www.twitter.com/omricohe> [image:
> >> WordPress]<http://omricohen.me>
> >>  Please consider your environmental responsibility. Before printing this
> >> e-mail message, ask yourself whether you really need a hard copy.
> >> IMPORTANT: The contents of this email and any attachments are
> confidential.
> >> They are intended for the named recipient(s) only. If you have received
> >> this
> >> email by mistake, please notify the sender immediately and do not
> disclose
> >> the contents to anyone or make copies thereof.
> >> Signature powered by
> >> <
> >>
> http://www.wisestamp.com/email-install?utm_source=extension&utm_medium=email&utm_campaign=footer
> >> >
> >> WiseStamp<
> >>
> http://www.wisestamp.com/email-install?utm_source=extension&utm_medium=email&utm_campaign=footer
> >> >
> >>
> >>
> >>
> >> ---------- Forwarded message ----------
> >> From: Pranav Prakash <pr...@gmail.com>
> >> Date: Thu, Jun 23, 2011 at 12:26 PM
> >> Subject: Removing duplicate documents from search results
> >> To: solr-user@lucene.apache.org
> >>
> >>
> >> How can I remove very similar documents from search results?
> >>
> >> My scenario is that there are documents in the index which are almost
> >> similar (people submitting same stuff multiple times, sometimes
> different
> >> people submitting same stuff). Now when a search is performed for
> >> "keyword",
> >> in the top N results, quite frequently, same document comes up multiple
> >> times. I want to remove those duplicate (or possible duplicate)
> documents.
> >> Very similar to what Google does when they say "In order to show you
> most
> >> relevant result, duplicates have been removed". How can I achieve this
> >> functionality using Solr? Does Solr has an implied or plugin which could
> >> help me with it?
> >>
> >>
> >> *Pranav Prakash*
> >>
> >> "temet nosce"
> >>
> >> Twitter <http://twitter.com/pranavprakash> | Blog <
> http://blog.myblive.com
> >> >
> >> |
> >> Google <http://www.google.com/profiles/pranny>
> >>
> >
>



-- 
Thanks and Regards
Mohammad Shariq

Re: Removing duplicate documents from search results

Posted by simon <mt...@gmail.com>.

have you checked out the deduplication process that's available at
indexing time ? This includes a fuzzy hash algorithm .

http://wiki.apache.org/solr/Deduplication

-Simon

On Thu, Jun 23, 2011 at 5:55 AM, Pranav Prakash <pr...@gmail.com> wrote:
> This approach would definitely work is the two documents are *Exactly* the
> same. But this is very fragile. Even if one extra space has been added, the
> whole hash would change. What I am really looking for is some %age
> similarity between documents, and remove those documents which are more than
> 95% similar.
>
> *Pranav Prakash*
>
> "temet nosce"
>
> Twitter <http://twitter.com/pranavprakash> | Blog <http://blog.myblive.com> |
> Google <http://www.google.com/profiles/pranny>
>
>
> On Thu, Jun 23, 2011 at 15:16, Omri Cohen <om...@yotpo.com> wrote:
>
>> What you need to do, is to calculate some HASH (using any message digest
>> algorithm you want, md5, sha-1 and so on), then do some reading on solr
>> field collapse capabilities. Should not be too complicated..
>>
>> *Omri Cohen*
>>
>>
>>
>> Co-founder @ yotpo.com | omri@yotpo.com | +972-50-7235198 | +972-3-6036295
>>
>>
>>
>>
>> My profiles: [image: LinkedIn] <http://www.linkedin.com/in/omric> [image:
>> Twitter] <http://www.twitter.com/omricohe> [image:
>> WordPress]<http://omricohen.me>
>>  Please consider your environmental responsibility. Before printing this
>> e-mail message, ask yourself whether you really need a hard copy.
>> IMPORTANT: The contents of this email and any attachments are confidential.
>> They are intended for the named recipient(s) only. If you have received
>> this
>> email by mistake, please notify the sender immediately and do not disclose
>> the contents to anyone or make copies thereof.
>> Signature powered by
>> <
>> http://www.wisestamp.com/email-install?utm_source=extension&utm_medium=email&utm_campaign=footer
>> >
>> WiseStamp<
>> http://www.wisestamp.com/email-install?utm_source=extension&utm_medium=email&utm_campaign=footer
>> >
>>
>>
>>
>> ---------- Forwarded message ----------
>> From: Pranav Prakash <pr...@gmail.com>
>> Date: Thu, Jun 23, 2011 at 12:26 PM
>> Subject: Removing duplicate documents from search results
>> To: solr-user@lucene.apache.org
>>
>>
>> How can I remove very similar documents from search results?
>>
>> My scenario is that there are documents in the index which are almost
>> similar (people submitting same stuff multiple times, sometimes different
>> people submitting same stuff). Now when a search is performed for
>> "keyword",
>> in the top N results, quite frequently, same document comes up multiple
>> times. I want to remove those duplicate (or possible duplicate) documents.
>> Very similar to what Google does when they say "In order to show you most
>> relevant result, duplicates have been removed". How can I achieve this
>> functionality using Solr? Does Solr has an implied or plugin which could
>> help me with it?
>>
>>
>> *Pranav Prakash*
>>
>> "temet nosce"
>>
>> Twitter <http://twitter.com/pranavprakash> | Blog <http://blog.myblive.com
>> >
>> |
>> Google <http://www.google.com/profiles/pranny>
>>
>

Re: Removing duplicate documents from search results

Posted by Pranav Prakash <pr...@gmail.com>.

This approach would definitely work is the two documents are *Exactly* the
same. But this is very fragile. Even if one extra space has been added, the
whole hash would change. What I am really looking for is some %age
similarity between documents, and remove those documents which are more than
95% similar.

*Pranav Prakash*

"temet nosce"

Twitter <http://twitter.com/pranavprakash> | Blog <http://blog.myblive.com> |
Google <http://www.google.com/profiles/pranny>


On Thu, Jun 23, 2011 at 15:16, Omri Cohen <om...@yotpo.com> wrote:

> What you need to do, is to calculate some HASH (using any message digest
> algorithm you want, md5, sha-1 and so on), then do some reading on solr
> field collapse capabilities. Should not be too complicated..
>
> *Omri Cohen*
>
>
>
> Co-founder @ yotpo.com | omri@yotpo.com | +972-50-7235198 | +972-3-6036295
>
>
>
>
> My profiles: [image: LinkedIn] <http://www.linkedin.com/in/omric> [image:
> Twitter] <http://www.twitter.com/omricohe> [image:
> WordPress]<http://omricohen.me>
>  Please consider your environmental responsibility. Before printing this
> e-mail message, ask yourself whether you really need a hard copy.
> IMPORTANT: The contents of this email and any attachments are confidential.
> They are intended for the named recipient(s) only. If you have received
> this
> email by mistake, please notify the sender immediately and do not disclose
> the contents to anyone or make copies thereof.
> Signature powered by
> <
> http://www.wisestamp.com/email-install?utm_source=extension&utm_medium=email&utm_campaign=footer
> >
> WiseStamp<
> http://www.wisestamp.com/email-install?utm_source=extension&utm_medium=email&utm_campaign=footer
> >
>
>
>
> ---------- Forwarded message ----------
> From: Pranav Prakash <pr...@gmail.com>
> Date: Thu, Jun 23, 2011 at 12:26 PM
> Subject: Removing duplicate documents from search results
> To: solr-user@lucene.apache.org
>
>
> How can I remove very similar documents from search results?
>
> My scenario is that there are documents in the index which are almost
> similar (people submitting same stuff multiple times, sometimes different
> people submitting same stuff). Now when a search is performed for
> "keyword",
> in the top N results, quite frequently, same document comes up multiple
> times. I want to remove those duplicate (or possible duplicate) documents.
> Very similar to what Google does when they say "In order to show you most
> relevant result, duplicates have been removed". How can I achieve this
> functionality using Solr? Does Solr has an implied or plugin which could
> help me with it?
>
>
> *Pranav Prakash*
>
> "temet nosce"
>
> Twitter <http://twitter.com/pranavprakash> | Blog <http://blog.myblive.com
> >
> |
> Google <http://www.google.com/profiles/pranny>
>

Re: Removing duplicate documents from search results

Posted by Omri Cohen <om...@yotpo.com>.

What you need to do, is to calculate some HASH (using any message digest
algorithm you want, md5, sha-1 and so on), then do some reading on solr
field collapse capabilities. Should not be too complicated..

*Omri Cohen*



Co-founder @ yotpo.com | omri@yotpo.com | +972-50-7235198 | +972-3-6036295




My profiles: [image: LinkedIn] <http://www.linkedin.com/in/omric> [image:
Twitter] <http://www.twitter.com/omricohe> [image:
WordPress]<http://omricohen.me>
 Please consider your environmental responsibility. Before printing this
e-mail message, ask yourself whether you really need a hard copy.
IMPORTANT: The contents of this email and any attachments are confidential.
They are intended for the named recipient(s) only. If you have received this
email by mistake, please notify the sender immediately and do not disclose
the contents to anyone or make copies thereof.
Signature powered by
<http://www.wisestamp.com/email-install?utm_source=extension&utm_medium=email&utm_campaign=footer>
WiseStamp<http://www.wisestamp.com/email-install?utm_source=extension&utm_medium=email&utm_campaign=footer>



---------- Forwarded message ----------
From: Pranav Prakash <pr...@gmail.com>
Date: Thu, Jun 23, 2011 at 12:26 PM
Subject: Removing duplicate documents from search results
To: solr-user@lucene.apache.org


How can I remove very similar documents from search results?

My scenario is that there are documents in the index which are almost
similar (people submitting same stuff multiple times, sometimes different
people submitting same stuff). Now when a search is performed for "keyword",
in the top N results, quite frequently, same document comes up multiple
times. I want to remove those duplicate (or possible duplicate) documents.
Very similar to what Google does when they say "In order to show you most
relevant result, duplicates have been removed". How can I achieve this
functionality using Solr? Does Solr has an implied or plugin which could
help me with it?


*Pranav Prakash*

"temet nosce"

Twitter <http://twitter.com/pranavprakash> | Blog <http://blog.myblive.com>
|
Google <http://www.google.com/profiles/pranny>

Re: Removing duplicate documents from search results

Posted by pravesh <su...@yahoo.com>.

Would you care to even index the duplicate documents? Finding duplicacy in
content fields would be not so easy as in some untokenized/keyword field.
May be you could do this filtering at indexing time before sending the
document to SOLR. Then the question comes, which one document should go(from
a group of duplicates)?? The latest one?

--
View this message in context: http://lucene.472066.n3.nabble.com/Removing-duplicate-documents-from-search-results-tp3099214p3099432.html
Sent from the Solr - User mailing list archive at Nabble.com.