You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Fred Zimmerman <zi...@gmail.com> on 2011/11/25 02:04:18 UTC

remove answers with identical scores

I have a corpus that has a lot of identical or nearly identical documents.
I'd like to return only the unique ones (excluding the "nearly identical"
which are redirects).  I notice that all the identical/nearly identicals
have identical Solr scores. How can I tell Solr to  throw out all the
successive documents in an answer set that have identical scores?

doc 1 score 5.0
doc 2  score 5.0
doc 3 score 5.0
doc 4 score 4.9

skip docs 2 and 3

bring back 10 docs with unique scores

Re: remove answers with identical scores

Posted by Erick Erickson <er...@gmail.com>.
Have you considered removing them at index time? See:
http://wiki.apache.org/solr/Deduplication

Best
Erick

On Fri, Nov 25, 2011 at 3:13 PM, Ted Dunning <te...@gmail.com> wrote:
> See http://en.wikipedia.org/wiki/Locality-sensitive_hashing
>
> The obvious thought that I had just after hitting send was that you could
> put the LSH signatures on the documents.  That would let you do the scan at
> low volume and using LSH would make the duplicate scan almost as fast as
> your score scan idea.
>
> Whether Solr will do this for you is really neither here nor there.  Solr
> does an awful lot of stuff for a an awful lot of people who find it very
> congenial.  They probably don't have lots of duplicate documents.  If you
> really think that this capability is core, then you can contribute an
> implementation to Solr and all will be made whole.  In the short-term, I
> would recommend you prototype independently.
>
> On Fri, Nov 25, 2011 at 4:47 AM, Fred Zimmerman <zi...@gmail.com>wrote:
>
>> thanks.  i did consider postprocessing and may wind up doing that, i was
>> hoping there was a way to have Solr do it for me! that I have to as this
>> question is probably not a good sign, but what is LSH clustering?
>>
>> On Fri, Nov 25, 2011 at 4:34 AM, Ted Dunning <te...@gmail.com>
>> wrote:
>>
>> > You can do that pretty easily by just retrieving extra documents and post
>> > processing the results list.
>> >
>> > You are likely to have a significant number of apparent duplicates this
>> > way.
>> >
>> > To really get rid of duplicates in results, it might be better to remove
>> > them from the corpus by deploying something like LSH clustering.
>> >
>> > On Thu, Nov 24, 2011 at 5:04 PM, Fred Zimmerman <zimzaz.wfz@gmail.com
>> > >wrote:
>> >
>> > > I have a corpus that has a lot of identical or nearly identical
>> > documents.
>> > > I'd like to return only the unique ones (excluding the "nearly
>> identical"
>> > > which are redirects).  I notice that all the identical/nearly
>> identicals
>> > > have identical Solr scores. How can I tell Solr to  throw out all the
>> > > successive documents in an answer set that have identical scores?
>> > >
>> > > doc 1 score 5.0
>> > > doc 2  score 5.0
>> > > doc 3 score 5.0
>> > > doc 4 score 4.9
>> > >
>> > > skip docs 2 and 3
>> > >
>> > > bring back 10 docs with unique scores
>> > >
>> >
>>
>

Re: remove answers with identical scores

Posted by Ted Dunning <te...@gmail.com>.
See http://en.wikipedia.org/wiki/Locality-sensitive_hashing

The obvious thought that I had just after hitting send was that you could
put the LSH signatures on the documents.  That would let you do the scan at
low volume and using LSH would make the duplicate scan almost as fast as
your score scan idea.

Whether Solr will do this for you is really neither here nor there.  Solr
does an awful lot of stuff for a an awful lot of people who find it very
congenial.  They probably don't have lots of duplicate documents.  If you
really think that this capability is core, then you can contribute an
implementation to Solr and all will be made whole.  In the short-term, I
would recommend you prototype independently.

On Fri, Nov 25, 2011 at 4:47 AM, Fred Zimmerman <zi...@gmail.com>wrote:

> thanks.  i did consider postprocessing and may wind up doing that, i was
> hoping there was a way to have Solr do it for me! that I have to as this
> question is probably not a good sign, but what is LSH clustering?
>
> On Fri, Nov 25, 2011 at 4:34 AM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > You can do that pretty easily by just retrieving extra documents and post
> > processing the results list.
> >
> > You are likely to have a significant number of apparent duplicates this
> > way.
> >
> > To really get rid of duplicates in results, it might be better to remove
> > them from the corpus by deploying something like LSH clustering.
> >
> > On Thu, Nov 24, 2011 at 5:04 PM, Fred Zimmerman <zimzaz.wfz@gmail.com
> > >wrote:
> >
> > > I have a corpus that has a lot of identical or nearly identical
> > documents.
> > > I'd like to return only the unique ones (excluding the "nearly
> identical"
> > > which are redirects).  I notice that all the identical/nearly
> identicals
> > > have identical Solr scores. How can I tell Solr to  throw out all the
> > > successive documents in an answer set that have identical scores?
> > >
> > > doc 1 score 5.0
> > > doc 2  score 5.0
> > > doc 3 score 5.0
> > > doc 4 score 4.9
> > >
> > > skip docs 2 and 3
> > >
> > > bring back 10 docs with unique scores
> > >
> >
>

Re: remove answers with identical scores

Posted by Fred Zimmerman <zi...@gmail.com>.
thanks.  i did consider postprocessing and may wind up doing that, i was
hoping there was a way to have Solr do it for me! that I have to as this
question is probably not a good sign, but what is LSH clustering?

On Fri, Nov 25, 2011 at 4:34 AM, Ted Dunning <te...@gmail.com> wrote:

> You can do that pretty easily by just retrieving extra documents and post
> processing the results list.
>
> You are likely to have a significant number of apparent duplicates this
> way.
>
> To really get rid of duplicates in results, it might be better to remove
> them from the corpus by deploying something like LSH clustering.
>
> On Thu, Nov 24, 2011 at 5:04 PM, Fred Zimmerman <zimzaz.wfz@gmail.com
> >wrote:
>
> > I have a corpus that has a lot of identical or nearly identical
> documents.
> > I'd like to return only the unique ones (excluding the "nearly identical"
> > which are redirects).  I notice that all the identical/nearly identicals
> > have identical Solr scores. How can I tell Solr to  throw out all the
> > successive documents in an answer set that have identical scores?
> >
> > doc 1 score 5.0
> > doc 2  score 5.0
> > doc 3 score 5.0
> > doc 4 score 4.9
> >
> > skip docs 2 and 3
> >
> > bring back 10 docs with unique scores
> >
>

Re: remove answers with identical scores

Posted by Ted Dunning <te...@gmail.com>.
You can do that pretty easily by just retrieving extra documents and post
processing the results list.

You are likely to have a significant number of apparent duplicates this
way.

To really get rid of duplicates in results, it might be better to remove
them from the corpus by deploying something like LSH clustering.

On Thu, Nov 24, 2011 at 5:04 PM, Fred Zimmerman <zi...@gmail.com>wrote:

> I have a corpus that has a lot of identical or nearly identical documents.
> I'd like to return only the unique ones (excluding the "nearly identical"
> which are redirects).  I notice that all the identical/nearly identicals
> have identical Solr scores. How can I tell Solr to  throw out all the
> successive documents in an answer set that have identical scores?
>
> doc 1 score 5.0
> doc 2  score 5.0
> doc 3 score 5.0
> doc 4 score 4.9
>
> skip docs 2 and 3
>
> bring back 10 docs with unique scores
>