You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Miles Barr <mi...@runtime-collective.com> on 2005/03/14 18:19:24 UTC

Removing similar documents from search results

Has anyone tried to remove similar documents from their search results?
It looks like Google does some on the fly filtering of the results,
hiding pages which is thinks are too similar, i.e. when you see:

"In order to show you the most relevant results, we have omitted some
entries very similar to the 7 already displayed.
If you like, you can repeat the search with the omitted results
included."

at the bottom of the page.

Is there anything in Lucene or one of the contrib packages that compares
two documents?



-- 
Miles Barr <mi...@runtime-collective.com>
Runtime Collective Ltd.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Removing similar documents from search results

Posted by Miles Barr <mi...@runtime-collective.com>.
On Mon, 2005-03-14 at 10:24 -0800, David Spencer wrote:
> Yes, in theory the "similarity" package in the sandbox can help.
> The code generates a query for a source document to find documents that 
> are similar to it - the MoreLikeThis class uses the heuristic that 2 
> docs are similar if they share "interesting" words. "Interesting" words 
> are words that are common in a source doc but not too common in the 
> corpus. If you were do do this you'd do something like this:
> 
> [1] Do your normal query
> [2] As you loop thru the results, for every doc
> [2a]	generate a similarity query
> [2b]	requery the index for similar docs
> [2c]	then, maybe, for every doc from [2b] with a score above some 
> threshold, it it's also high up in the results from [2] then "hide" the 
> doc a la google et. al.
> 
> Could be tricky coding. Another way is to only show 1 doc from any given 
> domain. Note that instead of 1 query you'll have "1+n" queries for the 
> display of "n" search results.

That sounds like an interesting approach. But I'll probably wait until
Chuck's patch is included. I'm also a bit worried about the performance
of this approach. It might add too much time to each query.



-- 
Miles Barr <mi...@runtime-collective.com>
Runtime Collective Ltd.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Removing similar documents from search results

Posted by David Spencer <da...@tropo.com>.
Miles Barr wrote:

> Has anyone tried to remove similar documents from their search results?
> It looks like Google does some on the fly filtering of the results,
> hiding pages which is thinks are too similar, i.e. when you see:
> 
> "In order to show you the most relevant results, we have omitted some
> entries very similar to the 7 already displayed.
> If you like, you can repeat the search with the omitted results
> included."
> 
> at the bottom of the page.
> 
> Is there anything in Lucene or one of the contrib packages that compares
> two documents?

Yes, in theory the "similarity" package in the sandbox can help.
The code generates a query for a source document to find documents that 
are similar to it - the MoreLikeThis class uses the heuristic that 2 
docs are similar if they share "interesting" words. "Interesting" words 
are words that are common in a source doc but not too common in the 
corpus. If you were do do this you'd do something like this:

[1] Do your normal query
[2] As you loop thru the results, for every doc
[2a]	generate a similarity query
[2b]	requery the index for similar docs
[2c]	then, maybe, for every doc from [2b] with a score above some 
threshold, it it's also high up in the results from [2] then "hide" the 
doc a la google et. al.

Could be tricky coding. Another way is to only show 1 doc from any given 
domain. Note that instead of 1 query you'll have "1+n" queries for the 
display of "n" search results.




Similarity links:

Source control:

	http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/similarity/

My weblog entry about the code being checked in:

	http://searchmorph.com/weblog/index.php?id=44

Javadoc of it that I host:

	http://searchmorph.com/pub/jakarta-lucene-sandbox/contributions/similarity/build/docs/api/org/apache/lucene/search/similar/MoreLikeThis.html


-- Dave


> 
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Removing similar documents from search results

Posted by sergiu gordea <gs...@ifit.uni-klu.ac.at>.
Chris Lamprecht wrote:

It's a nice idea, and makes sense. I think that it can be broken if 
boosting is used and the search is performed on multiple fileds, especially
unstored ones. In this case the distance between very similar documents 
might be increased.
I think that also the duplications should be searched in a proximity of 
x docs, or x score units.
The point is .. you need 2 loops, even if the second is a short one.

 Best,

 Sergiu

>Miles,
>
>I'm assuming that you want to detect documents that are "almost"
>exactly the same (since if they were identical, you could just do a
>straight string compare or md5 compare, etc).
>
>If you're storing term vectors in your index, you could compare the
>term vectors for the search results, and if their similarity is less
>than some threshold, consider them duplicates.  How you define
>"similarity" is up to you, but using the cosine similarity between
>each document is probably a good start.
>
>You don't want to compare each document in your result set to each
>other document (this would be n^2 comparisons!).  But you know the
>documents are returned in order of relevance to your query (assuming
>you didn't sort on some other field).  So you know that if two or more
>documents are going to be near-duplicates, then they should appear
>consecutively in the Hits.  This saves you not only from the n^2
>comparisons, but from comparing documents whose scores are so
>different that they can't possibly be duplicates.
>
>Thus, I think the following logic may perform fairly efficiently:
>
>final float DUPE_THRESHOLD = 0.05f;  // you'll have to tune this parameter
>float previousScore = Float.MAX_VALUE;
>
>for (int i=0; i < hits.length(); ++i) {
>  Document doc = hits.doc(i);
>  // assumption: scores are non-increasing
>  float delta = previousScore - hits.score(i);
>  if (delta < DUPE_THRESHOLD) {
>    // docs score is very close, now check for dupes
>    // using cosine similarity or some other measure    
>    ...
>    // probably don't set previous score here, since you
>    // want to compare each doc to the first non-duplicate doc in the group
>  } else {
>    // scores are too far apart to be duplicates
>    // so don't even compute cosine similarity
>    previousScore = hits.score(i);
>  }
>}
>
>
>I haven't tried this approach out yet.  If anyone sees any big flaws
>in it, please let me know, and if you try it, I'd like to hear how it
>works out.
>
>-chris
>
>
>On Mon, 14 Mar 2005 17:19:24 +0000, Miles Barr
><mi...@runtime-collective.com> wrote:
>  
>
>>Has anyone tried to remove similar documents from their search results?
>>It looks like Google does some on the fly filtering of the results,
>>hiding pages which is thinks are too similar, i.e. when you see:
>>
>>"In order to show you the most relevant results, we have omitted some
>>entries very similar to the 7 already displayed.
>>If you like, you can repeat the search with the omitted results
>>included."
>>
>>at the bottom of the page.
>>
>>Is there anything in Lucene or one of the contrib packages that compares
>>two documents?
>>
>>--
>>Miles Barr <mi...@runtime-collective.com>
>>Runtime Collective Ltd.
>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>    
>>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>For additional commands, e-mail: java-user-help@lucene.apache.org
>
>  
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Removing similar documents from search results

Posted by Chris Lamprecht <cl...@gmail.com>.
Miles,

I'm assuming that you want to detect documents that are "almost"
exactly the same (since if they were identical, you could just do a
straight string compare or md5 compare, etc).

If you're storing term vectors in your index, you could compare the
term vectors for the search results, and if their similarity is less
than some threshold, consider them duplicates.  How you define
"similarity" is up to you, but using the cosine similarity between
each document is probably a good start.

You don't want to compare each document in your result set to each
other document (this would be n^2 comparisons!).  But you know the
documents are returned in order of relevance to your query (assuming
you didn't sort on some other field).  So you know that if two or more
documents are going to be near-duplicates, then they should appear
consecutively in the Hits.  This saves you not only from the n^2
comparisons, but from comparing documents whose scores are so
different that they can't possibly be duplicates.

Thus, I think the following logic may perform fairly efficiently:

final float DUPE_THRESHOLD = 0.05f;  // you'll have to tune this parameter
float previousScore = Float.MAX_VALUE;

for (int i=0; i < hits.length(); ++i) {
  Document doc = hits.doc(i);
  // assumption: scores are non-increasing
  float delta = previousScore - hits.score(i);
  if (delta < DUPE_THRESHOLD) {
    // docs score is very close, now check for dupes
    // using cosine similarity or some other measure    
    ...
    // probably don't set previous score here, since you
    // want to compare each doc to the first non-duplicate doc in the group
  } else {
    // scores are too far apart to be duplicates
    // so don't even compute cosine similarity
    previousScore = hits.score(i);
  }
}


I haven't tried this approach out yet.  If anyone sees any big flaws
in it, please let me know, and if you try it, I'd like to hear how it
works out.

-chris


On Mon, 14 Mar 2005 17:19:24 +0000, Miles Barr
<mi...@runtime-collective.com> wrote:
> Has anyone tried to remove similar documents from their search results?
> It looks like Google does some on the fly filtering of the results,
> hiding pages which is thinks are too similar, i.e. when you see:
> 
> "In order to show you the most relevant results, we have omitted some
> entries very similar to the 7 already displayed.
> If you like, you can repeat the search with the omitted results
> included."
> 
> at the bottom of the page.
> 
> Is there anything in Lucene or one of the contrib packages that compares
> two documents?
> 
> --
> Miles Barr <mi...@runtime-collective.com>
> Runtime Collective Ltd.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Removing similar documents from search results

Posted by David Spencer <da...@tropo.com>.
Miles Barr wrote:
> On Mon, 2005-03-14 at 20:48 +0100, Dawid Weiss wrote:
> 
>>I think what they do at Google is a fancy heuristic -- as David Spencer 
>>mentioned, suburls of a given page, identical snippets, or titles... My 
>>idea was more towards providing a 'realistic overview' of subjects in 
>>pages. So you could pick, say, the first document from each cluster and 
>>show them like that to the user. Then, in every cluster documents 
>>already have mutual similarity (this could be calculated manually, the 
>>clustering algorithm doesn't do it for all pairs of documents), but some 
>>have more and some have less. You could then hide nearly identical 
>>results from the user.
>>
>>Anyway, I think the Google method is just a heuristic based on URLs and 
>>nothing as fancy.
> 
> 
> At the moment I need something quite simple. To identify a page that
> appears in many forms, e.g.:
> 
> - Normal version
> - Split across several pages
> - Print version
> - From a different section (different styling and navigation elements)
> 
> Basically identical content, presented in different ways.
> 
> I beginning to think any on the fly filtering would have to be quite
> simple because of the required response time. Something like looking at
> URLs and titles (which are already separate in the index).
> 
> A more permanent solution would have to process the pages ahead of time,
> trying to break it down into chunks, then maybe built up a list of
> hashes of the chunks for each page. Then each page would have a
> 'fingerprint', and hopefully you could come up with a quick way to
> compare them at query time.

I think the paper "Syntactic Clustering of the Web" describes a 
technique similar to what you suggest:

http://www.std.org/~msm/common/clustering.html
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Removing similar documents from search results

Posted by Miles Barr <mi...@runtime-collective.com>.
On Sun, 2005-03-20 at 00:49 -0800, Chris Hostetter wrote:
> Actually, your "Split across several pages" comment implies that you want
> a system which can tell that page 1 of a multipage article should be
> grouped with page 2 -- which may be radically different content.  Most
> multipage documents have very differnet text on subsequent pages, so i'm
> not sure that a progromatic solution is going to be bale to spot that.

Actually I added that in after I saw that Google does it. You're right
that the context is likely to be completely different so I guess they do
it through some URL matching.

> I may also be reading too much into your message, but it sounds like you
> aren't trying to index generic content -- it sounds like you are trying to
> index content under your control (ie: content on your own web site).
> 
> if that's the case, then presumably you know somethign about the
> source data and the URL strucutre -- maybe you could solve this problem
> when you build your index.
> 
> for example, if i look at a site like perl.com, i can see a pattern in the
> way the article URLs look...
> 
> page 1...
> http://www.perl.com/pub/a/2005/02/17/3d_engine.html
> page 2, etc...
> http://www.perl.com/pub/a/2005/02/17/3d_engine.html?page=2
> printable...
> http://www.perl.com/lpt/a/2005/02/17/3d_engine.html
> 
> 
> So instead of putting all of those URLs in the index as seperate docs, why
> not create a single doc, with all of those URLs?

I have to index several sites and I used some examples of the problems
I've come across so far. I don't control the content for any of them,
and they get picked up by a spider so excluding pages requires adding
special cases.

I'll probably adopt a two stage approach.

1. Prevent duplicate documents from getting into the index in the first
place, e.g. compare MD5 hashes and file sizes, maybe make the spider
configurable to spot certain URL patterns, etc.

2. Try out the various techniques suggested in this thread to spot
similar pages at query time and hide them.



-- 
Miles Barr <mi...@runtime-collective.com>
Runtime Collective Ltd.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Removing similar documents from search results

Posted by Chris Hostetter <ho...@fucit.org>.
: At the moment I need something quite simple. To identify a page that
: appears in many forms, e.g.:
:
: - Normal version
: - Split across several pages
: - Print version
: - From a different section (different styling and navigation elements)
:
: Basically identical content, presented in different ways.

Actually, your "Split across several pages" comment implies that you want
a system which can tell that page 1 of a multipage article should be
grouped with page 2 -- which may be radically different content.  Most
multipage documents have very differnet text on subsequent pages, so i'm
not sure that a progromatic solution is going to be bale to spot that.

I may also be reading too much into your message, but it sounds like you
aren't trying to index generic content -- it sounds like you are trying to
index content under your control (ie: content on your own web site).

if that's the case, then presumably you know somethign about the
source data and the URL strucutre -- maybe you could solve this problem
when you build your index.

for example, if i look at a site like perl.com, i can see a pattern in the
way the article URLs look...

page 1...
http://www.perl.com/pub/a/2005/02/17/3d_engine.html
page 2, etc...
http://www.perl.com/pub/a/2005/02/17/3d_engine.html?page=2
printable...
http://www.perl.com/lpt/a/2005/02/17/3d_engine.html


So instead of putting all of those URLs in the index as seperate docs, why
not createa single doc, with all of those URLs?




-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Removing similar documents from search results

Posted by Miles Barr <mi...@runtime-collective.com>.
On Mon, 2005-03-14 at 20:48 +0100, Dawid Weiss wrote:
> I think what they do at Google is a fancy heuristic -- as David Spencer 
> mentioned, suburls of a given page, identical snippets, or titles... My 
> idea was more towards providing a 'realistic overview' of subjects in 
> pages. So you could pick, say, the first document from each cluster and 
> show them like that to the user. Then, in every cluster documents 
> already have mutual similarity (this could be calculated manually, the 
> clustering algorithm doesn't do it for all pairs of documents), but some 
> have more and some have less. You could then hide nearly identical 
> results from the user.
> 
> Anyway, I think the Google method is just a heuristic based on URLs and 
> nothing as fancy.

At the moment I need something quite simple. To identify a page that
appears in many forms, e.g.:

- Normal version
- Split across several pages
- Print version
- From a different section (different styling and navigation elements)

Basically identical content, presented in different ways.

I beginning to think any on the fly filtering would have to be quite
simple because of the required response time. Something like looking at
URLs and titles (which are already separate in the index).

A more permanent solution would have to process the pages ahead of time,
trying to break it down into chunks, then maybe built up a list of
hashes of the chunks for each page. Then each page would have a
'fingerprint', and hopefully you could come up with a quick way to
compare them at query time.


-- 
Miles Barr <mi...@runtime-collective.com>
Runtime Collective Ltd.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Removing similar documents from search results

Posted by Dawid Weiss <da...@cs.put.poznan.pl>.
I think what they do at Google is a fancy heuristic -- as David Spencer 
mentioned, suburls of a given page, identical snippets, or titles... My 
idea was more towards providing a 'realistic overview' of subjects in 
pages. So you could pick, say, the first document from each cluster and 
show them like that to the user. Then, in every cluster documents 
already have mutual similarity (this could be calculated manually, the 
clustering algorithm doesn't do it for all pairs of documents), but some 
have more and some have less. You could then hide nearly identical 
results from the user.

Anyway, I think the Google method is just a heuristic based on URLs and 
nothing as fancy.

D.

Miles Barr wrote:
> Hi Dawid,
> 
> On Mon, 2005-03-14 at 18:55 +0100, Dawid Weiss wrote:
> 
>>I can imagine if you apply clustering to search results anyway then the 
>>information about clusters can help you determine 'similar' results and 
>>reorder the output list.
> 
> 
> That's an interesting idea. How easy is it to 'tighten' the clustering
> clones? So say we take a very narrow cone around each result and any
> other documents within that cone can be considered similar enough, and
> hence not displayed. Then we'd take the document closest to the centre
> of the cloud and make that the 'original' copy and display it.
> 
> Or would that approach be too expensive to calculate for each search?
> 
> 
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Removing similar documents from search results

Posted by Miles Barr <mi...@runtime-collective.com>.
Hi Dawid,

On Mon, 2005-03-14 at 18:55 +0100, Dawid Weiss wrote:
> I can imagine if you apply clustering to search results anyway then the 
> information about clusters can help you determine 'similar' results and 
> reorder the output list.

That's an interesting idea. How easy is it to 'tighten' the clustering
clones? So say we take a very narrow cone around each result and any
other documents within that cone can be considered similar enough, and
hence not displayed. Then we'd take the document closest to the centre
of the cloud and make that the 'original' copy and display it.

Or would that approach be too expensive to calculate for each search?



-- 
Miles Barr <mi...@runtime-collective.com>
Runtime Collective Ltd.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Removing similar documents from search results

Posted by Dawid Weiss <da...@cs.put.poznan.pl>.
Hi Miles :)

I can imagine if you apply clustering to search results anyway then the 
information about clusters can help you determine 'similar' results and 
reorder the output list.

Just a thought.

D.


Miles Barr wrote:
> Has anyone tried to remove similar documents from their search results?
> It looks like Google does some on the fly filtering of the results,
> hiding pages which is thinks are too similar, i.e. when you see:
> 
> "In order to show you the most relevant results, we have omitted some
> entries very similar to the 7 already displayed.
> If you like, you can repeat the search with the omitted results
> included."
> 
> at the bottom of the page.
> 
> Is there anything in Lucene or one of the contrib packages that compares
> two documents?
> 
> 
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org