You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by qi wu <ch...@gmail.com> on 2007/03/20 18:18:21 UTC

Any way for removing pages with same title in index?

Hi,
I found many pages with the same title , page contents are almost same. I would like to index the pages with the same title only once.How can I recognize the pages with same title during indexing process?
How do nutch remove pages with same page content and in which class/package can I find the code? 

Thanks
-Qi

Re: Any way for removing pages with same title in index?

Posted by Enis Soztutar <en...@gmail.com>.
qi wu wrote:
> Hi,
> I found many pages with the same title , page contents are almost same. I would like to index the pages with the same title only once.How can I recognize the pages with same title during indexing process?
> How do nutch remove pages with same page content and in which class/package can I find the code? 
>
> Thanks
> -Qi
>   
Hi,

Normally, in the nutch processing sequence, after indexing you can run
dedup command to delete the duplicate entries from the index.
DeleteDuplicates class does this in a two phrase manner. In the first
phrase the documents with the same url are deleted and in the second the
documents with the same content are deleted. In your case, I assume that
the document urls are different but the contents are "nearly the same".
Document similarity is computed using either MD5Signature or
TextProfileSignature. md5signature computes a value based on the content
of the page, but if the page's contents are not exactly the same, it
will generate distinct signatures. However TextProfileSignature
generates a signature based on the most frequent terms of the content,
so pages with similar content will generate same signature.

I can recommend two options. First one is to use the
TextProfilSignature(you can change the signiture from the
configuration), the other is to modify the DeleteDuplicates code for
deleting duplicates by the title. IMO former method is more sensible.