You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by kiran chitturi <ch...@gmail.com> on 2012/09/13 23:44:29 UTC

nutch dedup on content of the html

Hi,

When crawling and indexing the document i have seen that Nutch is creating
singature and running dedup on solr which it shows as digest.

Can anyone point how the signature is computed, is it based on the entire
text in the file ?

Can i create signature based on only one field like 'content' so that solr
can dedup files with same content but different urls ?

Many Thanks for your help,
-- 
Kiran Chitturi

RE: nutch dedup on content of the html

Posted by Markus Jelsma <ma...@openindex.io>.

It depends on the implementation you use, configured in your nutch-site.xml:

http://svn.apache.org/viewvc/nutch/trunk/src/java/org/apache/nutch/crawl/MD5Signature.java?view=markup

http://svn.apache.org/viewvc/nutch/trunk/src/java/org/apache/nutch/crawl/TextProfileSignature.java?view=markup
 
-----Original message-----
> From:kiran chitturi <ch...@gmail.com>
> Sent: Thu 13-Sep-2012 23:48
> To: user@nutch.apache.org
> Subject: nutch dedup on content of the html
> 
> Hi,
> 
> When crawling and indexing the document i have seen that Nutch is creating
> singature and running dedup on solr which it shows as digest.
> 
> Can anyone point how the signature is computed, is it based on the entire
> text in the file ?
> 
> Can i create signature based on only one field like 'content' so that solr
> can dedup files with same content but different urls ?
> 
> Many Thanks for your help,
> -- 
> Kiran Chitturi
>