You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by kiran chitturi <ch...@gmail.com> on 2012/09/13 23:44:29 UTC
nutch dedup on content of the html
Hi,
When crawling and indexing the document i have seen that Nutch is creating
singature and running dedup on solr which it shows as digest.
Can anyone point how the signature is computed, is it based on the entire
text in the file ?
Can i create signature based on only one field like 'content' so that solr
can dedup files with same content but different urls ?
Many Thanks for your help,
--
Kiran Chitturi
RE: nutch dedup on content of the html
Posted by Markus Jelsma <ma...@openindex.io>.
It depends on the implementation you use, configured in your nutch-site.xml:
http://svn.apache.org/viewvc/nutch/trunk/src/java/org/apache/nutch/crawl/MD5Signature.java?view=markup
http://svn.apache.org/viewvc/nutch/trunk/src/java/org/apache/nutch/crawl/TextProfileSignature.java?view=markup
-----Original message-----
> From:kiran chitturi <ch...@gmail.com>
> Sent: Thu 13-Sep-2012 23:48
> To: user@nutch.apache.org
> Subject: nutch dedup on content of the html
>
> Hi,
>
> When crawling and indexing the document i have seen that Nutch is creating
> singature and running dedup on solr which it shows as digest.
>
> Can anyone point how the signature is computed, is it based on the entire
> text in the file ?
>
> Can i create signature based on only one field like 'content' so that solr
> can dedup files with same content but different urls ?
>
> Many Thanks for your help,
> --
> Kiran Chitturi
>