You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Doug Cutting <cu...@nutch.org> on 2006/01/02 19:39:30 UTC

Re: svn commit: r359822 - in /lucene/nutch/trunk: bin/ conf/ src/java/org/apache/nutch/crawl/ src/java/org/apache/nutch/fetcher/ src/java/org/apache/nutch/indexer/ src/java/org/apache/nutch/parse/ src/java/org/apache/nutch/segment/ src/java/org/apache/nutc...

ab@apache.org wrote:
> Now users can select their own page signature implementation, possibly
> with better properties than the old one.
> 
> Two implementations are provided:
> 
> * MD5Signature: backward-compatible with the old schema.
> 
> * TextProfileSignature: an example implementation of a signature, which
>   gives the same values for near-duplicate pages. Please see Javadoc for
>   more information.

This looks great!  Thanks!

Shouldn't this also be used in DeleteDuplicates.java?

Doug

Re: svn commit: r359822 - in /lucene/nutch/trunk: bin/ conf/ src/java/org/apache/nutch/crawl/ src/java/org/apache/nutch/fetcher/ src/java/org/apache/nutch/indexer/ src/java/org/apache/nutch/parse/ src/java/org/apache/nutch/segment/ src/java/org/apache/nutc...

Posted by Andrzej Bialecki <ab...@getopt.org>.
Doug Cutting wrote:

> ab@apache.org wrote:
>
>> Now users can select their own page signature implementation, possibly
>> with better properties than the old one.
>>
>> Two implementations are provided:
>>
>> * MD5Signature: backward-compatible with the old schema.
>>
>> * TextProfileSignature: an example implementation of a signature, which
>>   gives the same values for near-duplicate pages. Please see Javadoc for
>>   more information.
>
>
> This looks great!  Thanks!
>
> Shouldn't this also be used in DeleteDuplicates.java?


Yes, I missed that. No harm done (yet), because the two existing 
implementations both produce an MD5 digest, just differently. I'll fix it.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com