You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Doug Cutting <cu...@nutch.org> on 2006/01/02 19:39:30 UTC
Re: svn commit: r359822 - in /lucene/nutch/trunk: bin/ conf/ src/java/org/apache/nutch/crawl/
src/java/org/apache/nutch/fetcher/ src/java/org/apache/nutch/indexer/ src/java/org/apache/nutch/parse/
src/java/org/apache/nutch/segment/ src/java/org/apache/nutc...
ab@apache.org wrote:
> Now users can select their own page signature implementation, possibly
> with better properties than the old one.
>
> Two implementations are provided:
>
> * MD5Signature: backward-compatible with the old schema.
>
> * TextProfileSignature: an example implementation of a signature, which
> gives the same values for near-duplicate pages. Please see Javadoc for
> more information.
This looks great! Thanks!
Shouldn't this also be used in DeleteDuplicates.java?
Doug
Re: svn commit: r359822 - in /lucene/nutch/trunk: bin/ conf/ src/java/org/apache/nutch/crawl/
src/java/org/apache/nutch/fetcher/ src/java/org/apache/nutch/indexer/ src/java/org/apache/nutch/parse/
src/java/org/apache/nutch/segment/ src/java/org/apache/nutc...
Posted by Andrzej Bialecki <ab...@getopt.org>.
Doug Cutting wrote:
> ab@apache.org wrote:
>
>> Now users can select their own page signature implementation, possibly
>> with better properties than the old one.
>>
>> Two implementations are provided:
>>
>> * MD5Signature: backward-compatible with the old schema.
>>
>> * TextProfileSignature: an example implementation of a signature, which
>> gives the same values for near-duplicate pages. Please see Javadoc for
>> more information.
>
>
> This looks great! Thanks!
>
> Shouldn't this also be used in DeleteDuplicates.java?
Yes, I missed that. No harm done (yet), because the two existing
implementations both produce an MD5 digest, just differently. I'll fix it.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com