You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Andrzej Bialecki (JIRA)" <ji...@apache.org> on 2010/07/01 14:12:51 UTC

[jira] Resolved: (NUTCH-835) document deduplication (exact duplicates) failed using MD5Signature

     [ https://issues.apache.org/jira/browse/NUTCH-835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrzej Bialecki  resolved NUTCH-835.
-------------------------------------

         Assignee: Andrzej Bialecki 
    Fix Version/s: 2.0
       Resolution: Fixed

Fixed in rev. 959629. Thanks!

> document deduplication (exact duplicates) failed using MD5Signature
> -------------------------------------------------------------------
>
>                 Key: NUTCH-835
>                 URL: https://issues.apache.org/jira/browse/NUTCH-835
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.0.0, 1.1
>         Environment: Linux, Ubuntu 10.04, Java 1.6.0_20
>            Reporter: Sebastian Nagel
>            Assignee: Andrzej Bialecki 
>             Fix For: 2.0
>
>
> The MD5Signature class calculates different signatures for identical documents.
> The reason is that
>   byte[] data = content.getContent();
>   ... StringBuilder().append(data) ...
> uses java.lang.Object.toString() to get a string representation of the (binary) content
> which results in unique hash codes (e.g., [B@30dc9065) even for two byte arrays
> with identical content.
> A solution would be to take the MD5 sum of the binary content as first part of the
> final signature calculation (the parsed content is the second part):
>   ... .append(StringUtil.toHexString(MD5Hash.digest(data).getDigest())).append(parse.getText());
> Of course, there are many other solutions...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.