You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Sebastian Nagel (JIRA)" <ji...@apache.org> on 2010/06/25 13:01:54 UTC
[jira] Created: (NUTCH-835) document deduplication (exact
duplicates) failed using MD5Signature
document deduplication (exact duplicates) failed using MD5Signature
-------------------------------------------------------------------
Key: NUTCH-835
URL: https://issues.apache.org/jira/browse/NUTCH-835
Project: Nutch
Issue Type: Bug
Affects Versions: 1.1, 1.0.0
Environment: Linux, Ubuntu 10.04, Java 1.6.0_20
Reporter: Sebastian Nagel
The MD5Signature class calculates different signatures for identical documents.
The reason is that
byte[] data = content.getContent();
... StringBuilder().append(data) ...
uses java.lang.Object.toString() to get a string representation of the (binary) content
which results in unique hash codes (e.g., [B@30dc9065) even for two byte arrays
with identical content.
A solution would be to take the MD5 sum of the binary content as first part of the
final signature calculation (the parsed content is the second part):
... .append(StringUtil.toHexString(MD5Hash.digest(data).getDigest())).append(parse.getText());
Of course, there are many other solutions...
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-835) document deduplication (exact
duplicates) failed using MD5Signature
Posted by "Hudson (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12884540#action_12884540 ]
Hudson commented on NUTCH-835:
------------------------------
Integrated in Nutch-trunk #1195 (See [http://hudson.zones.apache.org/hudson/job/Nutch-trunk/1195/])
NUTCH-835 Document deduplication failed using MD5Signature (Sebastian Nagel via ab)
> document deduplication (exact duplicates) failed using MD5Signature
> -------------------------------------------------------------------
>
> Key: NUTCH-835
> URL: https://issues.apache.org/jira/browse/NUTCH-835
> Project: Nutch
> Issue Type: Bug
> Affects Versions: 1.0.0, 1.1
> Environment: Linux, Ubuntu 10.04, Java 1.6.0_20
> Reporter: Sebastian Nagel
> Assignee: Andrzej Bialecki
> Fix For: 1.2, 2.0
>
>
> The MD5Signature class calculates different signatures for identical documents.
> The reason is that
> byte[] data = content.getContent();
> ... StringBuilder().append(data) ...
> uses java.lang.Object.toString() to get a string representation of the (binary) content
> which results in unique hash codes (e.g., [B@30dc9065) even for two byte arrays
> with identical content.
> A solution would be to take the MD5 sum of the binary content as first part of the
> final signature calculation (the parsed content is the second part):
> ... .append(StringUtil.toHexString(MD5Hash.digest(data).getDigest())).append(parse.getText());
> Of course, there are many other solutions...
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-835) document deduplication (exact
duplicates) failed using MD5Signature
Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12884630#action_12884630 ]
Andrzej Bialecki commented on NUTCH-835:
-----------------------------------------
Sorry, I should've been more precise - I committed this to branch-1.2 as well (r95963).
> document deduplication (exact duplicates) failed using MD5Signature
> -------------------------------------------------------------------
>
> Key: NUTCH-835
> URL: https://issues.apache.org/jira/browse/NUTCH-835
> Project: Nutch
> Issue Type: Bug
> Affects Versions: 1.0.0, 1.1
> Environment: Linux, Ubuntu 10.04, Java 1.6.0_20
> Reporter: Sebastian Nagel
> Assignee: Andrzej Bialecki
> Fix For: 1.2, 2.0
>
>
> The MD5Signature class calculates different signatures for identical documents.
> The reason is that
> byte[] data = content.getContent();
> ... StringBuilder().append(data) ...
> uses java.lang.Object.toString() to get a string representation of the (binary) content
> which results in unique hash codes (e.g., [B@30dc9065) even for two byte arrays
> with identical content.
> A solution would be to take the MD5 sum of the binary content as first part of the
> final signature calculation (the parsed content is the second part):
> ... .append(StringUtil.toHexString(MD5Hash.digest(data).getDigest())).append(parse.getText());
> Of course, there are many other solutions...
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-835) document deduplication (exact
duplicates) failed using MD5Signature
Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12884255#action_12884255 ]
Andrzej Bialecki commented on NUTCH-835:
-----------------------------------------
Yes, this is a bug. In fact the implementation makes things even worse by appending the parsed text, contrary to its specification that says it should use just the raw content... I'll fix this shortly.
> document deduplication (exact duplicates) failed using MD5Signature
> -------------------------------------------------------------------
>
> Key: NUTCH-835
> URL: https://issues.apache.org/jira/browse/NUTCH-835
> Project: Nutch
> Issue Type: Bug
> Affects Versions: 1.0.0, 1.1
> Environment: Linux, Ubuntu 10.04, Java 1.6.0_20
> Reporter: Sebastian Nagel
>
> The MD5Signature class calculates different signatures for identical documents.
> The reason is that
> byte[] data = content.getContent();
> ... StringBuilder().append(data) ...
> uses java.lang.Object.toString() to get a string representation of the (binary) content
> which results in unique hash codes (e.g., [B@30dc9065) even for two byte arrays
> with identical content.
> A solution would be to take the MD5 sum of the binary content as first part of the
> final signature calculation (the parsed content is the second part):
> ... .append(StringUtil.toHexString(MD5Hash.digest(data).getDigest())).append(parse.getText());
> Of course, there are many other solutions...
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-835) document deduplication (exact
duplicates) failed using MD5Signature
Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andrzej Bialecki updated NUTCH-835:
------------------------------------
Fix Version/s: 1.2
> document deduplication (exact duplicates) failed using MD5Signature
> -------------------------------------------------------------------
>
> Key: NUTCH-835
> URL: https://issues.apache.org/jira/browse/NUTCH-835
> Project: Nutch
> Issue Type: Bug
> Affects Versions: 1.0.0, 1.1
> Environment: Linux, Ubuntu 10.04, Java 1.6.0_20
> Reporter: Sebastian Nagel
> Assignee: Andrzej Bialecki
> Fix For: 1.2, 2.0
>
>
> The MD5Signature class calculates different signatures for identical documents.
> The reason is that
> byte[] data = content.getContent();
> ... StringBuilder().append(data) ...
> uses java.lang.Object.toString() to get a string representation of the (binary) content
> which results in unique hash codes (e.g., [B@30dc9065) even for two byte arrays
> with identical content.
> A solution would be to take the MD5 sum of the binary content as first part of the
> final signature calculation (the parsed content is the second part):
> ... .append(StringUtil.toHexString(MD5Hash.digest(data).getDigest())).append(parse.getText());
> Of course, there are many other solutions...
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-835) document deduplication (exact
duplicates) failed using MD5Signature
Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andrzej Bialecki resolved NUTCH-835.
-------------------------------------
Assignee: Andrzej Bialecki
Fix Version/s: 2.0
Resolution: Fixed
Fixed in rev. 959629. Thanks!
> document deduplication (exact duplicates) failed using MD5Signature
> -------------------------------------------------------------------
>
> Key: NUTCH-835
> URL: https://issues.apache.org/jira/browse/NUTCH-835
> Project: Nutch
> Issue Type: Bug
> Affects Versions: 1.0.0, 1.1
> Environment: Linux, Ubuntu 10.04, Java 1.6.0_20
> Reporter: Sebastian Nagel
> Assignee: Andrzej Bialecki
> Fix For: 2.0
>
>
> The MD5Signature class calculates different signatures for identical documents.
> The reason is that
> byte[] data = content.getContent();
> ... StringBuilder().append(data) ...
> uses java.lang.Object.toString() to get a string representation of the (binary) content
> which results in unique hash codes (e.g., [B@30dc9065) even for two byte arrays
> with identical content.
> A solution would be to take the MD5 sum of the binary content as first part of the
> final signature calculation (the parsed content is the second part):
> ... .append(StringUtil.toHexString(MD5Hash.digest(data).getDigest())).append(parse.getText());
> Of course, there are many other solutions...
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-835) document deduplication (exact
duplicates) failed using MD5Signature
Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12884624#action_12884624 ]
Julien Nioche commented on NUTCH-835:
-------------------------------------
This patch has been marked for 1.2 but has been committed to trunk only (2.0).
Shall we also apply it to /nutch/branches/branch-1.2 ?
> document deduplication (exact duplicates) failed using MD5Signature
> -------------------------------------------------------------------
>
> Key: NUTCH-835
> URL: https://issues.apache.org/jira/browse/NUTCH-835
> Project: Nutch
> Issue Type: Bug
> Affects Versions: 1.0.0, 1.1
> Environment: Linux, Ubuntu 10.04, Java 1.6.0_20
> Reporter: Sebastian Nagel
> Assignee: Andrzej Bialecki
> Fix For: 1.2, 2.0
>
>
> The MD5Signature class calculates different signatures for identical documents.
> The reason is that
> byte[] data = content.getContent();
> ... StringBuilder().append(data) ...
> uses java.lang.Object.toString() to get a string representation of the (binary) content
> which results in unique hash codes (e.g., [B@30dc9065) even for two byte arrays
> with identical content.
> A solution would be to take the MD5 sum of the binary content as first part of the
> final signature calculation (the parsed content is the second part):
> ... .append(StringUtil.toHexString(MD5Hash.digest(data).getDigest())).append(parse.getText());
> Of course, there are many other solutions...
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.