You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "David Johnson (JIRA)" <ji...@apache.org> on 2017/06/05 18:09:04 UTC
[jira] [Created] (NUTCH-2391) Spurious Duplications for MD5
David Johnson created NUTCH-2391:
------------------------------------
Summary: Spurious Duplications for MD5
Key: NUTCH-2391
URL: https://issues.apache.org/jira/browse/NUTCH-2391
Project: Nutch
Issue Type: Bug
Components: commoncrawl
Affects Versions: 1.11
Reporter: David Johnson
Priority: Minor
We're seeing some incidence of a large number of documents being marked as duplicate in our crawl.
We traced it back to one of the crawl plugins returning an empty array for the content field.
We'd like to propose changing the MD5 signature generation from:
public byte[] calculate(Content content, Parse parse) {
byte[] data = content.getContent();
if (data == null)
data = content.getUrl().getBytes();
return MD5Hash.digest(data).getDigest();
}
to:
public byte[] calculate(Content content, Parse parse) {
byte[] data = content.getContent();
if ((data == null) || (data.length == 0))
data = content.getUrl().getBytes();
return MD5Hash.digest(data).getDigest();
}
to address the issue
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)