You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by "Andrzej Bialecki (JIRA)" <ji...@apache.org> on 2007/01/11 23:02:27 UTC

[jira] Closed: (NUTCH-420) DeleteDuplicates.HashPartitioner depends on the order of IndexDocs

     [ https://issues.apache.org/jira/browse/NUTCH-420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrzej Bialecki  closed NUTCH-420.
-----------------------------------

       Resolution: Fixed
    Fix Version/s: 0.9.0
         Assignee: Andrzej Bialecki 

Fixed in rev. 495397. Thank you!

> DeleteDuplicates.HashPartitioner depends on the order of IndexDocs
> ------------------------------------------------------------------
>
>                 Key: NUTCH-420
>                 URL: https://issues.apache.org/jira/browse/NUTCH-420
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 0.9.0
>            Reporter: Dogacan Güney
>         Assigned To: Andrzej Bialecki 
>            Priority: Minor
>             Fix For: 0.9.0
>
>         Attachments: dedup-v2.patch, dedup-v3.patch, dedup.patch, index.tar.gz
>
>
> DeleteDuplicates.HashPartitioner.reduce():
> // byScore case
> if (value.score > highest.score) {
>   highest.keep = false;
>   LOG.debug("-discard " + highest + ", keep " + value);
>   output.collect(highest.url, highest);     // delete highest
>   highest = value;
> }
> // !byScore is also similar
> So assume two docs with same hash are in values.If the first has higher score than the second than second doc will be deleted. But if the first has lower score than the second then none will be deleted. AFAICS, there should be an else condition to delete value and keep highest as it is.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira