You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Sebastian Nagel (JIRA)" <ji...@apache.org> on 2018/06/28 09:39:00 UTC

[jira] [Updated] (NUTCH-2237) DeduplicationJob: Add extra order criteria based on slug

     [ https://issues.apache.org/jira/browse/NUTCH-2237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sebastian Nagel updated NUTCH-2237:
-----------------------------------
    Fix Version/s:     (was: 1.15)
                   1.16

> DeduplicationJob: Add extra order criteria based on slug
> --------------------------------------------------------
>
>                 Key: NUTCH-2237
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2237
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Ron van der Vegt
>            Priority: Major
>             Fix For: 1.16
>
>         Attachments: NUTCH-2237.patch, NUTCH-2237.patch
>
>
> Currently user can elect the main document when signatures are the same on score, url lenght and fetchtime. The quality of the slug, based mainly on the amount of meaningful characters, could give users more flexibility to make a difference between slugified urls and urls based on page id.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)