You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Sebastian Nagel (JIRA)" <ji...@apache.org> on 2016/03/03 21:46:18 UTC

[jira] [Commented] (NUTCH-2237) DeduplicationJob: Add extra order criteria based on slug

    [ https://issues.apache.org/jira/browse/NUTCH-2237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15178587#comment-15178587 ] 

Sebastian Nagel commented on NUTCH-2237:
----------------------------------------

Good idea! Nice patch, including unit tests. A few comments for possible improvements:
* maybe URLUtil.java would be the better place for the slug functions, next to chooseRepr(...) which provides a similar functionality
* URLs are now always decoded, even if the decision which URL/document to keep is done solely by comparison of score or fetch time. Since decoding URLs isn't a cheap computation
*# it should be done lazily, and
*# the result could be cached for later comparisons if there are more than 2 duplicates. This would be an improvement of the current state, but should be done for both the decoded URL string and the slug length.
* Is it safe to first decode the URL string and then parse the resulting string as URL? After decoding there may be forbidden or reserved characters, so that the URL path and query fail to get properly parsed.
* no branch of this if clause is reachable given that compareUrlSlug(...) returns -1, 0, or 1:
{code}
if (compareUrlSlug(urlExisting, urlnewDoc) > 1) {
  // mark new one as duplicate
  ...
} else if (compareUrlSlug(urlnewDoc, urlExisting) > 1) {
{code}


> DeduplicationJob: Add extra order criteria based on slug
> --------------------------------------------------------
>
>                 Key: NUTCH-2237
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2237
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Ron van der Vegt
>             Fix For: 1.12
>
>         Attachments: NUTCH-2237.patch
>
>
> Currently user can elect the main document when signatures are the same on score, url lenght and fetchtime. The quality of the slug, based mainly on the amount of meaningful characters, could give users more flexibility to make a difference between slugified urls and urls based on page id.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)