You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Sebastian Nagel (JIRA)" <ji...@apache.org> on 2014/04/09 16:30:24 UTC

[jira] [Commented] (NUTCH-710) Support for rel="canonical" attribute

    [ https://issues.apache.org/jira/browse/NUTCH-710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13964202#comment-13964202 ] 

Sebastian Nagel commented on NUTCH-710:
---------------------------------------

Thanks,  [~Sertac Turkel]! My comments:
* every page containing a canonical link is now rejected. That's a rather hard decision. It should be configurable whether pages containing correct (non-empty, not self-referential, etc.) canonical links
*# are unconditionally rejected
*# are removed later only if the target is indexed. It's close to deduplication, and it's what canonical links are intended for: give web masters a chance to support and influence deduplication.
*# are only recorded (as outlinks and/or as indexed fields)
This point is the most challenging one: you need to take care for all nasty situations "in the wild", e.g. a canonical link pointing to a redirect which leads you back to the current page, etc. It's required to "resolve" chains of canonical links in combination with redirects, see Julien's comment and [1|http://mail-archives.apache.org/mod_mbox/nutch-user/201203.mbox/%3CCA+-fM0sg=rvuNxzoez5NLFmhNJHta=qP5qHTfRJ8ii55fB2mJA@mail.gmail.com%3E].
* is it really necessary to handle canonical links explicitely in DbUpdateMapper and mark as injected? Couldn't this be done by adding them simply as outlinks? Per default links of "link" elements are added as outlinks, cf. parser.html.outlinks.ignore_tags. Of course, canonical links should be added even if "link" elements are ignored.
* extraction of canonical links: at least, the following points are missing: relative URLs, and canonical link inside HTTP headers (required for anything which is not HTML). I'll try support you in this point because there's already some work done.
* keep names in parallel?
{code}src/plugin/parse-html/.../TestDOMContentUtils.java
src/plugin/parse-tika/.../DOMContentUtilsTest.java
{code}

... and some useful references:
[http://en.wikipedia.org/wiki/Canonical_link_element]
[http://tools.ietf.org/html/rfc6596]
[https://support.google.com/webmasters/answer/139066]
[http://www.mattcutts.com/blog/rel-canonical-html-head/]
[http://googlewebmastercentral.blogspot.de/2011/06/supporting-relcanonical-http-headers.html]


> Support for rel="canonical" attribute
> -------------------------------------
>
>                 Key: NUTCH-710
>                 URL: https://issues.apache.org/jira/browse/NUTCH-710
>             Project: Nutch
>          Issue Type: New Feature
>    Affects Versions: 1.1
>            Reporter: Frank McCown
>            Priority: Minor
>             Fix For: 1.9
>
>         Attachments: NUTCH-710.patch, canonical.patch
>
>
> There is a the new rel="canonical" attribute which is
> now being supported by Google, Yahoo, and Live:
> http://googlewebmastercentral.blogspot.com/2009/02/specify-your-canonical.html
> Adding support for this attribute value will potentially reduce the number of URLs crawled and indexed and reduce duplicate page content.



--
This message was sent by Atlassian JIRA
(v6.2#6252)