You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Julien Nioche (JIRA)" <ji...@apache.org> on 2010/04/19 15:39:53 UTC

[jira] Commented: (NUTCH-710) Support for rel="canonical" attribute

    [ https://issues.apache.org/jira/browse/NUTCH-710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12858487#action_12858487 ] 

Julien Nioche commented on NUTCH-710:
-------------------------------------

Shall we treat pages with a canonical metatag as a form of redirection? We know that there is no point indexing the page and that we'd be better off making sure that the page it refers to is fetched, parsed and indexed. That will not prevent these entries to be put in the crawlDB but should limit the size of the index and more importantly its quality. 

Alternatively we could keep the content of the page for indexing and rely on the de-duplication later. This would allow to have something returned in the search even if the target of the canonical tag has not been indexed yet (or if it does not exist). 

The first option would be easier to implement. The second option would require some adaptation to the DeleteDuplicates and SolrDeleteDuplicates classes

Any thoughts on this?

> Support for rel="canonical" attribute
> -------------------------------------
>
>                 Key: NUTCH-710
>                 URL: https://issues.apache.org/jira/browse/NUTCH-710
>             Project: Nutch
>          Issue Type: New Feature
>    Affects Versions: 1.1
>            Reporter: Frank McCown
>            Priority: Minor
>
> There is a the new rel="canonical" attribute which is
> now being supported by Google, Yahoo, and Live:
> http://googlewebmastercentral.blogspot.com/2009/02/specify-your-canonical.html
> Adding support for this attribute value will potentially reduce the number of URLs crawled and indexed and reduce duplicate page content.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.