You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Andrew McCall (JIRA)" <ji...@apache.org> on 2009/02/18 22:00:02 UTC

[jira] Created: (NUTCH-693) Add configurable option for treating nofollow behaviour.

Add configurable option for treating nofollow behaviour.
--------------------------------------------------------

                 Key: NUTCH-693
                 URL: https://issues.apache.org/jira/browse/NUTCH-693
             Project: Nutch
          Issue Type: New Feature
            Reporter: Andrew McCall
            Priority: Minor
         Attachments: nutch.nofollow.patch

For my purposes I'd like to follow links even if they're marked nofollow- Ideally I'd like to follow them, but not pass the link juice between them. 

I've attached a patch that adds a configuration element parser.html.outlinks.ignore_nofollow which allows the parser to ignore the nofollow elements on a page. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-693) Add configurable option for treating nofollow behaviour.

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrzej Bialecki  updated NUTCH-693:
------------------------------------

    Assignee:     (was: Otis Gospodnetic)

> Add configurable option for treating nofollow behaviour.
> --------------------------------------------------------
>
>                 Key: NUTCH-693
>                 URL: https://issues.apache.org/jira/browse/NUTCH-693
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Andrew McCall
>            Priority: Minor
>         Attachments: nutch.nofollow.patch
>
>
> For my purposes I'd like to follow links even if they're marked nofollow- Ideally I'd like to follow them, but not pass the link juice between them. 
> I've attached a patch that adds a configuration element parser.html.outlinks.ignore_nofollow which allows the parser to ignore the nofollow elements on a page. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Assigned: (NUTCH-693) Add configurable option for treating nofollow behaviour.

Posted by "Otis Gospodnetic (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Otis Gospodnetic reassigned NUTCH-693:
--------------------------------------

    Assignee: Otis Gospodnetic

> Add configurable option for treating nofollow behaviour.
> --------------------------------------------------------
>
>                 Key: NUTCH-693
>                 URL: https://issues.apache.org/jira/browse/NUTCH-693
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Andrew McCall
>            Assignee: Otis Gospodnetic
>            Priority: Minor
>         Attachments: nutch.nofollow.patch
>
>
> For my purposes I'd like to follow links even if they're marked nofollow- Ideally I'd like to follow them, but not pass the link juice between them. 
> I've attached a patch that adds a configuration element parser.html.outlinks.ignore_nofollow which allows the parser to ignore the nofollow elements on a page. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-693) Add configurable option for treating nofollow behaviour.

Posted by "Otis Gospodnetic (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12713862#action_12713862 ] 

Otis Gospodnetic commented on NUTCH-693:
----------------------------------------

I think I see some formatting that's a bit off (looks off in the patch itself at least), but more importantly, is everyone OK with allowing this behaviour?

+1 from me -- let the operators decide.


> Add configurable option for treating nofollow behaviour.
> --------------------------------------------------------
>
>                 Key: NUTCH-693
>                 URL: https://issues.apache.org/jira/browse/NUTCH-693
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Andrew McCall
>            Assignee: Otis Gospodnetic
>            Priority: Minor
>         Attachments: nutch.nofollow.patch
>
>
> For my purposes I'd like to follow links even if they're marked nofollow- Ideally I'd like to follow them, but not pass the link juice between them. 
> I've attached a patch that adds a configuration element parser.html.outlinks.ignore_nofollow which allows the parser to ignore the nofollow elements on a page. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-693) Add configurable option for treating nofollow behaviour.

Posted by "Andrew McCall (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12847153#action_12847153 ] 

Andrew McCall commented on NUTCH-693:
-------------------------------------

[http://en.wikipedia.org/wiki/Nofollow]

I don't think there is really any consensus on this standard to be honest. Most search engines don't index no-follow links per se, but they do follow them for crawling. Even Google, who first proposed the nofollow, sometimes actually do follow according to some tests linked in the wikipedia article. The results show that if the link is already in the index (eg has been followed elsewhere) then it does get followed and indexed. 

The nofollow is really just a keyword to point out that the link isn't being endorsed by the author - It's more a content guideline than a strict order for robots to obey. So I disagree that you're breaking standards or creating a robot that's not well behaved by ignoring it. 

I would have liked to have done a bit more with this so that I could have respected nofollows, but injected the URL as a brand new seed URL but other commitments took over and I never got around to it. Since the ideal nofollow behaviour is somewhere between ignoring them and not ignoring them I figured the option to ignore them was a good start and submitted the patch, but I'm not precious about it.

> Add configurable option for treating nofollow behaviour.
> --------------------------------------------------------
>
>                 Key: NUTCH-693
>                 URL: https://issues.apache.org/jira/browse/NUTCH-693
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Andrew McCall
>            Assignee: Otis Gospodnetic
>            Priority: Minor
>         Attachments: nutch.nofollow.patch
>
>
> For my purposes I'd like to follow links even if they're marked nofollow- Ideally I'd like to follow them, but not pass the link juice between them. 
> I've attached a patch that adds a configuration element parser.html.outlinks.ignore_nofollow which allows the parser to ignore the nofollow elements on a page. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-693) Add configurable option for treating nofollow behaviour.

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12847291#action_12847291 ] 

Andrzej Bialecki  commented on NUTCH-693:
-----------------------------------------

Thanks for the pointer to the article. Indeed, the issue is muddy at best. So far Nutch adhered to a strict interpretation, where the links with this attribute are deleted from page outlinks immediately (so they are not only not followed but also don't affect out-degree metrics). If there is a general agreement in Nutch community towards relaxing this behavior we can further develop this patch - at the moment I don't see such support. Consequently, I propose to discuss it and in the meantime to move this issue to a later release.

> Add configurable option for treating nofollow behaviour.
> --------------------------------------------------------
>
>                 Key: NUTCH-693
>                 URL: https://issues.apache.org/jira/browse/NUTCH-693
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Andrew McCall
>            Assignee: Otis Gospodnetic
>            Priority: Minor
>         Attachments: nutch.nofollow.patch
>
>
> For my purposes I'd like to follow links even if they're marked nofollow- Ideally I'd like to follow them, but not pass the link juice between them. 
> I've attached a patch that adds a configuration element parser.html.outlinks.ignore_nofollow which allows the parser to ignore the nofollow elements on a page. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-693) Add configurable option for treating nofollow behaviour.

Posted by "Andrew McCall (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrew McCall updated NUTCH-693:
--------------------------------

    Attachment: nutch.nofollow.patch

Here is the patch.

> Add configurable option for treating nofollow behaviour.
> --------------------------------------------------------
>
>                 Key: NUTCH-693
>                 URL: https://issues.apache.org/jira/browse/NUTCH-693
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Andrew McCall
>            Priority: Minor
>         Attachments: nutch.nofollow.patch
>
>
> For my purposes I'd like to follow links even if they're marked nofollow- Ideally I'd like to follow them, but not pass the link juice between them. 
> I've attached a patch that adds a configuration element parser.html.outlinks.ignore_nofollow which allows the parser to ignore the nofollow elements on a page. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-693) Add configurable option for treating nofollow behaviour.

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12847074#action_12847074 ] 

Andrzej Bialecki  commented on NUTCH-693:
-----------------------------------------

This patch is controversial in the sense that a) Nutch strives to adhere to Internet standards and netiquette, which says that robots should obey nofollow, and b) most Nutch users want a well-behaved robot. You are free of course to modify the source as you did. Therefore I think that this functionality is not applicable to majority of Nutch users, and I vote -1 on including it in Nutch.

> Add configurable option for treating nofollow behaviour.
> --------------------------------------------------------
>
>                 Key: NUTCH-693
>                 URL: https://issues.apache.org/jira/browse/NUTCH-693
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Andrew McCall
>            Assignee: Otis Gospodnetic
>            Priority: Minor
>         Attachments: nutch.nofollow.patch
>
>
> For my purposes I'd like to follow links even if they're marked nofollow- Ideally I'd like to follow them, but not pass the link juice between them. 
> I've attached a patch that adds a configuration element parser.html.outlinks.ignore_nofollow which allows the parser to ignore the nofollow elements on a page. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.