You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Markus Jelsma (JIRA)" <ji...@apache.org> on 2015/03/17 00:19:38 UTC
[jira] [Commented] (NUTCH-1941) Optional rolling http.agent.name's

    [ https://issues.apache.org/jira/browse/NUTCH-1941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14364183#comment-14364183 ] 

Markus Jelsma commented on NUTCH-1941:
--------------------------------------

Hi - yes, this is in essence a useful feature. The question is, similar to disabling a robots.txt check, is this what we want in ASF distributed software? I understand the usefulness regarding (acedemic) research purposes and/or doing potential clandestine crawls but i do, again, want to raise a point here about whether this is want we want to have in our distribution. So i am +0 for this feature.

Regarding the feature itself, is rotating per time interval the ideal choice for avoiding either clandestine crawl detection or automated systems detecting bots? Do any of you have access to such detection systems or have the know-how on how they operate? My gut tells me a very irregular fetch interval and much more sophisticated generator (hopefully not more than a FetchSchedule impl.) would get us farther, of course, having a rotating UserAgent and probably IP rotation.

Lewis, the hyperlink you reference is a very static approach for blocking bots that actually identify themselves. Their solution is easily mitigated by announcing one's crawler as a regular web browser.

Regarding the patch, i contains a lot of clutter about class paths which i am unfamiliar with. It doesn't look like a trunk patch and i don't remeber 2x having these files. Do we need them?

> Optional rolling http.agent.name's
> ----------------------------------
>
>                 Key: NUTCH-1941
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1941
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher, protocol
>            Reporter: Lewis John McGibbney
>            Priority: Trivial
>         Attachments: nutch.patch
>
>
> In some scenarios, even whilst adhering to fetcher.crawl.delay, web admins can block your fetcher based merely on your crawler name. 
> I propose the ability to implement rolling http.agent.name's which could be substituted every 5 seconds for example. This would mean that successive requests to the same domain would be sent with different http.agent.name. 
> This behavior should be off by default.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)