You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Nayanish Hinge <na...@gmail.com> on 2010/09/05 13:09:29 UTC

Why is robots/IP blocking code removed from nutch lib-http recently

Hi,
I wanted to understand and use the blockAddr functionality of nutch lib-http
(HTTPBase.java)
But recently found that the whole code is removed. Seems like that code was
working with Old Fetcher. How is this handled with new Fetcher?

http://www.mail-archive.com/commits@nutch.apache.org/msg00152.html
http://svn.apache.org/viewvc/nutch/trunk/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java?view=diff&r1=984336&r2=984337&pathrev=984337
https://issues.apache.org/jira/browse/NUTCH-876?page=com.atlassian.jira.plugin.system.issuetabpanels%3Aall-tabpanel

My use case:
-------------------
I wanted to use that kind of 'blocking' for my specific purpose. I wanted to
do responsive throttling based on
1. HTTP status 503
2. presence of CAPTCHA etc in the response

Depending on this I wanted to create my block queues where fetch requests
would wait for a delay (which would increase up to some max_delay) if we get
throttled from the remote server.
I do not wish to change Nutch core (Fetcher), lib-http is a better place for
me to make changes and deploy the plugin.

Could somebody shed come light here? How could i do this?
Thanks
-- 
Nayanish
Hyderabad