You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Lewis John McGibbney (JIRA)" <ji...@apache.org> on 2012/11/03 16:45:12 UTC

[jira] [Comment Edited] (NUTCH-208) http: proxy exception list:

    [ https://issues.apache.org/jira/browse/NUTCH-208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13490042#comment-13490042 ] 

Lewis John McGibbney edited comment on NUTCH-208 at 11/3/12 3:44 PM:
---------------------------------------------------------------------

The attached patches address this issue for trunk and 2.x. This has been used effectively when crawling from behind a University proxy and a local tinyproxy proxy configuration. I confirm (logs from Nutch trunk) as follows

{code}
lewismc@lewismc-HP-Mini-110-3100:~/ASF/trunk_head/runtime/local$ ./bin/nutch fetch crawldb/segment/20121103152653 -threads 5
Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property.
Fetcher: starting at 2012-11-03 15:27:53
Fetcher: segment: crawldb/segment/20121103152653
Using queue mode : byHost
Fetcher: threads: 5
Fetcher: time-out divisor: 2
QueueFeeder finished: total 3 records + hit by time limit :0
Using queue mode : byHost
Using queue mode : byHost
fetching http://www.heraldscotland.com/
Using queue mode : byHost
fetching http://www.theoatmeal.com/
Using queue mode : byHost
fetching http://www.bbc.co.uk/
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=3
Fetcher: throughput threshold: -1
Fetcher: throughput threshold retries: 5
-finishing thread FetcherThread, activeThreads=3
-activeThreads=3, spinWaiting=0, fetchQueues.totalSize=0
fetch of http://www.bbc.co.uk/ failed with: Http code=403, url=http://www.bbc.co.uk/
fetch of http://www.heraldscotland.com/ failed with: Http code=403, url=http://www.heraldscotland.com/
-finishing thread FetcherThread, activeThreads=2
-finishing thread FetcherThread, activeThreads=1
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetcher: finished at 2012-11-03 15:28:05, elapsed: 00:00:11
{code}

Here you can see both "http://www.bbc.co.uk/" and "http://www.heraldscotland.com/" fail with 403's, this is because they are blocked by my proxy, however http://www.theoatmeal.com is not.

Any comments? An issue is that there is no JUnit test to accompany... I am unsure how to implement this currently.
                
      was (Author: lewismc):
    The attached patches address this issue for trunk and 2.x. This has been used effectively when crawling from behind a University proxy and a local tinyproxy proxy configuration. I confirm (logs from Nutch trunk) as follows

{code}
lewismc@lewismc-HP-Mini-110-3100:~/ASF/trunk_head/runtime/local$ ./bin/nutch fetch crawldb/segment/20121103152653 -threads 5
Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property.
Fetcher: starting at 2012-11-03 15:27:53
Fetcher: segment: crawldb/segment/20121103152653
Using queue mode : byHost
Fetcher: threads: 5
Fetcher: time-out divisor: 2
QueueFeeder finished: total 3 records + hit by time limit :0
Using queue mode : byHost
Using queue mode : byHost
fetching http://www.heraldscotland.com/
Using queue mode : byHost
fetching http://www.theoatmeal.com/
Using queue mode : byHost
fetching http://www.bbc.co.uk/
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=3
Fetcher: throughput threshold: -1
Fetcher: throughput threshold retries: 5
-finishing thread FetcherThread, activeThreads=3
-activeThreads=3, spinWaiting=0, fetchQueues.totalSize=0
fetch of http://www.bbc.co.uk/ failed with: Http code=403, url=http://www.bbc.co.uk/
fetch of http://www.heraldscotland.com/ failed with: Http code=403, url=http://www.heraldscotland.com/
-finishing thread FetcherThread, activeThreads=2
-finishing thread FetcherThread, activeThreads=1
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetcher: finished at 2012-11-03 15:28:05, elapsed: 00:00:11
{code}

Here you can see both "http://www.bbc.co.uk/" and http://www.heraldscotland.com/" fail with 403's, this is because they are blocked by my proxy.

Any comments? An issue is that there is no JUnit test to accompany... I am unsure how to implement this currently.
                  
> http: proxy exception list:
> ---------------------------
>
>                 Key: NUTCH-208
>                 URL: https://issues.apache.org/jira/browse/NUTCH-208
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>    Affects Versions: 0.8, 1.3, nutchgora
>            Reporter: Matthias Günter
>            Assignee: Lewis John McGibbney
>            Priority: Trivial
>              Labels: patch
>             Fix For: 1.6
>
>         Attachments: NUTCH-208-2.x.patch, NUTCH-208-branch-1.4-20110210-v3.patch, NUTCH-208-branch-1.4-20110807.patch, NUTCH-208-branch-1.4-20110809-v2.patch, NUTCH-208.patch, NUTCH-208-trunk-2.0-20110810.patch, NUTCH-208-trunk-2.0-20110810-v2.patch, patch.txt, patch.txt, proxy_exception_list-0.8.diff
>
>
> I suggest that a parameter is added to nutch-default.xml which allows to generate a proxy exception list. 
> <property>
>   <name>http.proxy.exception.list</name>
>   <value></value>
>   <description>URL's and hosts that don't use the proxy (e.g. intranets)</description>
> </property>
> This is useful when scanning intranet/internet combinations from behind a firewall. A preliminary patch is added to this extend to this request, showing the changes. We will test it and update it if necessary. this also reflects the reality in web browsers, where there is in most cases an exception list.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira