You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "McGibbney, Lewis John" <Le...@gcu.ac.uk> on 2011/03/03 16:20:38 UTC

Http Authentication problem...

Hi list,

I have trawled the mail archives for something which could help me on this one, and although there is some interesting past use
cases I have not seen any queries or answers which help me.

I am using Nutch 1.2 to crawl the following website

http://www.scotland.gov.uk

It has an automatic redirect to www.scotland.gov.uk/Home, therefore I thought that experimenting with http.redirect.max and http.verbose in nutch-site would shine some light, however this then flagged up the following in yesterdays hadoop.log

2011-03-02 15:58:19,165 INFO  fetcher.Fetcher - fetching http://www.scotland.gov.uk/Home
2011-03-02 15:58:19,166 INFO  fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=8
2011-03-02 15:58:19,166 INFO  fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=8
2011-03-02 15:58:19,166 INFO  fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=6
2011-03-02 15:58:19,166 INFO  fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=5
2011-03-02 15:58:19,166 INFO  fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=4
2011-03-02 15:58:19,166 INFO  fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=2
2011-03-02 15:58:19,166 INFO  fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=3
2011-03-02 15:58:19,170 INFO  fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=2
2011-03-02 15:58:19,170 INFO  fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=1
2011-03-02 15:58:19,211 INFO  http.Http - http.proxy.host = null
2011-03-02 15:58:19,211 INFO  http.Http - http.proxy.port = 8080
2011...
blahblahblah
2011-03-02 15:58:26,220 INFO  plugin.PluginRepository -         Ontology Model Loader (org.apache.nutch.ontology.Ontology)
2011-03-02 15:58:26,241 INFO  fetcher.Fetcher - fetching https://citfil1.enterprise.gcal.ac.uk:8081/AuthenticationServer/AuthenticationForm.jsp?URL=http:/www.scotland.gov.uk/Home&IP=10.15.5.246
2011-03-02 15:58:26,241 INFO  fetcher.Fetcher - fetch of https://citfil1.enterprise.gcal.ac.uk:8081/AuthenticationServer/AuthenticationForm.jsp?URL=http:/www.scotland.gov.uk/Home&IP=10.15.5.246 failed with: org.apache.nutch.protocol.ProtocolNotFound: protocol not found for url=https
2011-03-02 15:58:26,241 INFO  fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=9
2011-03-02 15:58:26,242 INFO  fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=8
2011-03-02 15:58:26,242 INFO  fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=7
2011-03-02 15:58:26,242 INFO  fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=6
2011-03-02 15:58:26,242 INFO  fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=5
2011-03-02 15:58:26,242 INFO  fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=4
2011-03-02 15:58:26,246 INFO  fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=0
2011-03-02 15:58:26,246 INFO  fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=1
2011-03-02 15:58:26,246 INFO  fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=2
2011-03-02 15:58:26,246 INFO  fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=3
2011-03-02 15:58:27,241 INFO  fetcher.Fetcher - -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
2011-03-02 15:58:27,242 INFO  fetcher.Fetcher - -activeThreads=0

It was at this stage that I realised that some sort of authentication scheme was in place, however I am still puzzled to what type and how I can work around it. Today I reconfigured nutch-1.2 to crawl using httpclient protocol as oppose to http protocol, however I am now no longer able to replicate the org.apache.nutch.protocol.ProtocolNotFound: protocol not found for url=https. In implementing the httpclient protocol I have undertaken all steps advised in the wiki entry HttpAuthenticationSchemes apart from setting credentials in httpclient-auth.xml (as I don't know what they are).

I hope I have explained thoroughly enough to justify the post

Thank you Lewis




Glasgow Caledonian University is a registered Scottish charity, number SC021474

Winner: Times Higher Education’s Widening Participation Initiative of the Year 2009 and Herald Society’s Education Initiative of the Year 2009.
http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html

Winner: Times Higher Education’s Outstanding Support for Early Career Researchers of the Year 2010, GCU as a lead with Universities Scotland partners.
http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,15691,en.html