You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "McGibbney, Lewis John" <Le...@gcu.ac.uk> on 2011/03/03 16:20:38 UTC
Http Authentication problem...
Hi list,
I have trawled the mail archives for something which could help me on this one, and although there is some interesting past use
cases I have not seen any queries or answers which help me.
I am using Nutch 1.2 to crawl the following website
http://www.scotland.gov.uk
It has an automatic redirect to www.scotland.gov.uk/Home, therefore I thought that experimenting with http.redirect.max and http.verbose in nutch-site would shine some light, however this then flagged up the following in yesterdays hadoop.log
2011-03-02 15:58:19,165 INFO fetcher.Fetcher - fetching http://www.scotland.gov.uk/Home
2011-03-02 15:58:19,166 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=8
2011-03-02 15:58:19,166 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=8
2011-03-02 15:58:19,166 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=6
2011-03-02 15:58:19,166 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=5
2011-03-02 15:58:19,166 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=4
2011-03-02 15:58:19,166 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=2
2011-03-02 15:58:19,166 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=3
2011-03-02 15:58:19,170 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=2
2011-03-02 15:58:19,170 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=1
2011-03-02 15:58:19,211 INFO http.Http - http.proxy.host = null
2011-03-02 15:58:19,211 INFO http.Http - http.proxy.port = 8080
2011...
blahblahblah
2011-03-02 15:58:26,220 INFO plugin.PluginRepository - Ontology Model Loader (org.apache.nutch.ontology.Ontology)
2011-03-02 15:58:26,241 INFO fetcher.Fetcher - fetching https://citfil1.enterprise.gcal.ac.uk:8081/AuthenticationServer/AuthenticationForm.jsp?URL=http:/www.scotland.gov.uk/Home&IP=10.15.5.246
2011-03-02 15:58:26,241 INFO fetcher.Fetcher - fetch of https://citfil1.enterprise.gcal.ac.uk:8081/AuthenticationServer/AuthenticationForm.jsp?URL=http:/www.scotland.gov.uk/Home&IP=10.15.5.246 failed with: org.apache.nutch.protocol.ProtocolNotFound: protocol not found for url=https
2011-03-02 15:58:26,241 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=9
2011-03-02 15:58:26,242 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=8
2011-03-02 15:58:26,242 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=7
2011-03-02 15:58:26,242 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=6
2011-03-02 15:58:26,242 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=5
2011-03-02 15:58:26,242 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=4
2011-03-02 15:58:26,246 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=0
2011-03-02 15:58:26,246 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=1
2011-03-02 15:58:26,246 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=2
2011-03-02 15:58:26,246 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=3
2011-03-02 15:58:27,241 INFO fetcher.Fetcher - -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
2011-03-02 15:58:27,242 INFO fetcher.Fetcher - -activeThreads=0
It was at this stage that I realised that some sort of authentication scheme was in place, however I am still puzzled to what type and how I can work around it. Today I reconfigured nutch-1.2 to crawl using httpclient protocol as oppose to http protocol, however I am now no longer able to replicate the org.apache.nutch.protocol.ProtocolNotFound: protocol not found for url=https. In implementing the httpclient protocol I have undertaken all steps advised in the wiki entry HttpAuthenticationSchemes apart from setting credentials in httpclient-auth.xml (as I don't know what they are).
I hope I have explained thoroughly enough to justify the post
Thank you Lewis
Glasgow Caledonian University is a registered Scottish charity, number SC021474
Winner: Times Higher Education’s Widening Participation Initiative of the Year 2009 and Herald Society’s Education Initiative of the Year 2009.
http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html
Winner: Times Higher Education’s Outstanding Support for Early Career Researchers of the Year 2010, GCU as a lead with Universities Scotland partners.
http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,15691,en.html