You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Eric Haszlakiewicz <Er...@twosigma.com> on 2014/05/27 22:52:34 UTC

using kerberos with nutch

I was able to follow the Nutch tutorial and get the bin/crawl command working with sites that don't require authentication, including loading the results into a Solr installation.  I also checked that I could query the Solr index and get back the expected information.

However, I can't figure out how to get it to use Kerberos authentication to fetch urls.
I'm using apache-nutch-1.8, which appears to have the necessary version of Apache HttpClient (httpclient-4.1.1.jar).

Here's what I see:

./bin/nutch org.apache.nutch.parse.ParserChecker https://myhost.example.com
fetching: https://myhost.example.com
Fetch failed with protocol status: access_denied(17), lastModified=0: Authentication required: https://myhost.example.com


In logs/hadoop.log:
2014-05-27 20:35:53,866 INFO  parse.ParserChecker - fetching: https://myhost.example.com
2014-05-27 20:35:54,071 ERROR protocol.RobotRulesParser - Agent we advertise (My Nutch Spider) not listed first in 'http.robots.agents' property!
2014-05-27 20:35:54,071 INFO  httpclient.Http - http.proxy.host = null
2014-05-27 20:35:54,071 INFO  httpclient.Http - http.proxy.port = 8080
2014-05-27 20:35:54,071 INFO  httpclient.Http - http.timeout = 10000
2014-05-27 20:35:54,071 INFO  httpclient.Http - http.content.limit = 65536
2014-05-27 20:35:54,071 INFO  httpclient.Http - http.agent = My Nutch Spider/Nutch-1.8
2014-05-27 20:35:54,071 INFO  httpclient.Http - http.accept.language = en-us,en-gb,en;q=0.7,*;q=0.3
2014-05-27 20:35:54,071 INFO  httpclient.Http - http.accept = text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
2014-05-27 20:35:54,651 WARN  httpclient.HttpMethodDirector - Unable to respond to any of these challenges: {negotiate=Negotiate}

I enabled protocol-httpclient in conf/nutch-default.xml.  I expect I need to put something in conf/httpclient-auth.xml, but I can't figure out what.  I found the http://wiki.apache.org/nutch/HttpAuthenticationSchemes page, but all the examples there seem to assume that credentials consist of a username and password, which is of course not the case with Kerberos.
How do I tell Nutch to use Negotiate authentication?

Thanks,
Eric

RE: using kerberos with nutch

Posted by Eric Haszlakiewicz <Er...@twosigma.com>.

Any thoughts?  I've heard that there is a bug in apache httpclient that makes 
Negotiate authentication not work, but even if that is fixed I'm not quite clear 
on how to configure the httpclient-auth.xml file.  Can someone point me in
the right direction?

Thanks,
Eric

> -----Original Message-----
> From: Eric Haszlakiewicz [mailto:Eric.Haszlakiewicz@twosigma.com]
> Sent: Tuesday, May 27, 2014 4:53 PM
> To: 'user@nutch.apache.org'
> Subject: using kerberos with nutch
> 
> I was able to follow the Nutch tutorial and get the bin/crawl command
> working with sites that don't require authentication, including loading the
> results into a Solr installation.  I also checked that I could query the Solr index
> and get back the expected information.
> 
> However, I can't figure out how to get it to use Kerberos authentication to
> fetch urls.
> I'm using apache-nutch-1.8, which appears to have the necessary version of
> Apache HttpClient (httpclient-4.1.1.jar).
> 
> Here's what I see:
> 
> ./bin/nutch org.apache.nutch.parse.ParserChecker
> https://myhost.example.com
> fetching: https://myhost.example.com
> Fetch failed with protocol status: access_denied(17), lastModified=0:
> Authentication required: https://myhost.example.com
> 
> 
> In logs/hadoop.log:
> 2014-05-27 20:35:53,866 INFO  parse.ParserChecker - fetching:
> https://myhost.example.com
> 2014-05-27 20:35:54,071 ERROR protocol.RobotRulesParser - Agent we
> advertise (My Nutch Spider) not listed first in 'http.robots.agents' property!
> 2014-05-27 20:35:54,071 INFO  httpclient.Http - http.proxy.host = null
> 2014-05-27 20:35:54,071 INFO  httpclient.Http - http.proxy.port = 8080
> 2014-05-27 20:35:54,071 INFO  httpclient.Http - http.timeout = 10000
> 2014-05-27 20:35:54,071 INFO  httpclient.Http - http.content.limit = 65536
> 2014-05-27 20:35:54,071 INFO  httpclient.Http - http.agent = My Nutch
> Spider/Nutch-1.8
> 2014-05-27 20:35:54,071 INFO  httpclient.Http - http.accept.language = en-
> us,en-gb,en;q=0.7,*;q=0.3
> 2014-05-27 20:35:54,071 INFO  httpclient.Http - http.accept =
> text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
> 2014-05-27 20:35:54,651 WARN  httpclient.HttpMethodDirector - Unable to
> respond to any of these challenges: {negotiate=Negotiate}
> 
> I enabled protocol-httpclient in conf/nutch-default.xml.  I expect I need to
> put something in conf/httpclient-auth.xml, but I can't figure out what.  I
> found the http://wiki.apache.org/nutch/HttpAuthenticationSchemes page,
> but all the examples there seem to assume that credentials consist of a
> username and password, which is of course not the case with Kerberos.
> How do I tell Nutch to use Negotiate authentication?
> 
> Thanks,
> Eric