You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Robert Sanford <rs...@smbology.com> on 2009/06/16 18:26:44 UTC

NTLM Authentication Not Occuring...

Nutch 1.0 running on Windows 2003 Server hitting a local Sharepoint site running under IIS that has been configured to require domain authentication. Hitting the site with the top-level URL, shown in the logs below, works both on the machine it is running on and on external machines with IE7, Firefox 3.x, and Google Chrome 2.x.

In my httpclient-auth.xml file I have the following:
<auth-configuration>
    <credentials username="EdgeSearch" password="SearchPassword">
      <default scheme="ntlm" realm="smb-edge-dev" />
    </credentials>
</auth-configuration>

Note that I have tried leaving out the "realm" attribute, fully qualifying the username to "smb-edge-dev\EdgeSearch", leaving out the domain as part of the username, and leaving out the "scheme" attribute. The results are consistent.

Those are the *only* credentials specified in the httpclient-auth.xml file. I do not have any other credentials for any other sites in the config. I'm only crawling this one site.

I set the general log level to DEBUG to get more information from the log. The lines from the log that are of interest to me are:
2009-06-16 11:01:42,487 DEBUG http.Http - fetching http://smb-edge-dev:8082/default.aspx
2009-06-16 11:01:42,487 DEBUG http.Http - fetched 1656 bytes from http://smb-edge-dev:8082/default.aspx
2009-06-16 11:01:42,539 DEBUG http.Http - 401 Authentication Required
2009-06-16 11:01:55,087 DEBUG crawl.Generator - -shouldFetch rejected 'http://smb-edge-dev:8082/default.aspx', fetchTime=1249056102539, curTime=1245168109986

What that is telling me, please correct me if I am wrong, is that Nutch is hitting the target site as requested and, as expected, is receiving a 401 requesting authentication.

There is *nothing* in the log file that indicates that authentication has failed. There are no ERROR level messages anywhere in the log. It goes from the 401 to rejecting the page and I have no idea why.

Suggestions are *more* than welcome.

rjsjr

RE: NTLM Authentication Not Occuring...

Posted by Robert Sanford <rs...@smbology.com>.
It wasn't the most fun I've ever had but it has been a learning experience :)

I'm hopeful that the doc change will help save someone else from suffering as well.

Enjoy your vacation! Small towns without connectivity are the best places to be when you're supposed to not be working.

rjsjr

-----Original Message-----
From: Susam Pal [mailto:susam.pal@gmail.com] 
Sent: Wednesday, June 17, 2009 12:52 PM
To: nutch-user@lucene.apache.org
Subject: Re: NTLM Authentication Not Occuring...

On Wed, Jun 17, 2009 at 11:03 PM, Robert Sanford<rs...@smbology.com> wrote:
> The issue was that I had not modified the plugins. The documentation was less than clear that this was a pre-requisite so after I figured out exactly what was wrong I edited the HTTPAuthentication page in the wiki to clarify that it was required.
>
> rjsjr
>

Okay. I sort of suspected but I was not sure that it was you who
edited the article to include 'Prerequisites' section since the Nutch
Wiki update for the edit arrived shortly after your mail requesting
for help. Thanks for improving the article and sorry for not being
able to help you soon enough as I am on a vacation at a small town
with poor internet connectivity.

Regards,
Susam Pal

Re: NTLM Authentication Not Occuring...

Posted by Susam Pal <su...@gmail.com>.
On Wed, Jun 17, 2009 at 11:03 PM, Robert Sanford<rs...@smbology.com> wrote:
> The issue was that I had not modified the plugins. The documentation was less than clear that this was a pre-requisite so after I figured out exactly what was wrong I edited the HTTPAuthentication page in the wiki to clarify that it was required.
>
> rjsjr
>

Okay. I sort of suspected but I was not sure that it was you who
edited the article to include 'Prerequisites' section since the Nutch
Wiki update for the edit arrived shortly after your mail requesting
for help. Thanks for improving the article and sorry for not being
able to help you soon enough as I am on a vacation at a small town
with poor internet connectivity.

Regards,
Susam Pal

RE: NTLM Authentication Not Occuring...

Posted by Robert Sanford <rs...@smbology.com>.
The issue was that I had not modified the plugins. The documentation was less than clear that this was a pre-requisite so after I figured out exactly what was wrong I edited the HTTPAuthentication page in the wiki to clarify that it was required.

rjsjr

-----Original Message-----
From: Susam Pal [mailto:susam.pal@gmail.com] 
Sent: Wednesday, June 17, 2009 11:59 AM
To: nutch-user@lucene.apache.org
Subject: Re: NTLM Authentication Not Occuring...

On Wed, Jun 17, 2009 at 8:37 PM, Robert Sanford<rs...@smbology.com> wrote:
> I installed "Fiddler" as a proxy on the server and compared the sessions from IE and Nutch. When IE receives the 401 it will then create a new request with the NTLM authentication tokens for which it receives a 200. When Nutch receives the 401 it does not make another request.
>
> This implies to me that the credential that I've added to httpclient-auth.xml are being ignored.
>
> Is there something that I need to set in nutch-site.xml to enable authentication? Is there another configuration option that I've missed somewhere?
>
> Many thanks!
>
> rjsjr

Hi Robert,

Please provide the following information:

1. How are you running the Nutch crawler on Windows 2003 Server?
Please mention the tools used and the commands invoked. e.g. Cygwin,
java commands if any, etc.

2. Have you modified 'conf/nutch-site.xml' to include
'protocol-httpclient' in the 'plugin.includes' property?

3. There must be more logs in the log file pertaining to HTTP
authentication. e.g. Log messages containing the word "Credentials",
"auth.AuthChallengeProcessor", etc. Please send these log messages as
well. If they are not present, probably you have not included
'protocol-httpclient'.

I would suggest that you go through the "Prerequisites" section of
this article: http://wiki.apache.org/nutch/HttpAuthenticationSchemes
to make sure that you have configured 'conf/nutch-site.xml' properly.
You need to ensure that you have replaced 'protocol-http' with
'protocol-httpclient' in the 'plugin.includes' property of
'conf/nutch-site.xml'.

Next, please go through the "Need Help?" section of the same article
and see if it helps you to troubleshoot your issue. If not, please
mail again with the information I have requested above.

Regards,
Susam Pal

Re: NTLM Authentication Not Occuring...

Posted by Susam Pal <su...@gmail.com>.
On Wed, Jun 17, 2009 at 8:37 PM, Robert Sanford<rs...@smbology.com> wrote:
> I installed "Fiddler" as a proxy on the server and compared the sessions from IE and Nutch. When IE receives the 401 it will then create a new request with the NTLM authentication tokens for which it receives a 200. When Nutch receives the 401 it does not make another request.
>
> This implies to me that the credential that I've added to httpclient-auth.xml are being ignored.
>
> Is there something that I need to set in nutch-site.xml to enable authentication? Is there another configuration option that I've missed somewhere?
>
> Many thanks!
>
> rjsjr

Hi Robert,

Please provide the following information:

1. How are you running the Nutch crawler on Windows 2003 Server?
Please mention the tools used and the commands invoked. e.g. Cygwin,
java commands if any, etc.

2. Have you modified 'conf/nutch-site.xml' to include
'protocol-httpclient' in the 'plugin.includes' property?

3. There must be more logs in the log file pertaining to HTTP
authentication. e.g. Log messages containing the word "Credentials",
"auth.AuthChallengeProcessor", etc. Please send these log messages as
well. If they are not present, probably you have not included
'protocol-httpclient'.

I would suggest that you go through the "Prerequisites" section of
this article: http://wiki.apache.org/nutch/HttpAuthenticationSchemes
to make sure that you have configured 'conf/nutch-site.xml' properly.
You need to ensure that you have replaced 'protocol-http' with
'protocol-httpclient' in the 'plugin.includes' property of
'conf/nutch-site.xml'.

Next, please go through the "Need Help?" section of the same article
and see if it helps you to troubleshoot your issue. If not, please
mail again with the information I have requested above.

Regards,
Susam Pal

RE: NTLM Authentication Not Occuring...

Posted by Robert Sanford <rs...@smbology.com>.
I installed "Fiddler" as a proxy on the server and compared the sessions from IE and Nutch. When IE receives the 401 it will then create a new request with the NTLM authentication tokens for which it receives a 200. When Nutch receives the 401 it does not make another request.

This implies to me that the credential that I've added to httpclient-auth.xml are being ignored.

Is there something that I need to set in nutch-site.xml to enable authentication? Is there another configuration option that I've missed somewhere?

Many thanks!

rjsjr

-----Original Message-----
From: Robert Sanford [mailto:rsanford@smbology.com] 
Sent: Tuesday, June 16, 2009 11:27 AM
To: nutch-user@lucene.apache.org
Subject: NTLM Authentication Not Occuring...

Nutch 1.0 running on Windows 2003 Server hitting a local Sharepoint site running under IIS that has been configured to require domain authentication. Hitting the site with the top-level URL, shown in the logs below, works both on the machine it is running on and on external machines with IE7, Firefox 3.x, and Google Chrome 2.x.

In my httpclient-auth.xml file I have the following:
<auth-configuration>
    <credentials username="EdgeSearch" password="SearchPassword">
      <default scheme="ntlm" realm="smb-edge-dev" />
    </credentials>
</auth-configuration>

Note that I have tried leaving out the "realm" attribute, fully qualifying the username to "smb-edge-dev\EdgeSearch", leaving out the domain as part of the username, and leaving out the "scheme" attribute. The results are consistent.

Those are the *only* credentials specified in the httpclient-auth.xml file. I do not have any other credentials for any other sites in the config. I'm only crawling this one site.

I set the general log level to DEBUG to get more information from the log. The lines from the log that are of interest to me are:
2009-06-16 11:01:42,487 DEBUG http.Http - fetching http://smb-edge-dev:8082/default.aspx
2009-06-16 11:01:42,487 DEBUG http.Http - fetched 1656 bytes from http://smb-edge-dev:8082/default.aspx
2009-06-16 11:01:42,539 DEBUG http.Http - 401 Authentication Required
2009-06-16 11:01:55,087 DEBUG crawl.Generator - -shouldFetch rejected 'http://smb-edge-dev:8082/default.aspx', fetchTime=1249056102539, curTime=1245168109986

What that is telling me, please correct me if I am wrong, is that Nutch is hitting the target site as requested and, as expected, is receiving a 401 requesting authentication.

There is *nothing* in the log file that indicates that authentication has failed. There are no ERROR level messages anywhere in the log. It goes from the 401 to rejecting the page and I have no idea why.

Suggestions are *more* than welcome.

rjsjr