You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Apache Wiki <wi...@apache.org> on 2009/03/31 18:54:09 UTC

[Nutch Wiki] Update of "HttpAuthenticationSchemes" by susam

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The following page has been changed by susam:
http://wiki.apache.org/nutch/HttpAuthenticationSchemes

------------------------------------------------------------------------------
  === Important Points ===
   1. For <authscope> tag, 'host' and 'port' attribute should always be specified. 'realm' and 'scheme' attributes may or may not be specified depending on your needs. If you are tempted to omit the 'host' and 'port' attribute, because you want the credentials to be used for any host and any port for that realm/scheme, please use the 'default' tag instead. That's what 'default' tag is meant for.
   1. One authentication scope should not be defined twice as different <authscope> tags for different <credentials> tag. However, if this is done by mistake, the credentials for the last defined <authscope> tag would be used. This is because, the XML parsing code, reads the file from top to bottom and sets the credentials for authentication-scopes. If the same authentication scope is encountered once again, it will be overwritten with the new credentials. However, one should not rely on this behavior as this might change with further developments.
-  1. Do not define multiple authscope tags with the same host, port but different realms if the server requires NTLM authentication. This can means there should not be multiple tags with same host, port, scheme="NTLM" but different realms. If you are omitting the scheme attribute and the server requires NTLM authentication, then there should not be multiple tags with same host, port but different realms. This is discussed more in the next section.
+  1. Do not define multiple authscope tags with the same host, port but different realms if the server requires NTLM authentication. This means there should not be multiple tags with same host, port, scheme="NTLM" but different realms. If you are omitting the scheme attribute and the server requires NTLM authentication, then there should not be multiple tags with same host, port but different realms. This is discussed more in the next section.
   1. If you are using NTLM scheme, you should also set the 'http.agent.host' property in conf/nutch-site.xml
  
  === A note on NTLM domains ===
@@ -104, +104 @@

   1. Do you see Nutch trying to fetch the pages you were expecting in 'logs/hadoop.log'. You should see some logs like "fetching http://www.example.com/expectedpage.html" where the URL is the page you were expecting to be fetched. If you don't see such lines for the pages you were expecting, the error is outside the scope of this feature. This feature comes into action only when the crawler is fetching a page but the page requires authentication.
   1. With debug logs enabled, check whether there are logs beginning with "Credentials" in 'logs/hadoop.log'. The lines would look like "Credentials - username someuser; set ...". For every entry in 'conf/httpclient-auth.xml' you should find a corresponding log. If they are absent, probably you haven't included 'plugin.includes'. In case you have manually patched Nutch 0.9 source code with the patch, this issue may be caused if you have not built the project.
   1. Do you see logs like this: "auth.!AuthChallengeProcessor - basic authentication scheme selected"? Instead of the word 'basic', you might see 'digest' or 'NTLM' depending on the scheme supported by the page being fetched? If you do not see it at all, probably the web server or the page being fetched does not require authentication. In that case, the crawler would not try to authenticate. If you were expecting an authentication for the page, probably something needs to be fixed at the server side.
-  1. You should also see some logs that begin with: "Pre-configured credentials with scope". It is very unlikely that this should happen after you have ensured all the above points. If it happens, please let us know in the mailing list.
  
  Once you have checked the items listed above and you are still unable to fix the problem or confused about any point listed above, please mail the issue with the following information:
  
   1. Version of Nutch you are running.
-  1. Did you get this feature directly from subversion or did you download the patch separately and apply?
+  1. Complete code in ''conf/httpclient-auth.xml' file.
   1. Relevant portion from 'logs/hadoop.log' file. If you are clueless, send the complete file.