You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Apache Wiki <wi...@apache.org> on 2009/06/17 18:04:53 UTC

[Nutch Wiki] Update of "HttpAuthenticationSchemes" by wobbet

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The following page has been changed by wobbet:
http://wiki.apache.org/nutch/HttpAuthenticationSchemes

------------------------------------------------------------------------------
  
  == Configuration ==
  Since the example and explanation provided as comments in 'conf/httpclient-auth.xml' is very brief, therefore this section would explain it in a little more detail. In all the examples below, the root element <auth-configuration> has been omitted for the sake of clarity.
+ 
+ === Prerequisites ===
+ In order use HTTP Authentication your Nutch install must be configured to use 'protocol-httpclient' instead of the default 'protocol-http'. To make this change copy the 'plugin.includes' property from 'conf/nutch-default.xml' and paste it into 'conf/nutch-site.xml'. Within that property replace 'protocol-http' with 'protocol-httpclient'. If you have made no other changes it will look as follows:
+ {{{
+ <property>
+   <name>plugin.includes</name>
+   <value>protocol-httpclient|urlfilter-regex|parse-(text|html|js)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
+   <description>Regular expression naming plugin directory names to
+   include.  Any plugin not matching this expression is excluded.
+   In any case you need at least include the nutch-extensionpoints plugin. By
+   default Nutch includes crawling just HTML and plain text via HTTP,
+   and basic indexing and search plugins. In order to use HTTPS please enable 
+   protocol-httpclient, but be aware of possible intermittent problems with the 
+   underlying commons-httpclient library.
+   </description>
+ </property>
+ }}}
+ 
+ === Optional ===
+ By default Nutch use credential from 'httpclient-auth.xml'. If you wish to use a different file you will need to copy the 'http.auth.file' property from 'conf/nutch-default.xml' and paste it into 'conf/nutch-site.xml' and then modify the '<value>' element. The default property appears as follows:
+ {{{
+ <property>
+   <name>http.auth.file</name>
+   <value>httpclient-auth.xml</value>
+   <description>Authentication configuration file for 'protocol-httpclient' plugin.</description>
+ </property>
+ }}}
+ 
  
  === Crawling an Intranet with Default Authentication Scope ===
  Let's say all pages of an intranet are protected by basic, digest or ntlm authentication and there is only one set of credentials to be used for all web pages in the intranet, then a configuration as described below is enough. This is also the simplest possible configuration possible for authentication schemes.