You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Nidhi malik <ni...@gmail.com> on 2008/01/01 19:25:03 UTC

Http-407 - authentication problem on Nutch -0.8

I am forwading my Nutch-site.xml
 please coorect it

---------- Forwarded message ----------
From: Nidhi malik <ni...@gmail.com>
Date: Jan 1, 2008 11:47 PM
Subject: nutch-site.xml
To: susam.pal@gmail.com

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
<property>
  <name>http.agent.name</name>
  <value>digvijay</value>
  <description>HTTP 'User-Agent' request header. MUST NOT be empty -
  please set this to a single word uniquely related to your organization.

  NOTE: You should also check other related properties:

    http.robots.agents
    http.agent.description
    http.agent.url
    http.agent.email
    http.agent.version

  and set their values appropriately.

  </description>
</property>

<property>
  <name>http.agent.description</name>
  <value>digvijay crawler</value>
  <description>Further description of our bot- this text is used in
  the User-Agent header.  It appears in parenthesis after the agent name.
  </description>
</property>

<property>
  <name>http.agent.url</name>
  <value>http://google.com</value>
  <description>A URL to advertise in the User-Agent header.  This will
   appear in parenthesis after the agent name. Custom dictates that this
   should be a URL of a page explaining the purpose and behavior of this
   crawler.
  </description>
</property>

<property>
  <name>http.agent.email</name>
  <value>digvijayy@it.iitb.ac.in</value>
  <description>An email address to advertise in the HTTP 'From' request
   header and User-Agent header. A good practice is to mangle this
   address (e.g. 'info at example dot com') to avoid spamming.
  </description>
</property>

<property>
  <name>http.proxy.host</name>
  <value>netmon.iitb.ac.in</value>
  <description>The proxy hostname.  If empty, no proxy is
used.</description>
</property>

<property>
  <name>http.proxy.port</name>
  <value>80</value>
  <description>The proxy port.</description>
</property>

<property>
  <name>http.proxy.username</name>
  <value>xyz</value>
  <description>Username for proxy. This will be used by
  'protocol-httpclient', if the proxy server requests basic, digest
  and/or NTLM authentication. To use this, 'protocol-httpclient' must
  be present in the value of 'plugin.includes' property.
  NOTE: For NTLM authentication, do not prefix the username with the
  domain, i.e. 'susam' is correct whereas 'DOMAIN\susam' is incorrect.
  </description>
</property>

<property>
  <name>http.proxy.password</name>
  <value>xyz</value>
  <description>Password for proxy. This will be used by
  'protocol-httpclient', if the proxy server requests basic, digest
  and/or NTLM authentication. To use this, 'protocol-httpclient' must
  be present in the value of 'plugin.includes' property.
  </description>
</property>

<property>
  <name>http.proxy.realm</name>
  <value>Squid proxy-caching web server</value>
  <description>Authentication realm for proxy. Do not define a value
  if realm is not required or authentication should take place for any
  realm. NTLM does not use the notion of realms. Specify the domain name
  of NTLM authentication as the value for this property. To use this,
  'protocol-httpclient' must be present in the value of
  'plugin.includes' property.
  </description>
</property>

<property>
  <name>http.agent.host</name>
  <value>10.129.30.14</value>
  <description>Name or IP address of the host on which the Nutch crawler
  would be running. Currently this is used by 'protocol-httpclient'
  plugin.
  </description>
</property>

<property>
  <name>searcher.dir</name>
  <value>crawl</value>
  <description>
  Path to root of crawl.  This directory is searched (in
  order) for either the file search-servers.txt, containing a list of
  distributed search servers, or the directory "index" containing
  merged indexes, or the directory "segments" containing segment
  indexes.
  </description>
</property>

<property>
  <name>plugin.includes</name>

<value>protocol-httpclient|urlfilter-regex|parse-(text|html|js|pdf|mp3|oo|msexcel|mspowerpoint|msword|pdf|rss|swf|zip)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>

  <description>Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints plugin. By

  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins. In order to use HTTPS please enable
  protocol-httpclient, but be aware of possible intermittent problems with
the
  underlying commons-httpclient library.
  </description>
</property>

<property>
  <name>http.timeout</name>
  <value>1000000</value>
  <description>The default network timeout, in milliseconds.</description>
</property>

</configuration>

Re: Http-407 - authentication problem on Nutch -0.8

Posted by Susam Pal <su...@gmail.com>.

Your configuration seems fine. Ideally http.agent.url should point to
a page where you describe your crawler, but that shouldn't cause an
error.

If you are facing any problem, please post the relevant logs from
logs/hadoop.log and describe your problem in detail.

Regards,
Susam Pal

On 1/1/08, Nidhi malik <ni...@gmail.com> wrote:
> I am forwading my Nutch-site.xml
>  please coorect it
>
> ---------- Forwarded message ----------
> From: Nidhi malik <ni...@gmail.com>
> Date: Jan 1, 2008 11:47 PM
> Subject: nutch-site.xml
> To: susam.pal@gmail.com
>
>
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>
> <!-- Put site-specific property overrides in this file. -->
>
> <configuration>
> <property>
>   <name>http.agent.name</name>
>   <value>digvijay</value>
>   <description>HTTP 'User-Agent' request header. MUST NOT be empty -
>   please set this to a single word uniquely related to your organization.
>
>   NOTE: You should also check other related properties:
>
>     http.robots.agents
>     http.agent.description
>     http.agent.url
>     http.agent.email
>     http.agent.version
>
>   and set their values appropriately.
>
>   </description>
> </property>
>
> <property>
>   <name>http.agent.description</name>
>   <value>digvijay crawler</value>
>   <description>Further description of our bot- this text is used in
>   the User-Agent header.  It appears in parenthesis after the agent name.
>   </description>
> </property>
>
> <property>
>   <name>http.agent.url</name>
>   <value>http://google.com</value>
>   <description>A URL to advertise in the User-Agent header.  This will
>    appear in parenthesis after the agent name. Custom dictates that this
>    should be a URL of a page explaining the purpose and behavior of this
>    crawler.
>   </description>
> </property>
>
> <property>
>   <name>http.agent.email</name>
>   <value>digvijayy@it.iitb.ac.in</value>
>   <description>An email address to advertise in the HTTP 'From' request
>    header and User-Agent header. A good practice is to mangle this
>    address (e.g. 'info at example dot com') to avoid spamming.
>   </description>
> </property>
>
>
> <property>
>   <name>http.proxy.host</name>
>   <value>netmon.iitb.ac.in</value>
>   <description>The proxy hostname.  If empty, no proxy is
> used.</description>
> </property>
>
> <property>
>   <name>http.proxy.port</name>
>   <value>80</value>
>   <description>The proxy port.</description>
> </property>
>
>
> <property>
>   <name>http.proxy.username</name>
>   <value>xyz</value>
>   <description>Username for proxy. This will be used by
>   'protocol-httpclient', if the proxy server requests basic, digest
>   and/or NTLM authentication. To use this, 'protocol-httpclient' must
>   be present in the value of 'plugin.includes' property.
>   NOTE: For NTLM authentication, do not prefix the username with the
>   domain, i.e. 'susam' is correct whereas 'DOMAIN\susam' is incorrect.
>   </description>
> </property>
>
> <property>
>   <name>http.proxy.password</name>
>   <value>xyz</value>
>   <description>Password for proxy. This will be used by
>   'protocol-httpclient', if the proxy server requests basic, digest
>   and/or NTLM authentication. To use this, 'protocol-httpclient' must
>   be present in the value of 'plugin.includes' property.
>   </description>
> </property>
>
> <property>
>   <name>http.proxy.realm</name>
>   <value>Squid proxy-caching web server</value>
>   <description>Authentication realm for proxy. Do not define a value
>   if realm is not required or authentication should take place for any
>   realm. NTLM does not use the notion of realms. Specify the domain name
>   of NTLM authentication as the value for this property. To use this,
>   'protocol-httpclient' must be present in the value of
>   'plugin.includes' property.
>   </description>
> </property>
>
> <property>
>   <name>http.agent.host</name>
>   <value>10.129.30.14</value>
>   <description>Name or IP address of the host on which the Nutch crawler
>   would be running. Currently this is used by 'protocol-httpclient'
>   plugin.
>   </description>
> </property>
>
> <property>
>   <name>searcher.dir</name>
>   <value>crawl</value>
>   <description>
>   Path to root of crawl.  This directory is searched (in
>   order) for either the file search-servers.txt, containing a list of
>   distributed search servers, or the directory "index" containing
>   merged indexes, or the directory "segments" containing segment
>   indexes.
>   </description>
> </property>
>
> <property>
>   <name>plugin.includes</name>
>
> <value>protocol-httpclient|urlfilter-regex|parse-(text|html|js|pdf|mp3|oo|msexcel|mspowerpoint|msword|pdf|rss|swf|zip)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
>
>   <description>Regular expression naming plugin directory names to
>   include.  Any plugin not matching this expression is excluded.
>   In any case you need at least include the nutch-extensionpoints plugin. By
>
>   default Nutch includes crawling just HTML and plain text via HTTP,
>   and basic indexing and search plugins. In order to use HTTPS please enable
>   protocol-httpclient, but be aware of possible intermittent problems with
> the
>   underlying commons-httpclient library.
>   </description>
> </property>
>
> <property>
>   <name>http.timeout</name>
>   <value>1000000</value>
>   <description>The default network timeout, in milliseconds.</description>
> </property>
>
> </configuration>
>

-- 
Sent from Gmail for mobile | mobile.google.com