You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Nidhi malik <ni...@gmail.com> on 2008/01/01 19:25:03 UTC
Http-407 - authentication problem on Nutch -0.8
I am forwading my Nutch-site.xml
please coorect it
---------- Forwarded message ----------
From: Nidhi malik <ni...@gmail.com>
Date: Jan 1, 2008 11:47 PM
Subject: nutch-site.xml
To: susam.pal@gmail.com
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>http.agent.name</name>
<value>digvijay</value>
<description>HTTP 'User-Agent' request header. MUST NOT be empty -
please set this to a single word uniquely related to your organization.
NOTE: You should also check other related properties:
http.robots.agents
http.agent.description
http.agent.url
http.agent.email
http.agent.version
and set their values appropriately.
</description>
</property>
<property>
<name>http.agent.description</name>
<value>digvijay crawler</value>
<description>Further description of our bot- this text is used in
the User-Agent header. It appears in parenthesis after the agent name.
</description>
</property>
<property>
<name>http.agent.url</name>
<value>http://google.com</value>
<description>A URL to advertise in the User-Agent header. This will
appear in parenthesis after the agent name. Custom dictates that this
should be a URL of a page explaining the purpose and behavior of this
crawler.
</description>
</property>
<property>
<name>http.agent.email</name>
<value>digvijayy@it.iitb.ac.in</value>
<description>An email address to advertise in the HTTP 'From' request
header and User-Agent header. A good practice is to mangle this
address (e.g. 'info at example dot com') to avoid spamming.
</description>
</property>
<property>
<name>http.proxy.host</name>
<value>netmon.iitb.ac.in</value>
<description>The proxy hostname. If empty, no proxy is
used.</description>
</property>
<property>
<name>http.proxy.port</name>
<value>80</value>
<description>The proxy port.</description>
</property>
<property>
<name>http.proxy.username</name>
<value>xyz</value>
<description>Username for proxy. This will be used by
'protocol-httpclient', if the proxy server requests basic, digest
and/or NTLM authentication. To use this, 'protocol-httpclient' must
be present in the value of 'plugin.includes' property.
NOTE: For NTLM authentication, do not prefix the username with the
domain, i.e. 'susam' is correct whereas 'DOMAIN\susam' is incorrect.
</description>
</property>
<property>
<name>http.proxy.password</name>
<value>xyz</value>
<description>Password for proxy. This will be used by
'protocol-httpclient', if the proxy server requests basic, digest
and/or NTLM authentication. To use this, 'protocol-httpclient' must
be present in the value of 'plugin.includes' property.
</description>
</property>
<property>
<name>http.proxy.realm</name>
<value>Squid proxy-caching web server</value>
<description>Authentication realm for proxy. Do not define a value
if realm is not required or authentication should take place for any
realm. NTLM does not use the notion of realms. Specify the domain name
of NTLM authentication as the value for this property. To use this,
'protocol-httpclient' must be present in the value of
'plugin.includes' property.
</description>
</property>
<property>
<name>http.agent.host</name>
<value>10.129.30.14</value>
<description>Name or IP address of the host on which the Nutch crawler
would be running. Currently this is used by 'protocol-httpclient'
plugin.
</description>
</property>
<property>
<name>searcher.dir</name>
<value>crawl</value>
<description>
Path to root of crawl. This directory is searched (in
order) for either the file search-servers.txt, containing a list of
distributed search servers, or the directory "index" containing
merged indexes, or the directory "segments" containing segment
indexes.
</description>
</property>
<property>
<name>plugin.includes</name>
<value>protocol-httpclient|urlfilter-regex|parse-(text|html|js|pdf|mp3|oo|msexcel|mspowerpoint|msword|pdf|rss|swf|zip)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
<description>Regular expression naming plugin directory names to
include. Any plugin not matching this expression is excluded.
In any case you need at least include the nutch-extensionpoints plugin. By
default Nutch includes crawling just HTML and plain text via HTTP,
and basic indexing and search plugins. In order to use HTTPS please enable
protocol-httpclient, but be aware of possible intermittent problems with
the
underlying commons-httpclient library.
</description>
</property>
<property>
<name>http.timeout</name>
<value>1000000</value>
<description>The default network timeout, in milliseconds.</description>
</property>
</configuration>
Re: Http-407 - authentication problem on Nutch -0.8
Posted by Susam Pal <su...@gmail.com>.
Your configuration seems fine. Ideally http.agent.url should point to
a page where you describe your crawler, but that shouldn't cause an
error.
If you are facing any problem, please post the relevant logs from
logs/hadoop.log and describe your problem in detail.
Regards,
Susam Pal
On 1/1/08, Nidhi malik <ni...@gmail.com> wrote:
> I am forwading my Nutch-site.xml
> please coorect it
>
> ---------- Forwarded message ----------
> From: Nidhi malik <ni...@gmail.com>
> Date: Jan 1, 2008 11:47 PM
> Subject: nutch-site.xml
> To: susam.pal@gmail.com
>
>
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>
> <!-- Put site-specific property overrides in this file. -->
>
> <configuration>
> <property>
> <name>http.agent.name</name>
> <value>digvijay</value>
> <description>HTTP 'User-Agent' request header. MUST NOT be empty -
> please set this to a single word uniquely related to your organization.
>
> NOTE: You should also check other related properties:
>
> http.robots.agents
> http.agent.description
> http.agent.url
> http.agent.email
> http.agent.version
>
> and set their values appropriately.
>
> </description>
> </property>
>
> <property>
> <name>http.agent.description</name>
> <value>digvijay crawler</value>
> <description>Further description of our bot- this text is used in
> the User-Agent header. It appears in parenthesis after the agent name.
> </description>
> </property>
>
> <property>
> <name>http.agent.url</name>
> <value>http://google.com</value>
> <description>A URL to advertise in the User-Agent header. This will
> appear in parenthesis after the agent name. Custom dictates that this
> should be a URL of a page explaining the purpose and behavior of this
> crawler.
> </description>
> </property>
>
> <property>
> <name>http.agent.email</name>
> <value>digvijayy@it.iitb.ac.in</value>
> <description>An email address to advertise in the HTTP 'From' request
> header and User-Agent header. A good practice is to mangle this
> address (e.g. 'info at example dot com') to avoid spamming.
> </description>
> </property>
>
>
> <property>
> <name>http.proxy.host</name>
> <value>netmon.iitb.ac.in</value>
> <description>The proxy hostname. If empty, no proxy is
> used.</description>
> </property>
>
> <property>
> <name>http.proxy.port</name>
> <value>80</value>
> <description>The proxy port.</description>
> </property>
>
>
> <property>
> <name>http.proxy.username</name>
> <value>xyz</value>
> <description>Username for proxy. This will be used by
> 'protocol-httpclient', if the proxy server requests basic, digest
> and/or NTLM authentication. To use this, 'protocol-httpclient' must
> be present in the value of 'plugin.includes' property.
> NOTE: For NTLM authentication, do not prefix the username with the
> domain, i.e. 'susam' is correct whereas 'DOMAIN\susam' is incorrect.
> </description>
> </property>
>
> <property>
> <name>http.proxy.password</name>
> <value>xyz</value>
> <description>Password for proxy. This will be used by
> 'protocol-httpclient', if the proxy server requests basic, digest
> and/or NTLM authentication. To use this, 'protocol-httpclient' must
> be present in the value of 'plugin.includes' property.
> </description>
> </property>
>
> <property>
> <name>http.proxy.realm</name>
> <value>Squid proxy-caching web server</value>
> <description>Authentication realm for proxy. Do not define a value
> if realm is not required or authentication should take place for any
> realm. NTLM does not use the notion of realms. Specify the domain name
> of NTLM authentication as the value for this property. To use this,
> 'protocol-httpclient' must be present in the value of
> 'plugin.includes' property.
> </description>
> </property>
>
> <property>
> <name>http.agent.host</name>
> <value>10.129.30.14</value>
> <description>Name or IP address of the host on which the Nutch crawler
> would be running. Currently this is used by 'protocol-httpclient'
> plugin.
> </description>
> </property>
>
> <property>
> <name>searcher.dir</name>
> <value>crawl</value>
> <description>
> Path to root of crawl. This directory is searched (in
> order) for either the file search-servers.txt, containing a list of
> distributed search servers, or the directory "index" containing
> merged indexes, or the directory "segments" containing segment
> indexes.
> </description>
> </property>
>
> <property>
> <name>plugin.includes</name>
>
> <value>protocol-httpclient|urlfilter-regex|parse-(text|html|js|pdf|mp3|oo|msexcel|mspowerpoint|msword|pdf|rss|swf|zip)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
>
> <description>Regular expression naming plugin directory names to
> include. Any plugin not matching this expression is excluded.
> In any case you need at least include the nutch-extensionpoints plugin. By
>
> default Nutch includes crawling just HTML and plain text via HTTP,
> and basic indexing and search plugins. In order to use HTTPS please enable
> protocol-httpclient, but be aware of possible intermittent problems with
> the
> underlying commons-httpclient library.
> </description>
> </property>
>
> <property>
> <name>http.timeout</name>
> <value>1000000</value>
> <description>The default network timeout, in milliseconds.</description>
> </property>
>
> </configuration>
>
--
Sent from Gmail for mobile | mobile.google.com