You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by karan thakral <ka...@gmail.com> on 2007/06/15 16:49:33 UTC
fetch failing while crawling
i m using crawl on the cygwin while working on windows
but the crawl output is not proper
during fetch its says fetch: the document could not be fetched java runtime
exception agent not configured
my nutch-site.xml is as follows
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>http.agent.name</name>
<value></value>
<description>HTTP 'User-Agent' request header. MUST NOT be empty -
please set this to a single word uniquely related to your organization.
NOTE: You should also check other related properties:
http.robots.agents
http.agent.description
http.agent.url
http.agent.email
http.agent.version
and set their values appropriately.
</description>
</property>
<property>
<name>http.agent.description</name>
<value></value>
<description>Further description of our bot- this text is used in
the User-Agent header. It appears in parenthesis after the agent name.
</description>
</property>
<property>
<name>http.agent.url</name>
<value></value>
<description>A URL to advertise in the User-Agent header. This will
appear in parenthesis after the agent name. Custom dictates that this
should be a URL of a page explaining the purpose and behavior of this
crawler.
</description>
</property>
<property>
<name>http.agent.email</name>
<value></value>
<description>An email address to advertise in the HTTP 'From' request
header and User-Agent header. A good practice is to mangle this
address (e.g. 'info at example dot com') to avoid spamming.
</description>
</property>
</configuration>
but still thrs error
also please throw some light on the searching of info through the web
interface after the crawl is made successful
--
With Regards
Karan Thakral
Re: fetch failing while crawling
Posted by Briggs <ac...@gmail.com>.
Oh and as for the web interface, take a look at the wiki page:
http://wiki.apache.org/nutch/NutchTutorial
The bottom of the page has a section on searching.
On 6/15/07, Briggs <ac...@gmail.com> wrote:
> Yeah, you still don't have the agent configured. All your values for
> the agent (the <value></value> needs a value) are blank. So, you need
> to at least confugure an agent name.
>
>
>
> On 6/15/07, karan thakral <ka...@gmail.com> wrote:
> > i m using crawl on the cygwin while working on windows
> >
> > but the crawl output is not proper
> >
> > during fetch its says fetch: the document could not be fetched java runtime
> > exception agent not configured
> >
> > my nutch-site.xml is as follows
> >
> > <?xml version="1.0"?>
> > <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
> >
> > <!-- Put site-specific property overrides in this file. -->
> >
> > <configuration>
> > <property>
> > <name>http.agent.name</name>
> > <value></value>
> > <description>HTTP 'User-Agent' request header. MUST NOT be empty -
> > please set this to a single word uniquely related to your organization.
> >
> > NOTE: You should also check other related properties:
> >
> > http.robots.agents
> > http.agent.description
> > http.agent.url
> > http.agent.email
> > http.agent.version
> >
> > and set their values appropriately.
> >
> > </description>
> > </property>
> >
> > <property>
> > <name>http.agent.description</name>
> > <value></value>
> > <description>Further description of our bot- this text is used in
> > the User-Agent header. It appears in parenthesis after the agent name.
> > </description>
> > </property>
> >
> > <property>
> > <name>http.agent.url</name>
> > <value></value>
> > <description>A URL to advertise in the User-Agent header. This will
> > appear in parenthesis after the agent name. Custom dictates that this
> > should be a URL of a page explaining the purpose and behavior of this
> > crawler.
> > </description>
> > </property>
> >
> > <property>
> > <name>http.agent.email</name>
> > <value></value>
> > <description>An email address to advertise in the HTTP 'From' request
> > header and User-Agent header. A good practice is to mangle this
> > address (e.g. 'info at example dot com') to avoid spamming.
> > </description>
> > </property>
> > </configuration>
> >
> > but still thrs error
> >
> > also please throw some light on the searching of info through the web
> > interface after the crawl is made successful
> > --
> > With Regards
> > Karan Thakral
> >
>
>
> --
> "Conscious decisions by conscious minds are what make reality real"
>
--
"Conscious decisions by conscious minds are what make reality real"
Re: fetch failing while crawling
Posted by Briggs <ac...@gmail.com>.
Yeah, you still don't have the agent configured. All your values for
the agent (the <value></value> needs a value) are blank. So, you need
to at least confugure an agent name.
On 6/15/07, karan thakral <ka...@gmail.com> wrote:
> i m using crawl on the cygwin while working on windows
>
> but the crawl output is not proper
>
> during fetch its says fetch: the document could not be fetched java runtime
> exception agent not configured
>
> my nutch-site.xml is as follows
>
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>
> <!-- Put site-specific property overrides in this file. -->
>
> <configuration>
> <property>
> <name>http.agent.name</name>
> <value></value>
> <description>HTTP 'User-Agent' request header. MUST NOT be empty -
> please set this to a single word uniquely related to your organization.
>
> NOTE: You should also check other related properties:
>
> http.robots.agents
> http.agent.description
> http.agent.url
> http.agent.email
> http.agent.version
>
> and set their values appropriately.
>
> </description>
> </property>
>
> <property>
> <name>http.agent.description</name>
> <value></value>
> <description>Further description of our bot- this text is used in
> the User-Agent header. It appears in parenthesis after the agent name.
> </description>
> </property>
>
> <property>
> <name>http.agent.url</name>
> <value></value>
> <description>A URL to advertise in the User-Agent header. This will
> appear in parenthesis after the agent name. Custom dictates that this
> should be a URL of a page explaining the purpose and behavior of this
> crawler.
> </description>
> </property>
>
> <property>
> <name>http.agent.email</name>
> <value></value>
> <description>An email address to advertise in the HTTP 'From' request
> header and User-Agent header. A good practice is to mangle this
> address (e.g. 'info at example dot com') to avoid spamming.
> </description>
> </property>
> </configuration>
>
> but still thrs error
>
> also please throw some light on the searching of info through the web
> interface after the crawl is made successful
> --
> With Regards
> Karan Thakral
>
--
"Conscious decisions by conscious minds are what make reality real"