You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by karan thakral <ka...@gmail.com> on 2007/06/15 16:49:33 UTC

fetch failing while crawling

i m using crawl on the cygwin while working on windows

but the crawl output is not proper

during fetch its says fetch: the document could not be fetched java runtime
exception  agent not configured

my nutch-site.xml is  as follows

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
<property>
  <name>http.agent.name</name>
  <value></value>
  <description>HTTP 'User-Agent' request header. MUST NOT be empty -
  please set this to a single word uniquely related to your organization.

  NOTE: You should also check other related properties:

  http.robots.agents
  http.agent.description
  http.agent.url
  http.agent.email
  http.agent.version

  and set their values appropriately.

  </description>
</property>

<property>
  <name>http.agent.description</name>
  <value></value>
  <description>Further description of our bot- this text is used in
  the User-Agent header.  It appears in parenthesis after the agent name.
  </description>
</property>

<property>
  <name>http.agent.url</name>
  <value></value>
  <description>A URL to advertise in the User-Agent header.  This will
   appear in parenthesis after the agent name. Custom dictates that this
   should be a URL of a page explaining the purpose and behavior of this
   crawler.
  </description>
</property>

<property>
  <name>http.agent.email</name>
  <value></value>
  <description>An email address to advertise in the HTTP 'From' request
   header and User-Agent header. A good practice is to mangle this
   address (e.g. 'info at example dot com') to avoid spamming.
  </description>
</property>
</configuration>

  but still thrs error

also please throw some light on the searching of info through the web
interface after the crawl is made successful
-- 
With Regards
Karan Thakral

Re: fetch failing while crawling

Posted by Briggs <ac...@gmail.com>.
Oh and as for the web interface, take a look at the wiki page:

http://wiki.apache.org/nutch/NutchTutorial

The bottom of the page has a section on searching.

On 6/15/07, Briggs <ac...@gmail.com> wrote:
> Yeah, you still don't have the agent configured.  All your values for
> the agent (the <value></value> needs a value) are blank.  So, you need
> to at least confugure an agent name.
>
>
>
> On 6/15/07, karan thakral <ka...@gmail.com> wrote:
> > i m using crawl on the cygwin while working on windows
> >
> > but the crawl output is not proper
> >
> > during fetch its says fetch: the document could not be fetched java runtime
> > exception  agent not configured
> >
> > my nutch-site.xml is  as follows
> >
> > <?xml version="1.0"?>
> > <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
> >
> > <!-- Put site-specific property overrides in this file. -->
> >
> > <configuration>
> > <property>
> >   <name>http.agent.name</name>
> >   <value></value>
> >   <description>HTTP 'User-Agent' request header. MUST NOT be empty -
> >   please set this to a single word uniquely related to your organization.
> >
> >   NOTE: You should also check other related properties:
> >
> >   http.robots.agents
> >   http.agent.description
> >   http.agent.url
> >   http.agent.email
> >   http.agent.version
> >
> >   and set their values appropriately.
> >
> >   </description>
> > </property>
> >
> > <property>
> >   <name>http.agent.description</name>
> >   <value></value>
> >   <description>Further description of our bot- this text is used in
> >   the User-Agent header.  It appears in parenthesis after the agent name.
> >   </description>
> > </property>
> >
> > <property>
> >   <name>http.agent.url</name>
> >   <value></value>
> >   <description>A URL to advertise in the User-Agent header.  This will
> >    appear in parenthesis after the agent name. Custom dictates that this
> >    should be a URL of a page explaining the purpose and behavior of this
> >    crawler.
> >   </description>
> > </property>
> >
> > <property>
> >   <name>http.agent.email</name>
> >   <value></value>
> >   <description>An email address to advertise in the HTTP 'From' request
> >    header and User-Agent header. A good practice is to mangle this
> >    address (e.g. 'info at example dot com') to avoid spamming.
> >   </description>
> > </property>
> > </configuration>
> >
> >   but still thrs error
> >
> > also please throw some light on the searching of info through the web
> > interface after the crawl is made successful
> > --
> > With Regards
> > Karan Thakral
> >
>
>
> --
> "Conscious decisions by conscious minds are what make reality real"
>


-- 
"Conscious decisions by conscious minds are what make reality real"

Re: fetch failing while crawling

Posted by Briggs <ac...@gmail.com>.
Yeah, you still don't have the agent configured.  All your values for
the agent (the <value></value> needs a value) are blank.  So, you need
to at least confugure an agent name.



On 6/15/07, karan thakral <ka...@gmail.com> wrote:
> i m using crawl on the cygwin while working on windows
>
> but the crawl output is not proper
>
> during fetch its says fetch: the document could not be fetched java runtime
> exception  agent not configured
>
> my nutch-site.xml is  as follows
>
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>
> <!-- Put site-specific property overrides in this file. -->
>
> <configuration>
> <property>
>   <name>http.agent.name</name>
>   <value></value>
>   <description>HTTP 'User-Agent' request header. MUST NOT be empty -
>   please set this to a single word uniquely related to your organization.
>
>   NOTE: You should also check other related properties:
>
>   http.robots.agents
>   http.agent.description
>   http.agent.url
>   http.agent.email
>   http.agent.version
>
>   and set their values appropriately.
>
>   </description>
> </property>
>
> <property>
>   <name>http.agent.description</name>
>   <value></value>
>   <description>Further description of our bot- this text is used in
>   the User-Agent header.  It appears in parenthesis after the agent name.
>   </description>
> </property>
>
> <property>
>   <name>http.agent.url</name>
>   <value></value>
>   <description>A URL to advertise in the User-Agent header.  This will
>    appear in parenthesis after the agent name. Custom dictates that this
>    should be a URL of a page explaining the purpose and behavior of this
>    crawler.
>   </description>
> </property>
>
> <property>
>   <name>http.agent.email</name>
>   <value></value>
>   <description>An email address to advertise in the HTTP 'From' request
>    header and User-Agent header. A good practice is to mangle this
>    address (e.g. 'info at example dot com') to avoid spamming.
>   </description>
> </property>
> </configuration>
>
>   but still thrs error
>
> also please throw some light on the searching of info through the web
> interface after the crawl is made successful
> --
> With Regards
> Karan Thakral
>


-- 
"Conscious decisions by conscious minds are what make reality real"