You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by d e <cr...@gmail.com> on 2007/03/10 10:59:59 UTC

Opps! Nothing Fetched when attempting to crawl other than the apache site !

I am a VERY new Nutch user. I thought I had made some progress when I was
able to crawl the apache site. The problem is I have *not* been able to
crawl anything else.

The crawl command fires up and produces some console output, but nothing is
ever actually fetched. I know this because the lines "fetching: http...."
that occur when crawling the apache site never appear - and of course I
don't gen any hits when attempting to search my resulting database.

What could be wrong ?

Here are the urls that worked for me:

http://lucene.apache.org/
http://lucene.apache.org/Nutch/

Here are the ones that did not:

http://www.birminghamfreepress.com/
http://www.bhamnews.com/

http://www.irs.gov

Am I setting up these links correctly?


There is one thing I did a bit differently. I put my input url directories
and output crawl directories outside of the nutch home directory, and used a

symbolic link to switch which of the outputs would be the active 'searcher'
directory. This is the purpose of the first property below in my
nutch-site.xml. Could that be my problem?

What follows is the text of my config file.

Thanks for your help!


<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->
<configuration>

<property>
 <name>searcher.dir</name>
 <value>/home/clipper/crawl/searchdir</value>
 <description>
   Path to root of crawl - searcher looks here to find its index
   (oversimplified description: see nutch-defaults.xml)
 </description>
</property>



<!-- file properties -->

<property>
  <name>file.content.limit</name>
  <value>65536</value>
  <description>The length limit for downloaded content, in bytes.
  If this value is nonnegative (>=0), content longer than it will be
truncated;
  otherwise, no truncation at all.
  </description>
</property>

<!-- HTTP properties -->

<property>
  <name>http.agent.name</name>
  <value>newscrawler</value>
  <description>HTTP 'User-Agent' request header. MUST NOT be empty
  please set this to a single word uniquely related to your organization.
  </description>
</property>

<property>
  <name>http.robots.agents</name>
  <value>clipper,*</value>
  <description>The agent strings we'll look for in robots.txt files,
  comma-separated, in decreasing order of precedence. You should
  put the value of http.agent.name as the first agent name, and keep the
  default * at the end of the list. E.g.: BlurflDev,Blurfl,*
  </description>
</property>

<property>
  <name>http.agent.description</name>
  <value>news search engine</value>
  <description>Further description of our bot- this text is used in
  the User-Agent header.  It appears in parenthesis after the agent name.
  </description>
</property>

<property>
  <name>http.agent.url</name>
  <value>http://decisionsmith.com</value>
  <description>A URL to advertise in the User-Agent header.  This will
   appear in parenthesis after the agent name. Custom dictates that this
   should be a URL of a page explaining the purpose and behavior of this
   crawler.
  </description>
</property>

<property>
  <name>http.agent.email</name>
  <value>clipper twenty nine at gmail dot com</value>
  <description>An email address to advertise in the HTTP 'From' request
   header and User-Agent header. A good practice is to mangle this
   address (e.g. 'info at example dot com') to avoid spamming.
  </description>
</property>

<property>
  <name>http.verbose</name>
  <value>false</value>
  <description>If true, HTTP will log more verbosely.</description>
</property>

<!-- web db properties -->

<property>
  <name>db.default.fetch.interval</name>
  <value>1</value>
  <description>The default number of days between re-fetches of a page.
  </description>
</property>

<property>
  <name>db.ignore.internal.links</name>
  <value>false</value>
  <description>If true, when adding new links to a page, links from
  the same host are ignored.  This is an effective way to limit the
  size of the link database, keeping only the highest quality
  links.
  </description>
</property>

<property>
  <name>db.ignore.external.links</name>
  <value>false</value>
  <description>If true, outlinks leading from a page to external hosts
  will be ignored. This is an effective way to limit the crawl to include
  only initially injected hosts, without creating complex URLFilters.
  </description>
</property>

</configuration>

Re: Opps! Nothing Fetched when attempting to crawl other than the apache site !

Posted by d e <cr...@gmail.com>.

Thanks folks! Very helpful!

On 3/10/07, Michael Wechner <mi...@wyona.com> wrote:
>
> d e wrote:
>
> > I am a VERY new Nutch user. I thought I had made some progress when I
> was
> > able to crawl the apache site. The problem is I have *not* been able to
> > crawl anything else.
> >
> > The crawl command fires up and produces some console output, but
> > nothing is
> > ever actually fetched. I know this because the lines "fetching:
> http...."
> > that occur when crawling the apache site never appear - and of course I
> > don't gen any hits when attempting to search my resulting database.
> >
> > What could be wrong ?
>
>
> have you added your domains to th url filters?
>
> HTH
>
> Michael
>
> >
> > Here are the urls that worked for me:
> >
> > http://lucene.apache.org/
> > http://lucene.apache.org/Nutch/
> >
> > Here are the ones that did not:
> >
> > http://www.birminghamfreepress.com/
> > http://www.bhamnews.com/
> >
> > http://www.irs.gov
> >
> > Am I setting up these links correctly?
> >
> >
> > There is one thing I did a bit differently. I put my input url
> > directories
> > and output crawl directories outside of the nutch home directory, and
> > used a
> >
> > symbolic link to switch which of the outputs would be the active
> > 'searcher'
> > directory. This is the purpose of the first property below in my
> > nutch-site.xml. Could that be my problem?
> >
> > What follows is the text of my config file.
> >
> > Thanks for your help!
> >
> >
> > <?xml version="1.0"?>
> > <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
> >
> > <!-- Put site-specific property overrides in this file. -->
> > <configuration>
> >
> > <property>
> > <name>searcher.dir</name>
> > <value>/home/clipper/crawl/searchdir</value>
> > <description>
> >   Path to root of crawl - searcher looks here to find its index
> >   (oversimplified description: see nutch-defaults.xml)
> > </description>
> > </property>
> >
> >
> >
> > <!-- file properties -->
> >
> > <property>
> >  <name>file.content.limit</name>
> >  <value>65536</value>
> >  <description>The length limit for downloaded content, in bytes.
> >  If this value is nonnegative (>=0), content longer than it will be
> > truncated;
> >  otherwise, no truncation at all.
> >  </description>
> > </property>
> >
> > <!-- HTTP properties -->
> >
> > <property>
> >  <name>http.agent.name</name>
> >  <value>newscrawler</value>
> >  <description>HTTP 'User-Agent' request header. MUST NOT be empty
> >  please set this to a single word uniquely related to your organization.
> >  </description>
> > </property>
> >
> > <property>
> >  <name>http.robots.agents</name>
> >  <value>clipper,*</value>
> >  <description>The agent strings we'll look for in robots.txt files,
> >  comma-separated, in decreasing order of precedence. You should
> >  put the value of http.agent.name as the first agent name, and keep the
> >  default * at the end of the list. E.g.: BlurflDev,Blurfl,*
> >  </description>
> > </property>
> >
> > <property>
> >  <name>http.agent.description</name>
> >  <value>news search engine</value>
> >  <description>Further description of our bot- this text is used in
> >  the User-Agent header.  It appears in parenthesis after the agent name.
> >  </description>
> > </property>
> >
> > <property>
> >  <name>http.agent.url</name>
> >  <value>http://decisionsmith.com</value>
> >  <description>A URL to advertise in the User-Agent header.  This will
> >   appear in parenthesis after the agent name. Custom dictates that this
> >   should be a URL of a page explaining the purpose and behavior of this
> >   crawler.
> >  </description>
> > </property>
> >
> > <property>
> >  <name>http.agent.email</name>
> >  <value>clipper twenty nine at gmail dot com</value>
> >  <description>An email address to advertise in the HTTP 'From' request
> >   header and User-Agent header. A good practice is to mangle this
> >   address (e.g. 'info at example dot com') to avoid spamming.
> >  </description>
> > </property>
> >
> > <property>
> >  <name>http.verbose</name>
> >  <value>false</value>
> >  <description>If true, HTTP will log more verbosely.</description>
> > </property>
> >
> > <!-- web db properties -->
> >
> > <property>
> >  <name>db.default.fetch.interval</name>
> >  <value>1</value>
> >  <description>The default number of days between re-fetches of a page.
> >  </description>
> > </property>
> >
> > <property>
> >  <name>db.ignore.internal.links</name>
> >  <value>false</value>
> >  <description>If true, when adding new links to a page, links from
> >  the same host are ignored.  This is an effective way to limit the
> >  size of the link database, keeping only the highest quality
> >  links.
> >  </description>
> > </property>
> >
> > <property>
> >  <name>db.ignore.external.links</name>
> >  <value>false</value>
> >  <description>If true, outlinks leading from a page to external hosts
> >  will be ignored. This is an effective way to limit the crawl to include
> >  only initially injected hosts, without creating complex URLFilters.
> >  </description>
> > </property>
> >
> > </configuration>
> >
>
>
> --
> Michael Wechner
> Wyona      -   Open Source Content Management   -    Apache Lenya
> http://www.wyona.com                      http://lenya.apache.org
> michael.wechner@wyona.com                        michi@apache.org
> +41 44 272 91 61
>
>

Re: Opps! Nothing Fetched when attempting to crawl other than the apache site !

Posted by Michael Wechner <mi...@wyona.com>.

d e wrote:

> I am a VERY new Nutch user. I thought I had made some progress when I was
> able to crawl the apache site. The problem is I have *not* been able to
> crawl anything else.
>
> The crawl command fires up and produces some console output, but 
> nothing is
> ever actually fetched. I know this because the lines "fetching: http...."
> that occur when crawling the apache site never appear - and of course I
> don't gen any hits when attempting to search my resulting database.
>
> What could be wrong ?


have you added your domains to th url filters?

HTH

Michael

>
> Here are the urls that worked for me:
>
> http://lucene.apache.org/
> http://lucene.apache.org/Nutch/
>
> Here are the ones that did not:
>
> http://www.birminghamfreepress.com/
> http://www.bhamnews.com/
>
> http://www.irs.gov
>
> Am I setting up these links correctly?
>
>
> There is one thing I did a bit differently. I put my input url 
> directories
> and output crawl directories outside of the nutch home directory, and 
> used a
>
> symbolic link to switch which of the outputs would be the active 
> 'searcher'
> directory. This is the purpose of the first property below in my
> nutch-site.xml. Could that be my problem?
>
> What follows is the text of my config file.
>
> Thanks for your help!
>
>
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>
> <!-- Put site-specific property overrides in this file. -->
> <configuration>
>
> <property>
> <name>searcher.dir</name>
> <value>/home/clipper/crawl/searchdir</value>
> <description>
>   Path to root of crawl - searcher looks here to find its index
>   (oversimplified description: see nutch-defaults.xml)
> </description>
> </property>
>
>
>
> <!-- file properties -->
>
> <property>
>  <name>file.content.limit</name>
>  <value>65536</value>
>  <description>The length limit for downloaded content, in bytes.
>  If this value is nonnegative (>=0), content longer than it will be
> truncated;
>  otherwise, no truncation at all.
>  </description>
> </property>
>
> <!-- HTTP properties -->
>
> <property>
>  <name>http.agent.name</name>
>  <value>newscrawler</value>
>  <description>HTTP 'User-Agent' request header. MUST NOT be empty
>  please set this to a single word uniquely related to your organization.
>  </description>
> </property>
>
> <property>
>  <name>http.robots.agents</name>
>  <value>clipper,*</value>
>  <description>The agent strings we'll look for in robots.txt files,
>  comma-separated, in decreasing order of precedence. You should
>  put the value of http.agent.name as the first agent name, and keep the
>  default * at the end of the list. E.g.: BlurflDev,Blurfl,*
>  </description>
> </property>
>
> <property>
>  <name>http.agent.description</name>
>  <value>news search engine</value>
>  <description>Further description of our bot- this text is used in
>  the User-Agent header.  It appears in parenthesis after the agent name.
>  </description>
> </property>
>
> <property>
>  <name>http.agent.url</name>
>  <value>http://decisionsmith.com</value>
>  <description>A URL to advertise in the User-Agent header.  This will
>   appear in parenthesis after the agent name. Custom dictates that this
>   should be a URL of a page explaining the purpose and behavior of this
>   crawler.
>  </description>
> </property>
>
> <property>
>  <name>http.agent.email</name>
>  <value>clipper twenty nine at gmail dot com</value>
>  <description>An email address to advertise in the HTTP 'From' request
>   header and User-Agent header. A good practice is to mangle this
>   address (e.g. 'info at example dot com') to avoid spamming.
>  </description>
> </property>
>
> <property>
>  <name>http.verbose</name>
>  <value>false</value>
>  <description>If true, HTTP will log more verbosely.</description>
> </property>
>
> <!-- web db properties -->
>
> <property>
>  <name>db.default.fetch.interval</name>
>  <value>1</value>
>  <description>The default number of days between re-fetches of a page.
>  </description>
> </property>
>
> <property>
>  <name>db.ignore.internal.links</name>
>  <value>false</value>
>  <description>If true, when adding new links to a page, links from
>  the same host are ignored.  This is an effective way to limit the
>  size of the link database, keeping only the highest quality
>  links.
>  </description>
> </property>
>
> <property>
>  <name>db.ignore.external.links</name>
>  <value>false</value>
>  <description>If true, outlinks leading from a page to external hosts
>  will be ignored. This is an effective way to limit the crawl to include
>  only initially injected hosts, without creating complex URLFilters.
>  </description>
> </property>
>
> </configuration>
>


-- 
Michael Wechner
Wyona      -   Open Source Content Management   -    Apache Lenya
http://www.wyona.com                      http://lenya.apache.org
michael.wechner@wyona.com                        michi@apache.org
+41 44 272 91 61