You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by d e <cr...@gmail.com> on 2007/03/10 10:13:50 UTC

Nothing Fetched when attempting to crawl other than the apache site !

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->
<configuration>

<property>
 <name>searcher.dir</name>
 <value>/home/clipper/crawl/searchdir</value>
 <description>
   Path to root of crawl - searcher looks here to find its index
   (oversimplified description: see nutch-defaults.xml)
 </description>
</property>



<!-- file properties -->

<property>
  <name>file.content.limit</name>
  <value>65536</value>
  <description>The length limit for downloaded content, in bytes.
  If this value is nonnegative (>=0), content longer than it will be
truncated;
  otherwise, no truncation at all.
  </description>
</property>

<!-- HTTP properties -->

<property>
  <name>http.agent.name</name>
  <value>newscrawler</value>
  <description>HTTP 'User-Agent' request header. MUST NOT be empty
  please set this to a single word uniquely related to your organization.
  </description>
</property>

<property>
  <name>http.robots.agents</name>
  <value>clipper,*</value>
  <description>The agent strings we'll look for in robots.txt files,
  comma-separated, in decreasing order of precedence. You should
  put the value of http.agent.name as the first agent name, and keep the
  default * at the end of the list. E.g.: BlurflDev,Blurfl,*
  </description>
</property>

<property>
  <name>http.agent.description</name>
  <value>news search engine</value>
  <description>Further description of our bot- this text is used in
  the User-Agent header.  It appears in parenthesis after the agent name.
  </description>
</property>

<property>
  <name>http.agent.url</name>
  <value>http://decisionsmith.com</value>
  <description>A URL to advertise in the User-Agent header.  This will
   appear in parenthesis after the agent name. Custom dictates that this
   should be a URL of a page explaining the purpose and behavior of this
   crawler.
  </description>
</property>

<property>
  <name>http.agent.email</name>
  <value>clipper twenty nine at gmail dot com</value>
  <description>An email address to advertise in the HTTP 'From' request
   header and User-Agent header. A good practice is to mangle this
   address (e.g. 'info at example dot com') to avoid spamming.
  </description>
</property>

<property>
  <name>http.verbose</name>
  <value>false</value>
  <description>If true, HTTP will log more verbosely.</description>
</property>

<!-- web db properties -->

<property>
  <name>db.default.fetch.interval</name>
  <value>1</value>
  <description>The default number of days between re-fetches of a page.
  </description>
</property>

<property>
  <name>db.ignore.internal.links</name>
  <value>false</value>
  <description>If true, when adding new links to a page, links from
  the same host are ignored.  This is an effective way to limit the
  size of the link database, keeping only the highest quality
  links.
  </description>
</property>

<property>
  <name>db.ignore.external.links</name>
  <value>false</value>
  <description>If true, outlinks leading from a page to external hosts
  will be ignored. This is an effective way to limit the crawl to include
  only initially injected hosts, without creating complex URLFilters.
  </description>
</property>

</configuration>

Re: Nothing Fetched when attempting to crawl other than the apache site !

Posted by rubdabadub <ru...@gmail.com>.

You need to know regular expression.
You need to edit the file crawl-urlfilter.txt under conf or
you need to edit the file regex-urlfilter.txt

so that it reflects the sites you plan to crawl.

Regards

On 3/10/07, d e <cr...@gmail.com> wrote:
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>
> <!-- Put site-specific property overrides in this file. -->
> <configuration>
>
> <property>
>  <name>searcher.dir</name>
>  <value>/home/clipper/crawl/searchdir</value>
>  <description>
>    Path to root of crawl - searcher looks here to find its index
>    (oversimplified description: see nutch-defaults.xml)
>  </description>
> </property>
>
>
>
> <!-- file properties -->
>
> <property>
>   <name>file.content.limit</name>
>   <value>65536</value>
>   <description>The length limit for downloaded content, in bytes.
>   If this value is nonnegative (>=0), content longer than it will be
> truncated;
>   otherwise, no truncation at all.
>   </description>
> </property>
>
> <!-- HTTP properties -->
>
> <property>
>   <name>http.agent.name</name>
>   <value>newscrawler</value>
>   <description>HTTP 'User-Agent' request header. MUST NOT be empty
>   please set this to a single word uniquely related to your organization.
>   </description>
> </property>
>
> <property>
>   <name>http.robots.agents</name>
>   <value>clipper,*</value>
>   <description>The agent strings we'll look for in robots.txt files,
>   comma-separated, in decreasing order of precedence. You should
>   put the value of http.agent.name as the first agent name, and keep the
>   default * at the end of the list. E.g.: BlurflDev,Blurfl,*
>   </description>
> </property>
>
> <property>
>   <name>http.agent.description</name>
>   <value>news search engine</value>
>   <description>Further description of our bot- this text is used in
>   the User-Agent header.  It appears in parenthesis after the agent name.
>   </description>
> </property>
>
> <property>
>   <name>http.agent.url</name>
>   <value>http://decisionsmith.com</value>
>   <description>A URL to advertise in the User-Agent header.  This will
>    appear in parenthesis after the agent name. Custom dictates that this
>    should be a URL of a page explaining the purpose and behavior of this
>    crawler.
>   </description>
> </property>
>
> <property>
>   <name>http.agent.email</name>
>   <value>clipper twenty nine at gmail dot com</value>
>   <description>An email address to advertise in the HTTP 'From' request
>    header and User-Agent header. A good practice is to mangle this
>    address (e.g. 'info at example dot com') to avoid spamming.
>   </description>
> </property>
>
> <property>
>   <name>http.verbose</name>
>   <value>false</value>
>   <description>If true, HTTP will log more verbosely.</description>
> </property>
>
> <!-- web db properties -->
>
> <property>
>   <name>db.default.fetch.interval</name>
>   <value>1</value>
>   <description>The default number of days between re-fetches of a page.
>   </description>
> </property>
>
> <property>
>   <name>db.ignore.internal.links</name>
>   <value>false</value>
>   <description>If true, when adding new links to a page, links from
>   the same host are ignored.  This is an effective way to limit the
>   size of the link database, keeping only the highest quality
>   links.
>   </description>
> </property>
>
> <property>
>   <name>db.ignore.external.links</name>
>   <value>false</value>
>   <description>If true, outlinks leading from a page to external hosts
>   will be ignored. This is an effective way to limit the crawl to include
>   only initially injected hosts, without creating complex URLFilters.
>   </description>
> </property>
>
> </configuration>
>