You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Paul M Lieberman <pa...@alum.mit.edu> on 2006/09/15 22:52:43 UTC

nutch-0.8 intranet crawls & logs

I've just switched from nutch-0.7.2 to nutch-0.8.

I'm attempting to do an intranet crawl of a single site. The setup I've 
used in nutch-0.7.2 translates well to nutch-0.8 with two exceptions:

1. the crawl is no longer staying within the website. Why?

The single text file in my url directory contains the root URL:
http://www.psychologymatters.org/
and conf/crawl-urlfilter.txt has one line for accepting hosts:
+^http://([a-z0-9]*\.)*psychologymatters.org/

and
-.
to skip all else. So, why does nutch-0.8 pursue links outside this domain?
Here's how I invoke the crawl:
nohup bin/nutch crawl url -dir /d01/nutch/psychologymatters9 -depth 9 >& 
logs/psychologymatters9.log &

2. The other question relates to log files. As you see above, I want to 
redirect to a log file specific to this crawl. In nutch-0.7.2, it does 
just that, but with nutch-0.8, all log messages are appended to 
logs/hadoop.log. How can I change this?

- Paul M Lieberman
American Psychological Association

Re: nutch-0.8 intranet crawls & logs

Posted by Zaheed Haque <za...@gmail.com>.

On 9/15/06, Paul M Lieberman <pa...@alum.mit.edu> wrote:
> I've just switched from nutch-0.7.2 to nutch-0.8.
>
> I'm attempting to do an intranet crawl of a single site. The setup I've
> used in nutch-0.7.2 translates well to nutch-0.8 with two exceptions:
>
> 1. the crawl is no longer staying within the website. Why?

> The single text file in my url directory contains the root URL:
> http://www.psychologymatters.org/
> and conf/crawl-urlfilter.txt has one line for accepting hosts:
> +^http://([a-z0-9]*\.)*psychologymatters.org/

> and
> -.
> to skip all else. So, why does nutch-0.8 pursue links outside this domain?
> Here's how I invoke the crawl:

This should work, I can not re produce this bug on 0.8 .. You can also
chnage the following property to be true in nutch-site.xml. Is your
regex-urlfilter same as the crawl-urlfilter? just wondering

Furthermore you might also want to change the following property in
nutch-site.xml as well.
<property>
  <name>db.ignore.external.links</name>
  <value>true</value>
  <description>If true, outlinks leading from a page to external hosts
  will be ignored. This is an effective way to limit the crawl to include
  only initially injected hosts, without creating complex URLFilters.
  </description>
</property>

> nohup bin/nutch crawl url -dir /d01/nutch/psychologymatters9 -depth 9 >&
> logs/psychologymatters9.log &
>
> 2. The other question relates to log files. As you see above, I want to
> redirect to a log file specific to this crawl. In nutch-0.7.2, it does
> just that, but with nutch-0.8, all log messages are appended to
> logs/hadoop.log. How can I change this?

You need to edit the file conf/log4j.properties. There are bunch of
options you can tweak and twist. Please refer to log4j documentation
for that.

http://logging.apache.org/log4j/docs/documentation.html

> - Paul M Lieberman
> American Psychological Association
>