You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Paul M Lieberman <pa...@alum.mit.edu> on 2006/09/15 22:52:43 UTC
nutch-0.8 intranet crawls & logs
I've just switched from nutch-0.7.2 to nutch-0.8.
I'm attempting to do an intranet crawl of a single site. The setup I've
used in nutch-0.7.2 translates well to nutch-0.8 with two exceptions:
1. the crawl is no longer staying within the website. Why?
The single text file in my url directory contains the root URL:
http://www.psychologymatters.org/
and conf/crawl-urlfilter.txt has one line for accepting hosts:
+^http://([a-z0-9]*\.)*psychologymatters.org/
and
-.
to skip all else. So, why does nutch-0.8 pursue links outside this domain?
Here's how I invoke the crawl:
nohup bin/nutch crawl url -dir /d01/nutch/psychologymatters9 -depth 9 >&
logs/psychologymatters9.log &
2. The other question relates to log files. As you see above, I want to
redirect to a log file specific to this crawl. In nutch-0.7.2, it does
just that, but with nutch-0.8, all log messages are appended to
logs/hadoop.log. How can I change this?
- Paul M Lieberman
American Psychological Association
Re: nutch-0.8 intranet crawls & logs
Posted by Zaheed Haque <za...@gmail.com>.
On 9/15/06, Paul M Lieberman <pa...@alum.mit.edu> wrote:
> I've just switched from nutch-0.7.2 to nutch-0.8.
>
> I'm attempting to do an intranet crawl of a single site. The setup I've
> used in nutch-0.7.2 translates well to nutch-0.8 with two exceptions:
>
> 1. the crawl is no longer staying within the website. Why?
> The single text file in my url directory contains the root URL:
> http://www.psychologymatters.org/
> and conf/crawl-urlfilter.txt has one line for accepting hosts:
> +^http://([a-z0-9]*\.)*psychologymatters.org/
> and
> -.
> to skip all else. So, why does nutch-0.8 pursue links outside this domain?
> Here's how I invoke the crawl:
This should work, I can not re produce this bug on 0.8 .. You can also
chnage the following property to be true in nutch-site.xml. Is your
regex-urlfilter same as the crawl-urlfilter? just wondering
Furthermore you might also want to change the following property in
nutch-site.xml as well.
<property>
<name>db.ignore.external.links</name>
<value>true</value>
<description>If true, outlinks leading from a page to external hosts
will be ignored. This is an effective way to limit the crawl to include
only initially injected hosts, without creating complex URLFilters.
</description>
</property>
> nohup bin/nutch crawl url -dir /d01/nutch/psychologymatters9 -depth 9 >&
> logs/psychologymatters9.log &
>
> 2. The other question relates to log files. As you see above, I want to
> redirect to a log file specific to this crawl. In nutch-0.7.2, it does
> just that, but with nutch-0.8, all log messages are appended to
> logs/hadoop.log. How can I change this?
You need to edit the file conf/log4j.properties. There are bunch of
options you can tweak and twist. Please refer to log4j documentation
for that.
http://logging.apache.org/log4j/docs/documentation.html
> - Paul M Lieberman
> American Psychological Association
>