You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Piotr Kosiorowski <pk...@gmail.com> on 2005/08/08 14:37:59 UTC

Tutorial

Hello,
Some time ago someone mentioned on the list a problem with nutch
tutorial (I cannot find this email now). I have checked it today and
he/she was right.  If you follow the nutch Intranet Crawling tutorial
you will end up with not very interesting index.
This is because it recommends users to set urlfilter and urls file for
nutch.org domain, but www.nutch.org redirects to
http://lucene.apache.org/nutch and all links are rejected by
urlfilter.

So I suggest to change it so:
urls file will contain: http://lucene.apache.org/nutch
crawl-urlfilter.txt will contain:
+^http://([a-z0-9]*\.)*apache.org/
I would also add pdf and png to list of rejected extensions in
crawl-urlfilter.txt file so users would not be confused by errors in
log file. pdf parsing plugin is disabled in default configuration.
I can commit such changes for 0.7 release (it means today) if I got
positive feedback from other committers.
Regards
Piotr

Re: Tutorial

Posted by Andrzej Bialecki <ab...@getopt.org>.

Piotr Kosiorowski wrote:

> I can commit such changes for 0.7 release (it means today) if I got
> positive feedback from other committers.

+1


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Tutorial

Posted by Doug Cutting <cu...@nutch.org>.

+1

Piotr Kosiorowski wrote:
> Hello,
> Some time ago someone mentioned on the list a problem with nutch
> tutorial (I cannot find this email now). I have checked it today and
> he/she was right.  If you follow the nutch Intranet Crawling tutorial
> you will end up with not very interesting index.
> This is because it recommends users to set urlfilter and urls file for
> nutch.org domain, but www.nutch.org redirects to
> http://lucene.apache.org/nutch and all links are rejected by
> urlfilter.
> 
> So I suggest to change it so:
> urls file will contain: http://lucene.apache.org/nutch
> crawl-urlfilter.txt will contain:
> +^http://([a-z0-9]*\.)*apache.org/
> I would also add pdf and png to list of rejected extensions in
> crawl-urlfilter.txt file so users would not be confused by errors in
> log file. pdf parsing plugin is disabled in default configuration.
> I can commit such changes for 0.7 release (it means today) if I got
> positive feedback from other committers.
> Regards
> Piotr