You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by reinhard schwab <re...@aon.at> on 2009/07/26 13:55:03 UTC
crawl-tool.xml
i have tried the recrawl script of susam pal and have wondered why
url filtering no longer works.
http://wiki.apache.org/nutch/Crawl
the mystery is
only Crawl.java adds crawl-tool.xml to the NutchConfiguration.
Configuration conf = NutchConfiguration.create();
conf.addResource("crawl-tool.xml");
Fetcher.java and all the other tools which filter the outlinks do not
add this.
this is really confusing me and i have spent some time to figure this out.
regards
reinhard
Re: crawl-tool.xml
Posted by reinhard schwab <re...@aon.at>.
its not only confusing me,
its also confusing the author, FrankMcCown, of the nutch tutorial
http://wiki.apache.org/nutch/NutchTutorial
Crawl Command: Configuration
To configure things for the crawl command you must:
*
Create a directory with a flat file of root urls. For example, to
crawl the nutch site you might start with a file named urls/nutch
containing the url of just the Nutch home page. All other Nutch
pages should be reachable from this page. The urls/nutch file
would thus contain:
http://lucene.apache.org/nutch/
*
Edit the file conf/crawl-urlfilter.txt and replace MY.DOMAIN.NAME
with the name of the domain you wish to crawl. For example, if you
wished to limit the crawl to the apache.org domain, the line
should read:
+^http://([a-z0-9]*\.)*apache.org/
This will include any url in the domain apache.org.
* Until someone could explain this...When I use the file
crawl-urlfilter.txt the filter doesn't work, instead of it use the file
conf/regex-urlfilter.txt and change the last line from "+." to "-."
reinhard schwab schrieb:
> i have tried the recrawl script of susam pal and have wondered why
> url filtering no longer works.
> http://wiki.apache.org/nutch/Crawl
>
> the mystery is
>
> only Crawl.java adds crawl-tool.xml to the NutchConfiguration.
>
> Configuration conf = NutchConfiguration.create();
> conf.addResource("crawl-tool.xml");
>
> Fetcher.java and all the other tools which filter the outlinks do not
> add this.
> this is really confusing me and i have spent some time to figure this out.
>
> regards
> reinhard
>
>
>
>
>
>
>
>
>