You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Wolf Fischer <Wo...@informatik.uni-augsburg.de> on 2009/04/02 17:00:47 UTC

Problem with Crawler and Parent Directories

Hi there,

i currently try to use Nutch for a local file directory. I have the url 
to the directory, which looks like the following:
file:///C:/test/
in crawl-urlfilter.txt I added +.* for testing purposes, however this 
resulted in the famous "bug" of also looking through the parent 
directories. So i looked into the FAQ as well as the mailing list 
archive and found the solution: I simply should add something like
+^file:///c:/top/directory/^
-.
to the urlfilter.txt. So I did:
+^file:///c:/test/
-.
However if I do this the fetcher does not get any url at all and 
immediately exits because of "no more URLs to fetch."
I have no idea why this is not working. I tried several other solutions 
and simply cant get it to work the way i want it to work. Can somebody 
please give me a hint on what i am doing wrong?

Thanks in advance!

Wolf

Re: Problem with Crawler and Parent Directories

Posted by Hannu Väisänen <hv...@joyx.joensuu.fi>.

On Thu, Apr 02, 2009 at 05:00:47PM +0200, Wolf Fischer wrote:
> +^file:///c:/test/
> -.

Try this:

+^file:///c:/test/
+^file:/c:/test/
-.


That is, put three an one slashes after the "file:".
That worked for me.