You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by oh...@cox.net on 2009/07/16 22:57:46 UTC

Question about crawling local filesystem and directories

Hi,

I know that there is an issue with Nutch when crawling the local filesystem, where it also crawls the parent directory.

So, I did a test today, where I put the directory that I actually wanted to crawl under a directory, e.g., I had:

/testfiles
/testfiles/foo ==> contained content to be crawled.

My thinking was that even with the Nutch issue it would just crawl /testfiles directory, which was empty, and we'd be ok.

However, when I reviewed the nutch log, I saw that it was also fetching directories like /opt/, /tmp/, etc.  Mind you it didn't fetch any of the CONTENTS of those directories, but it did fetch those directories themselves.

Has anyone else noticed this behavior? 

Also, with the suggested change to "org.apache.nutch.protocol.file.FileResponse.getDirAsHttpResponse" at:

http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch

fix this problem?

Thanks,
Jim