You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by John George <jt...@yahoo.com> on 2006/10/30 01:50:06 UTC

On local file system crawl, why does nutch crawl parent directories?

I'm crawling a directory on my local Windows file
system. However, nutch crawls all of the top level
directories
 on my C: drive - not just the directory I told it to
crawl. Is this a bug or expected behavior? If it is
expected behavior - why?

I created a directory of sample documents at
c:\nutch-0.8.1\input. This directory contains a single
word document and a sub directory with additional word
and pdf documents. I also created a single url file,
which I'll pass to the crawler. It has the following
entry: file:///C:/nutch-0.8.1/input

During the crawl, I notice 404 errors. For example:

	fetching file:/C:/nutch-0.8.1/DOC1.doc
	fetching file:/C:/nutch-0.8.1/fin/
	fetch of file:/C:/nutch-0.8.1/fin/ failed with:
org.apache.nutch.protocol.file.FileError: File Error:
404
	fetch of file:/C:/nutch-0.8.1/DOC1.doc failed with:
org.apache.nutch.protocol.file.FileError: File Error:
404

	
Fetcher: done


Why is nutch looking for "DOC1.doc" in C:\nutch-0.8.1?
Where did it get the idea to look for that doc in that
wrong location? It should only look for it in
c:\nutch-0.8.1\input (and it does eventually find it
in there).

After the crawl is finished, I can see top level C:
folders and documents as being crawled. For example,
in the crawldb's dump, here is an entry that should
not have been crawled:

	file:/C:/CONFIG.SYS	Version: 4
	Status: 1 (DB_unfetched)
	Fetch time: Sun Oct 29 00:56:27 PDT 2006
	Modified time: Wed Dec 31 16:00:00 PST 1969
	Retries since fetch: 0
	Retry interval: 30.0 days
	Score: 0.004761905
	Signature: null
	Metadata: null

Why did C:\Config.sys get crawled when I specified the
crawl directory as c:\nutch-0.8.1\input?

For what it's worth, I've set the following in my
crawl-urlfilter.txt: 

+^file://*


Finally, someone posted a very helpful resource on
crawling the local filesystem at
http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch
--- in item #7, this person suggests changing the
org.apache.nutch.protocol.file.FileResponse.getDirAsHttpResponse(java.io.File
f) method and recompiling to get rid of this behavior.


Thank you,
John 


 
____________________________________________________________________________________
We have the perfect Group for you. Check out the handy changes to Yahoo! Groups 
(http://groups.yahoo.com)