You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Godmar Back <go...@gmail.com> on 2010/01/07 02:04:56 UTC

Nutch crawls parent directories and ignores the url filters added to prevent this in crawl-urlfilter.txt

... if you followed the wrong instructions in the old FAQ, which I took the
liberty to correct:

http://wiki.apache.org/nutch/FAQ?action=diff&rev1=113&rev2=115

I am proud to report that nutch has now indexed an entire directory of PDF
files and actually returns search results.

 - Godmar

keyword: nutch crawls parent directories, indexing local filesystem,
urlfilter-regexp, plugin.include