You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Teruhiko Kurosaka <Ku...@basistech.com> on 2006/03/09 21:08:29 UTC

Why does crawler skips some files and scan others of the same suffix?

I placed a bunch of files in a directory in Apache web server's
htdocs directory, and had Nutch crawl that directory.

But, according to the output from "nutch crawl" command some files 
were scanned while some were not.  For example, these were scanned:
jp5-fwroman_UTF8B.txt
jp5_EUCJP.html
jp5-UTF8.html
jp5-fwroman_SJIS.txt

These were not:
jp5-fwroman.ppt
jp5.ppt
jp5_EUCJP.txt
jp5_SJIS.html
jp5_SJIS.txt
jp5_UTF8B.txt

I understand why .ppt files were skipped since .ppt is filtered out by
the crawl-urlfilter.txt file, but I don't understand why some .txt and
.html
files were scanned while the other weren't. (I modified the default
crawl-urlfilter.txt to replace the MY.DOMAIN.NAME and to add
"|rtf" to the suffix list to skip. )

How can I trace the reasons why the Nutch crawler decided to skip some
files?

-kuro