You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by 盖世豪侠 <ma...@gmail.com> on 2006/02/03 14:13:51 UTC

How to crawl only a specific type of files?

Nutch always crawls from from a parsed file to the urls contained in the
file. However, if we want to crawl a specific type of files (e.g. rss file),
there may be some difficulties. As the links to real rss files are always
contained in some entry files of html/htm, so there is no direct urls from
one rss file to another. If we want to index rss files, we have to index
many html/htm files first.

--
《盖世豪侠》好评如潮，让无线收视居高不下，无线高兴之余，仍未重用。周星驰岂是池中物，喜剧天分既然崭露，当然不甘心受冷落，于是转投电影界，在大银幕上一展风采。无线既得千里马，又失千里马，当然后悔莫及。