You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by te...@gmail.com on 2006/10/03 14:58:04 UTC

Nutch crawler ignores sites without default page

I'm using intranet crawling. The URLS in the URLs files include the
filenames, e.g.

http://somedomain.com/page1.htm
http://otherdomain.com/page2.htm

Both sites have no index.htm page. When after crawling I use the
CrawlDbReader tool to  view the list of crawled pages, one of the
pages is fetched and  another is marked as gone.

I guess this may depend on the status answer the server gives when
conn ecting to http://somedomain.com or http://otherdomain.com,
whether it is 403 or 404.

But shouldn't Nutch just ignore the main page and request only
page1.htm or the page2.htm?