You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by te...@gmail.com on 2006/10/03 14:58:04 UTC
Nutch crawler ignores sites without default page
I'm using intranet crawling. The URLS in the URLs files include the
filenames, e.g.
http://somedomain.com/page1.htm
http://otherdomain.com/page2.htm
Both sites have no index.htm page. When after crawling I use the
CrawlDbReader tool to view the list of crawled pages, one of the
pages is fetched and another is marked as gone.
I guess this may depend on the status answer the server gives when
conn ecting to http://somedomain.com or http://otherdomain.com,
whether it is 403 or 404.
But shouldn't Nutch just ignore the main page and request only
page1.htm or the page2.htm?