You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by karthik085 <ka...@gmail.com> on 2007/10/03 23:15:44 UTC

Failed Fetch Pages - Index Verification and Optimization

I crawled a website - out of 1000 links, 100 failed to fetch because they
were invalid links or network error or some other type of error. When I did
a search for any of these 100 failed pages - I didn't find it in the search
results.

So, I think nutch deletes these kind of urls from the index, but some page
info about them still exists in the segments. Is this right?

If so, will having these failed pages in the segments affect any
performance? I am assuming, no. But, any clarity about this matter is
appreciated. Will the size of the segments (storing these failed pages info
takes some space) affect the performance too?

Also, I am curious on more inner details on how indexing works in nutch -
how does nutch index an segment, what does it index and what does it not
index? I could use Luke. Any other tools or techniques?

Thanks.

-- 
View this message in context: http://www.nabble.com/Failed-Fetch-Pages---Index-Verification-and-Optimization-tf4564385.html#a13027939
Sent from the Nutch - Dev mailing list archive at Nabble.com.