You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Ian Reardon <ir...@gmail.com> on 2005/05/22 14:53:09 UTC

Problem with crawl

I was doing a crawl of a site and at around 10,000 documents in it
looks like it deleted the segments and started over from 0.  I think
it might have merged them into the db??

I see 

050522 000938 Processing pagesByMD5: Merged to new DB containing 9324
records in 0.07 seconds
050522 000938 Processing pagesByMD5: Merged 133200.0 records/second
.
.
.
050522 000953 Overall processing: Sorted 1.2441012441012442E-5 entries/second
050522 000953 FetchListTool completed
050522 000953 logging at INFO

Then it starts over 

050522 003516 status: segment 20050522000951, 100 pages, 0 errors,
456114 bytes, 1523290 ms


So I lost all the 10,000 pages I had already fetched.  Is this normal?
 How can I make it so that it never delted the segments??

Re: Problem with crawl

Posted by Olaf Thiele <ol...@gmail.com>.
Hi Ian,
what type of search are you doing internet or intranet?
Merging will not delete information. You should be able
to access the new database with Luke.

Using Luke to inspect the index will help you.
Olaf

On 5/22/05, Ian Reardon <ir...@gmail.com> wrote:
> I was doing a crawl of a site and at around 10,000 documents in it
> looks like it deleted the segments and started over from 0.  I think
> it might have merged them into the db??
> 
> I see
> 
> 050522 000938 Processing pagesByMD5: Merged to new DB containing 9324
> records in 0.07 seconds
> 050522 000938 Processing pagesByMD5: Merged 133200.0 records/second
> .
> .
> .
> 050522 000953 Overall processing: Sorted 1.2441012441012442E-5 entries/second
> 050522 000953 FetchListTool completed
> 050522 000953 logging at INFO
> 
> Then it starts over
> 
> 050522 003516 status: segment 20050522000951, 100 pages, 0 errors,
> 456114 bytes, 1523290 ms
> 
> 
> So I lost all the 10,000 pages I had already fetched.  Is this normal?
>  How can I make it so that it never delted the segments??
> 


-- 

<SimpleHuman gender="male">
   <Physical name="Olaf Thiele" />
   <Virtual adress="http://www.olafthiele.de" />
</SimpleHuman>