You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by sdeck <sc...@gmail.com> on 2006/12/20 20:01:23 UTC

Fun question for index merge

Thanks for everyones help so far from my postings.
Here is another question.

I am currently merging my crawls, but am wondering if I can skip a few steps
and how to do it.
I inject a whole slew of urls into a crawl each time, and then merge it with
the crawl previously to that.
The urls injected are the same each time.

Now, my merged segments directory is starting to get larger and the indexing
is starting to get slower. However, I only use the generated Lucene index
for my website, not any of the segments, etc. Plus, I restart the crawl each
and every time. So, would I be able to give the de duper the two lucene
index directories I have, and then use IndexMerger to combine the indexes
into a new lucene index, and skip over the merge of the linkdb, crawld,
segments ?

Thanks,
S
-- 
View this message in context: http://www.nabble.com/Fun-question-for-index-merge-tf2861621.html#a7995826
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: Fun question for index merge

Posted by sdeck <sc...@gmail.com>.
I tested this last night, so in case anyone wants to know the answer, yes,
this can be done.
If all you need are the Lucene indexes for your website, you can do the
crawl, do another crawl, and then do an
IndexMerger (from the nutch.crawl api dir)
Then do a DeleteDuplicates on that new index

whamo, new index with both crawls data.
S



sdeck wrote:
> 
> Thanks for everyones help so far from my postings.
> Here is another question.
> 
> I am currently merging my crawls, but am wondering if I can skip a few
> steps and how to do it.
> I inject a whole slew of urls into a crawl each time, and then merge it
> with the crawl previously to that.
> The urls injected are the same each time.
> 
> Now, my merged segments directory is starting to get larger and the
> indexing is starting to get slower. However, I only use the generated
> Lucene index for my website, not any of the segments, etc. Plus, I restart
> the crawl each and every time. So, would I be able to give the de duper
> the two lucene index directories I have, and then use IndexMerger to
> combine the indexes into a new lucene index, and skip over the merge of
> the linkdb, crawld, segments ?
> 
> Thanks,
> S
> 

-- 
View this message in context: http://www.nabble.com/Fun-question-for-index-merge-tf2861621.html#a8012047
Sent from the Nutch - User mailing list archive at Nabble.com.