You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Chris Schneider <Sc...@TransPac.com> on 2006/02/12 18:11:52 UTC

Injecting into existing DB

Nutch colleagues,

I'm wondering how you inject new URLs into an existing MapReduce 
crawldb in a way that guarantees they'll end up on the next fetch 
list. The db.score.injected property used to control the score of 
newly injected URLs, but I don't see that getting loaded anywhere in 
the Nutch 0.8 code. It looks like the score of the CrawlDatum added 
by Injector.java will just be 1.0. Since that's the minimum score in 
the 45M unfetched pages I currently have in my crawldb, it doesn't 
seem likely that the 21 new URLs I'd like to inject will end up in 
the topN=500K URLs in the first fetch list.

Of course, I could just modify the code to honor the 
db.score.injected property, then set it to something like 2.0. 
However, I'm not sure I want to do this either. I'm guessing that 
would also bias the minimum score for all of the pages I get to from 
these new URLs.

What I'd like is to put this injection set on roughly equal footing 
with my original injection set. Thus, it seems like the proper way to 
handle this is to mark the injected URLs in some way that ensures 
that the Generator will put them on the first fetch list.

However, I'm probably missing something important here.

Ideas?

- Chris
-- 
------------------------
Chris Schneider
TransPac Software, Inc.
Schmed@TransPac.com
------------------------

Re: Injecting into existing DB

Posted by Stefan Groschupf <sg...@media-style.com>.
I use normally a simple trick in such situations.
I create a new empthy db inject the urls, create my segment and fetch  
the segment.
Than I inject the urls a second time to my orginal db and update the  
the db with the segment.

Stefan

Am 12.02.2006 um 18:11 schrieb Chris Schneider:

> Nutch colleagues,
>
> I'm wondering how you inject new URLs into an existing MapReduce  
> crawldb in a way that guarantees they'll end up on the next fetch  
> list. The db.score.injected property used to control the score of  
> newly injected URLs, but I don't see that getting loaded anywhere  
> in the Nutch 0.8 code. It looks like the score of the CrawlDatum  
> added by Injector.java will just be 1.0. Since that's the minimum  
> score in the 45M unfetched pages I currently have in my crawldb, it  
> doesn't seem likely that the 21 new URLs I'd like to inject will  
> end up in the topN=500K URLs in the first fetch list.
>
> Of course, I could just modify the code to honor the  
> db.score.injected property, then set it to something like 2.0.  
> However, I'm not sure I want to do this either. I'm guessing that  
> would also bias the minimum score for all of the pages I get to  
> from these new URLs.
>
> What I'd like is to put this injection set on roughly equal footing  
> with my original injection set. Thus, it seems like the proper way  
> to handle this is to mark the injected URLs in some way that  
> ensures that the Generator will put them on the first fetch list.
>
> However, I'm probably missing something important here.
>
> Ideas?
>
> - Chris
> -- 
> ------------------------
> Chris Schneider
> TransPac Software, Inc.
> Schmed@TransPac.com
> ------------------------
>

---------------------------------------------
George Orwel was an Optimist
blog: http://www.find23.org
company: http://www.media-style.com