You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Chris Schneider <Sc...@TransPac.com> on 2006/02/12 18:11:52 UTC
Injecting into existing DB
Nutch colleagues,
I'm wondering how you inject new URLs into an existing MapReduce
crawldb in a way that guarantees they'll end up on the next fetch
list. The db.score.injected property used to control the score of
newly injected URLs, but I don't see that getting loaded anywhere in
the Nutch 0.8 code. It looks like the score of the CrawlDatum added
by Injector.java will just be 1.0. Since that's the minimum score in
the 45M unfetched pages I currently have in my crawldb, it doesn't
seem likely that the 21 new URLs I'd like to inject will end up in
the topN=500K URLs in the first fetch list.
Of course, I could just modify the code to honor the
db.score.injected property, then set it to something like 2.0.
However, I'm not sure I want to do this either. I'm guessing that
would also bias the minimum score for all of the pages I get to from
these new URLs.
What I'd like is to put this injection set on roughly equal footing
with my original injection set. Thus, it seems like the proper way to
handle this is to mark the injected URLs in some way that ensures
that the Generator will put them on the first fetch list.
However, I'm probably missing something important here.
Ideas?
- Chris
--
------------------------
Chris Schneider
TransPac Software, Inc.
Schmed@TransPac.com
------------------------
Re: Injecting into existing DB
Posted by Stefan Groschupf <sg...@media-style.com>.
I use normally a simple trick in such situations.
I create a new empthy db inject the urls, create my segment and fetch
the segment.
Than I inject the urls a second time to my orginal db and update the
the db with the segment.
Stefan
Am 12.02.2006 um 18:11 schrieb Chris Schneider:
> Nutch colleagues,
>
> I'm wondering how you inject new URLs into an existing MapReduce
> crawldb in a way that guarantees they'll end up on the next fetch
> list. The db.score.injected property used to control the score of
> newly injected URLs, but I don't see that getting loaded anywhere
> in the Nutch 0.8 code. It looks like the score of the CrawlDatum
> added by Injector.java will just be 1.0. Since that's the minimum
> score in the 45M unfetched pages I currently have in my crawldb, it
> doesn't seem likely that the 21 new URLs I'd like to inject will
> end up in the topN=500K URLs in the first fetch list.
>
> Of course, I could just modify the code to honor the
> db.score.injected property, then set it to something like 2.0.
> However, I'm not sure I want to do this either. I'm guessing that
> would also bias the minimum score for all of the pages I get to
> from these new URLs.
>
> What I'd like is to put this injection set on roughly equal footing
> with my original injection set. Thus, it seems like the proper way
> to handle this is to mark the injected URLs in some way that
> ensures that the Generator will put them on the first fetch list.
>
> However, I'm probably missing something important here.
>
> Ideas?
>
> - Chris
> --
> ------------------------
> Chris Schneider
> TransPac Software, Inc.
> Schmed@TransPac.com
> ------------------------
>
---------------------------------------------
George Orwel was an Optimist
blog: http://www.find23.org
company: http://www.media-style.com