You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Andrzej Bialecki <ab...@getopt.org> on 2006/05/11 14:48:40 UTC

Interleaved (parallel) fetch cycles

Hi,

I'm planning to work on adding support in 0.8 for interleaved fetch cycles.

What this means is that (within some limits) you can generate multiple 
fetchlists, fetch them at different times, and then update the crawldb 
not necessarily in the original sequence as they were generated. You can 
also generate more fetchlists before any updatedb is run.

This functionality was supported in 0.7.x. When FetchListTool selected a 
Page for fetching, its next fetch time was pushed 1 week in the future. 
This was a simple and effective way to prevent the same Pages ending up 
on the next fetchlist, but at the same time to have their waiting "time 
out" after 1 week, if e.g. fetching failed, segment was lost or 
whatever. Please note that this method requires modification of WebDB.

If fetching was completed and an updatedb was run, the original 
fetchTime/fetchInterval could be recovered from a copy of the Page 
inside the FetcherOutput.

Now, in 0.8 we do it differently. We don't modify CrawlDB, so we have no 
way of recording which CrawlDatums end up on some fetchlist. This means 
that two "generate" operations run in sequence, without intervening 
updatedb, will produce exactly the same fetchlists.

Generator would have to be modified to use the same trick as in 0.7. 
Unfortunately, this probably means that it will have to run a sort of 
updatedb, using its output fetchlist to mark entries in CrawlDB. This 
adds another map-reduce job to an already long-ish job (Generator 
already uses two map-reduce jobs). This also means that Generator will 
have to put a lock on CrawlDB for the duration of this job, so that no 
other "generate" or "updatedb" can update it at the same time.

Then, when running an updatedb, the issue of scores and metadata comes 
into question. We can imagine now that there were some other updatedb-s 
run in the meantime, not necessarily with earlier fetchlists - so the 
score and metadata info could be actually newer in the latest CrawlDB 
than what we have inside the current segment. In such case, we will get 
the following in CrawlDbReducer:

* "old" value from CrawlDb (which could be actually newer!). Even if 
it's old, its fetchTime could be in the future due to the trick 
described above. We could also get null here, if we just discovered a 
new page.

* "original" value from CrawlDb, which was recorded in fetchlist. This, 
for once, has a true fetch time, and its metadata and score are 
snapshots of that information at the time of "generate".

* "new" value from Fetcher, with new score / metadata information. We 
will also get "new" values from redirects, which might not match any of 
the above values (i.e. they could use unique urls).

* "linked" values from parsers, with score / metadata contributions.

Now, the question is how to update the score, metadata, fetchTime and 
fetchInterval information. We need a way to determine if the "new" value 
we have is in fact newer or older than the "old" value - I'm not sure 
how to do this, fetchTime and fetchInterval could have been modified so 
they are not reliable... Perhaps we should add a "generation ID" to 
CrawlDatum? Anyway, assuming we have a way to know this:

* if "new" is newer than "old", then we take all metadata from "old", 
overwrite all info with the values from "new", and we keep "new".

* if "new" is older than "old", then we overwrite its metadata with all 
values from "old". We do the same with fetchTime and fetchInterval. What 
about the score? I think that for new score calculations we should take 
the latest available score info from the "old" value.

Updatedb would also have to lock CrawlDB so that no other updatedb or 
generate could run while we modify it.

That's probably all at the moment ... Any comments or suggestions 
appreciated!

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Interleaved (parallel) fetch cycles

Posted by Doug Cutting <cu...@apache.org>.

Andrzej Bialecki wrote:
> I'm planning to work on adding support in 0.8 for interleaved fetch cycles.

Great!

> Then, when running an updatedb, the issue of scores and metadata comes 
> into question. We can imagine now that there were some other updatedb-s 
> run in the meantime, not necessarily with earlier fetchlists - so the 
> score and metadata info could be actually newer in the latest CrawlDB 
> than what we have inside the current segment. In such case, we will get 
> the following in CrawlDbReducer:
> 
> * "old" value from CrawlDb (which could be actually newer!). Even if 
> it's old, its fetchTime could be in the future due to the trick 
> described above. We could also get null here, if we just discovered a 
> new page.
> 
> * "original" value from CrawlDb, which was recorded in fetchlist. This, 
> for once, has a true fetch time, and its metadata and score are 
> snapshots of that information at the time of "generate".
> 
> * "new" value from Fetcher, with new score / metadata information. We 
> will also get "new" values from redirects, which might not match any of 
> the above values (i.e. they could use unique urls).
> 
> * "linked" values from parsers, with score / metadata contributions.
> 
> Now, the question is how to update the score, metadata, fetchTime and 
> fetchInterval information. We need a way to determine if the "new" value 
> we have is in fact newer or older than the "old" value - I'm not sure 
> how to do this, fetchTime and fetchInterval could have been modified so 
> they are not reliable... Perhaps we should add a "generation ID" to 
> CrawlDatum?

Would it work to, when generating, set the fetch time for generated 
items to the current time?  That way, the "new" value will always be a 
bit after the "old" time.  In 0.7 we stored not the fetched-time but the 
time-to-next-fetch, so we had to set it into the future.  But if we 
instead just mark it as fetched now, so that it won't be re-generated 
until its fetch interval has expired, that would resolve this, no?

>  Anyway, assuming we have a way to know this:
> 
> * if "new" is newer than "old", then we take all metadata from "old", 
> overwrite all info with the values from "new", and we keep "new".
> 
> * if "new" is older than "old", then we overwrite its metadata with all 
> values from "old". We do the same with fetchTime and fetchInterval.

That sounds right to me.  When is "original" used, if at all?

> What about the score? I think that for new score calculations we should take 
> the latest available score info from the "old" value.

That also sounds right.  The crawl db should own the scores.  Scores 
should not be updated by the fetcher, but only by crawldb updates.

> Updatedb would also have to lock CrawlDB so that no other updatedb or 
> generate could run while we modify it.

Yes, that sounds right too.

Thanks for working on this!

Doug