You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Apache Wiki <wi...@apache.org> on 2008/04/23 07:58:42 UTC

[Nutch Wiki] Update of "FetchCycleOverlap" by OtisGospodnetic

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The following page has been changed by OtisGospodnetic:
http://wiki.apache.org/nutch/FetchCycleOverlap

New page:
Without overlapping jobs people running Nutch are likely not utilizing their clusters fully.  Thus, here is a recipe for overlapping jobs:

0. imagine a cluster with M max maps and R max reduces (say M=R=8)

1. run generate job with -numFetchers equal to M-2

2. run a fetcher job (uses M-2 maps and later all R reduces)

3. at this point there are 2 open map slots for something else to run, say the updatedb job for the previously fetched/parsed segment

4. when updatedb job is done the cluster can take on more jobs.  Any completed tasks (C) from the running fetcher job represent "open work slots"

5. start another fetch job.  This will be able to use only C tasks, but C will grow as the first job opens up more slots, eventually hitting M-2 open slots.

6. at some point, the fetch job from 2) above will complete, opening up 2 map slots, so updatedb can be run, even in the background, allowing the execution to go back to 1)

Because a URL is "locked out" for 7 days after the generate step included it into a fetchlist, the above cycle needs to complete within 7 days.  In more detail:

Generate updates the CrawlDb so that urls selected
for the latest fetchlist become "locked out" for the next 7 days. This
means that you can happily generate multiple fetchlists, and fetch them
out of order, and then do the DB updates out of order, as you see fit,
so long as you make it within the 7 days of the "lock out" period.

This means that it's practical to limit the numFetchers to a number
below your cluster capacity, because then you can run other maintenance
jobs in parallel with the currently running fetch job (such as updatedb
and generate of next fetchlists).