You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@nutch.apache.org by Apache Wiki <wi...@apache.org> on 2006/05/10 18:51:41 UTC

[Nutch Wiki] Update of "nutch-0.8-dev/bin/nutch mergedb" by LukasVlcek

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The following page has been changed by LukasVlcek:
http://wiki.apache.org/nutch/nutch-0%2e8-dev/bin/nutch_mergedb

New page:
= "mergedb" is an alias for "org.apache.nutch.crawl.CrawlDbMerger" =

== Merges several CrawlDb(s) together. URLFilters can be optionaly used to filter out specific conetnt. ==

You can merge several existing DBs into one. This comes useful if you ran several
partial crawls and you'd like to combine the DBs. Optionally, you can
run current URLFilters on URLs in the databases, to filter out unwanted
URLs. This works also if you run it with just one input DB, which means
that you can use this tool for weeding out unwanted URLs from a single DB.

It is possible to use this tool just for filtering - in that case
only one crawldb should be specified in arguments.

If more than one !CrawlDb contains information about the same URL,
only the most recent version is retained, as determined by the
value of org.apache.nutch.crawl.!CrawlDatum.getFetchTime().
However, all metadata information from all versions is accumulated,
with newer values taking precedence over older values.

=== Usage ===
 nutch-0.8-dev/bin/nutch org.apache.nutch.crawl.!CrawlDbMerger output_crawldb crawldb1 [crawldb2 crawldb3 ...] [-filter]

  '''output_crawldb:''' Output !CrawlDb.[[BR]]
  '''crawldb1 [crawldb2 crawldb3 ...]:''' One or many input !CrawlDb(s).[[BR]]
  '''-filter:''' Actual URLFilters to be applied on urls in !CrawlDb(s).[[BR]]

=== Configuration Files ===
 hadoop-default.xml[[BR]]
 hadoop-site.xml[[BR]]
 nutch-default.xml[[BR]]
 nutch-site.xml[[BR]]

=== Other Files ===
 None.

=== Caveats and Notes ===
 index.done file is not created.

DevelopmentCommandLineOptions