You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Apache Wiki <wi...@apache.org> on 2011/06/28 22:17:31 UTC

[Nutch Wiki] Trivial Update of "bin/nutch mergedb" by LewisJohnMcgibbney

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The "bin/nutch mergedb" page has been changed by LewisJohnMcgibbney:
http://wiki.apache.org/nutch/bin/nutch%20mergedb?action=diff&rev1=1&rev2=2

Comment:
trivial formatting

Mergedb is an alias for org.apache.nutch.crawl.CrawlDbMerger

- This tool merges several CrawlDb's into one, optionally filtering URLs through the current URLFilters, to skip prohibited pages. It is possible to use this tool just for filtering - in that case only one CrawlDb should be specified in arguments. If more than one CrawlDb contains information about the same URL, only the most recent version is retained, as determined by the value of {@link org.apache.nutch.crawl.CrawlDatum#getFetchTime()}. However, all metadata information from all versions is accumulated, with newer values taking precedence over older values.
+ This tool merges several crawldb's into one, optionally filtering URLs through the current URLFilters, to skip prohibited pages. It is possible to use this tool just for filtering - in that case only one crawldb should be specified in arguments. If more than one CrawlDb contains information about the same URL, only the most recent version is retained, as determined by the value of ''''' org.apache.nutch.crawl.CrawlDatum#getFetchTime()'''''. However, all metadata information from all versions is accumulated, with newer values taking precedence over older values.

Usage: